You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/06/09 04:06:55 UTC

[GitHub] [tvm-rfcs] AndrewZhaoLuo opened a new pull request #6: Automatic Mixed Precision Pass RFC

AndrewZhaoLuo opened a new pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-904813648


   Thanks @AndrewZhaoLuo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-893007973


   btw, according to #17, please update the RFC number on the file name to align with this PR number.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-900596242


   PTAL @comaniac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: Automatic Mixed Precision Pass RFC

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-857931228


   I don't know Chris Sullivan's github handle so if someone could cc him too that would be great.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690760136



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       Yeah that would be sufficient for now. We will also need the BF16 support as well.

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art

Review comment:
       It's by tracing the PyTorch code so I don't have an article talking about this. The basic idea is since PyTorch uses interpreter to execute the model, it dynamically casts a FP32 tensor to FP16 when the op it's going to execute requires FP16 inputs. Accordingly, to avoid casting a tensor from FP32 to FP16 multiple times, it has a tensor cache to cache the casted tensors, so that they can be directly used in the future.
   
   With the above content it should be clear that why TVM should adopt the XLA-like approach instead of PyTorch, because TVM is more compile-oriented and we are able to embrace more graph-level optimizations.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #6: Automatic Mixed Precision Pass RFC

Posted by GitBox <gi...@apache.org>.

comaniac commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-857889774


   Thanks for the RFC. I have two questions:
   1. How to mark/set the color (i.e., attribute) of every operator?
   2. It seems to me that if we register a casting checker instead of just a label (color), then we can simplify the algorithm a lot. Taking the case `A(green) - B(gray) - C(green)` as an example, if we could register a casting rule of B as follows, then we just need one traverse to know if we need cast around B:
   
       ```
       def amp_B(expr, args):
           a = args[0]
           if (a.dtype is float16):
             return fp16
           return fp32
       ```
   
       After all, we only need the previous nodes to determine 1) whether to use FP16 implementation, and 2) whether to insert casts. It seems to me that this pass is similar to the layout conversion pass, which uses one traverse to finish everything, so it might be possible for AMP too.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-861736780


   So the associated PR is getting closer to a mergeable state. Is this RFC ready to merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] gayatripk1 removed a comment on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

gayatripk1 removed a comment on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-1024374936


   @AndrewZhaoLuo, When we enable mixed precision arithmetic, the output types of a model depends on the last operation in the relay IR, ie, if it convolution, it will be in float16 and for the case argmax it will be int32 and in the case of softmax it will be float32. So, is this expected? Or should the nodes that are marked as output nodes should be treated differently so that they return the expected specific out_type irrespective of the optimization(mixed precision pass) happening? Please let me know your thoughts on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] gayatripk1 commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

gayatripk1 commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-1024374936


   @AndrewZhaoLuo, When we enable mixed precision arithmetic, the output types of a model depends on the last operation in the relay IR, ie, if it convolution, it will be in float16 and for the case argmax it will be int32 and in the case of softmax it will be float32. So, is this expected? Or should the nodes that are marked as output nodes should be treated differently so that they return the expected specific out_type irrespective of the optimization(mixed precision pass) happening? Please let me know your thoughts on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-893004650


   Going to get to this tomorrow 😬. Promise 🤞


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690674905



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.

Review comment:
       Done. Please let me know if this is sufficient. Don't have the best background on some of this stuff.

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art

Review comment:
       I am not familiar with this and cannot find a good article, do you have a link?

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art

Review comment:
       I am not familiar with this and cannot find this concept, do you have a link?

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       Hey all, in the case where does not have an instruction for accumulating fp16 results in fp32, the user has the interface to turn off accumulating in fp32. 
   
   Right now the default is designed with NVidia GPU in mind, but example pass settings can be easily made.
   
   As for concerns about the representation being suitable for your device, do you have any specific concerns? As long as the codegen path to the device is good and clear I do not think the representation will be a problem.
   

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       @jwfromm I believe we also had good results running mixed precision on M6g (graviton) instances?

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       @comaniac x86 is not really the target for this pass as it has poor hardware FP16 support. I'll add a list of supported targets we will aim for.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] MeJerry215 commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

MeJerry215 commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-1002905618


   @AndrewZhaoLuo will it remove cast weight to float16 from graph? and make weight as float16 when build lib. 
   in my opinion, it will reduce the bandwidth.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac merged pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac merged pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r677866702



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       Thanks for the suggestion. I agree that it would be interesting to see how AMP works on m6g instances. Maybe we can document it in future possibilities section so that everyone interested in this feature can potentially take over after the AMP pass is stable on x86 platforms.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690766936



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art

Review comment:
       It's by tracing the PyTorch code so I don't have an article talking about this. The basic idea is since PyTorch uses interpreter to execute the model, it dynamically casts a FP32 tensor to FP16 when the op it's going to execute requires FP16 inputs. Accordingly, to avoid casting a tensor from FP32 to FP16 multiple times, it has a tensor cache to cache the casted tensors, so that they can be directly used in the future.
   
   With the above content it should be clear that why TVM should adopt the XLA-like approach instead of PyTorch, because TVM is more compile-oriented and we are able to embrace more graph-level optimizations.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo edited a comment on pull request #6: Automatic Mixed Precision Pass RFC

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo edited a comment on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-857903539


   > Thanks for the RFC. I have two questions:
   > 
   > 1. How to mark/set the color (i.e., attribute) of every operator?
   > 2. It seems to me that if we register a casting checker instead of just a label (color), then we can simplify the algorithm a lot. Taking the case `A(green) - B(gray) - C(green)` as an example, if we could register a casting rule of B as follows, then we just need one traverse to know if we need cast around B:
   >    ```
   >    def amp_B(expr, args):
   >        a = args[0]
   >        if (a.dtype is float16):
   >          return fp16
   >        return fp32
   >    ```
   >    
   >    
   >        
   >          
   >        
   >    
   >          
   >        
   >    
   >        
   >      
   >    After all, we only need the previous nodes to determine 1) whether to use FP16 implementation, and 2) whether to insert casts. It seems to me that this pass is similar to the layout conversion pass, which uses one traverse to finish everything, so it might be possible for AMP too.
   
   Yep that is correct it is very similar to the layout conversion pass. This RFC has an initial PR here: https://github.com/apache/tvm/pull/8069.
   
   To answer your questions:
   1.  src/relay/transforms/fp32_to_fp16.h -- DefaultFP16Colorer is the default way. But the only thing we need is a callable with type CallNode*(Color). So you could write your own colorer that does arbitrary stuff when only looking at a single node at a time.
   
   2. This is functionally what is done in the PR I link.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #6: Automatic Mixed Precision Pass RFC

Posted by GitBox <gi...@apache.org>.

comaniac commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-857914672


   Thanks for the answers. I'll review the PR to get more implementation details.
   One more question regarding the extensibility: can this be extended easily to support bfloat16?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690675566



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art

Review comment:
       I am not familiar with this and cannot find a good article, do you have a link?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-902192178


   If there is not other objections, this will be merged on monday.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: Automatic Mixed Precision Pass RFC

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-857903539


   > Thanks for the RFC. I have two questions:
   > 
   > 1. How to mark/set the color (i.e., attribute) of every operator?
   > 2. It seems to me that if we register a casting checker instead of just a label (color), then we can simplify the algorithm a lot. Taking the case `A(green) - B(gray) - C(green)` as an example, if we could register a casting rule of B as follows, then we just need one traverse to know if we need cast around B:
   >    ```
   >    def amp_B(expr, args):
   >        a = args[0]
   >        if (a.dtype is float16):
   >          return fp16
   >        return fp32
   >    ```
   >    
   >    
   >        
   >          
   >        
   >    
   >          
   >        
   >    
   >        
   >      
   >    After all, we only need the previous nodes to determine 1) whether to use FP16 implementation, and 2) whether to insert casts. It seems to me that this pass is similar to the layout conversion pass, which uses one traverse to finish everything, so it might be possible for AMP too.
   
   Yep that is correct. This RFC has an initial PR here: https://github.com/apache/tvm/pull/8069.
   
   To answer your questions:
   1.  src/relay/transforms/fp32_to_fp16.h -- DefaultFP16Colorer is the default way. But the only thing we need is a callable with type CallNode*(Color). So you could write your own colorer that does arbitrary stuff when only looking at a single node at a time.
   
   2. This is functionally what is done in the PR I link.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: Automatic Mixed Precision Pass RFC

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-857922283


   > Thanks for the answers. I'll review the PR to get more implementation details.
   > One more question regarding the extensibility: can this be extended easily to support bfloat16?
   
   It should be trivial (hope I don't eat my words). I'm not 100% sure of the support for bfloat16 in current relay ops however.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r677616527



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.

Review comment:
       The answer to this question should come with a discussion of existing mechanisms used by other frameworks, such as XLA and PyTorch.

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation

Review comment:
       It would be better to provide an example at the end of this section. i.e., how this pass is used and what's the result IR looks like.

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       This section should also discuss the implementation. Specifically, 1) the interface of annotating an op with color, 2) the coloring algorithm in the pass, 3) some corner cases (i.e., ops) that need more care.

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO

Review comment:
       Update the RFC PR.

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art
+
+Many of the ideas are taken from Tensorflow's [automatic mixed precision training framework](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf)
+and the initial "Green", "Gray", and "Red" lists are based [similarly](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). 
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+- What parts of the design do you expect to resolve through the RFC process before this gets merged?
+
+We still need to make sure that the current design and knobs exposed provide extensibility to every hardware platform out there.
+
+- What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
+
+Probably a lot of edge cases of operations within TVM.

Review comment:
       This is too vague. It's better to have a quantitative metric toward to a stable release. For example, you can set up a benchmark with a set of models, and the goal is to make all of them work well with AMP (in terms of the performance and accuracy) on both CPU and GPU.
   
   In addition, it would be better to also investigate how AutoScheduler works with AMP models. Since tuning is an important feature in TVM, the impact of AMP would be moderated if a tuned FP32 model can still run faster than an AMP model.

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO

Review comment:
       As this RFC is guaranteed to be merged and the feature must be landed, It should be fine to open a tracking issue now and update the link here.

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art
+
+Many of the ideas are taken from Tensorflow's [automatic mixed precision training framework](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf)
+and the initial "Green", "Gray", and "Red" lists are based [similarly](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). 
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+- What parts of the design do you expect to resolve through the RFC process before this gets merged?
+
+We still need to make sure that the current design and knobs exposed provide extensibility to every hardware platform out there.

Review comment:
       This seems not the answer to this question. IIUC, the initial implementation of the pass has been merged, so we should mention that here with the PR link.

##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art

Review comment:
       It's better to also discuss the tensor cache mechanism used in PyTorch.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r677642360



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO

Review comment:
       I found that the issue has been opened, so please just update the link here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r691486596



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art

Review comment:
       Done, added discussion on this topic




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690674905



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.

Review comment:
       Done. Please let me know if this is sufficient. Don't have the best background on some of this stuff.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-887889315


   Thanks for driving this review @comaniac. I'll get to this later in the week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690675566



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994).
+
+As some have noticed the design can be simplified to a single pass where casting is determined by
+running type inference on mutated nodes. With a post-order traversal we can then check if we need to 
+cast arguments/propagate color.
+
+Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control
+the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking
+the performance gains from the pass.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we 
+will have to make sure it works on a wide range of models or people will be very mad at TVM.
+
+This might not be useful if mixed precision training becomes super popular in the future in which 
+case most models might be in a reduced precision floating point form already.
+
+It also might not be useful if integer quantization becomes super popular, though it may be possible
+to mix integer quantization and mixed floating precision techniques. Floating point does have 
+several advantages still over integer quantization including simplicity and the fact that some 
+operators like `sin` and `erf` are still designed in hardware with floating point in mind.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+- Why is this design the best in the space of possible designs?
+
+Other alternatives require a lot more work and changes and could probably considered future goals of TVM.
+This include automatic mixed precision training.
+
+- What other designs have been considered and what is the rationale for not choosing them?
+
+We can support automatic mixed precision retraining though that is a much, much larger future goal. It's
+good to have this in the meantime.
+
+- What is the impact of not doing this?
+
+TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table.
+
+# Prior art
+[prior-art]: #prior-art

Review comment:
       I am not familiar with this and cannot find this concept, do you have a link?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] masahi commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-1002908086


   @MeJerry215 Yes, casting of weight to fp16 is done at compile time by `FoldConstant` pass, so weights will be in fp16 at deploy time. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690680567



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       @comaniac x86 is not really the target for this pass as it has poor hardware FP16 support. I'll add a list of supported targets we will aim for.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tmoreau89 commented on pull request #6: Automatic Mixed Precision Pass RFC

Posted by GitBox <gi...@apache.org>.

tmoreau89 commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-857933603


   CCing @csullivan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-899816190


   Took a quick pass to the updated RFC. I think it's almost ready to merge as long as the last 3 comments are resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690678611



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       Hey all, in the case where does not have an instruction for accumulating fp16 results in fp32, the user has the interface to turn off accumulating in fp32. 
   
   Right now the default is designed with NVidia GPU in mind, but example pass settings can be easily made.
   
   As for concerns about the representation being suitable for your device, do you have any specific concerns? As long as the codegen path to the device is good and clear I do not think the representation will be a problem.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo edited a comment on pull request #6: Automatic Mixed Precision Pass RFC

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo edited a comment on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-857903539


   > Thanks for the RFC. I have two questions:
   > 
   > 1. How to mark/set the color (i.e., attribute) of every operator?
   > 2. It seems to me that if we register a casting checker instead of just a label (color), then we can simplify the algorithm a lot. Taking the case `A(green) - B(gray) - C(green)` as an example, if we could register a casting rule of B as follows, then we just need one traverse to know if we need cast around B:
   >    ```
   >    def amp_B(expr, args):
   >        a = args[0]
   >        if (a.dtype is float16):
   >          return fp16
   >        return fp32
   >    ```
   >    
   >    
   >        
   >          
   >        
   >    
   >          
   >        
   >    
   >        
   >      
   >    After all, we only need the previous nodes to determine 1) whether to use FP16 implementation, and 2) whether to insert casts. It seems to me that this pass is similar to the layout conversion pass, which uses one traverse to finish everything, so it might be possible for AMP too.
   
   Yep that is correct it is very similar to the layout conversion pass. This RFC has an initial PR here: https://github.com/apache/tvm/pull/8069.
   
   To answer your questions:
   1.  src/relay/transforms/fp32_to_fp16.h -- DefaultFP16Colorer is the default way. But the only thing we need is a callable with type CallNode*(Color). So you could write your own colorer that does arbitrary stuff when only looking at a single node at a time.
   
   2. This is functionally what is done in the PR I link. It's one pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] u99127 commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

u99127 commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r677863248



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       The M6G for instance also has FP16 instructions from the AArch64 ISA including Advanced SIMD fixed length 128 bit vectors- perhaps also investigate the suitability of the representation to support more than 1 backend with this ? I also expect that with the SVE (Scalable Vector extensions) you'd see FP16 instructions there as well but that's a different kettle of fish.
   
   I also expect that in the uTVM case it would be interesting in a later date for the MVE instruction set with FP16 being present there as well. I'd like to look at whether there is an multiply and accumulate instruction there that matches with an FP32 accumulator.  Off hand I'm not sure about the answer to that question.
   
   Just something to consider.
   
   My 2 cents 
   
   Ramana




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-893007973


   btw, according to #17, please update the RFC number on the file name to align with this PR number.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] tqchen commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

tqchen commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-886048408


   cc @comaniac would be great if you can help shepherd this RFC


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on pull request #6: Automatic Mixed Precision Pass RFC

Posted by GitBox <gi...@apache.org>.

comaniac commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-857950295


   > > Thanks for the answers. I'll review the PR to get more implementation details.
   > > One more question regarding the extensibility: can this be extended easily to support bfloat16?
   > 
   > It should be trivial (hope I don't eat my words). I'm not 100% sure of the support for bfloat16 in current relay ops however.
   
   TVM has limited bfloat16 support now but it's on the way, so it would be better for this RFC to also consider this case, even the initial version may not cover it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] comaniac commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

comaniac commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690760136



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       Yeah that would be sufficient for now. We will also need the BF16 support as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-901318475


   @comaniac, I'll be talking about this at the TVM community meeting tomorrow so put off merging until after.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-900596242


   PTAL @comaniac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#issuecomment-893004650


   Going to get to this tomorrow 😬. Promise 🤞


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm-rfcs] AndrewZhaoLuo commented on a change in pull request #6: [RFC] [Relay] Automatic Mixed Precision Pass

Posted by GitBox <gi...@apache.org>.

AndrewZhaoLuo commented on a change in pull request #6:
URL: https://github.com/apache/tvm-rfcs/pull/6#discussion_r690679600



##########
File path: rfcs/0001-AMP_pass.md
##########
@@ -0,0 +1,137 @@
+- Feature Name: Automatic Mixed Precision Pass
+- Start Date: 2021-06-08 
+- RFC PR: TODO
+- GitHub Issue: TODO
+
+# Summary
+[summary]: #summary
+
+Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. 
+These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth.
+As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. 
+Surprisingly, for many operations this has little effect on the results, though some care must had when changing 
+operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe 
+due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). 
+In general for a function `f`, if `|f(x)| >> |x|` for expected 
+ranges of input we probably do not want to use the 16 bit floating point versions.
+
+This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit 
+floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support
+for bfloat16 should be in mind.
+
+# Motivation
+[motivation]: #motivation
+
+Many machine learning models can move significant portions of their computational graphs into the FP16 space 
+without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In 
+the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). 
+
+We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable
+for many users.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit 
+of using a reduced floating point version of the operation. "Green" operations are compute intensive
+and almost always see hardware memory and latency savings by utilizing a reduced floating point form.
+Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to 
+no savings in using reduced floating point forms -- at least not enough to justify the overhead of 
+casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to 
+use reduced floating point forms on, usually due to numerical precision reasons.
+
+In general we always want to insert casts into reduced floating point space for "Green" operations, 
+are fine with transforming "Gray" operations into reduced floating point space if their inputs are already
+in that form, and want to explicitly cast back into full floating point space for "Red" operations. 
+Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes`
+and returns a color. For example, we might have a function which colors only a convolution as "Green" if it 
+has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple
+however and do something like place all convolutions in the "Green" list, all element-wise operations in 
+the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting 
+this "coloring" function.
+
+The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced
+floating point types. However, while they for example may take two FP16 operands they may accumulate the 
+result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. 
+The final knob we give is a control over how operations accumulate their result. For this, we have 
+a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output 
+datatype. The output datatype is the type other operations down the line will likely ingest from the previous
+calculation while the accumulation datatype describes the size of buffer where the results are initially
+stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype
+of FP16. The default implementation will follow this guideline closely and will by default have all 
+operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular
+operation.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation

Review comment:
       @jwfromm I believe we also had good results running mixed precision on M6g (graviton) instances?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org