You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by "Lv, Tao A" <ta...@intel.com> on 2019/04/09 08:46:31 UTC

[MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Hi dev,

As we're discussing the roadmap for MXNet 2.0, I would like to start a thread about refining the InferStorageType and memory planning pass in MXNet and hope it can happen as a part of the 2.0 release.

Thanks to @eric-haibin-lin, part of the proposal has already been discussed in issue #13598 [1].

As mentioned in the description of issue #13598, there are several drawbacks of the existing flow. Please allow me to quote them here:
* the selection of MKL/CPU/GPU/CUDNN implementation happens after graph attribute inference and memory planning, memory planning is thus not aware of the implementation that will be used for execution in the future, which may result in sub-optimal result. For example, the memory inplace option may vary depending on the accelerator backend (the new version of CUDNN enables x/dx inplace for _backward_conv).
* some sparse operator need to access dtype/shape information to decide which implementation to invoke for execution, and whether to perform fallback. This information is not yet exposed in the existing infer storage type interface.

Besides, the existing memory planning pass calculates and afterwards allocates memory strictly according to the input/output tensor shapes (which can be got from operators' arithmetic formulas through InferShape). That's not true anymore when we come to accelerators like MKL-DNN on CPU which wants to pad input/output tensor to optimal formats (eg. nchw16c) according to hardware architecture. It also can be described as shape + stride. As many of you know, MKL-DNN shows great performance on these optimal formats which is blocked by the vector length of AVX512 or AVX2. It's very natural for us to pad on the channel dimension for those inputs/outputs which IC or OC is not multiples of vector length and leverage optimal kernels for blocked formats. Unfortunately this cannot be implemented without changing the logic in the memory planning pass. Currently we always fallback to slow reference kernels for both convolution [1] and deconvolution [2].

AFAIK, the padding feature of MKL-DNN has already been used in TensorFlow and other frameworks. We also found that, without supporting this feature, many other new features from MKL-DNN cannot be applied to MXNet, such as the deconvolution primitive, winograd, etc.

Changes for this proposal can be divided into following parts:
1. Following the proposal in issue #13598, we need add new InferStorageTypeEx functions to operators which need to do dispatch in a more fine-grained way. This also need the InfereStorage pass can handle the new -Ex function as what we did for FCompute and FComputeEx.
2. Attach more information to the computation graph/node, eg. accelerator specific information. Currently we add `IsMKLDNN` directly during operator registration if MXNET_USE_MKLDNN == 1. It looks simple and rude to me.
3. Do memory planning according to more information: topology, shapes, data types, in-place options and more accurate accelerator information (accelerator path, memory size requirements, accelerator-wise attributes).
4. Improve MKL-DNN operators so they can work on those well planned memory which may be larger than the arithmetic requirements and work with optimal kernels. Also, with more accurate dispatching in InferStorageTypeEx, there is no need for us to write complicated fallback logic in MKL-DNN operators.
5. If users feel uncomfortable with more memory usage, we can disable this feature by environmental variables.

Since the memory planning pass is implemented in NNVM, so we also need support from TVM community.

Please let me know what do you think. Thank you.

-tao

[1] https://github.com/apache/incubator-mxnet/issues/13598

[2] https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_convolution.cc#L194

[3] https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_deconvolution.cc#L55

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Posted by Junru Shao <ju...@gmail.com>.

+1 for this proposal. Probably this is doable prior to 2.0?

While I totally agree with Tianqi that in the sense of a compiler, we
should make layout transformation a separate pass, I would love to mention
that it will be non-trivial engineering effort given the fact that our
current NNVM does not have a pass manager for optionally applying passes.
Moreover, I believe Tao's proposal is somehow equivalent to adding a new
pass in NNVM (but one with the same name).

By the way, making MKLDNN as an accelerator is a nice proposal, which I
guess could be a wish for MXNet 2.0.

On Tue, Apr 9, 2019 at 8:39 PM Zhao, Patric <pa...@intel.com> wrote:

> BTW,  "maintainability, testability and readability"  is always our design
> goal from starting point of MKL-DNN integration :)
>
> > -----Original Message-----
> > From: Lv, Tao A [mailto:tao.a.lv@intel.com]
> > Sent: Wednesday, April 10, 2019 11:03 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: RE: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType
> and
> > memory planning pass
> >
> >
> > Thank you Tianqi and Sam for the kind suggestions.
> >
> > @Tianqi,
> >
> > Can you please point me to the code of this pass or do you think anyone
> > from TVM community can help to educate me on this? I'm very happy to
> > learn from that.
> >
> > Just one note, we are not only doing layout transformation but also want
> to
> > have more memory for layout transformation.
> > For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16,
> > H=256, W=256) on channel dimension then convert (N=32, C=16, H=256,
> > W=256) to nchw16c so we can leverage corresponding optimal computation
> > kernels.
> > That's why we also need changes to the memory planning pass.
> >
> >
> > @Sam,
> >
> > Yes, definitely we're treating MKL-DNN as an accelerator on CPU.
> Previously
> > we used it to accelerate certain critical operators in MXNet in certain
> > situations, eg. FP32 convolution/deconvolution/fullyConnected, etc. But
> > along with the evolving of both MXNet and MKL-DNN, we started to do more
> > which might not supported by MXNet in original CPU implementation, such
> > as quantization and graph fusion. So MKL-DNN backend is also changing
> from
> > a simple `accelerator` to a `default` backend on CPU. And I totally
> agree with
> > you that we need think more about the software architecture for
> > maintainability, testability and readability - that's why I sent out
> this proposal
> > to get more ideas from the community.
> >
> >
> > -tao
> >
> > -----Original Message-----
> > From: Skalicky, Sam [mailto:sskalic@amazon.com.INVALID]
> > Sent: Wednesday, April 10, 2019 2:24 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType
> and
> > memory planning pass
> >
> > I agree with Tianqi. We should let MKLDNN partitipate in memory planning
> > by first having a separate NNVM pass and then using that info in the
> regular
> > memory planning phase.
> >
> > Its starting to sound like MKLDNN should be treated like an accelerator
> rather
> > than an operator library. As it has explicit needs and can provide
> acceleration
> > when given extra capabilities in MXNet like having input to the memory
> > planning NNVM pass. It also has special tensor formatting needs and
> > conversions that could be best architected in another way than they
> > currently are.
> >
> > We need to think about how we want to architect this for maintainability,
> > testability, and readability.
> >
> > Sam
> >
> >
> > > On Apr 9, 2019, at 11:11 AM, Tianqi Chen <tq...@cs.washington.edu>
> > wrote:
> > >
> > > The layout transformation should really be a separate optimization
> > > pass rather than memory planning. As is done in the TVM stack. If we
> > > want to do a clean slate solution, I would recommend looking into that
> > instead.
> > >
> > > TIanqi
> > >
> > > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <ta...@intel.com> wrote:
> > >
> > >>
> > >>
> > >> Hi dev,
> > >>
> > >>
> > >>
> > >> As we're discussing the roadmap for MXNet 2.0, I would like to start
> > >> a thread about refining the InferStorageType and memory planning pass
> > >> in MXNet and hope it can happen as a part of the 2.0 release.
> > >>
> > >>
> > >>
> > >> Thanks to @eric-haibin-lin, part of the proposal has already been
> > >> discussed in issue #13598 [1].
> > >>
> > >>
> > >>
> > >> As mentioned in the description of issue #13598, there are several
> > >> drawbacks of the existing flow. Please allow me to quote them here:
> > >> *        the selection of MKL/CPU/GPU/CUDNN implementation happens
> > after
> > >> graph attribute inference and memory planning, memory planning is
> > >> thus not aware of the implementation that will be used for execution
> > >> in the future, which may result in sub-optimal result. For example,
> > >> the memory inplace option may vary depending on the accelerator
> > >> backend (the new version of CUDNN enables x/dx inplace for
> > _backward_conv).
> > >> *        some sparse operator need to access dtype/shape information
> to
> > >> decide which implementation to invoke for execution, and whether to
> > >> perform fallback. This information is not yet exposed in the existing
> > >> infer storage type interface.
> > >>
> > >>
> > >>
> > >> Besides, the existing memory planning pass calculates and afterwards
> > >> allocates memory strictly according to the input/output tensor shapes
> > >> (which can be got from operators' arithmetic formulas through
> > InferShape).
> > >> That's not true anymore when we come to accelerators like MKL-DNN on
> > >> CPU which wants to pad input/output tensor to optimal formats (eg.
> > >> nchw16c) according to hardware architecture. It also can be described
> > >> as shape + stride. As many of you know, MKL-DNN shows great
> > >> performance on these optimal formats which is blocked by the vector
> > length of AVX512 or AVX2.
> > >> It's very natural for us to pad on the channel dimension for those
> > >> inputs/outputs which IC or OC is not multiples of vector length and
> > >> leverage optimal kernels for blocked formats. Unfortunately this
> > >> cannot be implemented without changing the logic in the memory
> > planning pass.
> > >> Currently we always fallback to slow reference kernels for both
> > >> convolution [1] and deconvolution [2].
> > >>
> > >>
> > >>
> > >> AFAIK, the padding feature of MKL-DNN has already been used in
> > >> TensorFlow and other frameworks. We also found that, without
> > >> supporting this feature, many other new features from MKL-DNN cannot
> > >> be applied to MXNet,  such as the deconvolution primitive, winograd,
> etc.
> > >>
> > >>
> > >>
> > >> Changes for this proposal can be divided into following parts:
> > >> 1.      Following the proposal in issue #13598, we need add new
> > >> InferStorageTypeEx functions to operators which need to do dispatch
> > >> in a more fine-grained way. This also need the InfereStorage pass can
> > >> handle the new -Ex function as what we did for FCompute and
> > FComputeEx.
> > >> 2.      Attach more information to the computation graph/node, eg.
> > >> accelerator specific information. Currently we add `IsMKLDNN`
> > >> directly during operator registration if MXNET_USE_MKLDNN == 1. It
> > >> looks simple and rude to me.
> > >> 3.      Do memory planning according to more information: topology,
> > >> shapes, data types, in-place options and more accurate accelerator
> > >> information (accelerator path, memory size requirements,
> > >> accelerator-wise attributes).
> > >> 4.      Improve MKL-DNN operators so they can work on those well
> planned
> > >> memory which may be larger than the arithmetic requirements and work
> > >> with optimal kernels. Also, with more accurate dispatching in
> > >> InferStorageTypeEx, there is no need for us to write complicated
> > >> fallback logic in MKL-DNN operators.
> > >> 5.      If users feel uncomfortable with more memory usage, we can
> disable
> > >> this feature by environmental variables.
> > >>
> > >>
> > >>
> > >> Since the memory planning pass is implemented in NNVM, so we also
> > >> need support from TVM community.
> > >>
> > >>
> > >>
> > >> Please let me know what do you think. Thank you.
> > >>
> > >>
> > >>
> > >> -tao
> > >>
> > >>
> > >>
> > >> [1] https://github.com/apache/incubator-mxnet/issues/13598
> > >>
> > >> [2]
> > >> https://github.com/apache/incubator-
> > mxnet/blob/master/src/operator/nn
> > >> /mkldnn/mkldnn_convolution.cc#L194
> > >>
> > >> [3]
> > >> https://github.com/apache/incubator-
> > mxnet/blob/master/src/operator/nn
> > >> /mkldnn/mkldnn_deconvolution.cc#L55
> > >>
> > >>
>
>

RE: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Posted by "Zhao, Patric" <pa...@intel.com>.

BTW,  "maintainability, testability and readability"  is always our design goal from starting point of MKL-DNN integration :)

> -----Original Message-----
> From: Lv, Tao A [mailto:tao.a.lv@intel.com]
> Sent: Wednesday, April 10, 2019 11:03 AM
> To: dev@mxnet.incubator.apache.org
> Subject: RE: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and
> memory planning pass
> 
> 
> Thank you Tianqi and Sam for the kind suggestions.
> 
> @Tianqi,
> 
> Can you please point me to the code of this pass or do you think anyone
> from TVM community can help to educate me on this? I'm very happy to
> learn from that.
> 
> Just one note, we are not only doing layout transformation but also want to
> have more memory for layout transformation.
> For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16,
> H=256, W=256) on channel dimension then convert (N=32, C=16, H=256,
> W=256) to nchw16c so we can leverage corresponding optimal computation
> kernels.
> That's why we also need changes to the memory planning pass.
> 
> 
> @Sam,
> 
> Yes, definitely we're treating MKL-DNN as an accelerator on CPU. Previously
> we used it to accelerate certain critical operators in MXNet in certain
> situations, eg. FP32 convolution/deconvolution/fullyConnected, etc. But
> along with the evolving of both MXNet and MKL-DNN, we started to do more
> which might not supported by MXNet in original CPU implementation, such
> as quantization and graph fusion. So MKL-DNN backend is also changing from
> a simple `accelerator` to a `default` backend on CPU. And I totally agree with
> you that we need think more about the software architecture for
> maintainability, testability and readability - that's why I sent out this proposal
> to get more ideas from the community.
> 
> 
> -tao
> 
> -----Original Message-----
> From: Skalicky, Sam [mailto:sskalic@amazon.com.INVALID]
> Sent: Wednesday, April 10, 2019 2:24 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and
> memory planning pass
> 
> I agree with Tianqi. We should let MKLDNN partitipate in memory planning
> by first having a separate NNVM pass and then using that info in the regular
> memory planning phase.
> 
> Its starting to sound like MKLDNN should be treated like an accelerator rather
> than an operator library. As it has explicit needs and can provide acceleration
> when given extra capabilities in MXNet like having input to the memory
> planning NNVM pass. It also has special tensor formatting needs and
> conversions that could be best architected in another way than they
> currently are.
> 
> We need to think about how we want to architect this for maintainability,
> testability, and readability.
> 
> Sam
> 
> 
> > On Apr 9, 2019, at 11:11 AM, Tianqi Chen <tq...@cs.washington.edu>
> wrote:
> >
> > The layout transformation should really be a separate optimization
> > pass rather than memory planning. As is done in the TVM stack. If we
> > want to do a clean slate solution, I would recommend looking into that
> instead.
> >
> > TIanqi
> >
> > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <ta...@intel.com> wrote:
> >
> >>
> >>
> >> Hi dev,
> >>
> >>
> >>
> >> As we're discussing the roadmap for MXNet 2.0, I would like to start
> >> a thread about refining the InferStorageType and memory planning pass
> >> in MXNet and hope it can happen as a part of the 2.0 release.
> >>
> >>
> >>
> >> Thanks to @eric-haibin-lin, part of the proposal has already been
> >> discussed in issue #13598 [1].
> >>
> >>
> >>
> >> As mentioned in the description of issue #13598, there are several
> >> drawbacks of the existing flow. Please allow me to quote them here:
> >> *        the selection of MKL/CPU/GPU/CUDNN implementation happens
> after
> >> graph attribute inference and memory planning, memory planning is
> >> thus not aware of the implementation that will be used for execution
> >> in the future, which may result in sub-optimal result. For example,
> >> the memory inplace option may vary depending on the accelerator
> >> backend (the new version of CUDNN enables x/dx inplace for
> _backward_conv).
> >> *        some sparse operator need to access dtype/shape information to
> >> decide which implementation to invoke for execution, and whether to
> >> perform fallback. This information is not yet exposed in the existing
> >> infer storage type interface.
> >>
> >>
> >>
> >> Besides, the existing memory planning pass calculates and afterwards
> >> allocates memory strictly according to the input/output tensor shapes
> >> (which can be got from operators' arithmetic formulas through
> InferShape).
> >> That's not true anymore when we come to accelerators like MKL-DNN on
> >> CPU which wants to pad input/output tensor to optimal formats (eg.
> >> nchw16c) according to hardware architecture. It also can be described
> >> as shape + stride. As many of you know, MKL-DNN shows great
> >> performance on these optimal formats which is blocked by the vector
> length of AVX512 or AVX2.
> >> It's very natural for us to pad on the channel dimension for those
> >> inputs/outputs which IC or OC is not multiples of vector length and
> >> leverage optimal kernels for blocked formats. Unfortunately this
> >> cannot be implemented without changing the logic in the memory
> planning pass.
> >> Currently we always fallback to slow reference kernels for both
> >> convolution [1] and deconvolution [2].
> >>
> >>
> >>
> >> AFAIK, the padding feature of MKL-DNN has already been used in
> >> TensorFlow and other frameworks. We also found that, without
> >> supporting this feature, many other new features from MKL-DNN cannot
> >> be applied to MXNet,  such as the deconvolution primitive, winograd, etc.
> >>
> >>
> >>
> >> Changes for this proposal can be divided into following parts:
> >> 1.      Following the proposal in issue #13598, we need add new
> >> InferStorageTypeEx functions to operators which need to do dispatch
> >> in a more fine-grained way. This also need the InfereStorage pass can
> >> handle the new -Ex function as what we did for FCompute and
> FComputeEx.
> >> 2.      Attach more information to the computation graph/node, eg.
> >> accelerator specific information. Currently we add `IsMKLDNN`
> >> directly during operator registration if MXNET_USE_MKLDNN == 1. It
> >> looks simple and rude to me.
> >> 3.      Do memory planning according to more information: topology,
> >> shapes, data types, in-place options and more accurate accelerator
> >> information (accelerator path, memory size requirements,
> >> accelerator-wise attributes).
> >> 4.      Improve MKL-DNN operators so they can work on those well planned
> >> memory which may be larger than the arithmetic requirements and work
> >> with optimal kernels. Also, with more accurate dispatching in
> >> InferStorageTypeEx, there is no need for us to write complicated
> >> fallback logic in MKL-DNN operators.
> >> 5.      If users feel uncomfortable with more memory usage, we can disable
> >> this feature by environmental variables.
> >>
> >>
> >>
> >> Since the memory planning pass is implemented in NNVM, so we also
> >> need support from TVM community.
> >>
> >>
> >>
> >> Please let me know what do you think. Thank you.
> >>
> >>
> >>
> >> -tao
> >>
> >>
> >>
> >> [1] https://github.com/apache/incubator-mxnet/issues/13598
> >>
> >> [2]
> >> https://github.com/apache/incubator-
> mxnet/blob/master/src/operator/nn
> >> /mkldnn/mkldnn_convolution.cc#L194
> >>
> >> [3]
> >> https://github.com/apache/incubator-
> mxnet/blob/master/src/operator/nn
> >> /mkldnn/mkldnn_deconvolution.cc#L55
> >>
> >>

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Posted by Sheng Zha <sz...@gmail.com>.

Relay is NNVM v2. The main difference between NNVM and Relay is that the former can represent control flow graph. Translating the suggested optimization pass in this thread from NNVM to relay should be straightforward. Given that I’d also suggest to start early with NNVM.

-sz

> On Apr 10, 2019, at 8:26 AM, Lv, Tao A <ta...@intel.com> wrote:
> 
> 
> @Tianqi,
> 
> Thank you for the information. I will take a look on that to see if we can take some advantages from it.
> 
> @Junru,
> 
> The reason for why we want to hold this change to 2.0 is that we know there is a discussion in TVM community that NNVM will be deprecated soon and then I think MXNet has to move to a new IR either NNVM v2 or Relay. As most changes in this proposal are related to IR passes, we definitely don't want to spend much effort on something which is deprecating. 2.0 seems to be a more appropriate timing for us to make these changes. But I agree with you, we can start to do some experiments on the existing architects and NNVM IR.
> 
> -tao
> 
> -----Original Message-----
> From: Junru Shao [mailto:junrushao1994@gmail.com] 
> Sent: Wednesday, April 10, 2019 1:34 PM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass
> 
> Agreed with Tianqi that we could have better implementation once we have better tvm nnvm v2 integration. For now I believe that we shouldn't block the development of Intel folks.
> 
> On Tue, Apr 9, 2019 at 10:10 PM Tianqi Chen <tq...@cs.washington.edu>
> wrote:
> 
>> Such kind of conversion can be viewed as an enhanced version of 
>> AlterOpLayout in the TVM relay Pass
>> 
>>> On Tue, Apr 9, 2019 at 8:03 PM Lv, Tao A <ta...@intel.com> wrote:
>>> 
>>> 
>>> Thank you Tianqi and Sam for the kind suggestions.
>>> 
>>> @Tianqi,
>>> 
>>> Can you please point me to the code of this pass or do you think 
>>> anyone from TVM community can help to educate me on this? I'm very 
>>> happy to
>> learn
>>> from that.
>>> 
>>> Just one note, we are not only doing layout transformation but also 
>>> want to have more memory for layout transformation.
>>> For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, 
>>> C=16, H=256, W=256) on channel dimension then convert (N=32, C=16, 
>>> H=256,
>> W=256)
>>> to nchw16c so we can leverage corresponding optimal computation kernels.
>>> That's why we also need changes to the memory planning pass.
>>> 
>>> 
>>> @Sam,
>>> 
>>> Yes, definitely we're treating MKL-DNN as an accelerator on CPU.
>>> Previously we used it to accelerate certain critical operators in 
>>> MXNet
>> in
>>> certain situations, eg. FP32 
>>> convolution/deconvolution/fullyConnected,
>> etc.
>>> But along with the evolving of both MXNet and MKL-DNN, we started to 
>>> do more which might not supported by MXNet in original CPU 
>>> implementation, such as quantization and graph fusion. So MKL-DNN 
>>> backend is also
>> changing
>>> from a simple `accelerator` to a `default` backend on CPU. And I 
>>> totally agree with you that we need think more about the software 
>>> architecture
>> for
>>> maintainability, testability and readability - that's why I sent out 
>>> this proposal to get more ideas from the community.
>>> 
>>> 
>>> -tao
>>> 
>>> -----Original Message-----
>>> From: Skalicky, Sam [mailto:sskalic@amazon.com.INVALID]
>>> Sent: Wednesday, April 10, 2019 2:24 AM
>>> To: dev@mxnet.incubator.apache.org
>>> Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the 
>>> InferStorageType and memory planning pass
>>> 
>>> I agree with Tianqi. We should let MKLDNN partitipate in memory 
>>> planning by first having a separate NNVM pass and then using that 
>>> info in the regular memory planning phase.
>>> 
>>> Its starting to sound like MKLDNN should be treated like an 
>>> accelerator rather than an operator library. As it has explicit 
>>> needs and can provide acceleration when given extra capabilities in 
>>> MXNet like having input to the memory planning NNVM pass. It also 
>>> has special tensor formatting
>> needs
>>> and conversions that could be best architected in another way than 
>>> they currently are.
>>> 
>>> We need to think about how we want to architect this for 
>>> maintainability, testability, and readability.
>>> 
>>> Sam
>>> 
>>> 
>>>> On Apr 9, 2019, at 11:11 AM, Tianqi Chen 
>>>> <tq...@cs.washington.edu>
>>> wrote:
>>>> 
>>>> The layout transformation should really be a separate optimization 
>>>> pass rather than memory planning. As is done in the TVM stack. If 
>>>> we want to do a clean slate solution, I would recommend looking 
>>>> into that
>>> instead.
>>>> 
>>>> TIanqi
>>>> 
>>>>> On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <ta...@intel.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> Hi dev,
>>>>> 
>>>>> 
>>>>> 
>>>>> As we're discussing the roadmap for MXNet 2.0, I would like to 
>>>>> start a thread about refining the InferStorageType and memory 
>>>>> planning pass in MXNet and hope it can happen as a part of the 2.0 release.
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks to @eric-haibin-lin, part of the proposal has already been 
>>>>> discussed in issue #13598 [1].
>>>>> 
>>>>> 
>>>>> 
>>>>> As mentioned in the description of issue #13598, there are 
>>>>> several drawbacks of the existing flow. Please allow me to quote them here:
>>>>> *        the selection of MKL/CPU/GPU/CUDNN implementation happens
>> after
>>>>> graph attribute inference and memory planning, memory planning is 
>>>>> thus not aware of the implementation that will be used for 
>>>>> execution in the future, which may result in sub-optimal result. 
>>>>> For example, the memory inplace option may vary depending on the 
>>>>> accelerator backend (the new version of CUDNN enables x/dx 
>>>>> inplace for
>>> _backward_conv).
>>>>> *        some sparse operator need to access dtype/shape information
>> to
>>>>> decide which implementation to invoke for execution, and whether 
>>>>> to perform fallback. This information is not yet exposed in the 
>>>>> existing infer storage type interface.
>>>>> 
>>>>> 
>>>>> 
>>>>> Besides, the existing memory planning pass calculates and 
>>>>> afterwards allocates memory strictly according to the 
>>>>> input/output tensor shapes (which can be got from operators' 
>>>>> arithmetic formulas through
>>> InferShape).
>>>>> That's not true anymore when we come to accelerators like MKL-DNN 
>>>>> on CPU which wants to pad input/output tensor to optimal formats (eg.
>>>>> nchw16c) according to hardware architecture. It also can be 
>>>>> described as shape + stride. As many of you know, MKL-DNN shows 
>>>>> great performance on these optimal formats which is blocked by 
>>>>> the vector
>>> length of AVX512 or AVX2.
>>>>> It's very natural for us to pad on the channel dimension for 
>>>>> those inputs/outputs which IC or OC is not multiples of vector 
>>>>> length and leverage optimal kernels for blocked formats. 
>>>>> Unfortunately this cannot be implemented without changing the 
>>>>> logic in the memory
>> planning
>>> pass.
>>>>> Currently we always fallback to slow reference kernels for both 
>>>>> convolution [1] and deconvolution [2].
>>>>> 
>>>>> 
>>>>> 
>>>>> AFAIK, the padding feature of MKL-DNN has already been used in 
>>>>> TensorFlow and other frameworks. We also found that, without 
>>>>> supporting this feature, many other new features from MKL-DNN 
>>>>> cannot be applied to MXNet,  such as the deconvolution primitive, 
>>>>> winograd,
>>> etc.
>>>>> 
>>>>> 
>>>>> 
>>>>> Changes for this proposal can be divided into following parts:
>>>>> 1.      Following the proposal in issue #13598, we need add new
>>>>> InferStorageTypeEx functions to operators which need to do 
>>>>> dispatch in a more fine-grained way. This also need the 
>>>>> InfereStorage pass can handle the new -Ex function as what we did 
>>>>> for FCompute and
>> FComputeEx.
>>>>> 2.      Attach more information to the computation graph/node, eg.
>>>>> accelerator specific information. Currently we add `IsMKLDNN` 
>>>>> directly during operator registration if MXNET_USE_MKLDNN == 1. 
>>>>> It looks simple and rude to me.
>>>>> 3.      Do memory planning according to more information: topology,
>>>>> shapes, data types, in-place options and more accurate 
>>>>> accelerator information (accelerator path, memory size 
>>>>> requirements, accelerator-wise attributes).
>>>>> 4.      Improve MKL-DNN operators so they can work on those well
>> planned
>>>>> memory which may be larger than the arithmetic requirements and 
>>>>> work with optimal kernels. Also, with more accurate dispatching 
>>>>> in InferStorageTypeEx, there is no need for us to write 
>>>>> complicated fallback logic in MKL-DNN operators.
>>>>> 5.      If users feel uncomfortable with more memory usage, we can
>>> disable
>>>>> this feature by environmental variables.
>>>>> 
>>>>> 
>>>>> 
>>>>> Since the memory planning pass is implemented in NNVM, so we also 
>>>>> need support from TVM community.
>>>>> 
>>>>> 
>>>>> 
>>>>> Please let me know what do you think. Thank you.
>>>>> 
>>>>> 
>>>>> 
>>>>> -tao
>>>>> 
>>>>> 
>>>>> 
>>>>> [1] https://github.com/apache/incubator-mxnet/issues/13598
>>>>> 
>>>>> [2]
>>>>> https://github.com/apache/incubator-mxnet/blob/master/src/operato
>>>>> r/nn
>>>>> /mkldnn/mkldnn_convolution.cc#L194
>>>>> 
>>>>> [3]
>>>>> https://github.com/apache/incubator-mxnet/blob/master/src/operato
>>>>> r/nn
>>>>> /mkldnn/mkldnn_deconvolution.cc#L55
>>>>> 
>>>>> 
>>> 
>>> 
>>

RE: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Posted by "Lv, Tao A" <ta...@intel.com>.

@Tianqi,

Thank you for the information. I will take a look on that to see if we can take some advantages from it.

@Junru,

The reason for why we want to hold this change to 2.0 is that we know there is a discussion in TVM community that NNVM will be deprecated soon and then I think MXNet has to move to a new IR either NNVM v2 or Relay. As most changes in this proposal are related to IR passes, we definitely don't want to spend much effort on something which is deprecating. 2.0 seems to be a more appropriate timing for us to make these changes. But I agree with you, we can start to do some experiments on the existing architects and NNVM IR.

-tao

-----Original Message-----
From: Junru Shao [mailto:junrushao1994@gmail.com] 
Sent: Wednesday, April 10, 2019 1:34 PM
To: dev@mxnet.incubator.apache.org
Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Agreed with Tianqi that we could have better implementation once we have better tvm nnvm v2 integration. For now I believe that we shouldn't block the development of Intel folks.

On Tue, Apr 9, 2019 at 10:10 PM Tianqi Chen <tq...@cs.washington.edu>
wrote:

> Such kind of conversion can be viewed as an enhanced version of 
> AlterOpLayout in the TVM relay Pass
>
> On Tue, Apr 9, 2019 at 8:03 PM Lv, Tao A <ta...@intel.com> wrote:
>
> >
> > Thank you Tianqi and Sam for the kind suggestions.
> >
> > @Tianqi,
> >
> > Can you please point me to the code of this pass or do you think 
> > anyone from TVM community can help to educate me on this? I'm very 
> > happy to
> learn
> > from that.
> >
> > Just one note, we are not only doing layout transformation but also 
> > want to have more memory for layout transformation.
> > For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, 
> > C=16, H=256, W=256) on channel dimension then convert (N=32, C=16, 
> > H=256,
> W=256)
> > to nchw16c so we can leverage corresponding optimal computation kernels.
> > That's why we also need changes to the memory planning pass.
> >
> >
> > @Sam,
> >
> > Yes, definitely we're treating MKL-DNN as an accelerator on CPU.
> > Previously we used it to accelerate certain critical operators in 
> > MXNet
> in
> > certain situations, eg. FP32 
> > convolution/deconvolution/fullyConnected,
> etc.
> > But along with the evolving of both MXNet and MKL-DNN, we started to 
> > do more which might not supported by MXNet in original CPU 
> > implementation, such as quantization and graph fusion. So MKL-DNN 
> > backend is also
> changing
> > from a simple `accelerator` to a `default` backend on CPU. And I 
> > totally agree with you that we need think more about the software 
> > architecture
> for
> > maintainability, testability and readability - that's why I sent out 
> > this proposal to get more ideas from the community.
> >
> >
> > -tao
> >
> > -----Original Message-----
> > From: Skalicky, Sam [mailto:sskalic@amazon.com.INVALID]
> > Sent: Wednesday, April 10, 2019 2:24 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the 
> > InferStorageType and memory planning pass
> >
> > I agree with Tianqi. We should let MKLDNN partitipate in memory 
> > planning by first having a separate NNVM pass and then using that 
> > info in the regular memory planning phase.
> >
> > Its starting to sound like MKLDNN should be treated like an 
> > accelerator rather than an operator library. As it has explicit 
> > needs and can provide acceleration when given extra capabilities in 
> > MXNet like having input to the memory planning NNVM pass. It also 
> > has special tensor formatting
> needs
> > and conversions that could be best architected in another way than 
> > they currently are.
> >
> > We need to think about how we want to architect this for 
> > maintainability, testability, and readability.
> >
> > Sam
> >
> >
> > > On Apr 9, 2019, at 11:11 AM, Tianqi Chen 
> > > <tq...@cs.washington.edu>
> > wrote:
> > >
> > > The layout transformation should really be a separate optimization 
> > > pass rather than memory planning. As is done in the TVM stack. If 
> > > we want to do a clean slate solution, I would recommend looking 
> > > into that
> > instead.
> > >
> > > TIanqi
> > >
> > > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <ta...@intel.com> wrote:
> > >
> > >>
> > >>
> > >> Hi dev,
> > >>
> > >>
> > >>
> > >> As we're discussing the roadmap for MXNet 2.0, I would like to 
> > >> start a thread about refining the InferStorageType and memory 
> > >> planning pass in MXNet and hope it can happen as a part of the 2.0 release.
> > >>
> > >>
> > >>
> > >> Thanks to @eric-haibin-lin, part of the proposal has already been 
> > >> discussed in issue #13598 [1].
> > >>
> > >>
> > >>
> > >> As mentioned in the description of issue #13598, there are 
> > >> several drawbacks of the existing flow. Please allow me to quote them here:
> > >> *        the selection of MKL/CPU/GPU/CUDNN implementation happens
> after
> > >> graph attribute inference and memory planning, memory planning is 
> > >> thus not aware of the implementation that will be used for 
> > >> execution in the future, which may result in sub-optimal result. 
> > >> For example, the memory inplace option may vary depending on the 
> > >> accelerator backend (the new version of CUDNN enables x/dx 
> > >> inplace for
> > _backward_conv).
> > >> *        some sparse operator need to access dtype/shape information
> to
> > >> decide which implementation to invoke for execution, and whether 
> > >> to perform fallback. This information is not yet exposed in the 
> > >> existing infer storage type interface.
> > >>
> > >>
> > >>
> > >> Besides, the existing memory planning pass calculates and 
> > >> afterwards allocates memory strictly according to the 
> > >> input/output tensor shapes (which can be got from operators' 
> > >> arithmetic formulas through
> > InferShape).
> > >> That's not true anymore when we come to accelerators like MKL-DNN 
> > >> on CPU which wants to pad input/output tensor to optimal formats (eg.
> > >> nchw16c) according to hardware architecture. It also can be 
> > >> described as shape + stride. As many of you know, MKL-DNN shows 
> > >> great performance on these optimal formats which is blocked by 
> > >> the vector
> > length of AVX512 or AVX2.
> > >> It's very natural for us to pad on the channel dimension for 
> > >> those inputs/outputs which IC or OC is not multiples of vector 
> > >> length and leverage optimal kernels for blocked formats. 
> > >> Unfortunately this cannot be implemented without changing the 
> > >> logic in the memory
> planning
> > pass.
> > >> Currently we always fallback to slow reference kernels for both 
> > >> convolution [1] and deconvolution [2].
> > >>
> > >>
> > >>
> > >> AFAIK, the padding feature of MKL-DNN has already been used in 
> > >> TensorFlow and other frameworks. We also found that, without 
> > >> supporting this feature, many other new features from MKL-DNN 
> > >> cannot be applied to MXNet,  such as the deconvolution primitive, 
> > >> winograd,
> > etc.
> > >>
> > >>
> > >>
> > >> Changes for this proposal can be divided into following parts:
> > >> 1.      Following the proposal in issue #13598, we need add new
> > >> InferStorageTypeEx functions to operators which need to do 
> > >> dispatch in a more fine-grained way. This also need the 
> > >> InfereStorage pass can handle the new -Ex function as what we did 
> > >> for FCompute and
> FComputeEx.
> > >> 2.      Attach more information to the computation graph/node, eg.
> > >> accelerator specific information. Currently we add `IsMKLDNN` 
> > >> directly during operator registration if MXNET_USE_MKLDNN == 1. 
> > >> It looks simple and rude to me.
> > >> 3.      Do memory planning according to more information: topology,
> > >> shapes, data types, in-place options and more accurate 
> > >> accelerator information (accelerator path, memory size 
> > >> requirements, accelerator-wise attributes).
> > >> 4.      Improve MKL-DNN operators so they can work on those well
> planned
> > >> memory which may be larger than the arithmetic requirements and 
> > >> work with optimal kernels. Also, with more accurate dispatching 
> > >> in InferStorageTypeEx, there is no need for us to write 
> > >> complicated fallback logic in MKL-DNN operators.
> > >> 5.      If users feel uncomfortable with more memory usage, we can
> > disable
> > >> this feature by environmental variables.
> > >>
> > >>
> > >>
> > >> Since the memory planning pass is implemented in NNVM, so we also 
> > >> need support from TVM community.
> > >>
> > >>
> > >>
> > >> Please let me know what do you think. Thank you.
> > >>
> > >>
> > >>
> > >> -tao
> > >>
> > >>
> > >>
> > >> [1] https://github.com/apache/incubator-mxnet/issues/13598
> > >>
> > >> [2]
> > >> https://github.com/apache/incubator-mxnet/blob/master/src/operato
> > >> r/nn
> > >> /mkldnn/mkldnn_convolution.cc#L194
> > >>
> > >> [3]
> > >> https://github.com/apache/incubator-mxnet/blob/master/src/operato
> > >> r/nn
> > >> /mkldnn/mkldnn_deconvolution.cc#L55
> > >>
> > >>
> >
> >
>

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Posted by Junru Shao <ju...@gmail.com>.

Agreed with Tianqi that we could have better implementation once we have
better tvm nnvm v2 integration. For now I believe that we shouldn't block
the development of Intel folks.

On Tue, Apr 9, 2019 at 10:10 PM Tianqi Chen <tq...@cs.washington.edu>
wrote:

> Such kind of conversion can be viewed as an enhanced version of
> AlterOpLayout in the TVM relay Pass
>
> On Tue, Apr 9, 2019 at 8:03 PM Lv, Tao A <ta...@intel.com> wrote:
>
> >
> > Thank you Tianqi and Sam for the kind suggestions.
> >
> > @Tianqi,
> >
> > Can you please point me to the code of this pass or do you think anyone
> > from TVM community can help to educate me on this? I'm very happy to
> learn
> > from that.
> >
> > Just one note, we are not only doing layout transformation but also want
> > to have more memory for layout transformation.
> > For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16,
> > H=256, W=256) on channel dimension then convert (N=32, C=16, H=256,
> W=256)
> > to nchw16c so we can leverage corresponding optimal computation kernels.
> > That's why we also need changes to the memory planning pass.
> >
> >
> > @Sam,
> >
> > Yes, definitely we're treating MKL-DNN as an accelerator on CPU.
> > Previously we used it to accelerate certain critical operators in MXNet
> in
> > certain situations, eg. FP32 convolution/deconvolution/fullyConnected,
> etc.
> > But along with the evolving of both MXNet and MKL-DNN, we started to do
> > more which might not supported by MXNet in original CPU implementation,
> > such as quantization and graph fusion. So MKL-DNN backend is also
> changing
> > from a simple `accelerator` to a `default` backend on CPU. And I totally
> > agree with you that we need think more about the software architecture
> for
> > maintainability, testability and readability - that's why I sent out this
> > proposal to get more ideas from the community.
> >
> >
> > -tao
> >
> > -----Original Message-----
> > From: Skalicky, Sam [mailto:sskalic@amazon.com.INVALID]
> > Sent: Wednesday, April 10, 2019 2:24 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType
> > and memory planning pass
> >
> > I agree with Tianqi. We should let MKLDNN partitipate in memory planning
> > by first having a separate NNVM pass and then using that info in the
> > regular memory planning phase.
> >
> > Its starting to sound like MKLDNN should be treated like an accelerator
> > rather than an operator library. As it has explicit needs and can provide
> > acceleration when given extra capabilities in MXNet like having input to
> > the memory planning NNVM pass. It also has special tensor formatting
> needs
> > and conversions that could be best architected in another way than they
> > currently are.
> >
> > We need to think about how we want to architect this for maintainability,
> > testability, and readability.
> >
> > Sam
> >
> >
> > > On Apr 9, 2019, at 11:11 AM, Tianqi Chen <tq...@cs.washington.edu>
> > wrote:
> > >
> > > The layout transformation should really be a separate optimization
> > > pass rather than memory planning. As is done in the TVM stack. If we
> > > want to do a clean slate solution, I would recommend looking into that
> > instead.
> > >
> > > TIanqi
> > >
> > > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <ta...@intel.com> wrote:
> > >
> > >>
> > >>
> > >> Hi dev,
> > >>
> > >>
> > >>
> > >> As we're discussing the roadmap for MXNet 2.0, I would like to start
> > >> a thread about refining the InferStorageType and memory planning pass
> > >> in MXNet and hope it can happen as a part of the 2.0 release.
> > >>
> > >>
> > >>
> > >> Thanks to @eric-haibin-lin, part of the proposal has already been
> > >> discussed in issue #13598 [1].
> > >>
> > >>
> > >>
> > >> As mentioned in the description of issue #13598, there are several
> > >> drawbacks of the existing flow. Please allow me to quote them here:
> > >> *        the selection of MKL/CPU/GPU/CUDNN implementation happens
> after
> > >> graph attribute inference and memory planning, memory planning is
> > >> thus not aware of the implementation that will be used for execution
> > >> in the future, which may result in sub-optimal result. For example,
> > >> the memory inplace option may vary depending on the accelerator
> > >> backend (the new version of CUDNN enables x/dx inplace for
> > _backward_conv).
> > >> *        some sparse operator need to access dtype/shape information
> to
> > >> decide which implementation to invoke for execution, and whether to
> > >> perform fallback. This information is not yet exposed in the existing
> > >> infer storage type interface.
> > >>
> > >>
> > >>
> > >> Besides, the existing memory planning pass calculates and afterwards
> > >> allocates memory strictly according to the input/output tensor shapes
> > >> (which can be got from operators' arithmetic formulas through
> > InferShape).
> > >> That's not true anymore when we come to accelerators like MKL-DNN on
> > >> CPU which wants to pad input/output tensor to optimal formats (eg.
> > >> nchw16c) according to hardware architecture. It also can be described
> > >> as shape + stride. As many of you know, MKL-DNN shows great
> > >> performance on these optimal formats which is blocked by the vector
> > length of AVX512 or AVX2.
> > >> It's very natural for us to pad on the channel dimension for those
> > >> inputs/outputs which IC or OC is not multiples of vector length and
> > >> leverage optimal kernels for blocked formats. Unfortunately this
> > >> cannot be implemented without changing the logic in the memory
> planning
> > pass.
> > >> Currently we always fallback to slow reference kernels for both
> > >> convolution [1] and deconvolution [2].
> > >>
> > >>
> > >>
> > >> AFAIK, the padding feature of MKL-DNN has already been used in
> > >> TensorFlow and other frameworks. We also found that, without
> > >> supporting this feature, many other new features from MKL-DNN cannot
> > >> be applied to MXNet,  such as the deconvolution primitive, winograd,
> > etc.
> > >>
> > >>
> > >>
> > >> Changes for this proposal can be divided into following parts:
> > >> 1.      Following the proposal in issue #13598, we need add new
> > >> InferStorageTypeEx functions to operators which need to do dispatch
> > >> in a more fine-grained way. This also need the InfereStorage pass can
> > >> handle the new -Ex function as what we did for FCompute and
> FComputeEx.
> > >> 2.      Attach more information to the computation graph/node, eg.
> > >> accelerator specific information. Currently we add `IsMKLDNN`
> > >> directly during operator registration if MXNET_USE_MKLDNN == 1. It
> > >> looks simple and rude to me.
> > >> 3.      Do memory planning according to more information: topology,
> > >> shapes, data types, in-place options and more accurate accelerator
> > >> information (accelerator path, memory size requirements,
> > >> accelerator-wise attributes).
> > >> 4.      Improve MKL-DNN operators so they can work on those well
> planned
> > >> memory which may be larger than the arithmetic requirements and work
> > >> with optimal kernels. Also, with more accurate dispatching in
> > >> InferStorageTypeEx, there is no need for us to write complicated
> > >> fallback logic in MKL-DNN operators.
> > >> 5.      If users feel uncomfortable with more memory usage, we can
> > disable
> > >> this feature by environmental variables.
> > >>
> > >>
> > >>
> > >> Since the memory planning pass is implemented in NNVM, so we also
> > >> need support from TVM community.
> > >>
> > >>
> > >>
> > >> Please let me know what do you think. Thank you.
> > >>
> > >>
> > >>
> > >> -tao
> > >>
> > >>
> > >>
> > >> [1] https://github.com/apache/incubator-mxnet/issues/13598
> > >>
> > >> [2]
> > >> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn
> > >> /mkldnn/mkldnn_convolution.cc#L194
> > >>
> > >> [3]
> > >> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn
> > >> /mkldnn/mkldnn_deconvolution.cc#L55
> > >>
> > >>
> >
> >
>

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Posted by Tianqi Chen <tq...@cs.washington.edu>.

Such kind of conversion can be viewed as an enhanced version of
AlterOpLayout in the TVM relay Pass

On Tue, Apr 9, 2019 at 8:03 PM Lv, Tao A <ta...@intel.com> wrote:

>
> Thank you Tianqi and Sam for the kind suggestions.
>
> @Tianqi,
>
> Can you please point me to the code of this pass or do you think anyone
> from TVM community can help to educate me on this? I'm very happy to learn
> from that.
>
> Just one note, we are not only doing layout transformation but also want
> to have more memory for layout transformation.
> For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16,
> H=256, W=256) on channel dimension then convert (N=32, C=16, H=256, W=256)
> to nchw16c so we can leverage corresponding optimal computation kernels.
> That's why we also need changes to the memory planning pass.
>
>
> @Sam,
>
> Yes, definitely we're treating MKL-DNN as an accelerator on CPU.
> Previously we used it to accelerate certain critical operators in MXNet in
> certain situations, eg. FP32 convolution/deconvolution/fullyConnected, etc.
> But along with the evolving of both MXNet and MKL-DNN, we started to do
> more which might not supported by MXNet in original CPU implementation,
> such as quantization and graph fusion. So MKL-DNN backend is also changing
> from a simple `accelerator` to a `default` backend on CPU. And I totally
> agree with you that we need think more about the software architecture for
> maintainability, testability and readability - that's why I sent out this
> proposal to get more ideas from the community.
>
>
> -tao
>
> -----Original Message-----
> From: Skalicky, Sam [mailto:sskalic@amazon.com.INVALID]
> Sent: Wednesday, April 10, 2019 2:24 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType
> and memory planning pass
>
> I agree with Tianqi. We should let MKLDNN partitipate in memory planning
> by first having a separate NNVM pass and then using that info in the
> regular memory planning phase.
>
> Its starting to sound like MKLDNN should be treated like an accelerator
> rather than an operator library. As it has explicit needs and can provide
> acceleration when given extra capabilities in MXNet like having input to
> the memory planning NNVM pass. It also has special tensor formatting needs
> and conversions that could be best architected in another way than they
> currently are.
>
> We need to think about how we want to architect this for maintainability,
> testability, and readability.
>
> Sam
>
>
> > On Apr 9, 2019, at 11:11 AM, Tianqi Chen <tq...@cs.washington.edu>
> wrote:
> >
> > The layout transformation should really be a separate optimization
> > pass rather than memory planning. As is done in the TVM stack. If we
> > want to do a clean slate solution, I would recommend looking into that
> instead.
> >
> > TIanqi
> >
> > On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <ta...@intel.com> wrote:
> >
> >>
> >>
> >> Hi dev,
> >>
> >>
> >>
> >> As we're discussing the roadmap for MXNet 2.0, I would like to start
> >> a thread about refining the InferStorageType and memory planning pass
> >> in MXNet and hope it can happen as a part of the 2.0 release.
> >>
> >>
> >>
> >> Thanks to @eric-haibin-lin, part of the proposal has already been
> >> discussed in issue #13598 [1].
> >>
> >>
> >>
> >> As mentioned in the description of issue #13598, there are several
> >> drawbacks of the existing flow. Please allow me to quote them here:
> >> *        the selection of MKL/CPU/GPU/CUDNN implementation happens after
> >> graph attribute inference and memory planning, memory planning is
> >> thus not aware of the implementation that will be used for execution
> >> in the future, which may result in sub-optimal result. For example,
> >> the memory inplace option may vary depending on the accelerator
> >> backend (the new version of CUDNN enables x/dx inplace for
> _backward_conv).
> >> *        some sparse operator need to access dtype/shape information to
> >> decide which implementation to invoke for execution, and whether to
> >> perform fallback. This information is not yet exposed in the existing
> >> infer storage type interface.
> >>
> >>
> >>
> >> Besides, the existing memory planning pass calculates and afterwards
> >> allocates memory strictly according to the input/output tensor shapes
> >> (which can be got from operators' arithmetic formulas through
> InferShape).
> >> That's not true anymore when we come to accelerators like MKL-DNN on
> >> CPU which wants to pad input/output tensor to optimal formats (eg.
> >> nchw16c) according to hardware architecture. It also can be described
> >> as shape + stride. As many of you know, MKL-DNN shows great
> >> performance on these optimal formats which is blocked by the vector
> length of AVX512 or AVX2.
> >> It's very natural for us to pad on the channel dimension for those
> >> inputs/outputs which IC or OC is not multiples of vector length and
> >> leverage optimal kernels for blocked formats. Unfortunately this
> >> cannot be implemented without changing the logic in the memory planning
> pass.
> >> Currently we always fallback to slow reference kernels for both
> >> convolution [1] and deconvolution [2].
> >>
> >>
> >>
> >> AFAIK, the padding feature of MKL-DNN has already been used in
> >> TensorFlow and other frameworks. We also found that, without
> >> supporting this feature, many other new features from MKL-DNN cannot
> >> be applied to MXNet,  such as the deconvolution primitive, winograd,
> etc.
> >>
> >>
> >>
> >> Changes for this proposal can be divided into following parts:
> >> 1.      Following the proposal in issue #13598, we need add new
> >> InferStorageTypeEx functions to operators which need to do dispatch
> >> in a more fine-grained way. This also need the InfereStorage pass can
> >> handle the new -Ex function as what we did for FCompute and FComputeEx.
> >> 2.      Attach more information to the computation graph/node, eg.
> >> accelerator specific information. Currently we add `IsMKLDNN`
> >> directly during operator registration if MXNET_USE_MKLDNN == 1. It
> >> looks simple and rude to me.
> >> 3.      Do memory planning according to more information: topology,
> >> shapes, data types, in-place options and more accurate accelerator
> >> information (accelerator path, memory size requirements,
> >> accelerator-wise attributes).
> >> 4.      Improve MKL-DNN operators so they can work on those well planned
> >> memory which may be larger than the arithmetic requirements and work
> >> with optimal kernels. Also, with more accurate dispatching in
> >> InferStorageTypeEx, there is no need for us to write complicated
> >> fallback logic in MKL-DNN operators.
> >> 5.      If users feel uncomfortable with more memory usage, we can
> disable
> >> this feature by environmental variables.
> >>
> >>
> >>
> >> Since the memory planning pass is implemented in NNVM, so we also
> >> need support from TVM community.
> >>
> >>
> >>
> >> Please let me know what do you think. Thank you.
> >>
> >>
> >>
> >> -tao
> >>
> >>
> >>
> >> [1] https://github.com/apache/incubator-mxnet/issues/13598
> >>
> >> [2]
> >> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn
> >> /mkldnn/mkldnn_convolution.cc#L194
> >>
> >> [3]
> >> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn
> >> /mkldnn/mkldnn_deconvolution.cc#L55
> >>
> >>
>
>

RE: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Posted by "Lv, Tao A" <ta...@intel.com>.

Thank you Tianqi and Sam for the kind suggestions.

@Tianqi,

Can you please point me to the code of this pass or do you think anyone from TVM community can help to educate me on this? I'm very happy to learn from that.

Just one note, we are not only doing layout transformation but also want to have more memory for layout transformation.
For example, (N=32, C=3, H=256, W=256) will be padded to (N=32, C=16, H=256, W=256) on channel dimension then convert (N=32, C=16, H=256, W=256) to nchw16c so we can leverage corresponding optimal computation kernels.
That's why we also need changes to the memory planning pass.


@Sam,

Yes, definitely we're treating MKL-DNN as an accelerator on CPU. Previously we used it to accelerate certain critical operators in MXNet in certain situations, eg. FP32 convolution/deconvolution/fullyConnected, etc. But along with the evolving of both MXNet and MKL-DNN, we started to do more which might not supported by MXNet in original CPU implementation, such as quantization and graph fusion. So MKL-DNN backend is also changing from a simple `accelerator` to a `default` backend on CPU. And I totally agree with you that we need think more about the software architecture for maintainability, testability and readability - that's why I sent out this proposal to get more ideas from the community.


-tao

-----Original Message-----
From: Skalicky, Sam [mailto:sskalic@amazon.com.INVALID] 
Sent: Wednesday, April 10, 2019 2:24 AM
To: dev@mxnet.incubator.apache.org
Subject: Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

I agree with Tianqi. We should let MKLDNN partitipate in memory planning by first having a separate NNVM pass and then using that info in the regular memory planning phase.

Its starting to sound like MKLDNN should be treated like an accelerator rather than an operator library. As it has explicit needs and can provide acceleration when given extra capabilities in MXNet like having input to the memory planning NNVM pass. It also has special tensor formatting needs and conversions that could be best architected in another way than they currently are.

We need to think about how we want to architect this for maintainability, testability, and readability.

Sam


> On Apr 9, 2019, at 11:11 AM, Tianqi Chen <tq...@cs.washington.edu> wrote:
> 
> The layout transformation should really be a separate optimization 
> pass rather than memory planning. As is done in the TVM stack. If we 
> want to do a clean slate solution, I would recommend looking into that instead.
> 
> TIanqi
> 
> On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <ta...@intel.com> wrote:
> 
>> 
>> 
>> Hi dev,
>> 
>> 
>> 
>> As we're discussing the roadmap for MXNet 2.0, I would like to start 
>> a thread about refining the InferStorageType and memory planning pass 
>> in MXNet and hope it can happen as a part of the 2.0 release.
>> 
>> 
>> 
>> Thanks to @eric-haibin-lin, part of the proposal has already been 
>> discussed in issue #13598 [1].
>> 
>> 
>> 
>> As mentioned in the description of issue #13598, there are several 
>> drawbacks of the existing flow. Please allow me to quote them here:
>> *        the selection of MKL/CPU/GPU/CUDNN implementation happens after
>> graph attribute inference and memory planning, memory planning is 
>> thus not aware of the implementation that will be used for execution 
>> in the future, which may result in sub-optimal result. For example, 
>> the memory inplace option may vary depending on the accelerator 
>> backend (the new version of CUDNN enables x/dx inplace for _backward_conv).
>> *        some sparse operator need to access dtype/shape information to
>> decide which implementation to invoke for execution, and whether to 
>> perform fallback. This information is not yet exposed in the existing 
>> infer storage type interface.
>> 
>> 
>> 
>> Besides, the existing memory planning pass calculates and afterwards 
>> allocates memory strictly according to the input/output tensor shapes 
>> (which can be got from operators' arithmetic formulas through InferShape).
>> That's not true anymore when we come to accelerators like MKL-DNN on 
>> CPU which wants to pad input/output tensor to optimal formats (eg. 
>> nchw16c) according to hardware architecture. It also can be described 
>> as shape + stride. As many of you know, MKL-DNN shows great 
>> performance on these optimal formats which is blocked by the vector length of AVX512 or AVX2.
>> It's very natural for us to pad on the channel dimension for those 
>> inputs/outputs which IC or OC is not multiples of vector length and 
>> leverage optimal kernels for blocked formats. Unfortunately this 
>> cannot be implemented without changing the logic in the memory planning pass.
>> Currently we always fallback to slow reference kernels for both 
>> convolution [1] and deconvolution [2].
>> 
>> 
>> 
>> AFAIK, the padding feature of MKL-DNN has already been used in 
>> TensorFlow and other frameworks. We also found that, without 
>> supporting this feature, many other new features from MKL-DNN cannot 
>> be applied to MXNet,  such as the deconvolution primitive, winograd, etc.
>> 
>> 
>> 
>> Changes for this proposal can be divided into following parts:
>> 1.      Following the proposal in issue #13598, we need add new
>> InferStorageTypeEx functions to operators which need to do dispatch 
>> in a more fine-grained way. This also need the InfereStorage pass can 
>> handle the new -Ex function as what we did for FCompute and FComputeEx.
>> 2.      Attach more information to the computation graph/node, eg.
>> accelerator specific information. Currently we add `IsMKLDNN` 
>> directly during operator registration if MXNET_USE_MKLDNN == 1. It 
>> looks simple and rude to me.
>> 3.      Do memory planning according to more information: topology,
>> shapes, data types, in-place options and more accurate accelerator 
>> information (accelerator path, memory size requirements, 
>> accelerator-wise attributes).
>> 4.      Improve MKL-DNN operators so they can work on those well planned
>> memory which may be larger than the arithmetic requirements and work 
>> with optimal kernels. Also, with more accurate dispatching in 
>> InferStorageTypeEx, there is no need for us to write complicated 
>> fallback logic in MKL-DNN operators.
>> 5.      If users feel uncomfortable with more memory usage, we can disable
>> this feature by environmental variables.
>> 
>> 
>> 
>> Since the memory planning pass is implemented in NNVM, so we also 
>> need support from TVM community.
>> 
>> 
>> 
>> Please let me know what do you think. Thank you.
>> 
>> 
>> 
>> -tao
>> 
>> 
>> 
>> [1] https://github.com/apache/incubator-mxnet/issues/13598
>> 
>> [2]
>> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn
>> /mkldnn/mkldnn_convolution.cc#L194
>> 
>> [3]
>> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn
>> /mkldnn/mkldnn_deconvolution.cc#L55
>> 
>>

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Posted by "Skalicky, Sam" <ss...@amazon.com.INVALID>.

I agree with Tianqi. We should let MKLDNN partitipate in memory planning by first having a separate NNVM pass and then using that info in the regular memory planning phase.

Its starting to sound like MKLDNN should be treated like an accelerator rather than an operator library. As it has explicit needs and can provide acceleration when given extra capabilities in MXNet like having input to the memory planning NNVM pass. It also has special tensor formatting needs and conversions that could be best architected in another way than they currently are.

We need to think about how we want to architect this for maintainability, testability, and readability.

Sam


> On Apr 9, 2019, at 11:11 AM, Tianqi Chen <tq...@cs.washington.edu> wrote:
> 
> The layout transformation should really be a separate optimization pass
> rather than memory planning. As is done in the TVM stack. If we want to do
> a clean slate solution, I would recommend looking into that instead.
> 
> TIanqi
> 
> On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <ta...@intel.com> wrote:
> 
>> 
>> 
>> Hi dev,
>> 
>> 
>> 
>> As we're discussing the roadmap for MXNet 2.0, I would like to start a
>> thread about refining the InferStorageType and memory planning pass in
>> MXNet and hope it can happen as a part of the 2.0 release.
>> 
>> 
>> 
>> Thanks to @eric-haibin-lin, part of the proposal has already been
>> discussed in issue #13598 [1].
>> 
>> 
>> 
>> As mentioned in the description of issue #13598, there are several
>> drawbacks of the existing flow. Please allow me to quote them here:
>> *        the selection of MKL/CPU/GPU/CUDNN implementation happens after
>> graph attribute inference and memory planning, memory planning is thus not
>> aware of the implementation that will be used for execution in the future,
>> which may result in sub-optimal result. For example, the memory inplace
>> option may vary depending on the accelerator backend (the new version of
>> CUDNN enables x/dx inplace for _backward_conv).
>> *        some sparse operator need to access dtype/shape information to
>> decide which implementation to invoke for execution, and whether to perform
>> fallback. This information is not yet exposed in the existing infer storage
>> type interface.
>> 
>> 
>> 
>> Besides, the existing memory planning pass calculates and afterwards
>> allocates memory strictly according to the input/output tensor shapes
>> (which can be got from operators' arithmetic formulas through InferShape).
>> That's not true anymore when we come to accelerators like MKL-DNN on CPU
>> which wants to pad input/output tensor to optimal formats (eg. nchw16c)
>> according to hardware architecture. It also can be described as shape +
>> stride. As many of you know, MKL-DNN shows great performance on these
>> optimal formats which is blocked by the vector length of AVX512 or AVX2.
>> It's very natural for us to pad on the channel dimension for those
>> inputs/outputs which IC or OC is not multiples of vector length and
>> leverage optimal kernels for blocked formats. Unfortunately this cannot be
>> implemented without changing the logic in the memory planning pass.
>> Currently we always fallback to slow reference kernels for both convolution
>> [1] and deconvolution [2].
>> 
>> 
>> 
>> AFAIK, the padding feature of MKL-DNN has already been used in TensorFlow
>> and other frameworks. We also found that, without supporting this feature,
>> many other new features from MKL-DNN cannot be applied to MXNet,  such as
>> the deconvolution primitive, winograd, etc.
>> 
>> 
>> 
>> Changes for this proposal can be divided into following parts:
>> 1.      Following the proposal in issue #13598, we need add new
>> InferStorageTypeEx functions to operators which need to do dispatch in a
>> more fine-grained way. This also need the InfereStorage pass can handle the
>> new -Ex function as what we did for FCompute and FComputeEx.
>> 2.      Attach more information to the computation graph/node, eg.
>> accelerator specific information. Currently we add `IsMKLDNN` directly
>> during operator registration if MXNET_USE_MKLDNN == 1. It looks simple and
>> rude to me.
>> 3.      Do memory planning according to more information: topology,
>> shapes, data types, in-place options and more accurate accelerator
>> information (accelerator path, memory size requirements, accelerator-wise
>> attributes).
>> 4.      Improve MKL-DNN operators so they can work on those well planned
>> memory which may be larger than the arithmetic requirements and work with
>> optimal kernels. Also, with more accurate dispatching in
>> InferStorageTypeEx, there is no need for us to write complicated fallback
>> logic in MKL-DNN operators.
>> 5.      If users feel uncomfortable with more memory usage, we can disable
>> this feature by environmental variables.
>> 
>> 
>> 
>> Since the memory planning pass is implemented in NNVM, so we also need
>> support from TVM community.
>> 
>> 
>> 
>> Please let me know what do you think. Thank you.
>> 
>> 
>> 
>> -tao
>> 
>> 
>> 
>> [1] https://github.com/apache/incubator-mxnet/issues/13598
>> 
>> [2]
>> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_convolution.cc#L194
>> 
>> [3]
>> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_deconvolution.cc#L55
>> 
>>

Re: [MXNET 2.0 Wishlist] [DISCUSS] Refine the InferStorageType and memory planning pass

Posted by Tianqi Chen <tq...@cs.washington.edu>.

The layout transformation should really be a separate optimization pass
rather than memory planning. As is done in the TVM stack. If we want to do
a clean slate solution, I would recommend looking into that instead.

TIanqi

On Tue, Apr 9, 2019 at 1:46 AM Lv, Tao A <ta...@intel.com> wrote:

>
>
> Hi dev,
>
>
>
> As we're discussing the roadmap for MXNet 2.0, I would like to start a
> thread about refining the InferStorageType and memory planning pass in
> MXNet and hope it can happen as a part of the 2.0 release.
>
>
>
> Thanks to @eric-haibin-lin, part of the proposal has already been
> discussed in issue #13598 [1].
>
>
>
> As mentioned in the description of issue #13598, there are several
> drawbacks of the existing flow. Please allow me to quote them here:
> *        the selection of MKL/CPU/GPU/CUDNN implementation happens after
> graph attribute inference and memory planning, memory planning is thus not
> aware of the implementation that will be used for execution in the future,
> which may result in sub-optimal result. For example, the memory inplace
> option may vary depending on the accelerator backend (the new version of
> CUDNN enables x/dx inplace for _backward_conv).
> *        some sparse operator need to access dtype/shape information to
> decide which implementation to invoke for execution, and whether to perform
> fallback. This information is not yet exposed in the existing infer storage
> type interface.
>
>
>
> Besides, the existing memory planning pass calculates and afterwards
> allocates memory strictly according to the input/output tensor shapes
> (which can be got from operators' arithmetic formulas through InferShape).
> That's not true anymore when we come to accelerators like MKL-DNN on CPU
> which wants to pad input/output tensor to optimal formats (eg. nchw16c)
> according to hardware architecture. It also can be described as shape +
> stride. As many of you know, MKL-DNN shows great performance on these
> optimal formats which is blocked by the vector length of AVX512 or AVX2.
> It's very natural for us to pad on the channel dimension for those
> inputs/outputs which IC or OC is not multiples of vector length and
> leverage optimal kernels for blocked formats. Unfortunately this cannot be
> implemented without changing the logic in the memory planning pass.
> Currently we always fallback to slow reference kernels for both convolution
> [1] and deconvolution [2].
>
>
>
> AFAIK, the padding feature of MKL-DNN has already been used in TensorFlow
> and other frameworks. We also found that, without supporting this feature,
> many other new features from MKL-DNN cannot be applied to MXNet,  such as
> the deconvolution primitive, winograd, etc.
>
>
>
> Changes for this proposal can be divided into following parts:
> 1.      Following the proposal in issue #13598, we need add new
> InferStorageTypeEx functions to operators which need to do dispatch in a
> more fine-grained way. This also need the InfereStorage pass can handle the
> new -Ex function as what we did for FCompute and FComputeEx.
> 2.      Attach more information to the computation graph/node, eg.
> accelerator specific information. Currently we add `IsMKLDNN` directly
> during operator registration if MXNET_USE_MKLDNN == 1. It looks simple and
> rude to me.
> 3.      Do memory planning according to more information: topology,
> shapes, data types, in-place options and more accurate accelerator
> information (accelerator path, memory size requirements, accelerator-wise
> attributes).
> 4.      Improve MKL-DNN operators so they can work on those well planned
> memory which may be larger than the arithmetic requirements and work with
> optimal kernels. Also, with more accurate dispatching in
> InferStorageTypeEx, there is no need for us to write complicated fallback
> logic in MKL-DNN operators.
> 5.      If users feel uncomfortable with more memory usage, we can disable
> this feature by environmental variables.
>
>
>
> Since the memory planning pass is implemented in NNVM, so we also need
> support from TVM community.
>
>
>
> Please let me know what do you think. Thank you.
>
>
>
> -tao
>
>
>
> [1] https://github.com/apache/incubator-mxnet/issues/13598
>
> [2]
> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_convolution.cc#L194
>
> [3]
> https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/mkldnn/mkldnn_deconvolution.cc#L55
>
>