You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/12/11 20:40:30 UTC

[GitHub] marcoabreu edited a comment on issue #13598: More fine-grained operator implementation dispatch & memory planning flow

marcoabreu edited a comment on issue #13598: More fine-grained operator implementation dispatch & memory planning flow
URL: https://github.com/apache/incubator-mxnet/issues/13598#issuecomment-446353433

Thanks for your very good questions!

For the operator selection I would think about a design which has something similar to a "tuning" or warm-up stage which evaluates the different possibilities. Initially, since that revamp would be quite big and experimental, I would hardcode an order (e.g. CUDA->AMD->MKLDNN->CPU) which is then evaluated and certain backends dropped if they don't support that operator or they're simply not present. Later on, there would ideally be a benchmark step which evaluates the different possibilities and then chooses the most efficient representation of the graph. This evaluation would first start with simple benchmarks (with different strategies like memory footprint, power consumption, throughput, etc) of each operator backend and then in the next stage go one level higher and evaluate groups of operators (up to evaluating the entire graph) to accomodate for layout conversion and memcopy overhead. In the last iteration, we would have a graph which is most efficienct, but also runnable on that hardware, for the requested graph.

There are two ways I could think of backends conflicting:
1. Mismatching memory layouts
2. Impossible/unlikely combinations (CUDA &AMDHIP or MKL &ARM)

To solve number one, I would extend the design to not only have the operators abstracted, but also their memory layouts. In the same way as we would have an operator registry, we would have a memory layout registry where each backend announces their memory layouts (this could be rearrange data or moving them to different memory slots like GPU mem) as well as converters. Each operator implementation would specify a desired layout (most likely the one they registered themselfes). Now imagine you have a graph with threeoperators:
```
Input -> Operator1_CUDA -> Operator2_MKL -> Operator3_MKL -> Output
```
These three operators are from two entirely different backends and have their own implementation and memory layouts. Our engine would, during the initial analysis of the graph (this step is after the optional graph optimization and we assume the graph as final at that point), analyse the desired layout of each operator (in this case CUDA and MKL, but it could also go a level deeper like CUDA_NHWC etc) and then see whether they are compatible. If they are not, the engine would request a converter from the memory layout registry. These converters would then be inserted into the graph and the final graph would look as follows:
```
Input -> Convert_Standard_CUDA -> Operator1_CUDA -> Convert_CUDA_MKL -> Operator2_MKL -> Operator3_MKL -> Convert_MKL_Standard -> Output
```
This way, you will always have compatibility in between the different layouts while the neither the operators nor the engine will have to care about the different backends as that conversion happens in between. When an operator receives and outputs data, it expects to be in its "isolated" world. If the operators are from the same backend and use the same layout though, this conversion is skipped and a performance advantage is achieved.
Now at this point you could get to O(N!) if you need convertors in between every single possible memory layout. The trick here is to have a standard layout (which we basically already have and is used to input and output data from the graphs). Each memory layout has to register at least two converters: TO_STANDARD and FROM_STANDARD. This allows have compatibility for backends where no direct conversion exists. Since this will require two conversions (FROM_MEMLAYOUT1_TO_STANARD and FROM_STANDARD_TO_MEMLAYOUT2), this will have additional overhead but keep compatibility high. For common cases, where would probably be direct converters.

For the second case where conflicting backends exist, they would simply be skipped during the evaluation stage when the engine checks whether an operator is actually eligible. So if CUDA is not present, the operator will simply not be considered for that graph.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services