You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/10/04 22:33:31 UTC
[GitHub] [tvm-rfcs] areusch commented on a change in pull request #38: [RFC] Improved multi-target handling

areusch commented on a change in pull request #38:
URL: https://github.com/apache/tvm-rfcs/pull/38#discussion_r721652382



##########
File path: rfcs/00xx-improved-multi-target-handling.md
##########
@@ -0,0 +1,176 @@
+- Feature Name: improved-multi-target-handling
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be (sequentially) evaluated on more than
+one device (GPU, CPU, accelerator, etc). For the non-BYOC flow this works as follows:
+1. Relay programs may contain "on_device" annotations which specify that a sub-expressions's result should
+   reside on a device with a given `DLDeviceType` (kDLCPU, kDLCUDA, etc).
+2. The device planning pass uses those annotations to decide on the unique device for every Relay sub-expression,
+   including every primitive operator call. Sub-expressions which are unconstrained are assigned to the 'default'
+   device. The pass then inserts "device_copy" operators whenever tensors need to cross device boundaries.
+3. The user/driver must also supply a list of `Target` objects. The compiler uses that list to build a `TargetMap`
+   from `DLDeviceType` to `Target` for all of those objects.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals we need to compile ('lower') that
+   primitive for that device. The `Target` to use for that compilation is found from the `TargetMap`.
+
+This approach has 5 problems:

Review comment:
       5!

##########
File path: rfcs/00xx-improved-multi-target-handling.md
##########
@@ -0,0 +1,176 @@
+- Feature Name: improved-multi-target-handling
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be (sequentially) evaluated on more than
+one device (GPU, CPU, accelerator, etc). For the non-BYOC flow this works as follows:
+1. Relay programs may contain "on_device" annotations which specify that a sub-expressions's result should
+   reside on a device with a given `DLDeviceType` (kDLCPU, kDLCUDA, etc).
+2. The device planning pass uses those annotations to decide on the unique device for every Relay sub-expression,
+   including every primitive operator call. Sub-expressions which are unconstrained are assigned to the 'default'
+   device. The pass then inserts "device_copy" operators whenever tensors need to cross device boundaries.
+3. The user/driver must also supply a list of `Target` objects. The compiler uses that list to build a `TargetMap`
+   from `DLDeviceType` to `Target` for all of those objects.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals we need to compile ('lower') that
+   primitive for that device. The `Target` to use for that compilation is found from the `TargetMap`.
+
+This approach has 5 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 'Big.LITTLE') and multiple tensor-friendly
+   devices (eg a GPU as well as an accelerator such as Arm 'Ethos-U'). This means a `DLDeviceType` no longer
+   uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a pair of a `DLDeviceType` and an
+   arbitrary 'device id', TVM does not consistently plumb the device id through annotations, passes and operators.
+   Thus currently we cannot use 'device id' to distinguish, eg, two CPUs in the same system.
+3. The codebase still uses an older `target` and `target_host` convention for distinguishing the main `Target` for
+   primitive operators from the `Target` for residual tensor computation, shape computation, and (for AOT) the
+   overall Relay control-flow. There's a fair bit of 'target normalization' scattered throughout the codebase to
+   deal with these different conventions.
+4. `Target`s are often manufactured on-the-fly (eg to represent the default 'CPU' target on which shape computations
+   should be hosted). However there's no guarantee those default `Target`s will match up with the user-supplied
+   `Target`s, thus it's possible to end up with `"llvm"` and `"llvm -m ..."` `Targets` coexisting. Now that
+   `IRModule` uses `Target` objects themselves to distinguish which `PrimFunc`s are intended for which targets,
+   it is particularly important to ensure there's a single source of truth for available `Target`s.
+5. TVM also supports a 'BYOC' extension mechanism. This allows `"target.<target name>"` annotations to be placed on
+   primitive operations to indicate they should possibly be compiled with the matching BYOC toolchain. A target
+   annotation pass uses those annotations to decide on a target name for every Relay sub-expression. A partition graph
+   pass then inserts function call boundaries whenever execution needs to cross target boundaries. However this
+   machinery is separate from and incompatible with the "on_device" mechanism, and 'target names' are a separate
+   concept from `Target` objects.
+
+In this RFC we tackle problems 1-4. We won't directly take on 5 since it involves more moving parts, but our hope
+is for this RFC to clear the way to taking on 5 in the future.
+
+Our proposal is:
+1. Extend `Target` to have a `DLDeviceType` attribute.
+2. Allow `Target` objects to be registered under a globally unique target label. Registration may be 'static' (ie
+   built into the TVM compiler via another REGISTER macro) and 'dynamic' (ie injected for a particular run of the
+   compiler, eg as part of `tvmc` command line processing). (This machinery should be reconciled with the existing
+   CUDA-specific target registration map.)
+3. Change the "on_device" call attributes to use a string instead of an integers (ie `DLDeviceType`). The string
+   can be of the form `<target label>` or `<target label>:<device id>`. The former simply implies a device id of 0.
+4. Rework device planning to use a pair of `Target` and 'device id' instead of `DLDeviceType`:
+   ```
+   class TargetDevice {
+    public:
+     Target target;
+     int device_id;
+   }
+   ```
+   (We could also use a `Device` and accept the redundant `DLDeviceType` specification.) It is trivial
+   to go from an "on_device" label to a `TargetDevice` and back using the global `Target` registry.
+5. Remove all uses of `TargetMap`. For example, in `LowerTEPass` we simply use the `TargetDevice` associated with

Review comment:
       do you propose any replacement in case we do need a map-like struct? `Map<target_label, Target>`?

##########
File path: rfcs/00xx-improved-multi-target-handling.md
##########
@@ -0,0 +1,176 @@
+- Feature Name: improved-multi-target-handling
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be (sequentially) evaluated on more than
+one device (GPU, CPU, accelerator, etc). For the non-BYOC flow this works as follows:
+1. Relay programs may contain "on_device" annotations which specify that a sub-expressions's result should
+   reside on a device with a given `DLDeviceType` (kDLCPU, kDLCUDA, etc).
+2. The device planning pass uses those annotations to decide on the unique device for every Relay sub-expression,
+   including every primitive operator call. Sub-expressions which are unconstrained are assigned to the 'default'
+   device. The pass then inserts "device_copy" operators whenever tensors need to cross device boundaries.
+3. The user/driver must also supply a list of `Target` objects. The compiler uses that list to build a `TargetMap`
+   from `DLDeviceType` to `Target` for all of those objects.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals we need to compile ('lower') that
+   primitive for that device. The `Target` to use for that compilation is found from the `TargetMap`.
+
+This approach has 5 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 'Big.LITTLE') and multiple tensor-friendly
+   devices (eg a GPU as well as an accelerator such as Arm 'Ethos-U'). This means a `DLDeviceType` no longer
+   uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a pair of a `DLDeviceType` and an
+   arbitrary 'device id', TVM does not consistently plumb the device id through annotations, passes and operators.
+   Thus currently we cannot use 'device id' to distinguish, eg, two CPUs in the same system.
+3. The codebase still uses an older `target` and `target_host` convention for distinguishing the main `Target` for
+   primitive operators from the `Target` for residual tensor computation, shape computation, and (for AOT) the
+   overall Relay control-flow. There's a fair bit of 'target normalization' scattered throughout the codebase to
+   deal with these different conventions.
+4. `Target`s are often manufactured on-the-fly (eg to represent the default 'CPU' target on which shape computations
+   should be hosted). However there's no guarantee those default `Target`s will match up with the user-supplied
+   `Target`s, thus it's possible to end up with `"llvm"` and `"llvm -m ..."` `Targets` coexisting. Now that
+   `IRModule` uses `Target` objects themselves to distinguish which `PrimFunc`s are intended for which targets,
+   it is particularly important to ensure there's a single source of truth for available `Target`s.
+5. TVM also supports a 'BYOC' extension mechanism. This allows `"target.<target name>"` annotations to be placed on
+   primitive operations to indicate they should possibly be compiled with the matching BYOC toolchain. A target
+   annotation pass uses those annotations to decide on a target name for every Relay sub-expression. A partition graph
+   pass then inserts function call boundaries whenever execution needs to cross target boundaries. However this
+   machinery is separate from and incompatible with the "on_device" mechanism, and 'target names' are a separate
+   concept from `Target` objects.
+
+In this RFC we tackle problems 1-4. We won't directly take on 5 since it involves more moving parts, but our hope
+is for this RFC to clear the way to taking on 5 in the future.
+
+Our proposal is:
+1. Extend `Target` to have a `DLDeviceType` attribute.

Review comment:
       as i'm reading this first time: curious how this will evolve when we take on 5 above.

##########
File path: rfcs/00xx-improved-multi-target-handling.md
##########
@@ -0,0 +1,176 @@
+- Feature Name: improved-multi-target-handling
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be (sequentially) evaluated on more than
+one device (GPU, CPU, accelerator, etc). For the non-BYOC flow this works as follows:
+1. Relay programs may contain "on_device" annotations which specify that a sub-expressions's result should
+   reside on a device with a given `DLDeviceType` (kDLCPU, kDLCUDA, etc).
+2. The device planning pass uses those annotations to decide on the unique device for every Relay sub-expression,
+   including every primitive operator call. Sub-expressions which are unconstrained are assigned to the 'default'
+   device. The pass then inserts "device_copy" operators whenever tensors need to cross device boundaries.
+3. The user/driver must also supply a list of `Target` objects. The compiler uses that list to build a `TargetMap`
+   from `DLDeviceType` to `Target` for all of those objects.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals we need to compile ('lower') that
+   primitive for that device. The `Target` to use for that compilation is found from the `TargetMap`.
+
+This approach has 5 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 'Big.LITTLE') and multiple tensor-friendly
+   devices (eg a GPU as well as an accelerator such as Arm 'Ethos-U'). This means a `DLDeviceType` no longer
+   uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a pair of a `DLDeviceType` and an
+   arbitrary 'device id', TVM does not consistently plumb the device id through annotations, passes and operators.
+   Thus currently we cannot use 'device id' to distinguish, eg, two CPUs in the same system.
+3. The codebase still uses an older `target` and `target_host` convention for distinguishing the main `Target` for
+   primitive operators from the `Target` for residual tensor computation, shape computation, and (for AOT) the
+   overall Relay control-flow. There's a fair bit of 'target normalization' scattered throughout the codebase to
+   deal with these different conventions.
+4. `Target`s are often manufactured on-the-fly (eg to represent the default 'CPU' target on which shape computations
+   should be hosted). However there's no guarantee those default `Target`s will match up with the user-supplied
+   `Target`s, thus it's possible to end up with `"llvm"` and `"llvm -m ..."` `Targets` coexisting. Now that
+   `IRModule` uses `Target` objects themselves to distinguish which `PrimFunc`s are intended for which targets,
+   it is particularly important to ensure there's a single source of truth for available `Target`s.
+5. TVM also supports a 'BYOC' extension mechanism. This allows `"target.<target name>"` annotations to be placed on
+   primitive operations to indicate they should possibly be compiled with the matching BYOC toolchain. A target
+   annotation pass uses those annotations to decide on a target name for every Relay sub-expression. A partition graph
+   pass then inserts function call boundaries whenever execution needs to cross target boundaries. However this
+   machinery is separate from and incompatible with the "on_device" mechanism, and 'target names' are a separate
+   concept from `Target` objects.
+
+In this RFC we tackle problems 1-4. We won't directly take on 5 since it involves more moving parts, but our hope
+is for this RFC to clear the way to taking on 5 in the future.
+
+Our proposal is:
+1. Extend `Target` to have a `DLDeviceType` attribute.
+2. Allow `Target` objects to be registered under a globally unique target label. Registration may be 'static' (ie
+   built into the TVM compiler via another REGISTER macro) and 'dynamic' (ie injected for a particular run of the
+   compiler, eg as part of `tvmc` command line processing). (This machinery should be reconciled with the existing
+   CUDA-specific target registration map.)
+3. Change the "on_device" call attributes to use a string instead of an integers (ie `DLDeviceType`). The string
+   can be of the form `<target label>` or `<target label>:<device id>`. The former simply implies a device id of 0.
+4. Rework device planning to use a pair of `Target` and 'device id' instead of `DLDeviceType`:
+   ```
+   class TargetDevice {
+    public:
+     Target target;
+     int device_id;
+   }
+   ```
+   (We could also use a `Device` and accept the redundant `DLDeviceType` specification.) It is trivial
+   to go from an "on_device" label to a `TargetDevice` and back using the global `Target` registry.
+5. Remove all uses of `TargetMap`. For example, in `LowerTEPass` we simply use the `TargetDevice` associated with
+   every primitive operator call already found by device planning.
+6. Bind two `TargetDevice`s as attributes on every `IRModule`:
+    - The default for primitive operators not otherwise constrained by "on_device" annotations.

Review comment:
       and it seems like we could call this `target_executor` e.g. describing the Target where the executor should run. we should formally notate that now that we have an AOT flow.

##########
File path: rfcs/00xx-improved-multi-target-handling.md
##########
@@ -0,0 +1,176 @@
+- Feature Name: improved-multi-target-handling
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be (sequentially) evaluated on more than
+one device (GPU, CPU, accelerator, etc). For the non-BYOC flow this works as follows:
+1. Relay programs may contain "on_device" annotations which specify that a sub-expressions's result should
+   reside on a device with a given `DLDeviceType` (kDLCPU, kDLCUDA, etc).
+2. The device planning pass uses those annotations to decide on the unique device for every Relay sub-expression,
+   including every primitive operator call. Sub-expressions which are unconstrained are assigned to the 'default'
+   device. The pass then inserts "device_copy" operators whenever tensors need to cross device boundaries.
+3. The user/driver must also supply a list of `Target` objects. The compiler uses that list to build a `TargetMap`
+   from `DLDeviceType` to `Target` for all of those objects.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals we need to compile ('lower') that
+   primitive for that device. The `Target` to use for that compilation is found from the `TargetMap`.
+
+This approach has 5 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 'Big.LITTLE') and multiple tensor-friendly
+   devices (eg a GPU as well as an accelerator such as Arm 'Ethos-U'). This means a `DLDeviceType` no longer
+   uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a pair of a `DLDeviceType` and an
+   arbitrary 'device id', TVM does not consistently plumb the device id through annotations, passes and operators.
+   Thus currently we cannot use 'device id' to distinguish, eg, two CPUs in the same system.
+3. The codebase still uses an older `target` and `target_host` convention for distinguishing the main `Target` for
+   primitive operators from the `Target` for residual tensor computation, shape computation, and (for AOT) the
+   overall Relay control-flow. There's a fair bit of 'target normalization' scattered throughout the codebase to
+   deal with these different conventions.
+4. `Target`s are often manufactured on-the-fly (eg to represent the default 'CPU' target on which shape computations
+   should be hosted). However there's no guarantee those default `Target`s will match up with the user-supplied
+   `Target`s, thus it's possible to end up with `"llvm"` and `"llvm -m ..."` `Targets` coexisting. Now that
+   `IRModule` uses `Target` objects themselves to distinguish which `PrimFunc`s are intended for which targets,
+   it is particularly important to ensure there's a single source of truth for available `Target`s.
+5. TVM also supports a 'BYOC' extension mechanism. This allows `"target.<target name>"` annotations to be placed on
+   primitive operations to indicate they should possibly be compiled with the matching BYOC toolchain. A target
+   annotation pass uses those annotations to decide on a target name for every Relay sub-expression. A partition graph
+   pass then inserts function call boundaries whenever execution needs to cross target boundaries. However this
+   machinery is separate from and incompatible with the "on_device" mechanism, and 'target names' are a separate
+   concept from `Target` objects.
+
+In this RFC we tackle problems 1-4. We won't directly take on 5 since it involves more moving parts, but our hope
+is for this RFC to clear the way to taking on 5 in the future.
+
+Our proposal is:
+1. Extend `Target` to have a `DLDeviceType` attribute.
+2. Allow `Target` objects to be registered under a globally unique target label. Registration may be 'static' (ie

Review comment:
       i'm curious about static vs dynamic Targets. is this the distinction we want to point out here, or do we just want to say there are Target aliases which may be mapped to pre-canned Targets?

##########
File path: rfcs/00xx-improved-multi-target-handling.md
##########
@@ -0,0 +1,176 @@
+- Feature Name: improved-multi-target-handling
+- Start Date: 2021-09-20
+- RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# Summary
+[summary]: #summary
+
+TVM supports 'hetrogeneous' execution, whereby primitive operators may be (sequentially) evaluated on more than
+one device (GPU, CPU, accelerator, etc). For the non-BYOC flow this works as follows:
+1. Relay programs may contain "on_device" annotations which specify that a sub-expressions's result should
+   reside on a device with a given `DLDeviceType` (kDLCPU, kDLCUDA, etc).
+2. The device planning pass uses those annotations to decide on the unique device for every Relay sub-expression,
+   including every primitive operator call. Sub-expressions which are unconstrained are assigned to the 'default'
+   device. The pass then inserts "device_copy" operators whenever tensors need to cross device boundaries.
+3. The user/driver must also supply a list of `Target` objects. The compiler uses that list to build a `TargetMap`
+   from `DLDeviceType` to `Target` for all of those objects.
+4. Each call to a primitive operator for a particular `DLDeviceType` signals we need to compile ('lower') that
+   primitive for that device. The `Target` to use for that compilation is found from the `TargetMap`.
+
+This approach has 5 problems:
+1. TVM is being targeted to environments with multiple CPUs (eg Arm 'Big.LITTLE') and multiple tensor-friendly
+   devices (eg a GPU as well as an accelerator such as Arm 'Ethos-U'). This means a `DLDeviceType` no longer
+   uniquely determines a `Target`.
+2. Though TVM's `Device` abstraction (an alias for `dlpack`'s `DLDevice`) is a pair of a `DLDeviceType` and an
+   arbitrary 'device id', TVM does not consistently plumb the device id through annotations, passes and operators.
+   Thus currently we cannot use 'device id' to distinguish, eg, two CPUs in the same system.
+3. The codebase still uses an older `target` and `target_host` convention for distinguishing the main `Target` for
+   primitive operators from the `Target` for residual tensor computation, shape computation, and (for AOT) the
+   overall Relay control-flow. There's a fair bit of 'target normalization' scattered throughout the codebase to
+   deal with these different conventions.
+4. `Target`s are often manufactured on-the-fly (eg to represent the default 'CPU' target on which shape computations
+   should be hosted). However there's no guarantee those default `Target`s will match up with the user-supplied
+   `Target`s, thus it's possible to end up with `"llvm"` and `"llvm -m ..."` `Targets` coexisting. Now that
+   `IRModule` uses `Target` objects themselves to distinguish which `PrimFunc`s are intended for which targets,
+   it is particularly important to ensure there's a single source of truth for available `Target`s.
+5. TVM also supports a 'BYOC' extension mechanism. This allows `"target.<target name>"` annotations to be placed on
+   primitive operations to indicate they should possibly be compiled with the matching BYOC toolchain. A target
+   annotation pass uses those annotations to decide on a target name for every Relay sub-expression. A partition graph
+   pass then inserts function call boundaries whenever execution needs to cross target boundaries. However this
+   machinery is separate from and incompatible with the "on_device" mechanism, and 'target names' are a separate
+   concept from `Target` objects.
+
+In this RFC we tackle problems 1-4. We won't directly take on 5 since it involves more moving parts, but our hope
+is for this RFC to clear the way to taking on 5 in the future.
+
+Our proposal is:
+1. Extend `Target` to have a `DLDeviceType` attribute.
+2. Allow `Target` objects to be registered under a globally unique target label. Registration may be 'static' (ie
+   built into the TVM compiler via another REGISTER macro) and 'dynamic' (ie injected for a particular run of the
+   compiler, eg as part of `tvmc` command line processing). (This machinery should be reconciled with the existing
+   CUDA-specific target registration map.)
+3. Change the "on_device" call attributes to use a string instead of an integers (ie `DLDeviceType`). The string
+   can be of the form `<target label>` or `<target label>:<device id>`. The former simply implies a device id of 0.
+4. Rework device planning to use a pair of `Target` and 'device id' instead of `DLDeviceType`:
+   ```
+   class TargetDevice {
+    public:
+     Target target;
+     int device_id;
+   }
+   ```
+   (We could also use a `Device` and accept the redundant `DLDeviceType` specification.) It is trivial
+   to go from an "on_device" label to a `TargetDevice` and back using the global `Target` registry.
+5. Remove all uses of `TargetMap`. For example, in `LowerTEPass` we simply use the `TargetDevice` associated with
+   every primitive operator call already found by device planning.
+6. Bind two `TargetDevice`s as attributes on every `IRModule`:
+    - The default for primitive operators not otherwise constrained by "on_device" annotations.
+    - The default for non primitive operators, such as Relay control flow and shape computation.
+7. We remove the various copies of target/target_host reconciliation, `TargetMap`
+   construction and 'default/fallback' device calculation from the codebase.
+
+This proposal tackles the original problems:
+1. There's now no ambiguity about `Targets` since we propagate them from the global registry directly.
+2. We support device ids.

Review comment:
       it would be great to clarify between device IDs or target_label. Device ID is very much a runtime thing and we can't always control which e.g. GPU gets mapped to 0 or 1.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org