You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tvm.apache.org by Cody Hao Yu <no...@github.com> on 2019/10/24 00:16:18 UTC

[dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Overview
---------
When a user wants to use AutoTVM to tune a model, she often lets AutoTVM tune every task extracted from the model sequentially. Assuming each task requires 1 hour or so, tuning a model with 10 to 100+ tasks requires days. This RFC proposes a lightweight solution to reduce tuning time for a model but still achieve a decent model performance. 

We achieve this goal by proposing a selective tuning mechanism for AutoTVM. It is mainly composed of two parts: task selection and schedule sharing.

Task Selection
--------------
Taks selection selects a few representative tasks from all tasks in a model. The idea behind it is that if a schedule works well for a conv2d on a certain platform, it may also work for another conv2d on the same platform. It implies that we can identify a few representative task and apply their top schedules to other tasks. As a result, the tuning time of other tasks can be saved.

The task selection algorithm is based on simularity rate (SM). The simularity rate is defined as the ratio of tuning space overlapping between two tasks. By computing simularity rate between any two tasks, we createt pairwise simularity matrix (PSM). With the PSM, we create a graph with tasks as nodes and SM as edge weights. Then finding all maximum cliques in the graph to cluster tasks. For each cluster, we select the task with the highest weight sum to all other tasks in the same cluster.

The API of using task selection is straightforward. A user only needs to call `mark_depend` after tasks have been created:

```python
tasks = autotvm.task.extract_from_program(...)
autotvm.task.mark_depend(tasks)
```

After the function call, unselected tasks will have an attribute `depend` referring to the selected task.

Schedule Sharing
------------------
We follow the current AutoTVM process to tune the representative tasks first, and then tune other tasks. When tuning other tasks, a user can specify how many schedules should be shared from the dependent task. For example, `top3` means sharing the best 3 schdules; `5%` means sharing the best 5% schedules.

An example of tuning tasks with some of them selected is shown as follows:

```python
sel_tasks = [t for t in tasks if t.depend == t]
other_tasks = [t for t in tasks if t.depend != t]

# the first pass tunes the representative tasks
for i, tsk in enumerate(reversed(sel_tasks)):
    prefix = "[Sel Task %2d/%2d] " %(i+1, len(sel_tasks))
    tuner_obj = create_tuner(tuner)

    # do tuning
    curr_trial = min(n_trial, len(tsk.config_space))
    tuner_obj.tune(n_trial=curr_trial,
                   early_stopping=early_stopping,
                   measure_option=measure_option,
                   callbacks=[
                       autotvm.callback.progress_bar(curr_trial, prefix=prefix),
                       autotvm.callback.log_to_file(log_file)])

# the second pass tunes the rest tasks
for i, tsk in enumerate(reversed(other_tasks)):
    prefix = "[Other Task %2d/%2d] " %(i+1, len(other_tasks))
    tuner_obj = create_tuner(tuner)

    # do tuning
    curr_trial = n_trial
    tuner_obj.tune(n_trial=curr_trial,
                   depend_mode='top10', # Indicating that we will share 10 best schedules
                   early_stopping=early_stopping,
                   measure_option=measure_option,
                   callbacks=[
                       autotvm.callback.progress_bar(10, prefix=prefix),
                       autotvm.callback.log_to_file(log_file)])
```

In above example, the second loop tunes the unselected tasks. Since the tuned schedules are cached in selected tasks, the tuner will use those schedules as the tuning space, which size is 10 in this example.

There are most two important advantages of this mechanism:

1. It is fully compatible to the current AutoTVM usecase. Existing AutoTVM scripts still work without any issue.

2. Even if the unselected task fails to achieve decent performance with shared schedules, users can still re-tune them as the normal AutoTVM process. This does not hurt because the time spending on the shared schedules is just minutes.

Results
-------

Here are the experimental results of using selective tuning for a set of models on EC2 P3 instance. We have evluated 7 models from Gluon CV model zoo, and the results are shown as follows. We tune selected tasks for 2,000 trials. Selective tuning achieves **on average 113% performance while using only 28% tuning time**. We consider the reason of outperforming the original AutoTVM is that the tuning space generated is not general enough to cover better schedules with non-factor tile sizes.


| Model | w/o Time (mins) | w/o Perf. (ms) | w. Time (mins) | w. Perf. (ms) | Used Time | Achieve Perf. |
|:-----:|:---------------:|:--------------:|:--------------:|:-------------:|:---------:|:-------------:|
| MobileNet V2 1.0       | 1185 | 0.74  | 404 | 0.78 | 34% | 95%  |
| ResNet 50 V1           | 833  | 2.69  | 179 | 3.7  | 21% | 73%  |
| VGG 19 BN              | 479  | 5.08  | 169 | 6.36 | 35% | 80%  |
| SqueezeNet 1.1         | 574  | 0.54  | 167 | 0.5  | 29% | 108% |
| DenseNet 121           | 2670 | 2.99  | 377 | 3.02 | 14% | 99%  |
| Yolo3 MobileNet1.0 voc | 1392 | 5.4   | 387 | 3.63 | 28% | 149% |
| SSD512 ResNet50 V1 voc | 1713 | 10.67 | 575 | 5.65 | 34% | 189% |

Note
----
1. The task selection implementation uses Python graph package networkx. This introduces a new dependency to AutoTVM.

2. This mechanism works well on GPU but not CPU, because the NCHWc layout limits the choices of C.

Tasks
------

- [x] Implementation #4187
- [ ] Tutorial

cc @kevinthesun @icemelon9 @tqchen 

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Cody Hao Yu <no...@github.com>.
Thanks for the suggestion. Now the networkx is imported only when the selective tuning API is invoked. The implementation is [here](https://github.com/dmlc/tvm/pull/4187/files#diff-752c9c125c8aafe01ed2c02743a56099R109). Is this what you meant?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-547086966

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Yao Wang <no...@github.com>.
Thanks for this proposal! For Yolo and ssd, does the performance advantage mainly come from larger tuning space? If so, I suggest we also do full auto-tuning with expanded tuning space, so that we have an apple to apple comparison, and a more clear picture about tuning time vs performance trade-off. Also it would be interesting to see how this method can be used to select representative schedules for dynamic shape workloads. 

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-546216492

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Cody Hao Yu <no...@github.com>.
While I am testing if tuning more trials could make the result more intuitive, I would like to first ask for the feedbacks about the naming. Here are my thoughts based on Tianqi's comments.

- select.py -> pass.py
As suggested, this module is more like a pass over a set of tasks, so we can treat it as a pass. I can implement it as a pass table so that we can add other passes in the future.

- Select representative tasks. Origianl: `autotvm.task.mark_depend(tasks)`
With the pass implemented above, this API becomes `autotvm.task.pass.FindReference(tasks)`, meaning that this pass is going to find the reference task for each task.

- task.depend -> task.cfg_ref_task
`cfg_ref_task` points to the task that we can refer its tuned configs when tuning. 

- tuner.depend_mode -> tuner.cfg_ref_mode
`cfg_ref_mode` is a string in the `<mode>-<cfg>` format. `<mode>` can be either "only" or "start"; while `<cfg>` can be either "topN" or "N%". Here are some examples:
    - "only-top10": Stop tuning after trying the top 10 configs in the `cfg_ref_task`.
    - "start-5%": First try the top 5% configs in the `cfg_ref_task` and back to the normal tuning.
    - The default is set to "only-top10", meaning that we will stop tuning after 10 trials if the task we are tuning has a tuned reference task.

@kevinthesun @eqy @icemelon9 please let me know what you guys think about these names, and you're welcome to propose better ones. Thanks.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-548992976

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Yao Wang <no...@github.com>.
I agree that we can think a bit more about the name "select" and "depend". These two names have rich semantics in different context. Maybe we would like some API more tightly bind to autotvm specifically. 

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-548049836

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Tianqi Chen <no...@github.com>.
yah, if that is the case, i think we should be fine for now. Let us focus the discussion on the high level API design of the order schedule(planning) API so that it works for other generic strategies.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-547096679

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Tianqi Chen <no...@github.com>.
Thanks for the proposal. I think overall this can be categorized into schedule of the tasks. Also vc @eqy Would be nice if we can come up with abstractions for a generic task schedule strategy, then add it as a modular plugin.

The networkx dependency is something that worth discussing. Ideally we want to minimize dependencies of the core autotvm package.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-546362936

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Tianqi Chen <no...@github.com>.
If we make the code modular enough via high level scheduling(planning api), the networkx dep could be an optional once for those who want to use the component, which might help alleviate the issue

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-546498960

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Cody Hao Yu <no...@github.com>.
All tasks are done with a tutorial.
@tqchen @kevinthesun @eqy @icemelon9, please review the PR and we can discuss if you have any suggestions to the API or designs.

Thanks.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-547763438

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Yao Wang <no...@github.com>.
Also by further discussing with @comaniac, we thought that it's worth a bit further dig into the reason why full tuning for ssd and yolo only achieve 60% -70% performance of selective tuning. Maybe increase the number of trials of full tuning can give us a more clear picture.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-548051813

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Tianqi Chen <no...@github.com>.
Thanks @comaniac. 

It would be great if people can discuss a bit more in terms of API choices. I think we achieved the proposed goals. 

Specifically, here are some "complains" I have by quickly looking into the API:

- select.py was a bit confusing(it reminds me of select API of socket networking, but it is not?)
   - It is more like a annotation API that annotates the tasks, can we treat it as a pass over the set of tasks(if that is what we want 
- The depend field in the autotvm should be documented, and we need to think about a better name (Seems the current specified behavior is to stop tuning x) and make use of dependent data

I do not have time too think too deep into this. but perhaps others can do a review and discuss the possible candidates?


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-548005028

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Cody Hao Yu <no...@github.com>.
The leftmost two columns in the table are the total tuning time of 2,000 trials each op and the final inference latency, respectively. With XGBoost tuner, I suppose the result after 2,000 trials is sufficient to illustrate the usability of selective tuning. Comparing to the full auto-tuning result could definitely be interesting, but it is time-consuming since the tuning space is the order of 10^5 for each op. I'll further study the tuning space coverage in the short future for sure, along with the dynamic shape codegen supports. (In fact, this RFC is more like a side application during the process of dynamic shape codegen investigation 😄 )

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-546220405

Re: [dmlc/tvm] [RFC][AutoTVM] Selective Tuning (#4188)

Posted by Cody Hao Yu <no...@github.com>.
@tqchen Thanks for the comments and you're right. One important message behind this investigation is a schedule should be shared across ops with differernt attributes.

For the networkx dependency, I have the same concern as well. I used it to build a graph and find maximum cliques in the graph. Unfortunately, I didn't find any existing TVM dependency can achieve the desire functionality. On the other hand, we can also use different algorithm such as K-Means to cluster tasks, so that the existing dependency, `sklearn`, would be sufficient, although the result of using K-Means is not as good as the one I showed. I would be happy to take your suggestion if any.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/4188#issuecomment-546430352