You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/11/12 20:57:21 UTC

[GitHub] [incubator-tvm] TaylorZowtuk opened a new pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

TaylorZowtuk opened a new pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909


   While running scripts using both AutoScheduler and AutoTvm to consecutively search for schedules for a number of operators/shapes, I observed different behaviors during measurement following the output “Too many errors happened during tuning.”
   
   After looking into the code I determined that the difference in behavior was due to AutoScheduler and AutoTvm handling the case of, the number of accumulated errors during measurement exceeding some threshold, differently.
   
   I observed that while using AutoTvm, the program would switch to debug level logging and continue search.
   ```
   Too many errors happen in the tuning. Now is in debug mode
   No: 217	GFLOPS: 0.00/0.00	result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (5) /home/tanvir/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7fd9b685ee13]\n  [bt] (4) /home/tanvir/tvm/build/libtvm.so(+0x1309037) [0x7fd9b68c8037]\n  [bt] (3) /home/tanvir/tvm/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x3fa) [0x7fd9b68cc86a]\n  [bt] (2) /home/tanvir/tvm/build/libtvm.so(tvm::runtime::RPCClientSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)+0x57) [0x7fd9b68c0217]\n  [bt] (1) /home/tanvir/tvm/build/libtvm.so(tvm::runtime::RPCEndpoint::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)>)+0x6bd) [0x7fd9b68b546d]\n  [bt] (0) /home/tanvir/tvm/build/libtvm.so(+0x12f3668) [0x7fd9b68b2668]\n  File "/home/tanvir/tvm/src/runtime/rpc/rpc_endpoint.cc", line 807\nTVMError: Check failed: code == R
 PCCode: :kReturn: code=1'),), error_no=4, all_cost=10.765872716903687, timestamp=1604092331.4940712)	[('tile_f', [-1, 16]), ('tile_y', [-1, 2]), ('tile_x', [-1, 2]), ('tile_z', [-1, 16])],None,1719
   …
   <continues>
   ```
   While using AutoScheduler, the program would crash after throwing an uncaught error.
   ```
   Traceback (most recent call last):
     …
     File "runner.py", line 124, in fig_6
       m = run_operator(
     File "runner.py", line 58, in run_operator
       sch, args = auto_scheduler.auto_schedule(task, tuning_options=tune_option)
     File "/home/taylor/tvm/python/tvm/auto_scheduler/auto_schedule.py", line 213, in auto_schedule
       sch, tensors = _ffi_api.AutoSchedule(search_policy, tuning_options)
     File "/home/taylor/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
       raise get_last_ffi_error()
   tvm._ffi.base.TVMError: Traceback (most recent call last):
     [bt] (5) /home/taylor/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7f11e187d7b3]
     [bt] (4) /home/taylor/tvm/build/libtvm.so(+0x6965ab) [0x7f11e0c755ab]
     [bt] (3) /home/taylor/tvm/build/libtvm.so(tvm::auto_scheduler::AutoSchedule(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)+0x11a) [0x7f11e0c74cca]
     [bt] (2) /home/taylor/tvm/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::Search(int, int, int, tvm::auto_scheduler::ProgramMeasurer)+0x760) [0x7f11e0cfb3d0]
     [bt] (1) /home/taylor/tvm/build/libtvm.so(tvm::auto_scheduler::ProgramMeasurerNode::Measure(tvm::auto_scheduler::SearchTask const&, tvm::auto_scheduler::SearchPolicy const&, tvm::runtime::Array<tvm::auto_scheduler::MeasureInput, void> const&, tvm::runtime::Array<tvm::auto_scheduler::MeasureResult, void>*, int)+0x11ed) [0x7f11e0cd7b2d]
     [bt] (0) /home/taylor/tvm/build/libtvm.so(+0x6f4af8) [0x7f11e0cd3af8]
     File "/home/taylor/tvm/src/auto_scheduler/measure.cc", line 268
   TVMError: Too many errors happened during tuning
   ```
   
   In my particular case, AutoScheduler crashing rather than continuing to attempt searching meant that my script would terminate prematurely when it may have recovered from whatever was causing errors during search.
   In addition, I was unclear why this behavior was only occurring in AutoScheduler and not AutoTvm. This discrepancy in behavior can be confusing to new users who may want to explore both methods of schedule searching. This PR proposes bringing the AutoScheduler handling of errors in measurement in line with AutoTvm.
   
   By removing the LOG(FATAL) and changing verbosity for AutoScheduler in the same way we change logging level in AutoTvm the programs will behave the same. In addition, I changed the default verbosity of AutoScheduler to 0 (silent) in order to match the default logging level of AutoTvm.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] TaylorZowtuk commented on pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

TaylorZowtuk commented on pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909#issuecomment-728307739


   > How do you hit this part of the code? Generally, it means you have some fatal errors in the code.
   > It is very rare to recover from a case where you have so many continuous errors.
   
   I'm not entirely certain what causes us to hit this condition. In our case, we observed from the AutoTvm debug prints that it was due to error_no=4 which is a RUNTIME_DEVICE error (as you can see from the except of AutoTvm log I included previously). I think the main issue is that by terminating the program as soon as we meet this condition we dont allow for the chance to recover and additionally, we wont be getting this useful precise feedback about what error we are hitting while using the auto_scheduler.
   
   Ill do the rebasing and try to fix the CI issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] merrymercy edited a comment on pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

merrymercy edited a comment on pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909#issuecomment-727626290


   Please rebase and fix the CI error.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] merrymercy commented on a change in pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

merrymercy commented on a change in pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909#discussion_r523804433



##########
File path: src/auto_scheduler/measure.cc
##########
@@ -267,7 +269,11 @@ Array<MeasureResult> ProgramMeasurerNode::Measure(const SearchTask& task,
     }
 
     if (error_ct > max_continuous_error) {
-      LOG(FATAL) << "Too many errors happened during tuning";
+      LOG(WARNING) << "Too many errors happened during tuning. Switching to debug mode."
+                   << std::endl;
+      verbose = 1;

Review comment:
       ```suggestion
         verbose = 2;
   ```
   In this PR (https://github.com/apache/incubator-tvm/pull/6882), we changed the verbosity level. Now the debug model is `verboes=2`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] TaylorZowtuk commented on pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

TaylorZowtuk commented on pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909#issuecomment-726338624


   @tqchen @merrymercy Thoughts and review please?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] merrymercy commented on pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

merrymercy commented on pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909#issuecomment-727626290


   Also, please rebase.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] merrymercy commented on pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

merrymercy commented on pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909#issuecomment-728764382


   Thanks, @TaylorZowtuk. It is merged.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] tqchen commented on pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

tqchen commented on pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909#issuecomment-726411462


   cc @jcf94 @merrymercy 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] merrymercy merged pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

merrymercy merged pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] merrymercy commented on a change in pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

merrymercy commented on a change in pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909#discussion_r524836173



##########
File path: python/tvm/auto_scheduler/auto_schedule.py
##########
@@ -89,7 +89,7 @@ def __init__(
         num_measure_trials=0,
         early_stopping=None,
         num_measures_per_round=64,
-        verbose=1,
+        verbose=0,

Review comment:
       ```suggestion
           verbose=1,
   ```

##########
File path: python/tvm/auto_scheduler/auto_schedule.py
##########
@@ -72,7 +72,7 @@ class TuningOptions(Object):
         The number of schedules to be measured at each search round.
         The whole schedule search process will try a total number of `num_measure_trials` in several
         rounds.
-    verbose: int = 1
+    verbose: int = 0

Review comment:
       ```suggestion
       verbose: int = 1
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] TaylorZowtuk edited a comment on pull request #6909: Make AutoScheduler handling of errors during measure consistent with AutoTvm

Posted by GitBox <gi...@apache.org>.

TaylorZowtuk edited a comment on pull request #6909:
URL: https://github.com/apache/incubator-tvm/pull/6909#issuecomment-728307739


   > How do you hit this part of the code? Generally, it means you have some fatal errors in the code.
   > It is very rare to recover from a case where you have so many continuous errors.
   
   I'm not entirely certain what causes us to hit this condition. In our case, we observed from the AutoTvm debug prints that it was due to error_no=4 which is a RUNTIME_DEVICE error (as you can see from the except of AutoTvm log I included previously). Hitting this condition happened very intermittently. We could run a particular op/shape one time and hit the condition and without changing anything it would work the next. In addition, having one op/shape reach this condition didnt mean the rest of our op/shapes that we were running in the same script would fail meaning the system overall was able to recover. I think the main issue is that by terminating the program as soon as we meet this condition we dont allow for the chance to recover and additionally, we wont be getting this useful precise feedback about what error we are hitting while using the auto_scheduler.
   
   Ill do the rebasing and try to fix the CI issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org