You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/05/24 19:36:42 UTC

[GitHub] [tvm] tkonolige opened a new pull request, #11434: [POPEN POOL] Use multiprocessing to kill workers after timeout

tkonolige opened a new pull request, #11434:
URL: https://github.com/apache/tvm/pull/11434

   Popen pool timeout would not function in the case of long running c++ code. It relied on a python `threading.Timer` to interrupt the process after a certain amount of time. However python timers cannot interrupt c++ code (or any other code besides python). Instead we now use a subprocess (via python's `multiprocessing`) to kill the worker process after the timeout expires. This works no matter what code is being run.
   
   Note that this means that TimeoutErrors are never reported because we have no way of distinguishing between timeouts and subprocesses dying. I think this is a worthwhile tradeoff because we can run into situations where autotvm takes forever to tune because it is waiting on a really slow to compile program.
   
   @tqchen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tqchen commented on pull request #11434: [POPEN POOL] Use multiprocessing to kill workers after timeout

Posted by GitBox <gi...@apache.org>.

tqchen commented on PR #11434:
URL: https://github.com/apache/tvm/pull/11434#issuecomment-1136477607

   Coming back with some diggings.
   
   My initial reaction after seeing the comments was yes it could indeed be the case where python cannot interrupt c++. Because it is indeed true that python signal handler function in c++ won't be caught until getting back to python interpreter.
   
   Here is the fun part, I end up trying to type up a solution via c++ multi-threading, then after writing test cases to confirm the behavior I realized that actually the original implementation might work fine.
   
   Here is why: python's multi-threading is backed by real system threads and they are guarded by GIL. So although one thread enters the FFI as a long running function, another thread(the watcher) can continue to run in python interpreter (as the GIL has been released by the long running function) without a problem, as a result the timeout signal back to the parent process will continue to function, then the parent process signals kill to the popen worker.
   
   The two test cases I created for the alternative solution end up shows the original impl is working as intended.
   
   https://github.com/apache/tvm/compare/main...tqchen:popen2
   
   See the following busy counting function(in C++) as a proxy for long running c++ functions
   ```c++
   TVM_REGISTER_GLOBAL("testing.busy_counting").set_body_typed([](double value, int repeat, int number) {
     double sum = 0.0;
     for (int i = 0; i < number; ++i) {
       for (int j = 0; j < repeat; ++j) {
         sum += value;
       }
       LOG(INFO) << "Finished counting for " <<  (i + 1)  << " iter";
     }
   });
   ```
   
   Running the following debug script yields:
   ```python
   from tvm.contrib.popen_pool import PopenWorker
   import tvm.testing._ffi_api
   
   proc = PopenWorker()
   
   proc.send(tvm.testing.busy_counting, [0.1, 1<<25, 10], timeout=100)
   proc.recv()
   
   
   proc.send(tvm.testing.busy_counting, [0.1, 1<<25, 100], timeout=0.2)
   proc.recv()
   ```
   
   ```
   [18:06:34] ffi_testing.cc:180: Finished counting for 1 iter
   [18:06:34] ffi_testing.cc:180: Finished counting for 2 iter
   [18:06:34] ffi_testing.cc:180: Finished counting for 3 iter
   [18:06:34] ffi_testing.cc:180: Finished counting for 4 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 5 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 6 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 7 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 8 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 9 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 10 iter
   ---------see here is the second iteration where timeout get triggered in c++ run----
   [18:06:35] ffi_testing.cc:180: Finished counting for 1 iter
   [18:06:35] ffi_testing.cc:180: Finished counting for 2 iter
   Traceback (most recent call last):
     File "debug_popen.py", line 12, in <module>
       proc.recv()
     File "/home/tqchen/github/tvm/python/tvm/contrib/popen_pool.py", line 297, in recv
       raise TimeoutError()
   TimeoutError
   
   ```
   
   
   So perhaps there is over-speculation in this case. @tkonolige if you encountered a repro case, it would be good to dig deeper and see if there are other reasons behind.
   
   
    
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tqchen commented on pull request #11434: [POPEN POOL] Use multiprocessing to kill workers after timeout

Posted by GitBox <gi...@apache.org>.

tqchen commented on PR #11434:
URL: https://github.com/apache/tvm/pull/11434#issuecomment-1137809883

   Thanks @tkonolige , that is another fun surprise :) 
   
   I end up was able to reproduce the problem under cython, and this revealed an issue in the cython FFI implementation(which does not exist in ctypes due to automatic release of all GILs).
   
   https://github.com/apache/tvm/pull/11461 should fix the problem
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige closed pull request #11434: [POPEN POOL] Use multiprocessing to kill workers after timeout

Posted by GitBox <gi...@apache.org>.

tkonolige closed pull request #11434: [POPEN POOL] Use multiprocessing to kill workers after timeout
URL: https://github.com/apache/tvm/pull/11434


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on pull request #11434: [POPEN POOL] Use multiprocessing to kill workers after timeout

Posted by GitBox <gi...@apache.org>.

tkonolige commented on PR #11434:
URL: https://github.com/apache/tvm/pull/11434#issuecomment-1140048913

   Closing in favor of #11461


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on pull request #11434: [POPEN POOL] Use multiprocessing to kill workers after timeout

Posted by GitBox <gi...@apache.org>.

tkonolige commented on PR #11434:
URL: https://github.com/apache/tvm/pull/11434#issuecomment-1137505533

   I think something between our environments is different :). Here is what I get from your script (I modified it a little so it went slower):
   
   ```
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 1 iter1.1259e+15
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 2 iter2.2518e+15
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 3 iter3.3777e+15
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 4 iter4.5036e+15
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 5 iter5.6295e+15
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 6 iter6.7554e+15
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 7 iter7.8813e+15
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 8 iter9.0072e+15
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 9 iter1.01331e+16
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 10 iter1.1259e+16
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 1 iter1.1259e+16
   [09:23:44] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 2 iter2.2518e+16
   [09:23:45] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 3 iter3.3777e+16
   [09:23:45] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 4 iter4.5036e+16
   [09:23:45] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 5 iter5.6295e+16
   [09:23:45] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 6 iter6.7554e+16
   [09:23:45] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 7 iter7.8813e+16
   -------------------------- snip -----------------------------------
   [09:23:47] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 96 iter1.08086e+18
   [09:23:47] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 97 iter1.09212e+18
   [09:23:47] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 98 iter1.10338e+18
   [09:23:47] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 99 iter1.11464e+18
   [09:23:47] /home/tristan/octoml/tvm/src/support/ffi_testing.cc:184: Finished counting for 100 iter1.1259e+18
   Traceback (most recent call last):
     File "timer-debug.py", line 11, in <module>
       proc.recv()
     File "/home/tristan/octoml/tvm/python/tvm/contrib/popen_pool.py", line 297, in recv
       raise TimeoutError()
   TimeoutError
   ```
   The timeout error only occurs after the c++ function finishes.
   
   This is Python 3.8.12 on Pop!_OS 21.04.
   
   > Here is why: python's multi-threading is backed by real system threads and they are guarded by GIL. So although one thread enters the FFI as a long running function, another thread(the watcher) can continue to run in python interpreter (as the GIL has been released by the long running function) without a problem, as a result the timeout signal back to the parent process will continue to function, then the parent process signals kill to the popen worker.
   
   Is this still true in the case where we call python -> c++ -> python -> c++?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tqchen commented on pull request #11434: [POPEN POOL] Use multiprocessing to kill workers after timeout

Posted by GitBox <gi...@apache.org>.

tqchen commented on PR #11434:
URL: https://github.com/apache/tvm/pull/11434#issuecomment-1136412112

   Thanks @tkonolige, great catch of the problem(of long running c++ fns). This is indeed something we overlooked. Would be nice to still get the timeout information back.
   Let me also spend some time digging to see what can we do here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org