You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/04/09 20:15:04 UTC

[GitHub] [tvm] echuraev opened a new pull request #7819: [METAL] Fix issue with GPU fails

echuraev opened a new pull request #7819:
URL: https://github.com/apache/tvm/pull/7819


   Added first run to auto scheduler. This run is necessary for checking
   that the generated kernel is correct. When we just run time evaluator
   with incorrect kernel then it is possible that our application on iOS
   device will be added to ignore list because of big number of committed
   incorrect kernels. One run before running auto scheduling helps us to
   avoid this problem.
   
   Added complete handlers to all command buffers in Metal runtime. It
   helps to handle GPU errors and report about this error to the host
   application.
   
   In case when error happened, we have to create a new stream. Added
   mechanism for error handling and streams creating from python interface.
   
   Thanks for contributing to TVM!   Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from [Reviewers](https://github.com/apache/incubator-tvm/blob/master/CONTRIBUTORS.md#reviewers) by @ them in the pull request thread.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] echuraev commented on a change in pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
echuraev commented on a change in pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#discussion_r613968357



##########
File path: src/runtime/metal/metal_device_api.mm
##########
@@ -183,16 +187,25 @@ int GetWarpSize(id<MTLDevice> dev) {
   }
 }
 
+Stream* getStream(TVMStreamHandle stream, int device_id) {

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] echuraev commented on a change in pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
echuraev commented on a change in pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#discussion_r612950582



##########
File path: src/runtime/minrpc/rpc_reference.h
##########
@@ -49,7 +49,10 @@ enum class RPCCode : int {
   kDevGetAttr,
   kDevAllocData,
   kDevFreeData,
+  kDevCreateStream,

Review comment:
       You are right. Thank you.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#issuecomment-830798957


   hmm it seems this commit broke auto scheduling on vulkan. Removing the change in `auto_scheduler/measure.py` fixes it. I'll take a look but @echuraev do you know what could be wrong?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] echuraev commented on a change in pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
echuraev commented on a change in pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#discussion_r613968904



##########
File path: src/runtime/metal/metal_common.h
##########
@@ -45,15 +45,39 @@
 namespace tvm {
 namespace runtime {
 namespace metal {
+/*!
+ * \brief Structure for error handling in queues
+ */
+class Stream {
+ public:
+  explicit Stream(id<MTLDevice> device) : error_happened_(false) {
+    queue_ = [device newCommandQueue];
+  }
+  ~Stream() { [queue_ release]; }
+  id<MTLCommandBuffer> GetCommandBuffer() {
+    id<MTLCommandBuffer> cb = [queue_ commandBuffer];
+    [cb addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
+      if (buffer.status == MTLCommandBufferStatusError) SetErrorStatus();
+    }];
+    return cb;
+  }
+  bool IsErrorHappened() { return error_happened_; }

Review comment:
       Renamed. Thank you.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on a change in pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
tqchen commented on a change in pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#discussion_r613236203



##########
File path: python/tvm/_ffi/runtime_ctypes.py
##########
@@ -262,9 +262,23 @@ def max_thread_dimensions(self):
         """
         return json.loads(self._GetDeviceAttr(self.device_type, self.device_id, 8))
 
-    def sync(self):
+    def create_stream(self):
+        """Create a new runtime stream at the context."""

Review comment:
       Please document the return value per numpy docs convention(see documentation for max_thread_dimensions)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#issuecomment-821533080


   @tqchen blocked by your change request


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#discussion_r613868402



##########
File path: src/runtime/metal/metal_common.h
##########
@@ -45,15 +45,39 @@
 namespace tvm {
 namespace runtime {
 namespace metal {
+/*!
+ * \brief Structure for error handling in queues
+ */
+class Stream {
+ public:
+  explicit Stream(id<MTLDevice> device) : error_happened_(false) {
+    queue_ = [device newCommandQueue];
+  }
+  ~Stream() { [queue_ release]; }
+  id<MTLCommandBuffer> GetCommandBuffer() {
+    id<MTLCommandBuffer> cb = [queue_ commandBuffer];
+    [cb addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
+      if (buffer.status == MTLCommandBufferStatusError) SetErrorStatus();
+    }];
+    return cb;
+  }
+  bool IsErrorHappened() { return error_happened_; }

Review comment:
       `HasErrorHappened` would be more correct I believe




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on a change in pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
tqchen commented on a change in pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#discussion_r613236203



##########
File path: python/tvm/_ffi/runtime_ctypes.py
##########
@@ -262,9 +262,23 @@ def max_thread_dimensions(self):
         """
         return json.loads(self._GetDeviceAttr(self.device_type, self.device_id, 8))
 
-    def sync(self):
+    def create_stream(self):
+        """Create a new runtime stream at the context."""

Review comment:
       document the return value per numpy docs convention(see documentation for max_thread_dimensions)

##########
File path: python/tvm/_ffi/runtime_ctypes.py
##########
@@ -262,9 +262,23 @@ def max_thread_dimensions(self):
         """
         return json.loads(self._GetDeviceAttr(self.device_type, self.device_id, 8))
 
-    def sync(self):
+    def create_stream(self):
+        """Create a new runtime stream at the context."""
+        stream = ctypes.c_void_p()
+        check_call(_LIB.TVMStreamCreate(self.device_type, self.device_id, ctypes.byref(stream)))
+        return stream
+
+    def free_stream(self, stream):
+        """Free a created stream handle."""
+        check_call(_LIB.TVMStreamFree(self.device_type, self.device_id, stream))
+
+    def set_stream(self, stream):
+        """Set a created stream handle."""

Review comment:
       document all arguments and return value

##########
File path: python/tvm/_ffi/runtime_ctypes.py
##########
@@ -262,9 +262,23 @@ def max_thread_dimensions(self):
         """
         return json.loads(self._GetDeviceAttr(self.device_type, self.device_id, 8))
 
-    def sync(self):
+    def create_stream(self):
+        """Create a new runtime stream at the context."""
+        stream = ctypes.c_void_p()
+        check_call(_LIB.TVMStreamCreate(self.device_type, self.device_id, ctypes.byref(stream)))
+        return stream
+
+    def free_stream(self, stream):

Review comment:
       document the return value per numpy docs convention(see documentation for max_thread_dimensions)

##########
File path: python/tvm/_ffi/runtime_ctypes.py
##########
@@ -262,9 +262,23 @@ def max_thread_dimensions(self):
         """
         return json.loads(self._GetDeviceAttr(self.device_type, self.device_id, 8))
 
-    def sync(self):
+    def create_stream(self):
+        """Create a new runtime stream at the context."""

Review comment:
       Given that the stream we have is quite low-level (not managed by python) we might want to think about how to introduce a better pythonic wrapper later. As a result, we might want to name the API more carefully to indicate it is a low-level api.
   
   How about we do `create_raw_stream`, and add a note that the user expects to free the stream after use. Do the same thing for the other apis
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen edited a comment on pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
tqchen edited a comment on pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#issuecomment-830799247


   @masahi this could due to the stream management introduced in this PR(explicit call of set stream and new stream/free stream). I believe in vk we should always allocate and return an indicator of default stream


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#discussion_r613868402



##########
File path: src/runtime/metal/metal_common.h
##########
@@ -45,15 +45,39 @@
 namespace tvm {
 namespace runtime {
 namespace metal {
+/*!
+ * \brief Structure for error handling in queues
+ */
+class Stream {
+ public:
+  explicit Stream(id<MTLDevice> device) : error_happened_(false) {
+    queue_ = [device newCommandQueue];
+  }
+  ~Stream() { [queue_ release]; }
+  id<MTLCommandBuffer> GetCommandBuffer() {
+    id<MTLCommandBuffer> cb = [queue_ commandBuffer];
+    [cb addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
+      if (buffer.status == MTLCommandBufferStatusError) SetErrorStatus();
+    }];
+    return cb;
+  }
+  bool IsErrorHappened() { return error_happened_; }

Review comment:
       `HasErrorHappend` would be more correct I believe




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi merged pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
masahi merged pull request #7819:
URL: https://github.com/apache/tvm/pull/7819


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#issuecomment-821676012


   thanks @echuraev @tqchen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on a change in pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
tqchen commented on a change in pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#discussion_r612412020



##########
File path: src/runtime/minrpc/rpc_reference.h
##########
@@ -49,7 +49,10 @@ enum class RPCCode : int {
   kDevGetAttr,
   kDevAllocData,
   kDevFreeData,
+  kDevCreateStream,

Review comment:
       please put these code after everything so that the RPC code is backward compatible




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on a change in pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
masahi commented on a change in pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#discussion_r613866381



##########
File path: src/runtime/metal/metal_device_api.mm
##########
@@ -183,16 +187,25 @@ int GetWarpSize(id<MTLDevice> dev) {
   }
 }
 
+Stream* getStream(TVMStreamHandle stream, int device_id) {

Review comment:
       I think we prefer `GetStream`? I don't know the convention of Metal though




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
tqchen commented on pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#issuecomment-830808263


   We can let new stream return nullptr, and implement setstream/freestream for nullptr(nop)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
tqchen commented on pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#issuecomment-819511707


   also cc @masahi @csullivan @ZihengJiang please help to review this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] tqchen commented on pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
tqchen commented on pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#issuecomment-830799247


   @masahi this could due to the stream management


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#issuecomment-821012327


   @echuraev please kick CI again


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [tvm] masahi commented on pull request #7819: [METAL] Fix issue with GPU fails

Posted by GitBox <gi...@apache.org>.
masahi commented on pull request #7819:
URL: https://github.com/apache/tvm/pull/7819#issuecomment-830801201


   ok I see https://github.com/apache/tvm/blob/46e0634fb703a611b0e0f7407f10b303ccc6af81/src/runtime/vulkan/vulkan.cc#L397-L399


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org