You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2021/10/04 22:21:09 UTC

[GitHub] [incubator-mxnet] mk-61 opened a new pull request #20635: Port convolutions to cuDNN v8 API

mk-61 opened a new pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635


   ## Description ##
   This change ports Convolution and Deconvolution operations to cuDNN v8 API.
   Legacy API support is dropped, as per this RFC: https://github.com/apache/incubator-mxnet/issues/20618.
   
   The change also includes some cuDNN v8 API general support stuff, to be re-used later when more operations are ported to the v8 API.
   
   Finally, auto-tuning functionality is moved from cuDNN into MXNet, hence some memory management changes were required.
   
   ## Checklist ##
   ### Essentials ###
   - [X] Changes are complete (i.e. I finished coding on this PR)
   - [X] All changes have test coverage
   - [X] Code is well-documented
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-949183741


   Jenkins CI successfully triggered : [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966567465


   Jenkins CI successfully triggered : [unix-cpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-947040841


   Jenkins CI successfully triggered : [unix-cpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946113642


   Jenkins CI successfully triggered : [centos-gpu, unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946926444


   Jenkins CI successfully triggered : [clang, centos-cpu, windows-cpu, edge, miscellaneous, unix-cpu, website]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-939188130


   Jenkins CI successfully triggered : [unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-933900469


   Hey @mk-61 , Thanks for submitting the PR 
   All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands: 
   - To trigger all jobs: @mxnet-bot run ci [all] 
   - To trigger specific jobs: @mxnet-bot run ci [job1, job2] 
   *** 
   **CI supported jobs**: [website, sanity, unix-cpu, windows-cpu, edge, windows-gpu, centos-cpu, centos-gpu, unix-gpu, clang, miscellaneous]
   *** 
   _Note_: 
    Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin. 
   All CI tests must pass before the PR can be merged. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946019697


   Jenkins CI successfully triggered : [centos-gpu, unix-gpu, windows-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-949886423


   Jenkins CI successfully triggered : [windows-gpu, centos-gpu, unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946326744


   @mxnet-bot run ci [centos-gpu, unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

ptrendx commented on a change in pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#discussion_r747672691



##########
File path: src/operator/cudnn_ops.h
##########
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file  cudnn_ops.h
+ * \brief cuDNN v8 ops
+ */
+#ifndef MXNET_OPERATOR_CUDNN_OPS_H_
+#define MXNET_OPERATOR_CUDNN_OPS_H_
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <mxnet/op_attr_types.h>
+
+#include <mutex>
+#include <tuple>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+
+#include "nn/convolution-inl.h"
+#include "nn/deconvolution-inl.h"
+
+#include "../common/cuda/cudnn_cxx.h"
+
+namespace mxnet {
+namespace tuple_util {

Review comment:
       Why is it here and not in the cudnn_cxx.h?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966497098


   Jenkins CI successfully triggered : [centos-gpu, unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on a change in pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on a change in pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#discussion_r747698022



##########
File path: src/operator/cudnn_ops.h
##########
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file  cudnn_ops.h
+ * \brief cuDNN v8 ops
+ */
+#ifndef MXNET_OPERATOR_CUDNN_OPS_H_
+#define MXNET_OPERATOR_CUDNN_OPS_H_
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <mxnet/op_attr_types.h>
+
+#include <mutex>
+#include <tuple>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+
+#include "nn/convolution-inl.h"
+#include "nn/deconvolution-inl.h"
+
+#include "../common/cuda/cudnn_cxx.h"
+
+namespace mxnet {
+namespace tuple_util {

Review comment:
       Just by default, I'm trying to make stuff as local / private as possible? For now, just considering it as an implementation detail of cudnn_ops. Would have put it into cudnn_ops.cc, had it not been used in cudnn_ops.h.
   
   On the other hand, there's nothing cuDNN related here. So, if we do want to make it generic and reusable, the proper place would probably be to put it into dmlc, somewhere close to where dmlc::HashCombine is defined? I can prepare a separate PR to move it from here to there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

ptrendx commented on a change in pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#discussion_r747720842



##########
File path: src/operator/cudnn_ops.h
##########
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * \file  cudnn_ops.h
+ * \brief cuDNN v8 ops
+ */
+#ifndef MXNET_OPERATOR_CUDNN_OPS_H_
+#define MXNET_OPERATOR_CUDNN_OPS_H_
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <mxnet/op_attr_types.h>
+
+#include <mutex>
+#include <tuple>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+
+#include "nn/convolution-inl.h"
+#include "nn/deconvolution-inl.h"
+
+#include "../common/cuda/cudnn_cxx.h"
+
+namespace mxnet {
+namespace tuple_util {

Review comment:
       I think that's fine. dmlc-core is a separate repo so that is probably unnecessary to go there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 commented on a change in pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

DickJC123 commented on a change in pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#discussion_r746768324



##########
File path: src/common/cudnn_cxx.cc
##########
@@ -0,0 +1,336 @@
+/*

Review comment:
       Shouldn't ./operator/nn/cudnn/cudnn_deconvolution-inl.h be removed?  I don't believe it's referenced.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-947040787


   @mxnet-bot run ci [unix-cpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-949325937


   Jenkins CI successfully triggered : [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on a change in pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on a change in pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#discussion_r744999498



##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;

Review comment:
       I fixed the code to account for future major version numbers.
   The current logic is: MODE_B is the default if both compile time and runtime version is at least 8100, otherwise it's MODE_INSTANT.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] marcoabreu commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

marcoabreu commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-962427479


   The number of environment variables seems a bit excessive to me (although I like that a lot more than precompiler statements). Is there no neater way? I can't imagine most people playing around with a dozen environment variables just to get the desired result. Also, this basically makes the code untestable due to the sheer number of combinations. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-962546584


   > The number of environment variables seems a bit excessive to me (although I like that a lot more than precompiler statements). Is there no neater way? I can't imagine most people playing around with a dozen environment variables just to get the desired result. Also, this basically makes the code untestable due to the sheer number of combinations.
   
   It is expected that in vast majority of cases users can just leave all the new environment variables alone - defaults have been chosen with care. However, in some corner cases it can be useful to have a bit more control, so I was weary of just hard-coding the defaults. Please note that not all these variables are new - in some cases I was using already existing ones.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-944455102






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-937997854


   @mxnet-bot run ci [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966567429


   @mxnet-bot run ci [unix-cpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-944455207






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946113577


   @mxnet-bot run ci [centos-gpu, unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946266034


   @mxnet-bot run ci [centos-gpu, unix-gpu, windows-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946326785


   Jenkins CI successfully triggered : [centos-gpu, unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946019607


   @mxnet-bot run ci [centos-gpu, unix-gpu, windows-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946926302


   @mxnet-bot run ci [centos-cpu, clang, edge, miscellaneous, unix-cpu, website, windows-cpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ptrendx commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

ptrendx commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966460678

I agree that the number of env variables is quite large. The best solution I think would be to have some singleton object storing all settings that the user could write to, e.g. something like
```
mxnet.settings.cudnn.only_deterministic_algos = True
```
That said, this could not be a simple Python struct, as it would need to be connected to the backend (otherwise there would be issues like needing to drain the engine before setting it to avoid races and potentially having to grab a lock when accessing the values). I think the best way of doing such thing would be to have it as a resource (similar to workspace or random state). For all affected operators it could be read-only (in order to not introduce false dependencies), and setting the value on the user side would invoke an engine operation that would write to it.

That said, this would require changing the resource implementation on the backend side (currently all resources are used as mutated engine vars, we would need to allow resource accesses to be set as read-only), so it is probably not in the scope of this PR. It would be a nice improvement to MXNet API though I believe, as a lot of env variables could be dropped that way.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966605582


   @mxnet-bot run ci [miscellaneous, unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-967552837


   @mxnet-bot run ci [unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

DickJC123 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-963747883


   > The number of environment variables seems a bit excessive to me (although I like that a lot more than precompiler statements). Is there no neater way? I can't imagine most people playing around with a dozen environment variables just to get the desired result. Also, this basically makes the code untestable due to the sheer number of combinations.
   
   I'm not so worried about the combinations of the new env vars, since most are controlling independent filters on the list of potential convolution configs.  We could however think about dividing the env var descriptions into "basic" and "advanced" usage sections, so that new users don't feel overwhelmed when reviewing the list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-963838596


   Jenkins CI successfully triggered : [windows-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-949241210


   @mxnet-bot run ci [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] ptrendx commented on a change in pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

ptrendx commented on a change in pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#discussion_r746078540



##########
File path: src/common/cudnn_cxx.cc
##########
@@ -0,0 +1,336 @@
+/*

Review comment:
       I think it would be better if this file was inside src/common/cuda instead of top level common




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-937997892


   Jenkins CI successfully triggered : [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-937997892


   Jenkins CI successfully triggered : [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 commented on a change in pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

DickJC123 commented on a change in pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#discussion_r744059733



##########
File path: include/mxnet/storage.h
##########
@@ -71,18 +71,18 @@ class Storage {
    * \param ctx Context information about the device and ID.
    * \return Handle struct.
    */
-  Handle Alloc(size_t size, Context ctx) {
+  Handle Alloc(size_t size, Context ctx, bool failsafe = false) {

Review comment:
       Where you've add the failsafe parameter (here and elsewhere), please add documentation.  I suggest:
   
    \* \param failsafe Return a handle with a null dptr if out of memory, rather than exit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 merged pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

DickJC123 merged pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-949241235


   Jenkins CI successfully triggered : [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-949183707


   @mxnet-bot run ci [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966655736


   Jenkins CI successfully triggered : [unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966655686


   @mxnet-bot run ci [unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 commented on a change in pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

DickJC123 commented on a change in pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#discussion_r744059733



##########
File path: include/mxnet/storage.h
##########
@@ -71,18 +71,18 @@ class Storage {
    * \param ctx Context information about the device and ID.
    * \return Handle struct.
    */
-  Handle Alloc(size_t size, Context ctx) {
+  Handle Alloc(size_t size, Context ctx, bool failsafe = false) {

Review comment:
       Where you've add the failsafe parameter (here and elsewhere), please add documentation.  I suggest:
   
    * \param failsafe Return a handle with a null dptr if out of memory, rather than exit.

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;

Review comment:
       All supported env vars should have a description in ./docs/static_site/src/pages/api/faq/env_var.md

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */

Review comment:
       Feel free to add an \author tag

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;
+#else
+  int default_mode = CUDNN_HEUR_MODE_INSTANT;
+#endif  // CUDNN_VERSION >= 8100
+  return static_cast<cudnnBackendHeurMode_t>(dmlc::GetEnv("MXNET_CUDNN_HEUR_MODE", default_mode));
+}
+
+std::string ConvParamStr(const ConvParam& param) {
+  std::ostringstream ss;
+  ss << " layout: " << param.layout.value();
+  ss << " kernel: " << param.kernel;
+  ss << " stride: " << param.stride;
+  ss << " dilate: " << param.dilate;
+  ss << " pad: " << param.pad;
+  ss << " num_filter: " << param.num_filter;
+  ss << " num_group: " << param.num_group;
+  ss << " workspace: " << param.workspace;
+  return ss.str();
+}
+
+size_t GetWorkspace(const Descriptor& plan) {
+  return GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+}
+
+Storage::Handle FailsafeAlloc(size_t workspace_size) {
+  return Storage::Get()->Alloc(workspace_size, Context::GPU(), true);
+}
+
+Storage::Handle AllocWorkspace(std::vector<Descriptor>* plans, size_t* workspace_size) {
+  Storage::Handle workspace;
+  size_t alloc_size = *workspace_size;
+  while ((workspace = FailsafeAlloc(alloc_size)).dptr == nullptr && alloc_size > 0) {
+    // Remove any plan whose workspace_size equals the failed allocation size
+    auto hasMaxWorkspace = [alloc_size](auto const& plan) {
+      return GetWorkspace(plan) == alloc_size;
+    };
+    plans->erase(std::remove_if(plans->begin(), plans->end(), hasMaxWorkspace), plans->end());
+    // Calculate new maximum workspace_size for remaining plans
+    alloc_size = 0;
+    for (auto& plan : *plans)
+      alloc_size = std::max(alloc_size, GetWorkspace(plan));
+  }
+  *workspace_size = alloc_size;
+  return workspace;
+}
+
+std::unordered_set<int64_t> ExcludeEngines(const std::string& env_var) {
+  std::string engines = dmlc::GetEnv(env_var.c_str(), std::string());
+  std::replace(engines.begin(), engines.end(), ',', ' ');
+  std::istringstream ss(engines);
+  return std::unordered_set<int64_t>(std::istream_iterator<int64_t>(ss),
+                                     std::istream_iterator<int64_t>());
+}
+
+Descriptor SelectPlan(const OpContext& ctx,
+                      const ConvParam& param,
+                      Descriptor op,
+                      size_t n_fallbacks,
+                      const std::function<std::string()>& make_op_str,
+                      const std::vector<int64_t>& ids,
+                      const std::vector<void*>& tensor_ptrs,
+                      int64_t out_size,
+                      const std::string& excl_engines_var) {
+  auto s = ctx.get_stream<gpu>();
+  std::vector<Descriptor> ops;
+  ops.push_back(std::move(op));
+  auto op_graph = MakeOpGraph(s->dnn_handle_, ops);
+
+  int verbose = dmlc::GetEnv("MXNET_CUDNN_ALGO_VERBOSE_LEVEL", 0);
+  if (verbose > 0)
+    LOG(INFO) << "Selecting plan for " << make_op_str() << ":";
+
+  auto tune = param.cudnn_tune
+                  ? param.cudnn_tune.value()
+                  : dmlc::GetEnv("MXNET_CUDNN_AUTOTUNE_DEFAULT", static_cast<int>(conv::kLimited));
+  size_t workspace_size = 0;
+  size_t workspace_limit =
+      tune != conv::kFastest ? param.workspace << 20 : std::numeric_limits<size_t>::max();
+  auto excl_engines = ExcludeEngines(excl_engines_var);
+  auto plans        = GetPlans(HeurMode(),
+                        s->dnn_handle_,
+                        op_graph,
+                        workspace_limit,
+                        &workspace_size,
+                        excl_engines,
+                        RequireNumerics(),
+                        ExcludeNumerics(),
+#if CUDNN_VERSION >= 8200
+                        {},
+                        {},
+#endif  // CUDNN_VERSION >= 8200
+                        verbose > 1);
+  Storage::Handle out_space;
+  auto ptrs = tensor_ptrs;
+  if (tune != conv::kOff && param.add_to) {
+    // Cannot trash output tensor while auto-tuning.
+    out_space = FailsafeAlloc(out_size);
+    if (out_space.dptr)
+      ptrs.back() = out_space.dptr;
+  }
+  // Todo:
+  //     - should we be able to ask the tempspace for it's current size, then
+  //       alloc the workspace from the tempspace if its current size > workspace_size?
+  auto workspace = AllocWorkspace(&plans, &workspace_size);
+
+  if (plans.empty()) {
+    std::vector<int64_t> ixs(n_fallbacks);
+    std::iota(ixs.begin(), ixs.end(), 0);
+#if CUDNN_VERSION >= 8200
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics(),
+                              {},
+                              {});
+#else
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics());
+#endif  // CUDNN_VERSION >= 8200
+    workspace = AllocWorkspace(&plans, &workspace_size);
+    CHECK(!plans.empty());
+    LOG(WARNING) << "Using fallback engine(s) for " << make_op_str();
+  }
+
+  if (tune == conv::kOff || plans.size() == 1 || (param.add_to && !out_space.dptr)) {
+    if (verbose > 0)
+      LOG(INFO) << " " << PlanStr(plans[0]);
+    Storage::Get()->Free(out_space);
+    Storage::Get()->Free(workspace);
+    return std::move(plans[0]);
+  }
+
+  TuneWarnOnce();
+  size_t n      = verbose > 1 ? plans.size() : 1;
+  auto var_pack = MakeFinalized(CUDNN_BACKEND_VARIANT_PACK_DESCRIPTOR,
+                                CUDNN_ATTR_VARIANT_PACK_UNIQUE_IDS,
+                                ids,
+                                CUDNN_ATTR_VARIANT_PACK_DATA_POINTERS,
+                                ptrs,
+                                CUDNN_ATTR_VARIANT_PACK_WORKSPACE,
+                                workspace.dptr);
+  auto top      = FindTopPlans(std::move(plans), n, s->dnn_handle_, var_pack, MakeAvgSampler(3));
+  Storage::Get()->Free(out_space);
+  Storage::Get()->Free(workspace);
+  auto str_time = [](float t) {
+    std::ostringstream ss;
+    ss << std::fixed << std::setprecision(6) << t;
+    return ss.str();
+  };
+  for (size_t i = 0; verbose > 0 && i < top.size(); ++i) {
+    auto prefix = i == 0 ? " * " : "   ";
+    LOG(INFO) << prefix << top[i].heur_i << ") " << str_time(top[i].time) << "ms "
+              << PlanStr(top[i].plan);
+  }
+  return std::move(top[0].plan);
+}
+
+size_t Size(const TBlob& t) {
+  return t.Size() * mshadow::mshadow_sizeof(t.type_flag_);
+}
+
+// TODO(vcherepanov): remove these, once fallbacks are received as a heuristics mode in 8.3
+enum MaxFallbacks { kMaxConvFallbacks = 58, kMaxDgradFallbacks = 63, kMaxWgradFallbacks = 62 };
+
+cudnn_cxx::Descriptor Conv::Make(const OpContext& ctx,
+                                 const Param& param,
+                                 const TBlob& x,
+                                 const TBlob& w,
+                                 const TBlob& y) {
+  auto conv     = MakeConvDesc(param, static_cast<mshadow::TypeFlag>(x.type_flag_));
+  auto li       = GetLayoutInfo(static_cast<mshadow::LayoutFlag>(param.layout.value()));
+  auto x_desc   = MakeTensorDesc(ID_X, x, li, true, false);
+  auto w_desc   = MakeTensorDesc(ID_W, w, li, true, false);
+  auto y_desc   = MakeTensorDesc(ID_Y, y, li, true, false);
+  auto conv_fwd = MakeConvFwdOp(conv, x_desc, w_desc, y_desc, param.add_to);
+
+  auto make_op_str = [&param, &x]() {
+    std::ostringstream ss;
+    ss << "fprop " << mshadow::dtype_string(x.type_flag_) << " " << ConvParamStr(param);
+    return ss.str();
+  };
+
+  std::vector<int64_t> ids{ID_X, ID_W, ID_Y};
+  std::vector<void*> ptrs{x.dptr_, w.dptr_, y.dptr_};
+
+  return SelectPlan(ctx,
+                    param,
+                    std::move(conv_fwd),
+                    kMaxConvFallbacks,
+                    make_op_str,
+                    ids,
+                    ptrs,
+                    Size(y),
+                    "MXNET_CUDNN_DISABLED_CONV_FWD_ENGINES");
+}
+
+void Conv::Exec(const cudnn_cxx::Descriptor& plan,
+                const OpContext& ctx,
+                const TBlob& x,
+                const TBlob& w,
+                const TBlob& y) {
+  auto s              = ctx.get_stream<gpu>();
+  auto workspace_size = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+  auto workspace      = ctx.requested[0].get_space_internal(workspace_size, "Conv");
+
+  std::vector<int64_t> ids{ID_X, ID_W, ID_Y};
+  std::vector<void*> ptrs{x.dptr_, w.dptr_, y.dptr_};
+  auto var_pack = MakeFinalized(CUDNN_BACKEND_VARIANT_PACK_DESCRIPTOR,
+                                CUDNN_ATTR_VARIANT_PACK_UNIQUE_IDS,
+                                ids,
+                                CUDNN_ATTR_VARIANT_PACK_DATA_POINTERS,
+                                ptrs,
+                                CUDNN_ATTR_VARIANT_PACK_WORKSPACE,
+                                workspace);
+  CUDNN_CALL(cudnnBackendExecute(s->dnn_handle_, plan.get(), var_pack.get()));
+}
+
+cudnn_cxx::Descriptor ConvDgrad::Make(const OpContext& ctx,
+                                      const Param& param,
+                                      const TBlob& w,
+                                      const TBlob& dy,
+                                      const TBlob& dx) {
+  auto conv    = MakeConvDesc(param, static_cast<mshadow::TypeFlag>(w.type_flag_));
+  auto li      = GetLayoutInfo(static_cast<mshadow::LayoutFlag>(param.layout.value()));
+  auto w_desc  = MakeTensorDesc(ID_W, w, li, true, false);
+  auto dy_desc = MakeTensorDesc(ID_DY, dy, li, true, false);
+  auto dx_desc = MakeTensorDesc(ID_DX, dx, li, true, false);
+  auto dgrad   = MakeConvDgradOp(conv, w_desc, dy_desc, dx_desc, param.add_to);
+
+  auto make_op_str = [&param, &dx]() {
+    std::ostringstream ss;
+    ss << "dgrad " << mshadow::dtype_string(dx.type_flag_) << " " << ConvParamStr(param);
+    return ss.str();
+  };
+
+  std::vector<int64_t> ids{ID_W, ID_DY, ID_DX};
+  std::vector<void*> ptrs{w.dptr_, dy.dptr_, dx.dptr_};
+
+  return SelectPlan(ctx,
+                    param,
+                    std::move(dgrad),
+                    kMaxDgradFallbacks,
+                    make_op_str,
+                    ids,
+                    ptrs,
+                    Size(dx),
+                    "MXNET_CUDNN_DISABLED_CONV_DGRAD_ENGINES");

Review comment:
       New env var, needs doc.

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;
+#else
+  int default_mode = CUDNN_HEUR_MODE_INSTANT;
+#endif  // CUDNN_VERSION >= 8100
+  return static_cast<cudnnBackendHeurMode_t>(dmlc::GetEnv("MXNET_CUDNN_HEUR_MODE", default_mode));
+}
+
+std::string ConvParamStr(const ConvParam& param) {
+  std::ostringstream ss;
+  ss << " layout: " << param.layout.value();
+  ss << " kernel: " << param.kernel;
+  ss << " stride: " << param.stride;
+  ss << " dilate: " << param.dilate;
+  ss << " pad: " << param.pad;
+  ss << " num_filter: " << param.num_filter;
+  ss << " num_group: " << param.num_group;
+  ss << " workspace: " << param.workspace;
+  return ss.str();
+}
+
+size_t GetWorkspace(const Descriptor& plan) {
+  return GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+}
+
+Storage::Handle FailsafeAlloc(size_t workspace_size) {
+  return Storage::Get()->Alloc(workspace_size, Context::GPU(), true);
+}
+
+Storage::Handle AllocWorkspace(std::vector<Descriptor>* plans, size_t* workspace_size) {
+  Storage::Handle workspace;
+  size_t alloc_size = *workspace_size;
+  while ((workspace = FailsafeAlloc(alloc_size)).dptr == nullptr && alloc_size > 0) {
+    // Remove any plan whose workspace_size equals the failed allocation size
+    auto hasMaxWorkspace = [alloc_size](auto const& plan) {
+      return GetWorkspace(plan) == alloc_size;
+    };
+    plans->erase(std::remove_if(plans->begin(), plans->end(), hasMaxWorkspace), plans->end());
+    // Calculate new maximum workspace_size for remaining plans
+    alloc_size = 0;
+    for (auto& plan : *plans)
+      alloc_size = std::max(alloc_size, GetWorkspace(plan));
+  }
+  *workspace_size = alloc_size;
+  return workspace;
+}
+
+std::unordered_set<int64_t> ExcludeEngines(const std::string& env_var) {
+  std::string engines = dmlc::GetEnv(env_var.c_str(), std::string());
+  std::replace(engines.begin(), engines.end(), ',', ' ');
+  std::istringstream ss(engines);
+  return std::unordered_set<int64_t>(std::istream_iterator<int64_t>(ss),
+                                     std::istream_iterator<int64_t>());
+}
+
+Descriptor SelectPlan(const OpContext& ctx,
+                      const ConvParam& param,
+                      Descriptor op,
+                      size_t n_fallbacks,
+                      const std::function<std::string()>& make_op_str,
+                      const std::vector<int64_t>& ids,
+                      const std::vector<void*>& tensor_ptrs,
+                      int64_t out_size,
+                      const std::string& excl_engines_var) {
+  auto s = ctx.get_stream<gpu>();
+  std::vector<Descriptor> ops;
+  ops.push_back(std::move(op));
+  auto op_graph = MakeOpGraph(s->dnn_handle_, ops);
+
+  int verbose = dmlc::GetEnv("MXNET_CUDNN_ALGO_VERBOSE_LEVEL", 0);
+  if (verbose > 0)
+    LOG(INFO) << "Selecting plan for " << make_op_str() << ":";
+
+  auto tune = param.cudnn_tune
+                  ? param.cudnn_tune.value()
+                  : dmlc::GetEnv("MXNET_CUDNN_AUTOTUNE_DEFAULT", static_cast<int>(conv::kLimited));
+  size_t workspace_size = 0;
+  size_t workspace_limit =
+      tune != conv::kFastest ? param.workspace << 20 : std::numeric_limits<size_t>::max();
+  auto excl_engines = ExcludeEngines(excl_engines_var);
+  auto plans        = GetPlans(HeurMode(),
+                        s->dnn_handle_,
+                        op_graph,
+                        workspace_limit,
+                        &workspace_size,
+                        excl_engines,
+                        RequireNumerics(),
+                        ExcludeNumerics(),
+#if CUDNN_VERSION >= 8200
+                        {},
+                        {},
+#endif  // CUDNN_VERSION >= 8200
+                        verbose > 1);
+  Storage::Handle out_space;
+  auto ptrs = tensor_ptrs;
+  if (tune != conv::kOff && param.add_to) {
+    // Cannot trash output tensor while auto-tuning.
+    out_space = FailsafeAlloc(out_size);
+    if (out_space.dptr)
+      ptrs.back() = out_space.dptr;
+  }
+  // Todo:
+  //     - should we be able to ask the tempspace for it's current size, then
+  //       alloc the workspace from the tempspace if its current size > workspace_size?
+  auto workspace = AllocWorkspace(&plans, &workspace_size);
+
+  if (plans.empty()) {
+    std::vector<int64_t> ixs(n_fallbacks);
+    std::iota(ixs.begin(), ixs.end(), 0);
+#if CUDNN_VERSION >= 8200
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics(),
+                              {},
+                              {});
+#else
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics());
+#endif  // CUDNN_VERSION >= 8200
+    workspace = AllocWorkspace(&plans, &workspace_size);
+    CHECK(!plans.empty());
+    LOG(WARNING) << "Using fallback engine(s) for " << make_op_str();
+  }
+
+  if (tune == conv::kOff || plans.size() == 1 || (param.add_to && !out_space.dptr)) {
+    if (verbose > 0)
+      LOG(INFO) << " " << PlanStr(plans[0]);
+    Storage::Get()->Free(out_space);
+    Storage::Get()->Free(workspace);
+    return std::move(plans[0]);
+  }
+
+  TuneWarnOnce();
+  size_t n      = verbose > 1 ? plans.size() : 1;
+  auto var_pack = MakeFinalized(CUDNN_BACKEND_VARIANT_PACK_DESCRIPTOR,
+                                CUDNN_ATTR_VARIANT_PACK_UNIQUE_IDS,
+                                ids,
+                                CUDNN_ATTR_VARIANT_PACK_DATA_POINTERS,
+                                ptrs,
+                                CUDNN_ATTR_VARIANT_PACK_WORKSPACE,
+                                workspace.dptr);
+  auto top      = FindTopPlans(std::move(plans), n, s->dnn_handle_, var_pack, MakeAvgSampler(3));
+  Storage::Get()->Free(out_space);
+  Storage::Get()->Free(workspace);
+  auto str_time = [](float t) {
+    std::ostringstream ss;
+    ss << std::fixed << std::setprecision(6) << t;
+    return ss.str();
+  };
+  for (size_t i = 0; verbose > 0 && i < top.size(); ++i) {
+    auto prefix = i == 0 ? " * " : "   ";
+    LOG(INFO) << prefix << top[i].heur_i << ") " << str_time(top[i].time) << "ms "
+              << PlanStr(top[i].plan);
+  }
+  return std::move(top[0].plan);
+}
+
+size_t Size(const TBlob& t) {
+  return t.Size() * mshadow::mshadow_sizeof(t.type_flag_);
+}
+
+// TODO(vcherepanov): remove these, once fallbacks are received as a heuristics mode in 8.3
+enum MaxFallbacks { kMaxConvFallbacks = 58, kMaxDgradFallbacks = 63, kMaxWgradFallbacks = 62 };
+
+cudnn_cxx::Descriptor Conv::Make(const OpContext& ctx,
+                                 const Param& param,
+                                 const TBlob& x,
+                                 const TBlob& w,
+                                 const TBlob& y) {
+  auto conv     = MakeConvDesc(param, static_cast<mshadow::TypeFlag>(x.type_flag_));
+  auto li       = GetLayoutInfo(static_cast<mshadow::LayoutFlag>(param.layout.value()));
+  auto x_desc   = MakeTensorDesc(ID_X, x, li, true, false);
+  auto w_desc   = MakeTensorDesc(ID_W, w, li, true, false);
+  auto y_desc   = MakeTensorDesc(ID_Y, y, li, true, false);
+  auto conv_fwd = MakeConvFwdOp(conv, x_desc, w_desc, y_desc, param.add_to);
+
+  auto make_op_str = [&param, &x]() {
+    std::ostringstream ss;
+    ss << "fprop " << mshadow::dtype_string(x.type_flag_) << " " << ConvParamStr(param);
+    return ss.str();
+  };
+
+  std::vector<int64_t> ids{ID_X, ID_W, ID_Y};
+  std::vector<void*> ptrs{x.dptr_, w.dptr_, y.dptr_};
+
+  return SelectPlan(ctx,
+                    param,
+                    std::move(conv_fwd),
+                    kMaxConvFallbacks,
+                    make_op_str,
+                    ids,
+                    ptrs,
+                    Size(y),
+                    "MXNET_CUDNN_DISABLED_CONV_FWD_ENGINES");
+}
+
+void Conv::Exec(const cudnn_cxx::Descriptor& plan,
+                const OpContext& ctx,
+                const TBlob& x,
+                const TBlob& w,
+                const TBlob& y) {
+  auto s              = ctx.get_stream<gpu>();
+  auto workspace_size = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+  auto workspace      = ctx.requested[0].get_space_internal(workspace_size, "Conv");
+
+  std::vector<int64_t> ids{ID_X, ID_W, ID_Y};
+  std::vector<void*> ptrs{x.dptr_, w.dptr_, y.dptr_};
+  auto var_pack = MakeFinalized(CUDNN_BACKEND_VARIANT_PACK_DESCRIPTOR,
+                                CUDNN_ATTR_VARIANT_PACK_UNIQUE_IDS,
+                                ids,
+                                CUDNN_ATTR_VARIANT_PACK_DATA_POINTERS,
+                                ptrs,
+                                CUDNN_ATTR_VARIANT_PACK_WORKSPACE,
+                                workspace);
+  CUDNN_CALL(cudnnBackendExecute(s->dnn_handle_, plan.get(), var_pack.get()));
+}
+
+cudnn_cxx::Descriptor ConvDgrad::Make(const OpContext& ctx,
+                                      const Param& param,
+                                      const TBlob& w,
+                                      const TBlob& dy,
+                                      const TBlob& dx) {
+  auto conv    = MakeConvDesc(param, static_cast<mshadow::TypeFlag>(w.type_flag_));
+  auto li      = GetLayoutInfo(static_cast<mshadow::LayoutFlag>(param.layout.value()));
+  auto w_desc  = MakeTensorDesc(ID_W, w, li, true, false);
+  auto dy_desc = MakeTensorDesc(ID_DY, dy, li, true, false);
+  auto dx_desc = MakeTensorDesc(ID_DX, dx, li, true, false);
+  auto dgrad   = MakeConvDgradOp(conv, w_desc, dy_desc, dx_desc, param.add_to);
+
+  auto make_op_str = [&param, &dx]() {
+    std::ostringstream ss;
+    ss << "dgrad " << mshadow::dtype_string(dx.type_flag_) << " " << ConvParamStr(param);
+    return ss.str();
+  };
+
+  std::vector<int64_t> ids{ID_W, ID_DY, ID_DX};
+  std::vector<void*> ptrs{w.dptr_, dy.dptr_, dx.dptr_};
+
+  return SelectPlan(ctx,
+                    param,
+                    std::move(dgrad),
+                    kMaxDgradFallbacks,
+                    make_op_str,
+                    ids,
+                    ptrs,
+                    Size(dx),
+                    "MXNET_CUDNN_DISABLED_CONV_DGRAD_ENGINES");
+}
+
+void ConvDgrad::Exec(const cudnn_cxx::Descriptor& plan,
+                     const OpContext& ctx,
+                     const TBlob& w,
+                     const TBlob& dy,
+                     const TBlob& dx) {
+  auto s              = ctx.get_stream<gpu>();
+  auto workspace_size = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+  auto workspace      = ctx.requested[0].get_space_internal(workspace_size, "ConvDgrad");
+
+  std::vector<int64_t> ids{ID_W, ID_DY, ID_DX};
+  std::vector<void*> ptrs{w.dptr_, dy.dptr_, dx.dptr_};
+  auto var_pack = MakeFinalized(CUDNN_BACKEND_VARIANT_PACK_DESCRIPTOR,
+                                CUDNN_ATTR_VARIANT_PACK_UNIQUE_IDS,
+                                ids,
+                                CUDNN_ATTR_VARIANT_PACK_DATA_POINTERS,
+                                ptrs,
+                                CUDNN_ATTR_VARIANT_PACK_WORKSPACE,
+                                workspace);
+  CUDNN_CALL(cudnnBackendExecute(s->dnn_handle_, plan.get(), var_pack.get()));
+}
+
+cudnn_cxx::Descriptor ConvWgrad::Make(const OpContext& ctx,
+                                      const Param& param,
+                                      const TBlob& x,
+                                      const TBlob& dy,
+                                      const TBlob& dw) {
+  auto conv    = MakeConvDesc(param, static_cast<mshadow::TypeFlag>(x.type_flag_));
+  auto li      = GetLayoutInfo(static_cast<mshadow::LayoutFlag>(param.layout.value()));
+  auto x_desc  = MakeTensorDesc(ID_X, x, li, true, false);
+  auto dy_desc = MakeTensorDesc(ID_DY, dy, li, true, false);
+  auto dw_desc = MakeTensorDesc(ID_DW, dw, li, true, false);
+  auto wgrad   = MakeConvWgradOp(conv, x_desc, dy_desc, dw_desc, param.add_to);
+
+  auto make_op_str = [&param, &x]() {
+    std::ostringstream ss;
+    ss << "wgrad " << mshadow::dtype_string(x.type_flag_) << " " << ConvParamStr(param);
+    return ss.str();
+  };
+
+  std::vector<int64_t> ids{ID_X, ID_DY, ID_DW};
+  std::vector<void*> ptrs{x.dptr_, dy.dptr_, dw.dptr_};
+
+  return SelectPlan(ctx,
+                    param,
+                    std::move(wgrad),
+                    kMaxWgradFallbacks,
+                    make_op_str,
+                    ids,
+                    ptrs,
+                    Size(dw),
+                    "MXNET_CUDNN_DISABLED_CONV_WGRAD_ENGINES");

Review comment:
       New env var, needs doc.

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors

Review comment:
       I believe the Copyright ... by Contributors line should be removed.

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;
+#else
+  int default_mode = CUDNN_HEUR_MODE_INSTANT;
+#endif  // CUDNN_VERSION >= 8100
+  return static_cast<cudnnBackendHeurMode_t>(dmlc::GetEnv("MXNET_CUDNN_HEUR_MODE", default_mode));

Review comment:
       Again, please add a description of MXNET_CUDNN_HEUR_MODE in the env_var.md file.
   
   I'm OK to leave it as an int tied to the cudnn.h definitions.  But technically, that ties constants that users will put in training scripts to those definitions, which could change I suppose.  An alternative would be for you to parse the env var as a string, e.g. if set via:
   ```
   export MXNET_CUDNN_HEUR_MODE="B"
   ```
   

##########
File path: src/operator/nn/deconvolution.cu
##########
@@ -18,72 +18,23 @@
  */
 
 /*!
+ * Copyright (c) 2015 by Contributors

Review comment:
       Remove line please.

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;
+#else
+  int default_mode = CUDNN_HEUR_MODE_INSTANT;
+#endif  // CUDNN_VERSION >= 8100
+  return static_cast<cudnnBackendHeurMode_t>(dmlc::GetEnv("MXNET_CUDNN_HEUR_MODE", default_mode));
+}
+
+std::string ConvParamStr(const ConvParam& param) {
+  std::ostringstream ss;
+  ss << " layout: " << param.layout.value();
+  ss << " kernel: " << param.kernel;
+  ss << " stride: " << param.stride;
+  ss << " dilate: " << param.dilate;
+  ss << " pad: " << param.pad;
+  ss << " num_filter: " << param.num_filter;
+  ss << " num_group: " << param.num_group;
+  ss << " workspace: " << param.workspace;
+  return ss.str();
+}
+
+size_t GetWorkspace(const Descriptor& plan) {
+  return GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+}
+
+Storage::Handle FailsafeAlloc(size_t workspace_size) {
+  return Storage::Get()->Alloc(workspace_size, Context::GPU(), true);
+}
+
+Storage::Handle AllocWorkspace(std::vector<Descriptor>* plans, size_t* workspace_size) {
+  Storage::Handle workspace;
+  size_t alloc_size = *workspace_size;
+  while ((workspace = FailsafeAlloc(alloc_size)).dptr == nullptr && alloc_size > 0) {
+    // Remove any plan whose workspace_size equals the failed allocation size
+    auto hasMaxWorkspace = [alloc_size](auto const& plan) {
+      return GetWorkspace(plan) == alloc_size;
+    };
+    plans->erase(std::remove_if(plans->begin(), plans->end(), hasMaxWorkspace), plans->end());
+    // Calculate new maximum workspace_size for remaining plans
+    alloc_size = 0;
+    for (auto& plan : *plans)
+      alloc_size = std::max(alloc_size, GetWorkspace(plan));
+  }
+  *workspace_size = alloc_size;
+  return workspace;
+}
+
+std::unordered_set<int64_t> ExcludeEngines(const std::string& env_var) {
+  std::string engines = dmlc::GetEnv(env_var.c_str(), std::string());
+  std::replace(engines.begin(), engines.end(), ',', ' ');
+  std::istringstream ss(engines);
+  return std::unordered_set<int64_t>(std::istream_iterator<int64_t>(ss),
+                                     std::istream_iterator<int64_t>());
+}
+
+Descriptor SelectPlan(const OpContext& ctx,
+                      const ConvParam& param,
+                      Descriptor op,
+                      size_t n_fallbacks,
+                      const std::function<std::string()>& make_op_str,
+                      const std::vector<int64_t>& ids,
+                      const std::vector<void*>& tensor_ptrs,
+                      int64_t out_size,
+                      const std::string& excl_engines_var) {
+  auto s = ctx.get_stream<gpu>();
+  std::vector<Descriptor> ops;
+  ops.push_back(std::move(op));
+  auto op_graph = MakeOpGraph(s->dnn_handle_, ops);
+
+  int verbose = dmlc::GetEnv("MXNET_CUDNN_ALGO_VERBOSE_LEVEL", 0);
+  if (verbose > 0)
+    LOG(INFO) << "Selecting plan for " << make_op_str() << ":";
+
+  auto tune = param.cudnn_tune
+                  ? param.cudnn_tune.value()
+                  : dmlc::GetEnv("MXNET_CUDNN_AUTOTUNE_DEFAULT", static_cast<int>(conv::kLimited));
+  size_t workspace_size = 0;
+  size_t workspace_limit =
+      tune != conv::kFastest ? param.workspace << 20 : std::numeric_limits<size_t>::max();
+  auto excl_engines = ExcludeEngines(excl_engines_var);
+  auto plans        = GetPlans(HeurMode(),
+                        s->dnn_handle_,
+                        op_graph,
+                        workspace_limit,
+                        &workspace_size,
+                        excl_engines,
+                        RequireNumerics(),
+                        ExcludeNumerics(),
+#if CUDNN_VERSION >= 8200
+                        {},
+                        {},
+#endif  // CUDNN_VERSION >= 8200
+                        verbose > 1);
+  Storage::Handle out_space;
+  auto ptrs = tensor_ptrs;
+  if (tune != conv::kOff && param.add_to) {
+    // Cannot trash output tensor while auto-tuning.
+    out_space = FailsafeAlloc(out_size);
+    if (out_space.dptr)
+      ptrs.back() = out_space.dptr;
+  }
+  // Todo:
+  //     - should we be able to ask the tempspace for it's current size, then
+  //       alloc the workspace from the tempspace if its current size > workspace_size?
+  auto workspace = AllocWorkspace(&plans, &workspace_size);
+
+  if (plans.empty()) {
+    std::vector<int64_t> ixs(n_fallbacks);
+    std::iota(ixs.begin(), ixs.end(), 0);
+#if CUDNN_VERSION >= 8200
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics(),
+                              {},
+                              {});
+#else
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics());
+#endif  // CUDNN_VERSION >= 8200
+    workspace = AllocWorkspace(&plans, &workspace_size);
+    CHECK(!plans.empty());
+    LOG(WARNING) << "Using fallback engine(s) for " << make_op_str();
+  }
+
+  if (tune == conv::kOff || plans.size() == 1 || (param.add_to && !out_space.dptr)) {
+    if (verbose > 0)
+      LOG(INFO) << " " << PlanStr(plans[0]);
+    Storage::Get()->Free(out_space);
+    Storage::Get()->Free(workspace);

Review comment:
       I believe better would be DirectFree(), which would match the behavior of the legacy v7 cuDNNFind(). If we leave this as Free(), then autotuning will have the side effect of allocating lots of GPU memory, and then releasing it to the MXNet allocator's free pool.  In theory, with ReleaseAll() triggered as needed, this would work fine.  But in practice, we might leave all of memory up to the reserve limit (so 95% memory) allocated, then go out-of-memory if DALI asks for too many image buffers.  Having users then set MXNET_GPU_MEM_POOL_RESERVE to a value larger than the default of 5 would probably correct the problem, but I'd rather not put users to the trouble if we have a work-around. 

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;

Review comment:
       Once we get to a cuDNN 9.0, then this calculation based on `minor` will fail.  It looks like CUDNN_HEUR_MODE_INSTANT is the choice for all CUDNN_VERSION < 8200, so this can be simplified.
   
   In the code and documentation, can we move to CUDNN_HEUR_MODE_A instead of CUDNN_HEUR_MODE_INSTANT, since the latter is deprecated?  With which cuDNN version was CUDNN_HEUR_MODE_A first defined?
   

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;
+#else
+  int default_mode = CUDNN_HEUR_MODE_INSTANT;
+#endif  // CUDNN_VERSION >= 8100
+  return static_cast<cudnnBackendHeurMode_t>(dmlc::GetEnv("MXNET_CUDNN_HEUR_MODE", default_mode));
+}
+
+std::string ConvParamStr(const ConvParam& param) {
+  std::ostringstream ss;
+  ss << " layout: " << param.layout.value();
+  ss << " kernel: " << param.kernel;
+  ss << " stride: " << param.stride;
+  ss << " dilate: " << param.dilate;
+  ss << " pad: " << param.pad;
+  ss << " num_filter: " << param.num_filter;
+  ss << " num_group: " << param.num_group;
+  ss << " workspace: " << param.workspace;
+  return ss.str();
+}
+
+size_t GetWorkspace(const Descriptor& plan) {
+  return GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+}
+
+Storage::Handle FailsafeAlloc(size_t workspace_size) {
+  return Storage::Get()->Alloc(workspace_size, Context::GPU(), true);
+}
+
+Storage::Handle AllocWorkspace(std::vector<Descriptor>* plans, size_t* workspace_size) {
+  Storage::Handle workspace;
+  size_t alloc_size = *workspace_size;
+  while ((workspace = FailsafeAlloc(alloc_size)).dptr == nullptr && alloc_size > 0) {
+    // Remove any plan whose workspace_size equals the failed allocation size
+    auto hasMaxWorkspace = [alloc_size](auto const& plan) {
+      return GetWorkspace(plan) == alloc_size;
+    };
+    plans->erase(std::remove_if(plans->begin(), plans->end(), hasMaxWorkspace), plans->end());
+    // Calculate new maximum workspace_size for remaining plans
+    alloc_size = 0;
+    for (auto& plan : *plans)
+      alloc_size = std::max(alloc_size, GetWorkspace(plan));
+  }
+  *workspace_size = alloc_size;
+  return workspace;
+}
+
+std::unordered_set<int64_t> ExcludeEngines(const std::string& env_var) {
+  std::string engines = dmlc::GetEnv(env_var.c_str(), std::string());
+  std::replace(engines.begin(), engines.end(), ',', ' ');
+  std::istringstream ss(engines);
+  return std::unordered_set<int64_t>(std::istream_iterator<int64_t>(ss),
+                                     std::istream_iterator<int64_t>());
+}
+
+Descriptor SelectPlan(const OpContext& ctx,
+                      const ConvParam& param,
+                      Descriptor op,
+                      size_t n_fallbacks,
+                      const std::function<std::string()>& make_op_str,
+                      const std::vector<int64_t>& ids,
+                      const std::vector<void*>& tensor_ptrs,
+                      int64_t out_size,
+                      const std::string& excl_engines_var) {
+  auto s = ctx.get_stream<gpu>();
+  std::vector<Descriptor> ops;
+  ops.push_back(std::move(op));
+  auto op_graph = MakeOpGraph(s->dnn_handle_, ops);
+
+  int verbose = dmlc::GetEnv("MXNET_CUDNN_ALGO_VERBOSE_LEVEL", 0);

Review comment:
       Please add env var description.

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;
+#else
+  int default_mode = CUDNN_HEUR_MODE_INSTANT;
+#endif  // CUDNN_VERSION >= 8100
+  return static_cast<cudnnBackendHeurMode_t>(dmlc::GetEnv("MXNET_CUDNN_HEUR_MODE", default_mode));
+}
+
+std::string ConvParamStr(const ConvParam& param) {
+  std::ostringstream ss;
+  ss << " layout: " << param.layout.value();
+  ss << " kernel: " << param.kernel;
+  ss << " stride: " << param.stride;
+  ss << " dilate: " << param.dilate;
+  ss << " pad: " << param.pad;
+  ss << " num_filter: " << param.num_filter;
+  ss << " num_group: " << param.num_group;
+  ss << " workspace: " << param.workspace;
+  return ss.str();
+}
+
+size_t GetWorkspace(const Descriptor& plan) {
+  return GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+}
+
+Storage::Handle FailsafeAlloc(size_t workspace_size) {
+  return Storage::Get()->Alloc(workspace_size, Context::GPU(), true);
+}
+
+Storage::Handle AllocWorkspace(std::vector<Descriptor>* plans, size_t* workspace_size) {
+  Storage::Handle workspace;
+  size_t alloc_size = *workspace_size;
+  while ((workspace = FailsafeAlloc(alloc_size)).dptr == nullptr && alloc_size > 0) {
+    // Remove any plan whose workspace_size equals the failed allocation size
+    auto hasMaxWorkspace = [alloc_size](auto const& plan) {
+      return GetWorkspace(plan) == alloc_size;
+    };
+    plans->erase(std::remove_if(plans->begin(), plans->end(), hasMaxWorkspace), plans->end());
+    // Calculate new maximum workspace_size for remaining plans
+    alloc_size = 0;
+    for (auto& plan : *plans)
+      alloc_size = std::max(alloc_size, GetWorkspace(plan));
+  }
+  *workspace_size = alloc_size;
+  return workspace;
+}
+
+std::unordered_set<int64_t> ExcludeEngines(const std::string& env_var) {
+  std::string engines = dmlc::GetEnv(env_var.c_str(), std::string());
+  std::replace(engines.begin(), engines.end(), ',', ' ');
+  std::istringstream ss(engines);
+  return std::unordered_set<int64_t>(std::istream_iterator<int64_t>(ss),
+                                     std::istream_iterator<int64_t>());
+}
+
+Descriptor SelectPlan(const OpContext& ctx,
+                      const ConvParam& param,
+                      Descriptor op,
+                      size_t n_fallbacks,
+                      const std::function<std::string()>& make_op_str,
+                      const std::vector<int64_t>& ids,
+                      const std::vector<void*>& tensor_ptrs,
+                      int64_t out_size,
+                      const std::string& excl_engines_var) {
+  auto s = ctx.get_stream<gpu>();
+  std::vector<Descriptor> ops;
+  ops.push_back(std::move(op));
+  auto op_graph = MakeOpGraph(s->dnn_handle_, ops);
+
+  int verbose = dmlc::GetEnv("MXNET_CUDNN_ALGO_VERBOSE_LEVEL", 0);
+  if (verbose > 0)
+    LOG(INFO) << "Selecting plan for " << make_op_str() << ":";
+
+  auto tune = param.cudnn_tune
+                  ? param.cudnn_tune.value()
+                  : dmlc::GetEnv("MXNET_CUDNN_AUTOTUNE_DEFAULT", static_cast<int>(conv::kLimited));
+  size_t workspace_size = 0;
+  size_t workspace_limit =
+      tune != conv::kFastest ? param.workspace << 20 : std::numeric_limits<size_t>::max();
+  auto excl_engines = ExcludeEngines(excl_engines_var);
+  auto plans        = GetPlans(HeurMode(),
+                        s->dnn_handle_,
+                        op_graph,
+                        workspace_limit,
+                        &workspace_size,
+                        excl_engines,
+                        RequireNumerics(),
+                        ExcludeNumerics(),
+#if CUDNN_VERSION >= 8200
+                        {},
+                        {},
+#endif  // CUDNN_VERSION >= 8200
+                        verbose > 1);
+  Storage::Handle out_space;
+  auto ptrs = tensor_ptrs;
+  if (tune != conv::kOff && param.add_to) {
+    // Cannot trash output tensor while auto-tuning.
+    out_space = FailsafeAlloc(out_size);
+    if (out_space.dptr)
+      ptrs.back() = out_space.dptr;
+  }
+  // Todo:
+  //     - should we be able to ask the tempspace for it's current size, then
+  //       alloc the workspace from the tempspace if its current size > workspace_size?
+  auto workspace = AllocWorkspace(&plans, &workspace_size);
+
+  if (plans.empty()) {
+    std::vector<int64_t> ixs(n_fallbacks);
+    std::iota(ixs.begin(), ixs.end(), 0);
+#if CUDNN_VERSION >= 8200
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics(),
+                              {},
+                              {});
+#else
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics());
+#endif  // CUDNN_VERSION >= 8200
+    workspace = AllocWorkspace(&plans, &workspace_size);
+    CHECK(!plans.empty());
+    LOG(WARNING) << "Using fallback engine(s) for " << make_op_str();
+  }
+
+  if (tune == conv::kOff || plans.size() == 1 || (param.add_to && !out_space.dptr)) {
+    if (verbose > 0)
+      LOG(INFO) << " " << PlanStr(plans[0]);
+    Storage::Get()->Free(out_space);
+    Storage::Get()->Free(workspace);
+    return std::move(plans[0]);
+  }
+
+  TuneWarnOnce();
+  size_t n      = verbose > 1 ? plans.size() : 1;
+  auto var_pack = MakeFinalized(CUDNN_BACKEND_VARIANT_PACK_DESCRIPTOR,
+                                CUDNN_ATTR_VARIANT_PACK_UNIQUE_IDS,
+                                ids,
+                                CUDNN_ATTR_VARIANT_PACK_DATA_POINTERS,
+                                ptrs,
+                                CUDNN_ATTR_VARIANT_PACK_WORKSPACE,
+                                workspace.dptr);
+  auto top      = FindTopPlans(std::move(plans), n, s->dnn_handle_, var_pack, MakeAvgSampler(3));
+  Storage::Get()->Free(out_space);
+  Storage::Get()->Free(workspace);
+  auto str_time = [](float t) {
+    std::ostringstream ss;
+    ss << std::fixed << std::setprecision(6) << t;
+    return ss.str();
+  };
+  for (size_t i = 0; verbose > 0 && i < top.size(); ++i) {
+    auto prefix = i == 0 ? " * " : "   ";
+    LOG(INFO) << prefix << top[i].heur_i << ") " << str_time(top[i].time) << "ms "
+              << PlanStr(top[i].plan);
+  }
+  return std::move(top[0].plan);
+}
+
+size_t Size(const TBlob& t) {
+  return t.Size() * mshadow::mshadow_sizeof(t.type_flag_);
+}
+
+// TODO(vcherepanov): remove these, once fallbacks are received as a heuristics mode in 8.3
+enum MaxFallbacks { kMaxConvFallbacks = 58, kMaxDgradFallbacks = 63, kMaxWgradFallbacks = 62 };
+
+cudnn_cxx::Descriptor Conv::Make(const OpContext& ctx,
+                                 const Param& param,
+                                 const TBlob& x,
+                                 const TBlob& w,
+                                 const TBlob& y) {
+  auto conv     = MakeConvDesc(param, static_cast<mshadow::TypeFlag>(x.type_flag_));
+  auto li       = GetLayoutInfo(static_cast<mshadow::LayoutFlag>(param.layout.value()));
+  auto x_desc   = MakeTensorDesc(ID_X, x, li, true, false);
+  auto w_desc   = MakeTensorDesc(ID_W, w, li, true, false);
+  auto y_desc   = MakeTensorDesc(ID_Y, y, li, true, false);
+  auto conv_fwd = MakeConvFwdOp(conv, x_desc, w_desc, y_desc, param.add_to);
+
+  auto make_op_str = [&param, &x]() {
+    std::ostringstream ss;
+    ss << "fprop " << mshadow::dtype_string(x.type_flag_) << " " << ConvParamStr(param);
+    return ss.str();
+  };
+
+  std::vector<int64_t> ids{ID_X, ID_W, ID_Y};
+  std::vector<void*> ptrs{x.dptr_, w.dptr_, y.dptr_};
+
+  return SelectPlan(ctx,
+                    param,
+                    std::move(conv_fwd),
+                    kMaxConvFallbacks,
+                    make_op_str,
+                    ids,
+                    ptrs,
+                    Size(y),
+                    "MXNET_CUDNN_DISABLED_CONV_FWD_ENGINES");

Review comment:
       New env var, needs doc.

##########
File path: src/operator/nn/convolution.cu
##########
@@ -18,6 +18,7 @@
  */
 
 /*!
+ * Copyright (c) 2017 by Contributors

Review comment:
       These `by Contributors` lines are no longer included.

##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;
+#else
+  int default_mode = CUDNN_HEUR_MODE_INSTANT;
+#endif  // CUDNN_VERSION >= 8100
+  return static_cast<cudnnBackendHeurMode_t>(dmlc::GetEnv("MXNET_CUDNN_HEUR_MODE", default_mode));
+}
+
+std::string ConvParamStr(const ConvParam& param) {
+  std::ostringstream ss;
+  ss << " layout: " << param.layout.value();
+  ss << " kernel: " << param.kernel;
+  ss << " stride: " << param.stride;
+  ss << " dilate: " << param.dilate;
+  ss << " pad: " << param.pad;
+  ss << " num_filter: " << param.num_filter;
+  ss << " num_group: " << param.num_group;
+  ss << " workspace: " << param.workspace;
+  return ss.str();
+}
+
+size_t GetWorkspace(const Descriptor& plan) {
+  return GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+}
+
+Storage::Handle FailsafeAlloc(size_t workspace_size) {
+  return Storage::Get()->Alloc(workspace_size, Context::GPU(), true);
+}
+
+Storage::Handle AllocWorkspace(std::vector<Descriptor>* plans, size_t* workspace_size) {
+  Storage::Handle workspace;
+  size_t alloc_size = *workspace_size;
+  while ((workspace = FailsafeAlloc(alloc_size)).dptr == nullptr && alloc_size > 0) {
+    // Remove any plan whose workspace_size equals the failed allocation size
+    auto hasMaxWorkspace = [alloc_size](auto const& plan) {
+      return GetWorkspace(plan) == alloc_size;
+    };
+    plans->erase(std::remove_if(plans->begin(), plans->end(), hasMaxWorkspace), plans->end());
+    // Calculate new maximum workspace_size for remaining plans
+    alloc_size = 0;
+    for (auto& plan : *plans)
+      alloc_size = std::max(alloc_size, GetWorkspace(plan));
+  }
+  *workspace_size = alloc_size;
+  return workspace;
+}
+
+std::unordered_set<int64_t> ExcludeEngines(const std::string& env_var) {
+  std::string engines = dmlc::GetEnv(env_var.c_str(), std::string());
+  std::replace(engines.begin(), engines.end(), ',', ' ');
+  std::istringstream ss(engines);
+  return std::unordered_set<int64_t>(std::istream_iterator<int64_t>(ss),
+                                     std::istream_iterator<int64_t>());
+}
+
+Descriptor SelectPlan(const OpContext& ctx,
+                      const ConvParam& param,
+                      Descriptor op,
+                      size_t n_fallbacks,
+                      const std::function<std::string()>& make_op_str,
+                      const std::vector<int64_t>& ids,
+                      const std::vector<void*>& tensor_ptrs,
+                      int64_t out_size,
+                      const std::string& excl_engines_var) {
+  auto s = ctx.get_stream<gpu>();
+  std::vector<Descriptor> ops;
+  ops.push_back(std::move(op));
+  auto op_graph = MakeOpGraph(s->dnn_handle_, ops);
+
+  int verbose = dmlc::GetEnv("MXNET_CUDNN_ALGO_VERBOSE_LEVEL", 0);
+  if (verbose > 0)
+    LOG(INFO) << "Selecting plan for " << make_op_str() << ":";
+
+  auto tune = param.cudnn_tune
+                  ? param.cudnn_tune.value()
+                  : dmlc::GetEnv("MXNET_CUDNN_AUTOTUNE_DEFAULT", static_cast<int>(conv::kLimited));
+  size_t workspace_size = 0;
+  size_t workspace_limit =
+      tune != conv::kFastest ? param.workspace << 20 : std::numeric_limits<size_t>::max();
+  auto excl_engines = ExcludeEngines(excl_engines_var);
+  auto plans        = GetPlans(HeurMode(),
+                        s->dnn_handle_,
+                        op_graph,
+                        workspace_limit,
+                        &workspace_size,
+                        excl_engines,
+                        RequireNumerics(),
+                        ExcludeNumerics(),
+#if CUDNN_VERSION >= 8200
+                        {},
+                        {},
+#endif  // CUDNN_VERSION >= 8200
+                        verbose > 1);
+  Storage::Handle out_space;
+  auto ptrs = tensor_ptrs;
+  if (tune != conv::kOff && param.add_to) {
+    // Cannot trash output tensor while auto-tuning.
+    out_space = FailsafeAlloc(out_size);
+    if (out_space.dptr)
+      ptrs.back() = out_space.dptr;
+  }
+  // Todo:
+  //     - should we be able to ask the tempspace for it's current size, then
+  //       alloc the workspace from the tempspace if its current size > workspace_size?
+  auto workspace = AllocWorkspace(&plans, &workspace_size);
+
+  if (plans.empty()) {
+    std::vector<int64_t> ixs(n_fallbacks);
+    std::iota(ixs.begin(), ixs.end(), 0);
+#if CUDNN_VERSION >= 8200
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics(),
+                              {},
+                              {});
+#else
+    plans = MakeFallbackPlans(ixs,
+                              s->dnn_handle_,
+                              op_graph,
+                              workspace_limit,
+                              &workspace_size,
+                              excl_engines,
+                              RequireNumerics(),
+                              ExcludeNumerics());
+#endif  // CUDNN_VERSION >= 8200
+    workspace = AllocWorkspace(&plans, &workspace_size);
+    CHECK(!plans.empty());
+    LOG(WARNING) << "Using fallback engine(s) for " << make_op_str();
+  }
+
+  if (tune == conv::kOff || plans.size() == 1 || (param.add_to && !out_space.dptr)) {
+    if (verbose > 0)
+      LOG(INFO) << " " << PlanStr(plans[0]);
+    Storage::Get()->Free(out_space);
+    Storage::Get()->Free(workspace);
+    return std::move(plans[0]);
+  }
+
+  TuneWarnOnce();
+  size_t n      = verbose > 1 ? plans.size() : 1;
+  auto var_pack = MakeFinalized(CUDNN_BACKEND_VARIANT_PACK_DESCRIPTOR,
+                                CUDNN_ATTR_VARIANT_PACK_UNIQUE_IDS,
+                                ids,
+                                CUDNN_ATTR_VARIANT_PACK_DATA_POINTERS,
+                                ptrs,
+                                CUDNN_ATTR_VARIANT_PACK_WORKSPACE,
+                                workspace.dptr);
+  auto top      = FindTopPlans(std::move(plans), n, s->dnn_handle_, var_pack, MakeAvgSampler(3));
+  Storage::Get()->Free(out_space);
+  Storage::Get()->Free(workspace);

Review comment:
       Recommending DirectFree() here, per earlier comment.

##########
File path: src/storage/gpu_device_storage.h
##########
@@ -39,21 +39,28 @@ class GPUDeviceStorage {
    * \brief Allocation.
    * \param handle Handle struct.
    */
-  inline static void Alloc(Storage::Handle* handle);
+  inline static void Alloc(Storage::Handle* handle, bool failsafe = false);

Review comment:
       Add doc of new param `failsafe` per earlier comment.

##########
File path: src/storage/pinned_memory_storage.h
##########
@@ -36,7 +36,7 @@ class PinnedMemoryStorage {
    * \brief Allocation.
    * \param handle Handle struct.
    */
-  inline static void Alloc(Storage::Handle* handle);
+  inline static void Alloc(Storage::Handle* handle, bool failsafe);

Review comment:
       Add doc of new param `failsafe` per earlier comment.

##########
File path: tests/python/gpu/test_gluon_model_zoo_gpu.py
##########
@@ -39,7 +39,8 @@ def download_data():
 
 @mx.util.use_np
 @pytest.mark.serial
-@pytest.mark.parametrize('model_name', ['resnet50_v1', 'vgg19_bn', 'alexnet', 'densenet201', 'squeezenet1.0', 'mobilenet0.25'])
+# TODO(vcherepanov): mobilenet0.25 fails this test
+@pytest.mark.parametrize('model_name', ['resnet50_v1', 'vgg19_bn', 'alexnet', 'densenet201', 'squeezenet1.0'])

Review comment:
       I'm OK with removing mobilenet0.25.  I too experienced failures with this test on mobilenet and did not completely troubleshoot it.  I found that it failed non-deterministically on P100 & TU102 when MXNET_GPU_WORKER_NTHREADS=2.

##########
File path: src/storage/cpu_device_storage.h
##########
@@ -38,7 +38,7 @@ class CPUDeviceStorage {
    * \brief Aligned allocation on CPU.
    * \param handle Handle struct.
    */
-  inline static void Alloc(Storage::Handle* handle);
+  inline static void Alloc(Storage::Handle* handle, bool failsafe = false);

Review comment:
       Add doc of new param `failsafe` per earlier comment.

##########
File path: src/storage/storage_manager.h
##########
@@ -40,7 +40,7 @@ class StorageManager {
    * \brief Allocation.
    * \param handle Handle struct.
    */
-  virtual void Alloc(Storage::Handle* handle) = 0;
+  virtual void Alloc(Storage::Handle* handle, bool failsafe = false) = 0;

Review comment:
       Add doc of new param `failsafe` per earlier comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-948987082


   Jenkins CI successfully triggered : [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-948987014


   @mxnet-bot run ci [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-939188098


    @mxnet-bot run ci [greeting, unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-949325898


   @mxnet-bot run ci [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966497049


   @mxnet-bot run ci [unix-gpu, centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

DickJC123 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-969231589


   Thanks again for your responsiveness to our review comments, and patience throughout these recent CI issues (e.g. https://github.com/apache/incubator-mxnet/issues/20738).  The problem with the apt repo server seems to have been solved, and this PR has passed without new work-arounds added.  LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966538819


   Jenkins CI successfully triggered : [website, unix-cpu, miscellaneous]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966538746


   @mxnet-bot run ci [unix-cpu, website, miscellaneous]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-966605632


   Jenkins CI successfully triggered : [unix-gpu, miscellaneous]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-967553055


   Jenkins CI successfully triggered : [unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] DickJC123 commented on a change in pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

DickJC123 commented on a change in pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#discussion_r745831518



##########
File path: src/operator/cudnn_ops.cc
##########
@@ -0,0 +1,765 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ * Copyright (c) 2021 by Contributors
+ * \file  cudnn_ops.cc
+ * \brief cuDNN v8 ops
+ */
+
+#include "cudnn_ops.h"
+
+#include <mxnet/base.h>
+#if MXNET_USE_CUDNN == 1
+
+#include <dmlc/parameter.h>
+
+#include <algorithm>
+#include <cstdlib>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace mxnet {
+namespace op {
+
+using cudnn_cxx::Descriptor;
+using cudnn_cxx::GetAttr;
+using cudnn_cxx::GetSomeAttrs;
+using cudnn_cxx::IsCompatible;
+using cudnn_cxx::MakeAvgSampler;
+using cudnn_cxx::MakeFinalized;
+using cudnn_cxx::PackedStrides;
+using cudnn_cxx::PlanStr;
+
+namespace cudnn {
+
+cudnnDataType_t CudnnType(mshadow::TypeFlag dtype) {
+  static std::unordered_map<mshadow::TypeFlag, cudnnDataType_t> type_map {
+    {mshadow::kFloat32, CUDNN_DATA_FLOAT}, {mshadow::kFloat64, CUDNN_DATA_DOUBLE},
+        {mshadow::kFloat16, CUDNN_DATA_HALF}, {mshadow::kUint8, CUDNN_DATA_UINT8},
+        {mshadow::kInt8, CUDNN_DATA_INT8}, {mshadow::kInt32, CUDNN_DATA_INT32},
+#if CUDNN_VERSION >= 8100
+        {mshadow::kInt64, CUDNN_DATA_INT64},
+#endif  // CUDNN_VERSION >= 8100
+  };
+  auto it = type_map.find(dtype);
+  CHECK(it != type_map.end()) << "Unsupported type: " << dtype;
+  return it->second;
+}
+
+std::vector<size_t> LayoutInfo::Order() const {
+  std::vector<size_t> ret(n_space_dims + 2);
+  std::iota(ret.begin(), ret.end(), 0);
+  if (channel_last)
+    std::rotate(ret.begin() + 1, ret.begin() + 2, ret.end());
+  return ret;
+}
+
+size_t LayoutInfo::ChannelIdx() const {
+  return channel_last ? 1 + n_space_dims : 1;
+}
+
+std::vector<int64_t> LayoutInfo::Strides(const std::vector<int64_t>& dims) const {
+  return PackedStrides(Order(), dims);
+}
+
+LayoutInfo GetLayoutInfo(mshadow::LayoutFlag layout) {
+  static std::unordered_map<mshadow::LayoutFlag, LayoutInfo> layout_map{
+      {mshadow::kNCW, {1, false}},
+      {mshadow::kNWC, {1, true}},
+      {mshadow::kNCHW, {2, false}},
+      {mshadow::kNHWC, {2, true}},
+      {mshadow::kNCDHW, {3, false}},
+      {mshadow::kNDHWC, {3, true}},
+  };
+  auto it = layout_map.find(layout);
+  CHECK(it != layout_map.end()) << "Unsupported layout: " << layout;
+  return it->second;
+}
+
+TShape ExpandChannelDims(mshadow::LayoutFlag layout, int c) {
+  auto li = GetLayoutInfo(layout);
+  std::vector<int> dims(li.n_space_dims + 2, 1);
+  dims[li.ChannelIdx()] = c;
+  return TShape(dims.begin(), dims.end());
+}
+
+std::vector<size_t> ReverseOrder(const std::vector<size_t>& o) {
+  std::vector<size_t> ret(o.size());
+  for (size_t i = 0; i < ret.size(); ++i)
+    ret[o[i]] = i;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> RequireNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  return ret;
+}
+
+std::vector<cudnnBackendNumericalNote_t> ExcludeNumerics() {
+  std::vector<cudnnBackendNumericalNote_t> ret;
+  if (!dmlc::GetEnv("MXNET_CUDA_ALLOW_TENSOR_CORE", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_TENSOR_CORE);
+  if (!dmlc::GetEnv("MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_DOWN_CONVERT_INPUTS);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_REDUCED_PRECISION_REDUCTION", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_FFT", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_FFT);
+  if (dmlc::GetEnv("MXNET_ENFORCE_DETERMINISM", false))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_NONDETERMINISTIC);
+  if (!dmlc::GetEnv("MXNET_CUDNN_ALLOW_WINOGRAD", true))
+    ret.push_back(CUDNN_NUMERICAL_NOTE_WINOGRAD);
+  return ret;
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          cudnnDataType_t dtype,
+                          const std::vector<int64_t>& dims,
+                          const std::vector<int64_t>& strides,
+                          bool is_virtual) {
+  int64_t alignment = 16;  // TODO(vcherepanov): ?
+  return MakeFinalized(CUDNN_BACKEND_TENSOR_DESCRIPTOR,
+                       CUDNN_ATTR_TENSOR_UNIQUE_ID,
+                       uid,
+                       CUDNN_ATTR_TENSOR_DATA_TYPE,
+                       dtype,
+                       CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
+                       alignment,
+                       CUDNN_ATTR_TENSOR_DIMENSIONS,
+                       dims,
+                       CUDNN_ATTR_TENSOR_STRIDES,
+                       strides,
+                       CUDNN_ATTR_TENSOR_IS_VIRTUAL,
+                       is_virtual);
+}
+
+Descriptor MakeTensorDesc(int64_t uid,
+                          const TBlob& blob,
+                          const LayoutInfo& li,
+                          bool expand_1d,
+                          bool is_virtual) {
+  std::vector<int64_t> dims(blob.shape_.ndim());
+  CHECK_EQ(dims.size(), li.n_space_dims + 2);
+  auto rev_order = ReverseOrder(li.Order());
+  for (size_t i = 0; i < dims.size(); ++i)
+    dims[i] = blob.shape_[rev_order[i]];
+  auto strides = li.Strides(dims);
+  if (li.n_space_dims == 1 && expand_1d) {
+    dims.insert(dims.begin() + 2, 1);
+    std::vector<size_t> order(dims.size());
+    std::iota(order.begin(), order.end(), 0);
+    if (li.channel_last)
+      std::rotate(order.begin() + 1, order.begin() + 2, order.end());
+    strides = PackedStrides(order, dims);
+  }
+  return MakeTensorDesc(
+      uid, CudnnType(static_cast<mshadow::TypeFlag>(blob.type_flag_)), dims, strides, is_virtual);
+}
+
+Descriptor MakeCTensorDescExpandDims(int64_t uid,
+                                     const TBlob& b,
+                                     const LayoutInfo& li,
+                                     bool is_virtual) {
+  std::vector<int64_t> dims(li.n_space_dims + 2, 1);
+  dims[1]    = b.shape_[0];
+  auto dtype = CudnnType(static_cast<mshadow::TypeFlag>(b.type_flag_));
+  return MakeTensorDesc(uid, dtype, dims, li.Strides(dims), is_virtual);
+}
+
+Descriptor MakeConvDesc(const ConvParam& param, mshadow::TypeFlag dtype) {
+  int64_t sdims = param.kernel.ndim();
+  std::vector<int64_t> stride(param.stride.begin(), param.stride.end());
+  std::vector<int64_t> dilate(param.dilate.begin(), param.dilate.end());
+  std::vector<int64_t> pad(param.pad.begin(), param.pad.end());
+
+  auto comp_type = CudnnType(dtype);
+  if (comp_type == CUDNN_DATA_HALF)
+    comp_type = CUDNN_DATA_FLOAT;
+
+  if (sdims == 1) {
+    // TODO(vcherepanov): remove this once cuDNN properly supports 1D convolutions.
+    // For now, making spacial dims 2D: 1 x W.
+    ++sdims;
+    stride.insert(stride.begin(), 1);
+    dilate.insert(dilate.begin(), 1);
+    pad.insert(pad.begin(), 0);
+  }
+  return MakeFinalized(CUDNN_BACKEND_CONVOLUTION_DESCRIPTOR,
+                       CUDNN_ATTR_CONVOLUTION_SPATIAL_DIMS,
+                       sdims,
+                       CUDNN_ATTR_CONVOLUTION_COMP_TYPE,
+                       comp_type,
+                       CUDNN_ATTR_CONVOLUTION_CONV_MODE,
+                       CUDNN_CROSS_CORRELATION,
+                       CUDNN_ATTR_CONVOLUTION_FILTER_STRIDES,
+                       stride,
+                       CUDNN_ATTR_CONVOLUTION_DILATIONS,
+                       dilate,
+                       CUDNN_ATTR_CONVOLUTION_PRE_PADDINGS,
+                       pad,
+                       CUDNN_ATTR_CONVOLUTION_POST_PADDINGS,
+                       pad);
+}
+
+Descriptor MakeConvFwdOp(const Descriptor& conv,
+                         const Descriptor& x,
+                         const Descriptor& w,
+                         const Descriptor& y,
+                         bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y,
+                  y);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvDgradOp(const Descriptor& conv,
+                           const Descriptor& w,
+                           const Descriptor& dy,
+                           const Descriptor& dx,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W,
+                  w,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX,
+                  dx);
+  if (GetAttr<cudnnDataType_t>(w, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeConvWgradOp(const Descriptor& conv,
+                           const Descriptor& x,
+                           const Descriptor& dy,
+                           const Descriptor& dw,
+                           bool add_to) {
+  auto ret = Make(CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_CONV_DESC,
+                  conv,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X,
+                  x,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY,
+                  dy,
+                  CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW,
+                  dw);
+  if (GetAttr<cudnnDataType_t>(x, CUDNN_ATTR_TENSOR_DATA_TYPE) == CUDNN_DATA_DOUBLE) {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0 : 0.0);
+  } else {
+    SetAttrs(ret,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_ALPHA,
+             1.0f,
+             CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_BETA,
+             add_to ? 1.0f : 0.0f);
+  }
+  CUDNN_CALL(cudnnBackendFinalize(ret.get()));
+  return ret;
+}
+
+Descriptor MakeOpGraph(cudnnHandle_t handle, const std::vector<Descriptor>& ops) {
+  return MakeFinalized(CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR,
+                       CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
+                       handle,
+                       CUDNN_ATTR_OPERATIONGRAPH_OPS,
+                       ops);
+}
+
+ConvParam::ConvParam(const ConvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+ConvParam::ConvParam(const DeconvolutionParam& p, bool add_to)
+    : kernel(p.kernel),
+      stride(p.stride),
+      dilate(p.dilate),
+      pad(p.pad),
+      num_filter(p.num_filter),
+      num_group(p.num_group),
+      workspace(p.workspace),
+      cudnn_tune(p.cudnn_tune),
+      layout(p.layout),
+      add_to(add_to) {}
+
+void TuneWarnOnce() {
+  thread_local bool done = false;
+  if (!done) {
+    LOG(INFO) << "Auto-tuning cuDNN op, set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable";
+    done = true;
+  }
+}
+
+std::vector<Descriptor> MakeFallbackPlans(
+    const std::vector<int64_t>& ixs,
+    cudnnHandle_t handle,
+    const Descriptor& op_graph,
+    size_t workspace_limit,
+    size_t* max_workspace,
+    const std::unordered_set<int64_t>& excl_engines,
+    const std::vector<cudnnBackendNumericalNote_t>& req_numeric,
+    const std::vector<cudnnBackendNumericalNote_t>& excl_numeric
+#if CUDNN_VERSION >= 8200
+    ,
+    const std::vector<cudnnBackendBehaviorNote_t>& req_behavior,
+    const std::vector<cudnnBackendBehaviorNote_t>& excl_behavior
+#endif  // CUDNN_VERSION >= 8200
+) {
+  std::vector<Descriptor> plans;
+  if (max_workspace)
+    *max_workspace = 0;
+  for (auto ix : ixs) {
+    if (excl_engines.count(ix))
+      continue;
+    auto engine = Make(CUDNN_BACKEND_ENGINE_DESCRIPTOR,
+                       CUDNN_ATTR_ENGINE_OPERATION_GRAPH,
+                       op_graph,
+                       CUDNN_ATTR_ENGINE_GLOBAL_INDEX,
+                       ix);
+    auto err    = cudnnBackendFinalize(engine.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto cfg =
+        MakeFinalized(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, CUDNN_ATTR_ENGINECFG_ENGINE, engine);
+    auto plan = Make(CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR,
+                     CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
+                     handle,
+                     CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
+                     cfg);
+    err       = cudnnBackendFinalize(plan.get());
+    if (err == CUDNN_STATUS_NOT_SUPPORTED || err == CUDNN_STATUS_ARCH_MISMATCH)
+      continue;
+    if (err != CUDNN_STATUS_SUCCESS) {
+      LOG(WARNING) << "Unexpected cuDNN status: " << err << ": " << cudnnGetErrorString(err);
+      continue;
+    }
+    auto workspace = GetAttr<int64_t>(plan, CUDNN_ATTR_EXECUTION_PLAN_WORKSPACE_SIZE);
+    if (workspace > workspace_limit)
+      continue;
+    auto numerical = GetSomeAttrs<cudnnBackendNumericalNote_t>(
+        CUDNN_NUMERICAL_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_NUMERICAL_NOTE);
+    if (!IsCompatible(numerical, req_numeric, excl_numeric))
+      continue;
+#if CUDNN_VERSION >= 8200
+    auto behavior = GetSomeAttrs<cudnnBackendBehaviorNote_t>(
+        CUDNN_BEHAVIOR_NOTE_TYPE_COUNT, engine, CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE);
+    if (!IsCompatible(behavior, req_behavior, excl_behavior))
+      continue;
+#endif  // CUDNN_VERSION >= 8200
+    plans.push_back(std::move(plan));
+    if (max_workspace)
+      *max_workspace = std::max(*max_workspace, static_cast<size_t>(workspace));
+  }
+  return plans;
+}
+
+cudnnBackendHeurMode_t HeurMode() {
+#if CUDNN_VERSION >= 8100
+  auto minor       = cudnnGetVersion() / 100 % 10;
+  int default_mode = minor < 2 ? CUDNN_HEUR_MODE_INSTANT : CUDNN_HEUR_MODE_B;
+#else
+  int default_mode = CUDNN_HEUR_MODE_INSTANT;
+#endif  // CUDNN_VERSION >= 8100
+  return static_cast<cudnnBackendHeurMode_t>(dmlc::GetEnv("MXNET_CUDNN_HEUR_MODE", default_mode));

Review comment:
       The conclusion here is that it's simpler to define this as integers that align with the current definition.  If future versions of cudnn*.h invalidate this, we can choose to insert a remapping function at that time if we feel it's important to maintain backward compatibility.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-963838569


   @mxnet-bot run ci [windows-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-965842320


   Jenkins CI successfully triggered : [website, unix-cpu, miscellaneous, unix-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-965842279


   @mxnet-bot run ci [miscellaneous, unix-cpu, unix-gpu, website]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-937997854


   @mxnet-bot run ci [centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946266068


   Jenkins CI successfully triggered : [unix-gpu, windows-gpu, centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mxnet-bot commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mxnet-bot commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946352577


   Jenkins CI successfully triggered : [windows-gpu, unix-gpu, centos-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-946352542


   @mxnet-bot run ci [centos-gpu, unix-gpu, windows-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] mk-61 commented on pull request #20635: Port convolutions to cuDNN v8 API

Posted by GitBox <gi...@apache.org>.

mk-61 commented on pull request #20635:
URL: https://github.com/apache/incubator-mxnet/pull/20635#issuecomment-949886381


   @mxnet-bot run ci [centos-gpu, unix-gpu, windows-gpu]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org