You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/05/05 21:13:14 UTC

[GitHub] [tvm] tkonolige opened a new pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

tkonolige opened a new pull request #7983:
URL: https://github.com/apache/tvm/pull/7983


   This PR adds an optional dependency on PAPI (https://bitbucket.org/icl/papi/) in order to collect hardware performance counters on CPU and CUDA. These performance counters include data like total cycles, instructions executed, and cache misses. Users can control which performance counters are collected by setting the TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated list of metrics.
   
   @leandron @areusch @junrushao1994 @icemelon9 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

areusch commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-875759198


   @leandron @tqchen please take a look and explicitly approve


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

areusch commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r662484447



##########
File path: src/runtime/thread_pool.cc
##########
@@ -278,6 +278,22 @@ class ThreadPool {
     }
     threads_.reset();
   }
+  void Reset() {
+    for (std::unique_ptr<SpscTaskQueue>& q : queues_) {
+      q->SignalForKill();
+    }
+    queues_.clear();
+    threads_.reset();
+    for (int i = 0; i < num_workers_; ++i) {
+      // The SpscTaskQueue only hosts ONE item at a time
+      queues_.emplace_back(std::unique_ptr<SpscTaskQueue>(new SpscTaskQueue()));
+    }
+    threads_ = std::unique_ptr<tvm::runtime::threading::ThreadGroup>(

Review comment:
       would be great if this logic could be shared with ctor

##########
File path: src/runtime/profiling.cc
##########
@@ -100,16 +102,37 @@ TVM_REGISTER_GLOBAL("profiling.start_timer").set_body_typed(Timer::Start);
 
 namespace profiling {
 
-void Profiler::Start(const std::vector<Device>& devs) {
-  CHECK(global_timers_.empty()) << "You can only call Start once per Profiler.";
+Profiler::Profiler(std::vector<Device> devs, std::vector<MetricCollector> metric_collectors)
+    : devs_(devs), collectors_(metric_collectors) {
+  is_running_ = false;
+  std::vector<DeviceWrapper> wrapped_devs;
   for (auto dev : devs) {
-    global_timers_.emplace_back(dev, Timer::Start(dev));
+    wrapped_devs.push_back(DeviceWrapper(make_object<DeviceWrapperNode>(dev)));
+  }
+  for (auto& x : collectors_) {
+    x->Init(wrapped_devs);
+  }
+  // reset the thread pool so that PAPI eventset hooks are set in all threads.
+  threading::ResetThreadPool();

Review comment:
       is this side effect documented anywhere?

##########
File path: python/tvm/runtime/profiler_vm.py
##########
@@ -50,14 +50,17 @@ def get_stat(self, sort_by_time=True):  # pylint: disable=unused-argument
         warnings.warn("get_stat has been removed, use profile instead")
         return ""
 
-    def profile(self, *args, func_name="main", **kwargs):
+    def profile(self, *args, func_name="main", collectors=[], **kwargs):

Review comment:
       can you fix this one too?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-869949125


   @leandron @tqchen @areusch Can you review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r627761077



##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.
+   * \param dev The device the function will be run on.
+   * \returns A `PAPIEventSetNode` containing values for the counters at the
+   * start of the call. Passed to a corresponding `Stop` call.
+   */
+  ObjectRef Start(Device dev) final {
+    // Record counter values at the start of the call, so we can calculate the
+    // metrics for the call by comparing the values at the end of the call.
+    auto it = event_sets.find(dev);
+    if (it != event_sets.end()) {
+      int event_set = it->second;
+      std::vector<long_long> values(papi_metrics[dev].size());
+      PAPI_CALL(PAPI_read(event_set, values.data()));
+      return ObjectRef(make_object<PAPIEventSetNode>(values, dev));
+    } else {
+      return ObjectRef(nullptr);
+    }
+  }
+
+  /*! \brief Called right after a function call.
+   * \param obj `PAPIEventSetNode` created by a call to `Start`.
+   * \returns A mapping from metric name to value.
+   */
+  Map<String, ObjectRef> Stop(ObjectRef obj) final {
+    const PAPIEventSetNode* event_set_node = obj.as<PAPIEventSetNode>();
+    std::vector<long_long> end_values(papi_metrics[event_set_node->dev].size());
+    PAPI_CALL(PAPI_read(event_sets[event_set_node->dev], end_values.data()));
+    std::unordered_map<String, ObjectRef> reported_metrics;
+    for (size_t i = 0; i < end_values.size(); i++) {
+      reported_metrics[papi_metrics[event_set_node->dev][i]] =
+          ObjectRef(make_object<CountNode>(end_values[i] - event_set_node->start_values[i]));
+    }
+    return reported_metrics;
+  }
+
+  ~PAPIMetricCollectorNode() final {
+    for (auto p : event_sets) {

Review comment:
       I think the logic here is pretty simple and makes sense. All it is doing is destroying the created event sets.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r660878769



##########
File path: src/runtime/profiling.cc
##########
@@ -118,14 +141,21 @@ void Profiler::StopCall(std::unordered_map<std::string, ObjectRef> extra_metrics
   for (auto& p : extra_metrics) {
     cf.extra_metrics[p.first] = p.second;
   }
+  // collect the extra metrics from user defined collectors
+  for (const auto& obj : cf.extra_collectors) {
+    auto collector_metrics = obj.first->Stop(obj.second);
+    for (auto& p : collector_metrics) {
+      cf.extra_metrics[p.first] = p.second;
+    }
+  }
   in_flight_.pop();
   calls_.push_back(cf);
 }
 
 void Profiler::Stop() {
-  // Stop all global timers. We wait to synchronize until we are making the report.
-  for (auto p : global_timers_) {
-    p.second->Stop();
+  is_running_ = false;

Review comment:
       I'm not sure what errors would occur here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r629711264



##########
File path: src/runtime/vm/profiler/vm.cc
##########
@@ -117,11 +120,11 @@ void VirtualMachineDebug::InvokePacked(Index packed_index, const PackedFunc& fun
     }
     metrics["Argument Shapes"] = profiling::ShapeString(shapes);
 
-    prof_.StartCall(packed_index_map_[packed_index], dev, metrics);
+    prof_.operator*().StartCall(packed_index_map_[packed_index], dev, metrics);

Review comment:
       nope :(  `->` is const only I think.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tqchen commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tqchen commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r626971170



##########
File path: include/tvm/runtime/c_backend_api.h
##########
@@ -152,6 +152,16 @@ TVM_DLL int TVMBackendParallelBarrier(int task_id, TVMParallelGroupEnv* penv);
  */
 TVM_DLL int TVMBackendRunOnce(void** handle, int (*f)(void*), void* cdata, int nbytes);
 
+/*!
+ * \brief Reset the threads in the pool. All current threads are destroyed and
+ * new ones are created.
+ *
+ * Note that this does nothing when openmp is used.
+ *
+ * \return 0 when no error is thrown, -1 when failure happens
+ */

Review comment:
       Unless needed from the backend generated code, we should avoid adding new functions through C API, as C API needs to remain generally stable. Instead, consider register a packed function to do so




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r627755746



##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.
+   * \param dev The device the function will be run on.
+   * \returns A `PAPIEventSetNode` containing values for the counters at the
+   * start of the call. Passed to a corresponding `Stop` call.
+   */
+  ObjectRef Start(Device dev) final {
+    // Record counter values at the start of the call, so we can calculate the
+    // metrics for the call by comparing the values at the end of the call.
+    auto it = event_sets.find(dev);
+    if (it != event_sets.end()) {
+      int event_set = it->second;
+      std::vector<long_long> values(papi_metrics[dev].size());
+      PAPI_CALL(PAPI_read(event_set, values.data()));
+      return ObjectRef(make_object<PAPIEventSetNode>(values, dev));

Review comment:
       Given calls can be nested, we cannot create a single buffer and re-use it. Maybe we pre allocate a couple and then allocate if we need more?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r660800768



##########
File path: python/tvm/contrib/debugger/debug_executor.py
##########
@@ -268,14 +268,18 @@ def run_individual(self, number, repeat=1, min_repeat_ms=0):
         ret = self._run_individual(number, repeat, min_repeat_ms)
         return ret.strip(",").split(",") if ret else []
 
-    def profile(self, **input_dict):
+    def profile(self, collectors=[], **input_dict):  # pylint: disable=dangerous-default-value

Review comment:
       This list is never modified, so it should be fine.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] leandron commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

leandron commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r660491006



##########
File path: python/tvm/runtime/profiler_vm.py
##########
@@ -50,14 +50,17 @@ def get_stat(self, sort_by_time=True):  # pylint: disable=unused-argument
         warnings.warn("get_stat has been removed, use profile instead")
         return ""
 
-    def profile(self, *args, func_name="main", **kwargs):
+    def profile(self, *args, func_name="main", collectors=[], **kwargs):

Review comment:
       Interesting pylint didn't complain about this one.

##########
File path: src/runtime/vm/executable.cc
##########
@@ -273,6 +273,13 @@ void Executable::SavePrimitiveOpNames(dmlc::Stream* strm) {
     primitive_names[packed_index] = it.first;
   }
   strm->Write(primitive_names);
+  // TODO(tkonolige): cannot serialize ObjectRefs with dmlc's serializer.

Review comment:
       If this is not needed, maybe can be removed?

##########
File path: python/tvm/contrib/debugger/debug_executor.py
##########
@@ -268,14 +268,18 @@ def run_individual(self, number, repeat=1, min_repeat_ms=0):
         ret = self._run_individual(number, repeat, min_repeat_ms)
         return ret.strip(",").split(",") if ret else []
 
-    def profile(self, **input_dict):
+    def profile(self, collectors=[], **input_dict):  # pylint: disable=dangerous-default-value

Review comment:
       I see you have a `# pylint: disable=dangerous-default-value`, after setting the `collectors=[]`.
   
   Just wanted to double check whether this is intended, as `collectors=[]` behaves like a global, and there will be only one list that will be shared across all the calls to `profile()` in which collectors is not provided?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-859077293


   @leandron @tqchen @areusch Could you all re-review? I've pushed a change to the api to make it more closely match PassIntrument.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-873147072


   @leandron Does this PR look good to you or would you like some more changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] leandron edited a comment on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

leandron edited a comment on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-833365958


   > These performance counters include data like total cycles, instructions executed, and cache misses. Users can control which performance counters are collected by setting the TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated list of metrics.
   
   Can you also document the use of `$TVM_PAPI_${DEVICE}_METRICS` somewhere? @hogepodge can probably advise here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch merged pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

areusch merged pull request #7983:
URL: https://github.com/apache/tvm/pull/7983


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] leandron commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

leandron commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r660491006



##########
File path: python/tvm/runtime/profiler_vm.py
##########
@@ -50,14 +50,17 @@ def get_stat(self, sort_by_time=True):  # pylint: disable=unused-argument
         warnings.warn("get_stat has been removed, use profile instead")
         return ""
 
-    def profile(self, *args, func_name="main", **kwargs):
+    def profile(self, *args, func_name="main", collectors=[], **kwargs):

Review comment:
       Interesting pylint didn't complain about this one.

##########
File path: src/runtime/vm/executable.cc
##########
@@ -273,6 +273,13 @@ void Executable::SavePrimitiveOpNames(dmlc::Stream* strm) {
     primitive_names[packed_index] = it.first;
   }
   strm->Write(primitive_names);
+  // TODO(tkonolige): cannot serialize ObjectRefs with dmlc's serializer.

Review comment:
       If this is not needed, maybe can be removed?

##########
File path: python/tvm/contrib/debugger/debug_executor.py
##########
@@ -268,14 +268,18 @@ def run_individual(self, number, repeat=1, min_repeat_ms=0):
         ret = self._run_individual(number, repeat, min_repeat_ms)
         return ret.strip(",").split(",") if ret else []
 
-    def profile(self, **input_dict):
+    def profile(self, collectors=[], **input_dict):  # pylint: disable=dangerous-default-value

Review comment:
       I see you have a `# pylint: disable=dangerous-default-value`, after setting the `collectors=[]`.
   
   Just wanted to double check whether this is intended, as `collectors=[]` behaves like a global, and there will be only one list that will be shared across all the calls to `profile()` in which collectors is not provided?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r629711264



##########
File path: src/runtime/vm/profiler/vm.cc
##########
@@ -117,11 +120,11 @@ void VirtualMachineDebug::InvokePacked(Index packed_index, const PackedFunc& fun
     }
     metrics["Argument Shapes"] = profiling::ShapeString(shapes);
 
-    prof_.StartCall(packed_index_map_[packed_index], dev, metrics);
+    prof_.operator*().StartCall(packed_index_map_[packed_index], dev, metrics);

Review comment:
       nope :(




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r627749359



##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));

Review comment:
       I don't think so. Performance degradation is minimal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r662645296



##########
File path: src/runtime/profiling.cc
##########
@@ -100,16 +102,37 @@ TVM_REGISTER_GLOBAL("profiling.start_timer").set_body_typed(Timer::Start);
 
 namespace profiling {
 
-void Profiler::Start(const std::vector<Device>& devs) {
-  CHECK(global_timers_.empty()) << "You can only call Start once per Profiler.";
+Profiler::Profiler(std::vector<Device> devs, std::vector<MetricCollector> metric_collectors)
+    : devs_(devs), collectors_(metric_collectors) {
+  is_running_ = false;
+  std::vector<DeviceWrapper> wrapped_devs;
   for (auto dev : devs) {
-    global_timers_.emplace_back(dev, Timer::Start(dev));
+    wrapped_devs.push_back(DeviceWrapper(make_object<DeviceWrapperNode>(dev)));
+  }
+  for (auto& x : collectors_) {
+    x->Init(wrapped_devs);
+  }
+  // reset the thread pool so that PAPI eventset hooks are set in all threads.
+  threading::ResetThreadPool();

Review comment:
       I added a note about it to the constructor
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

areusch commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r660750725



##########
File path: src/runtime/profiling.cc
##########
@@ -431,7 +442,29 @@ Report Profiler::Report(bool aggregate, bool sort) {
     rows.push_back(row);
   }
 
-  return profiling::Report(rows, device_metrics);
+  // the last couple of call frames are the overall times
+  double overall_time = 0;

Review comment:
       still suggest this

##########
File path: src/runtime/profiling.cc
##########
@@ -118,14 +141,21 @@ void Profiler::StopCall(std::unordered_map<std::string, ObjectRef> extra_metrics
   for (auto& p : extra_metrics) {
     cf.extra_metrics[p.first] = p.second;
   }
+  // collect the extra metrics from user defined collectors
+  for (const auto& obj : cf.extra_collectors) {
+    auto collector_metrics = obj.first->Stop(obj.second);
+    for (auto& p : collector_metrics) {
+      cf.extra_metrics[p.first] = p.second;
+    }
+  }
   in_flight_.pop();
   calls_.push_back(cf);
 }
 
 void Profiler::Stop() {
-  // Stop all global timers. We wait to synchronize until we are making the report.
-  for (auto p : global_timers_) {
-    p.second->Stop();
+  is_running_ = false;

Review comment:
       any thoughts on how errors should get handled here?

##########
File path: python/tvm/contrib/debugger/debug_executor.py
##########
@@ -268,14 +268,18 @@ def run_individual(self, number, repeat=1, min_repeat_ms=0):
         ret = self._run_individual(number, repeat, min_repeat_ms)
         return ret.strip(",").split(",") if ret else []
 
-    def profile(self, **input_dict):
+    def profile(self, collectors=[], **input_dict):  # pylint: disable=dangerous-default-value

Review comment:
       yeah don't use a list as a kwarg default. Instead, use None, and say `collectors = collectors if collectors is not None else []` in function body




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

areusch commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r629665952



##########
File path: src/runtime/vm/profiler/vm.cc
##########
@@ -117,11 +120,11 @@ void VirtualMachineDebug::InvokePacked(Index packed_index, const PackedFunc& fun
     }
     metrics["Argument Shapes"] = profiling::ShapeString(shapes);
 
-    prof_.StartCall(packed_index_map_[packed_index], dev, metrics);
+    prof_.operator*().StartCall(packed_index_map_[packed_index], dev, metrics);

Review comment:
       does -> not work?

##########
File path: src/runtime/profiling.cc
##########
@@ -431,7 +442,29 @@ Report Profiler::Report(bool aggregate, bool sort) {
     rows.push_back(row);
   }
 
-  return profiling::Report(rows, device_metrics);
+  // the last couple of call frames are the overall times
+  double overall_time = 0;

Review comment:
       suggest unit suffix: `overall_time_us`

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.
+   * \param dev The device the function will be run on.
+   * \returns A `PAPIEventSetNode` containing values for the counters at the
+   * start of the call. Passed to a corresponding `Stop` call.
+   */
+  ObjectRef Start(Device dev) final {
+    // Record counter values at the start of the call, so we can calculate the
+    // metrics for the call by comparing the values at the end of the call.
+    auto it = event_sets.find(dev);
+    if (it != event_sets.end()) {
+      int event_set = it->second;
+      std::vector<long_long> values(papi_metrics[dev].size());
+      PAPI_CALL(PAPI_read(event_set, values.data()));
+      return ObjectRef(make_object<PAPIEventSetNode>(values, dev));

Review comment:
       could you give an example of nesting? I just see
   ```
   StartCall()
   // run kernel
   StopCall()
   ```
   
   my concern is that putting a bunch of malloc calls in between time points may inject a bunch of noise in the measurements




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

areusch commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r627609538



##########
File path: include/tvm/runtime/profiling.h
##########
@@ -329,6 +386,24 @@ class CountNode : public Object {
  */
 String ShapeString(const std::vector<NDArray>& shapes);
 
+/*! \brief Wrapper for `Device`. */

Review comment:
       explain why we need this

##########
File path: include/tvm/runtime/profiling.h
##########
@@ -210,16 +256,19 @@ struct CallFrame {
   Timer timer;
   /*! Extra performance metrics */
   std::unordered_map<std::string, ObjectRef> extra_metrics;
+  /*! User defined metric collectors */

Review comment:
       any reason this isn't a map? could you explain in a comment? what's the second item in the pair?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;

Review comment:
       eventually this will become int64_t. how do we handle differences between the two? worth warning the user about e.g. by detecting sizeof(long_long) vs sizeof(int64_t)?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.
+   * \param dev The device the function will be run on.
+   * \returns A `PAPIEventSetNode` containing values for the counters at the
+   * start of the call. Passed to a corresponding `Stop` call.
+   */
+  ObjectRef Start(Device dev) final {
+    // Record counter values at the start of the call, so we can calculate the
+    // metrics for the call by comparing the values at the end of the call.
+    auto it = event_sets.find(dev);
+    if (it != event_sets.end()) {
+      int event_set = it->second;
+      std::vector<long_long> values(papi_metrics[dev].size());
+      PAPI_CALL(PAPI_read(event_set, values.data()));
+      return ObjectRef(make_object<PAPIEventSetNode>(values, dev));

Review comment:
       is there any way to avoid allocating here?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {

Review comment:
       document the return type, and consider separating the retrieved data (compoenent_index) from the status code by placing it in an out parameter or raising an exception in the `default:`

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);

Review comment:
       combine with previous line?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));

Review comment:
       is this case worth a warning message?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \

Review comment:
       want to stringify func (## func) and place in the log message?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.
+   * \param dev The device the function will be run on.
+   * \returns A `PAPIEventSetNode` containing values for the counters at the
+   * start of the call. Passed to a corresponding `Stop` call.
+   */
+  ObjectRef Start(Device dev) final {
+    // Record counter values at the start of the call, so we can calculate the
+    // metrics for the call by comparing the values at the end of the call.
+    auto it = event_sets.find(dev);
+    if (it != event_sets.end()) {
+      int event_set = it->second;
+      std::vector<long_long> values(papi_metrics[dev].size());
+      PAPI_CALL(PAPI_read(event_set, values.data()));
+      return ObjectRef(make_object<PAPIEventSetNode>(values, dev));
+    } else {
+      return ObjectRef(nullptr);
+    }
+  }
+
+  /*! \brief Called right after a function call.
+   * \param obj `PAPIEventSetNode` created by a call to `Start`.
+   * \returns A mapping from metric name to value.
+   */
+  Map<String, ObjectRef> Stop(ObjectRef obj) final {
+    const PAPIEventSetNode* event_set_node = obj.as<PAPIEventSetNode>();
+    std::vector<long_long> end_values(papi_metrics[event_set_node->dev].size());
+    PAPI_CALL(PAPI_read(event_sets[event_set_node->dev], end_values.data()));
+    std::unordered_map<String, ObjectRef> reported_metrics;
+    for (size_t i = 0; i < end_values.size(); i++) {
+      reported_metrics[papi_metrics[event_set_node->dev][i]] =

Review comment:
       what if wraparound?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \

Review comment:
       use PAPI_OK?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.
+   * \param dev The device the function will be run on.
+   * \returns A `PAPIEventSetNode` containing values for the counters at the
+   * start of the call. Passed to a corresponding `Stop` call.
+   */
+  ObjectRef Start(Device dev) final {
+    // Record counter values at the start of the call, so we can calculate the
+    // metrics for the call by comparing the values at the end of the call.
+    auto it = event_sets.find(dev);
+    if (it != event_sets.end()) {
+      int event_set = it->second;
+      std::vector<long_long> values(papi_metrics[dev].size());
+      PAPI_CALL(PAPI_read(event_set, values.data()));
+      return ObjectRef(make_object<PAPIEventSetNode>(values, dev));
+    } else {
+      return ObjectRef(nullptr);
+    }
+  }
+
+  /*! \brief Called right after a function call.

Review comment:
       same question

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.

Review comment:
       what does this do, though?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;

Review comment:
       is this metrics or metric_names?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.
+   * \param dev The device the function will be run on.
+   * \returns A `PAPIEventSetNode` containing values for the counters at the
+   * start of the call. Passed to a corresponding `Stop` call.
+   */
+  ObjectRef Start(Device dev) final {
+    // Record counter values at the start of the call, so we can calculate the
+    // metrics for the call by comparing the values at the end of the call.
+    auto it = event_sets.find(dev);
+    if (it != event_sets.end()) {
+      int event_set = it->second;
+      std::vector<long_long> values(papi_metrics[dev].size());
+      PAPI_CALL(PAPI_read(event_set, values.data()));
+      return ObjectRef(make_object<PAPIEventSetNode>(values, dev));
+    } else {
+      return ObjectRef(nullptr);
+    }
+  }
+
+  /*! \brief Called right after a function call.
+   * \param obj `PAPIEventSetNode` created by a call to `Start`.
+   * \returns A mapping from metric name to value.
+   */
+  Map<String, ObjectRef> Stop(ObjectRef obj) final {
+    const PAPIEventSetNode* event_set_node = obj.as<PAPIEventSetNode>();
+    std::vector<long_long> end_values(papi_metrics[event_set_node->dev].size());
+    PAPI_CALL(PAPI_read(event_sets[event_set_node->dev], end_values.data()));
+    std::unordered_map<String, ObjectRef> reported_metrics;
+    for (size_t i = 0; i < end_values.size(); i++) {
+      reported_metrics[papi_metrics[event_set_node->dev][i]] =
+          ObjectRef(make_object<CountNode>(end_values[i] - event_set_node->start_values[i]));
+    }
+    return reported_metrics;
+  }
+
+  ~PAPIMetricCollectorNode() final {
+    for (auto p : event_sets) {

Review comment:
       do we need to do complex logic in destructor? would be great if we can avoid this.

##########
File path: src/runtime/vm/executable.cc
##########
@@ -273,6 +273,13 @@ void Executable::SavePrimitiveOpNames(dmlc::Stream* strm) {
     primitive_names[packed_index] = it.first;
   }
   strm->Write(primitive_names);
+  // TODO(tkonolige): cannot serialize ObjectRefs

Review comment:
       what's TODO and what's the fix? worth linking to GH issue?

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.
+   * \param dev The device the function will be run on.
+   * \returns A `PAPIEventSetNode` containing values for the counters at the
+   * start of the call. Passed to a corresponding `Stop` call.
+   */
+  ObjectRef Start(Device dev) final {
+    // Record counter values at the start of the call, so we can calculate the
+    // metrics for the call by comparing the values at the end of the call.
+    auto it = event_sets.find(dev);
+    if (it != event_sets.end()) {
+      int event_set = it->second;
+      std::vector<long_long> values(papi_metrics[dev].size());
+      PAPI_CALL(PAPI_read(event_set, values.data()));
+      return ObjectRef(make_object<PAPIEventSetNode>(values, dev));
+    } else {
+      return ObjectRef(nullptr);
+    }
+  }
+
+  /*! \brief Called right after a function call.
+   * \param obj `PAPIEventSetNode` created by a call to `Start`.
+   * \returns A mapping from metric name to value.
+   */
+  Map<String, ObjectRef> Stop(ObjectRef obj) final {
+    const PAPIEventSetNode* event_set_node = obj.as<PAPIEventSetNode>();
+    std::vector<long_long> end_values(papi_metrics[event_set_node->dev].size());
+    PAPI_CALL(PAPI_read(event_sets[event_set_node->dev], end_values.data()));
+    std::unordered_map<String, ObjectRef> reported_metrics;
+    for (size_t i = 0; i < end_values.size(); i++) {
+      reported_metrics[papi_metrics[event_set_node->dev][i]] =
+          ObjectRef(make_object<CountNode>(end_values[i] - event_set_node->start_values[i]));
+    }
+    return reported_metrics;
+  }
+
+  ~PAPIMetricCollectorNode() final {
+    for (auto p : event_sets) {
+      PAPI_CALL(PAPI_stop(p.second, NULL));
+      PAPI_CALL(PAPI_cleanup_eventset(p.second));
+      PAPI_CALL(PAPI_destroy_eventset(&p.second));
+    }
+  }
+
+  /*! \brief Device-specific event sets. Contains the running counters for that device. */

Review comment:
       document the value

##########
File path: src/runtime/contrib/papi/papi.cc
##########
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_PAPI_H_
+
+#include <papi.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+#define PAPI_CALL(func)                                                         \
+  {                                                                             \
+    int e = (func);                                                             \
+    if (e < 0) {                                                                \
+      LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)); \
+    }                                                                           \
+  }
+
+static const std::unordered_map<DLDeviceType, std::vector<std::string>> default_metrics = {
+    {kDLCPU,
+     {"perf::CYCLES", "perf::STALLED-CYCLES-FRONTEND", "perf::STALLED-CYCLES-BACKEND",
+      "perf::INSTRUCTIONS", "perf::CACHE-MISSES"}},
+    {kDLGPU, {"cuda:::event:elapsed_cycles_sm:device=0"}}};
+
+/*! \brief Object that holds the values of counters at the start of a function call. */
+struct PAPIEventSetNode : public Object {
+  /*! \brief The starting values of counters for all metrics of a specific device. */
+  std::vector<long_long> start_values;
+  /*! \brief The device these counters are for. */
+  Device dev;
+
+  explicit PAPIEventSetNode(std::vector<long_long> start_values, Device dev)
+      : start_values(start_values), dev(dev) {}
+
+  static constexpr const char* _type_key = "PAPIEventSetNode";
+  TVM_DECLARE_FINAL_OBJECT_INFO(PAPIEventSetNode, Object);
+};
+
+int component_for_device(Device dev) {
+  std::string component_name;
+  switch (dev.device_type) {
+    case kDLCPU:
+    case kDLCPUPinned:
+      component_name = "perf_event";
+      break;
+    case kDLGPU:
+      component_name = "cuda";
+      break;
+    case kDLROCM:
+      component_name = "rocm";
+      break;
+    default:
+      LOG(WARNING) << "PAPI does not support device " << DeviceName(dev.device_type);
+      return -1;
+  }
+  int cidx = PAPI_get_component_index(component_name.c_str());
+  if (cidx < 0) {
+    LOG(FATAL) << "Cannot find PAPI component \"" << component_name
+               << "\". Maybe you need to build PAPI with support for this component (use "
+                  "`./configure --components="
+               << component_name << "`).";
+  }
+  return cidx;
+}
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ *
+ * Users can change the metrics collected for by setting the environment
+ * variable `TVM_PAPI_${device_name}_METRICS` with a semicolon seperated list
+ * of metrics. Use the `papi_native_avail` tool to find the name of all
+ * available metrics.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {
+  explicit PAPIMetricCollectorNode(Array<DeviceWrapper> devices) {
+    if (!PAPI_is_initialized()) {
+      PAPI_CALL(PAPI_library_init(PAPI_VER_CURRENT));
+    }
+
+    // create event sets for each device
+    for (auto wrapped_device : devices) {
+      Device device = wrapped_device->device;
+      int cidx = component_for_device(device);
+      // unknown device, skipping
+      if (cidx < 0) {
+        continue;
+      }
+
+      const PAPI_component_info_t* component;
+      component = PAPI_get_component_info(cidx);
+      if (component->disabled) {
+        std::string help_message = "";
+        switch (device.device_type) {
+          case kDLCPU:
+          case kDLCPUPinned:
+            help_message =
+                "Try setting `sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'`";
+            break;
+          case kDLGPU:
+            help_message =
+                "Try enabling gpu profiling with `modprobe nvidia "
+                "NVreg_RestrictProfilingToAdminUsers=0`. If that does not work, try adding  "
+                "`options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\"` to "
+                "`/etc/modprobe.d/nvidia-kernel-common.conf`.";
+            break;
+          default:
+            break;
+        }
+        LOG(WARNING) << "PAPI could not initialize counters for " << DeviceName(device.device_type)
+                     << ": " << component->disabled_reason << "\n"
+                     << help_message;
+        continue;
+      }
+
+      int event_set = PAPI_NULL;
+      PAPI_CALL(PAPI_create_eventset(&event_set));
+      PAPI_CALL(PAPI_assign_eventset_component(event_set, cidx));
+      if (device.device_type == kDLCPU) {
+        // we set PAPI_INHERIT to make it so threads created after this inherit the event_set.
+        PAPI_option_t opt;
+        memset(&opt, 0x0, sizeof(PAPI_option_t));
+        opt.inherit.inherit = PAPI_INHERIT_ALL;
+        opt.inherit.eventset = event_set;
+        PAPI_CALL(PAPI_set_opt(PAPI_INHERIT, &opt));
+      }
+
+      // load default metrics for device or read them from an environment variable
+      std::vector<std::string> metrics;
+      std::string dev_name = DeviceName(device.device_type);
+      std::transform(dev_name.begin(), dev_name.end(), dev_name.begin(),
+                     [](unsigned char c) { return std::toupper(c); });
+      const char* env_p =
+          std::getenv((std::string("TVM_PAPI_") + dev_name + std::string("_METRICS")).c_str());
+      if (env_p != nullptr) {
+        std::string metric_string = env_p;
+        size_t loc = 0;
+        while (loc < metric_string.size()) {
+          size_t next = metric_string.find(';', loc);
+          if (next == metric_string.npos) {
+            next = metric_string.size();
+          }
+          metrics.push_back(metric_string.substr(loc, next - loc));
+          loc = next + 1;
+        }
+      } else {
+        auto it = default_metrics.find(device.device_type);
+        if (it != default_metrics.end()) {
+          metrics = it->second;
+        } else {
+          LOG(WARNING) << "No default metrics set for " << dev_name
+                       << ". You can specify metrics with the environment variable TVM_PAPI_"
+                       << dev_name << "_METRICS.";
+        }
+      }
+      // skip if no metrics exist
+      if (metrics.size() == 0) {
+        continue;
+      }
+      papi_metrics[device] = metrics;
+
+      if (static_cast<int>(metrics.size()) > PAPI_num_cmp_hwctrs(cidx)) {
+        PAPI_CALL(PAPI_set_multiplex(event_set));
+      }
+
+      // add all the metrics
+      for (auto metric : metrics) {
+        int e = PAPI_add_named_event(event_set, metric.c_str());
+        if (e != PAPI_OK) {
+          LOG(FATAL) << "PAPIError: " << e << " " << std::string(PAPI_strerror(e)) << ": " << metric
+                     << ".";
+        }
+      }
+      // Because we may have multiple calls in flight at the same time, we
+      // start all the timers when we initialize. Then we calculate the metrics
+      // counts for a call by comparing counter values at the start vs end of
+      // the call.
+      PAPI_CALL(PAPI_start(event_set));
+      event_sets[device] = event_set;
+    }
+  }
+
+  /*! \brief Called right before a function call.
+   * \param dev The device the function will be run on.
+   * \returns A `PAPIEventSetNode` containing values for the counters at the
+   * start of the call. Passed to a corresponding `Stop` call.
+   */
+  ObjectRef Start(Device dev) final {
+    // Record counter values at the start of the call, so we can calculate the
+    // metrics for the call by comparing the values at the end of the call.
+    auto it = event_sets.find(dev);
+    if (it != event_sets.end()) {
+      int event_set = it->second;
+      std::vector<long_long> values(papi_metrics[dev].size());
+      PAPI_CALL(PAPI_read(event_set, values.data()));
+      return ObjectRef(make_object<PAPIEventSetNode>(values, dev));
+    } else {
+      return ObjectRef(nullptr);
+    }
+  }
+
+  /*! \brief Called right after a function call.
+   * \param obj `PAPIEventSetNode` created by a call to `Start`.
+   * \returns A mapping from metric name to value.
+   */
+  Map<String, ObjectRef> Stop(ObjectRef obj) final {
+    const PAPIEventSetNode* event_set_node = obj.as<PAPIEventSetNode>();
+    std::vector<long_long> end_values(papi_metrics[event_set_node->dev].size());
+    PAPI_CALL(PAPI_read(event_sets[event_set_node->dev], end_values.data()));
+    std::unordered_map<String, ObjectRef> reported_metrics;
+    for (size_t i = 0; i < end_values.size(); i++) {
+      reported_metrics[papi_metrics[event_set_node->dev][i]] =
+          ObjectRef(make_object<CountNode>(end_values[i] - event_set_node->start_values[i]));
+    }
+    return reported_metrics;
+  }
+
+  ~PAPIMetricCollectorNode() final {
+    for (auto p : event_sets) {
+      PAPI_CALL(PAPI_stop(p.second, NULL));
+      PAPI_CALL(PAPI_cleanup_eventset(p.second));
+      PAPI_CALL(PAPI_destroy_eventset(&p.second));
+    }
+  }
+
+  /*! \brief Device-specific event sets. Contains the running counters for that device. */
+  std::unordered_map<Device, int> event_sets;
+  /*! \brief Device-specific metric names. Order of names matches the order in the corresponding
+   * `event_set`. */
+  std::unordered_map<Device, std::vector<std::string>> papi_metrics;

Review comment:
       papi_metric_names_by_device?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r629533488



##########
File path: include/tvm/runtime/profiling.h
##########
@@ -210,16 +256,19 @@ struct CallFrame {
   Timer timer;
   /*! Extra performance metrics */
   std::unordered_map<std::string, ObjectRef> extra_metrics;
+  /*! User defined metric collectors */

Review comment:
       I never do a lookup by key; I just iterate over all the pairs. I could switch it to a map if you like.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] hogepodge commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

hogepodge commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-833669656


   @leandron @tkonolige I would suggest two pieces of documentation. The first would be a description of all of the variables available to configure Papi integration. The second would be a "how-to" guide that would walk the user through installing with Papi and running a basic instrumentation. The combination of these two should help users who are interested be successful with Papi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r629532510



##########
File path: include/tvm/runtime/c_backend_api.h
##########
@@ -152,6 +152,16 @@ TVM_DLL int TVMBackendParallelBarrier(int task_id, TVMParallelGroupEnv* penv);
  */
 TVM_DLL int TVMBackendRunOnce(void** handle, int (*f)(void*), void* cdata, int nbytes);
 
+/*!
+ * \brief Reset the threads in the pool. All current threads are destroyed and
+ * new ones are created.
+ *
+ * Note that this does nothing when openmp is used.
+ *
+ * \return 0 when no error is thrown, -1 when failure happens
+ */

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r660803915



##########
File path: src/runtime/vm/executable.cc
##########
@@ -273,6 +273,13 @@ void Executable::SavePrimitiveOpNames(dmlc::Stream* strm) {
     primitive_names[packed_index] = it.first;
   }
   strm->Write(primitive_names);
+  // TODO(tkonolige): cannot serialize ObjectRefs with dmlc's serializer.

Review comment:
       It is needed, saving and loading an executable will not preserve the extra op attributes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-837409214


   @areusch and I chatted a little offline about how to handle mallocs in profiling code. Right now there are a couple of places that do allocation and they would be pretty difficult to remove. Also, as long as there are not any nested `Start` calls (besides the top level nesting), the overhead of malloc is counted overhead section of the full model execution and has no effect on the performance of each op invocation. We should move forward with this PR as it stands and I will think about ways of reducing the amount of allocation that we do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-869949125


   @leandron @tqchen @areusch Can you review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tqchen commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tqchen commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r668303499



##########
File path: include/tvm/runtime/contrib/papi.h
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_H_
+
+#include <tvm/runtime/container/array.h>
+#include <tvm/runtime/container/map.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {

Review comment:
       Because the profiler itself is a public interface, what we could do instead is to only expose a factory function that returns the instance, but not the actual data structure itself




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-872428677


   @areusch ping


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tqchen edited a comment on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tqchen edited a comment on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-833474904


   licenses will be needed if we bundle PAI's source code in our source distribution (in the case when it is included as a submodule). If it is a system level dep, we do not need to add licenses to the licenses folder


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tqchen commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tqchen commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-833474904


   licenses will be needed if we bundle PAI's source code in our source distribution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r668327260



##########
File path: include/tvm/runtime/contrib/papi.h
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_H_
+
+#include <tvm/runtime/container/array.h>
+#include <tvm/runtime/container/map.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {

Review comment:
       I've moved all definitions except a factory function into the .cc file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tqchen commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tqchen commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r668121949



##########
File path: include/tvm/runtime/profiling.h
##########
@@ -150,6 +151,26 @@ class Timer : public ObjectRef {
 Timer DefaultTimer(Device dev);
 
 namespace profiling {
+/*! \brief Wrapper for `Device` because `Device` is not passable across the

Review comment:
       Device cannot be put as part of object container. How about DeviceObject?

##########
File path: include/tvm/runtime/profiling.h
##########
@@ -210,16 +282,21 @@ struct CallFrame {
   Timer timer;
   /*! Extra performance metrics */
   std::unordered_map<std::string, ObjectRef> extra_metrics;
+  /*! User defined metric collectors. Each pair is the MetricCollector and its
+   * associated data (returned from MetricCollector.Start).
+   */

Review comment:
       Not related to this PR but just in general about hiding. In this case, we could use https://en.cppreference.com/w/cpp/language/pimpl to hide the CallFrame def, recommend to come with a followup refactor

##########
File path: include/tvm/runtime/contrib/papi.h
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_H_
+
+#include <tvm/runtime/container/array.h>
+#include <tvm/runtime/container/map.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {

Review comment:
       Given this is a implementation of an existing interface, we usually prefer hide as much as possible. As a result, let us to start as private header(move the def to cc file or the header file to src)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] leandron commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

leandron commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r627224468



##########
File path: cmake/modules/contrib/PAPI.cmake
##########
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+if(USE_PAPI)
+  find_package(PkgConfig REQUIRED)
+
+  set(ENV{PKG_CONFIG_PATH} "${USE_PAPI}:$ENV{PKG_CONFIG_PATH}")
+  pkg_check_modules(PAPI REQUIRED IMPORTED_TARGET papi>=6.0)
+  list(APPEND TVM_RUNTIME_LINKER_LIBS PkgConfig::PAPI)
+  list(APPEND RUNTIME_SRCS src/runtime/contrib/papi/papi.cc)

Review comment:
       Do you know whether this works as expected for cross-compilation?

##########
File path: CMakeLists.txt
##########
@@ -49,6 +49,7 @@ tvm_option(USE_FALLBACK_STL_MAP "Use TVM's POD compatible Map" OFF)
 tvm_option(USE_ETHOSN "Build with Arm Ethos-N" OFF)
 tvm_option(INDEX_DEFAULT_I64 "Defaults the index datatype to int64" ON)
 tvm_option(USE_LIBBACKTRACE "Build libbacktrace to supply linenumbers on stack traces" AUTO)
+tvm_option(USE_PAPI "Use PAPI (The Performance Application Programming Interface) to read performance counters" OFF)

Review comment:
       ```suggestion
   tvm_option(USE_PAPI "Use Performance Application Programming Interface (PAPI) to read performance counters" OFF)
   ```
   Minor suggestion rewording of this description, feel free to keep it your way as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] leandron commented on pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

leandron commented on pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#issuecomment-833365958


   > These performance counters include data like total cycles, instructions executed, and cache misses. Users can control which performance counters are collected by setting the TVM_PAPI_${DEVICE}_METRICS environment variable to a semicolon separated list of metrics.
   
   Can you also document the use of `$TVM_PAPI_${DEVICE}_METRICS` somewhere?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tqchen commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tqchen commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r668303761



##########
File path: include/tvm/runtime/profiling.h
##########
@@ -150,6 +151,26 @@ class Timer : public ObjectRef {
 Timer DefaultTimer(Device dev);
 
 namespace profiling {
+/*! \brief Wrapper for `Device` because `Device` is not passable across the

Review comment:
       I am not strongly attached to the choice, given it is in the profiling namespace, i will let you decide then




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r627568915



##########
File path: cmake/modules/contrib/PAPI.cmake
##########
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+if(USE_PAPI)
+  find_package(PkgConfig REQUIRED)
+
+  set(ENV{PKG_CONFIG_PATH} "${USE_PAPI}:$ENV{PKG_CONFIG_PATH}")
+  pkg_check_modules(PAPI REQUIRED IMPORTED_TARGET papi>=6.0)
+  list(APPEND TVM_RUNTIME_LINKER_LIBS PkgConfig::PAPI)
+  list(APPEND RUNTIME_SRCS src/runtime/contrib/papi/papi.cc)

Review comment:
       I'm not sure if this works with cross compiling. PAPI does support cross compilation though. I think you'd have to set `PKG_CONFIG_PATH` or `USE_PAPI` to point to the cross compiled library.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

areusch commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r663178912



##########
File path: src/runtime/thread_pool.cc
##########
@@ -258,32 +258,21 @@ class SpscTaskQueue {
 class ThreadPool {
  public:
   ThreadPool() : num_workers_(tvm::runtime::threading::MaxConcurrency()) {
-    for (int i = 0; i < num_workers_; ++i) {
-      // The SpscTaskQueue only hosts ONE item at a time
-      queues_.emplace_back(std::unique_ptr<SpscTaskQueue>(new SpscTaskQueue()));
-    }
     const char* exclude_worker0 = getenv("TVM_EXCLUDE_WORKER0");
     if (exclude_worker0 && atoi(exclude_worker0) == 0) {
       exclude_worker0_ = false;
     }
-    threads_ = std::unique_ptr<tvm::runtime::threading::ThreadGroup>(
-        new tvm::runtime::threading::ThreadGroup(
-            num_workers_, [this](int worker_id) { this->RunWorker(worker_id); },
-            exclude_worker0_ /* include_main_thread */));
-    num_workers_used_ = threads_->Configure(threading::ThreadGroup::kBig, 0, exclude_worker0_);
+    Init();
   }
+
   ~ThreadPool() {
     for (std::unique_ptr<SpscTaskQueue>& q : queues_) {
       q->SignalForKill();
     }
     threads_.reset();
   }
-  void Reset() {
-    for (std::unique_ptr<SpscTaskQueue>& q : queues_) {
-      q->SignalForKill();
-    }
-    queues_.clear();
-    threads_.reset();
+
+  void Init() {

Review comment:
       wait but this should be private, no?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] tkonolige commented on a change in pull request #7983: [PROFILING] Use PAPI to collect hardware performance counters on CPU and CUDA

Posted by GitBox <gi...@apache.org>.

tkonolige commented on a change in pull request #7983:
URL: https://github.com/apache/tvm/pull/7983#discussion_r668177949



##########
File path: include/tvm/runtime/contrib/papi.h
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*!
+ * \brief Performance counters for profiling via the PAPI library.
+ */
+#ifndef TVM_RUNTIME_CONTRIB_PAPI_H_
+#define TVM_RUNTIME_CONTRIB_PAPI_H_
+
+#include <tvm/runtime/container/array.h>
+#include <tvm/runtime/container/map.h>
+#include <tvm/runtime/profiling.h>
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace tvm {
+namespace runtime {
+namespace profiling {
+
+/*! \brief MetricCollectorNode for PAPI metrics.
+ *
+ * PAPI (Performance Application Programming Interface) collects metrics on a
+ * variety of platforms including cpu, cuda and rocm.
+ *
+ * PAPI is avaliable at https://bitbucket.org/icl/papi/src/master/.
+ */
+struct PAPIMetricCollectorNode final : public MetricCollectorNode {

Review comment:
       Users need to construct this to pass it to the profiler, so it needs to be in a public header.

##########
File path: include/tvm/runtime/profiling.h
##########
@@ -150,6 +151,26 @@ class Timer : public ObjectRef {
 Timer DefaultTimer(Device dev);
 
 namespace profiling {
+/*! \brief Wrapper for `Device` because `Device` is not passable across the

Review comment:
       DeviceObjectNode sounds a little weird. I think wrapper clearly explains what this is doing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org