You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by GitBox <gi...@apache.org> on 2022/06/12 15:36:01 UTC

[GitHub] [orc] coderex2522 opened a new pull request, #1158: ORC-961: [C++] expose related metrics of the reader

coderex2522 opened a new pull request, #1158:
URL: https://github.com/apache/orc/pull/1158

   ### What changes were proposed in this pull request?
   This patch keeps track of the time spent and the number of calls to each module. The current metrics mainly include the decompression time and number of calls, decoding time and number of calls and total elapsed time.
   
   ### Why are the changes needed?
   It exposes the relevant metrics of the reader, so that the user can visually see the metrics of each module which includes decompression, decoding.
   
   ### How was this patch tested?
   The orc-scan tool with the -m parameter can output relevant metrics.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1180088190

   Thank you,  @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r896883021


##########
c++/src/Reader.cc:
##########
@@ -1154,6 +1165,8 @@ namespace orc {
   }
 
   bool RowReaderImpl::next(ColumnVectorBatch& data) {
+    AutoStopwatch measure(&contents->readerMetrics->ReaderInclusiveLatencyUs,
+                  &contents->readerMetrics->ReaderCount);

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r905759781


##########
c++/include/orc/Reader.hh:
##########
@@ -32,13 +32,34 @@
 #include <set>
 #include <string>
 #include <vector>
+#include <atomic>
 
 namespace orc {
 
   // classes that hold data members so we can maintain binary compatibility
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  /**
+   * Expose the reader metrics including the latency and
+   * number of calls of the decompression/decoding/IO modules.

Review Comment:
   Done. I added some comments to the metrics latency fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r911716400


##########
c++/src/Compression.cc:
##########
@@ -463,6 +469,7 @@ DIAGNOSTIC_PUSH
   }
 
   bool DecompressionStream::Next(const void** data, int*size) {
+    DEFINE_AUTO_STOPWATCH(metrics, DecompLatencyUs, DecompCount);

Review Comment:
   The NextDecompress function will call the readBuffer function which includes the io operation. Currently the pure decompression time is got by subtracting the time taken by the readBuffer function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r896877595


##########
c++/include/orc/Reader.hh:
##########
@@ -39,6 +39,29 @@ namespace orc {
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  struct ReaderMetrics {
+    volatile uint64_t ReaderCount;

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r916027886


##########
c++/src/Compression.hh:
##########
@@ -30,12 +30,14 @@ namespace orc {
    * @param input the input stream that is the underlying source
    * @param bufferSize the maximum size of the buffer
    * @param pool the memory pool
+   * @param metrics the reader metrics
    */
   std::unique_ptr<SeekableInputStream>
      createDecompressor(CompressionKind kind,
                         std::unique_ptr<SeekableInputStream> input,
                         uint64_t bufferSize,
-                        MemoryPool& pool);
+                        MemoryPool& pool,
+                        ReaderMetrics* metrics);

Review Comment:
   If you don't mind, shall we have the default value, `ReaderMetrics* metrics = nullptr`, to be more backward-compatible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r913687696


##########
c++/src/Reader.cc:
##########
@@ -286,6 +287,10 @@ namespace orc {
           lastStripe = i + 1;
         }
       }
+      if (footer->rowindexstride() > 0) {
+        fileRowGroupCount +=

Review Comment:
   Done. The fileRowGroupCount is replaced with numRowGroupsInStripeRange.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r906918128


##########
c++/src/Utils.hh:
##########
@@ -0,0 +1,63 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_UTILS_HH
+#define ORC_UTILS_HH
+
+#include <atomic>
+#include <chrono>
+
+namespace orc {
+
+class AutoStopwatch {
+  std::chrono::high_resolution_clock::time_point start;
+  std::atomic<uint64_t>* latencyUs;
+  std::atomic<uint64_t>* count;
+
+public:
+  AutoStopwatch(std::atomic<uint64_t>* _latencyUs,
+                std::atomic<uint64_t>* _count)
+                : latencyUs(_latencyUs),
+                  count(_count) {
+    if (latencyUs) {
+      start = std::chrono::high_resolution_clock::now();
+    }
+  }
+
+  ~AutoStopwatch() {
+    if (latencyUs) {
+      std::chrono::microseconds elapsedTime =
+        std::chrono::duration_cast<std::chrono::microseconds>(
+          std::chrono::high_resolution_clock::now() - start);
+      latencyUs->fetch_add(static_cast<uint64_t>(elapsedTime.count()));
+    }
+
+    if (count) {
+      count->fetch_add(1);
+    }
+  }
+};
+
+#define DEFINE_AUTO_STOPWATCH(METRICS_PTR, LATENCY_VAR, COUNT_VAR)    \

Review Comment:
   I'd suggest to rename DEFINE_AUTO_STOPWATCH to SCOPED_STOPWATCH because usually DEFINE_ prefix is used to define a class or something via marco expansion.



##########
c++/src/Compression.cc:
##########
@@ -463,6 +469,7 @@ DIAGNOSTIC_PUSH
   }
 
   bool DecompressionStream::Next(const void** data, int*size) {
+    DEFINE_AUTO_STOPWATCH(metrics, DecompLatencyUs, DecompCount);

Review Comment:
   The stop watch should not be put here since it includes the io latency. If you move this line to line 498 right before the definition of **size_t availableSize**, what we got is the pure decompression time.



##########
c++/src/sargs/SargsApplier.cc:
##########
@@ -126,10 +127,14 @@ namespace orc {
     } while (rowGroup != 0);
 
     // update stats
-    mStats.first = std::accumulate(
-      mNextSkippedRows.cbegin(), mNextSkippedRows.cend(), mStats.first,
-      [](bool rg, uint64_t s) { return rg ? 1 : 0 + s; });
-    mStats.second += groupsInStripe;
+    uint64_t selectedRGs = std::accumulate(
+      mNextSkippedRows.cbegin(), mNextSkippedRows.cend(), 0UL,
+      [](uint64_t initVal, uint64_t rg) {
+        return rg > 0 ? initVal + 1 : initVal; });
+    if (mMetrics != nullptr) {

Review Comment:
   What if the file or stripe has been skipped and no row group evaluation happens here? We may have missed them.



##########
c++/include/orc/Reader.hh:
##########
@@ -32,13 +32,40 @@
 #include <set>
 #include <string>
 #include <vector>
+#include <atomic>
 
 namespace orc {
 
   // classes that hold data members so we can maintain binary compatibility
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  /**
+   * Expose the reader metrics including the latency and
+   * number of calls of the decompression/decoding/IO modules.
+   */
+  struct ReaderMetrics {
+    std::atomic<uint64_t> ReaderCount{0};
+    // ReaderInclusiveLatencyUs contains the latency of
+    // the decompression/decoding/IO modules.
+    std::atomic<uint64_t> ReaderInclusiveLatencyUs{0};
+    std::atomic<uint64_t> DecompCount{0};
+    // DecompLatencyUs contains the IO blocking latency.
+    std::atomic<uint64_t> DecompLatencyUs{0};
+    std::atomic<uint64_t> DecodingCount{0};
+    // DecodingLatencyUs contains the latency of the decompression/IO modules.

Review Comment:
   I think we can do better here to get the pure decoding time. Please see the inline comments below.



##########
c++/include/orc/Reader.hh:
##########
@@ -32,13 +32,40 @@
 #include <set>
 #include <string>
 #include <vector>
+#include <atomic>
 
 namespace orc {
 
   // classes that hold data members so we can maintain binary compatibility
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  /**
+   * Expose the reader metrics including the latency and
+   * number of calls of the decompression/decoding/IO modules.
+   */
+  struct ReaderMetrics {
+    std::atomic<uint64_t> ReaderCount{0};
+    // ReaderInclusiveLatencyUs contains the latency of
+    // the decompression/decoding/IO modules.
+    std::atomic<uint64_t> ReaderInclusiveLatencyUs{0};
+    std::atomic<uint64_t> DecompCount{0};

Review Comment:
   I'd prefer to use the full name here. And maybe replace **Count** with **Call** ?



##########
c++/src/RleDecoderV2.cc:
##########
@@ -432,6 +435,7 @@ void RleDecoderV2::skip(uint64_t numValues) {
 void RleDecoderV2::next(int64_t* const data,
                         const uint64_t numValues,
                         const char* const notNull) {
+  DEFINE_AUTO_STOPWATCH(metrics, DecodingLatencyUs, DecodingCount);

Review Comment:
   Where is the stop watch call in the RleDecoderV1 ?



##########
c++/src/RleDecoderV2.cc:
##########
@@ -432,6 +435,7 @@ void RleDecoderV2::skip(uint64_t numValues) {
 void RleDecoderV2::next(int64_t* const data,
                         const uint64_t numValues,
                         const char* const notNull) {
+  DEFINE_AUTO_STOPWATCH(metrics, DecodingLatencyUs, DecodingCount);

Review Comment:
   In addition, all IOs happen in the call of readByte(). We probably should make metrics a member varialbe of the RleDecoder and counts its io time in readByte(). Finally, we can exlucde the io time from decoding latancy here to get the pure RLE decoding time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1179799902

   Gentle ping, @coderex2522 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] stiga-huang commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
stiga-huang commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r903327104


##########
c++/src/ByteRLE.cc:
##########
@@ -433,6 +441,13 @@ namespace orc {
 
   void ByteRleDecoderImpl::next(char* data, uint64_t numValues,
                                 char* notNull) {
+    AutoStopwatch meassure(&metrics.ByteDecodingLatencyUs,

Review Comment:
   This is not the pure decoding time since it contains reading buffers from the input stream. So it's actually ioTime + decompressionTime + decodingTime. I think we should either fix it or document it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] XinyuZeng commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
XinyuZeng commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1200964837

   @dongjoon-hyun I created jira-issue 1232 for this problem. Also, even the metrics is disabled by setting nullptr, there is still some cost of initing the AutoStopWatch object and destructing it. The following perf flamegraph shows the cost.
   ![image](https://user-images.githubusercontent.com/30414224/182120492-94f85aa2-90b2-4995-b9f5-b1e452684d40.png)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1203175504

   Since Apache ORC 1.9.0 is planned on September 2023, I believe we have enough time to remove the regression before release. Please feel free to open a PR for the improvement. Everyone can make a PR and in the worst case (if there is no volunteer to fix the regression), we can revert ORC-961 from `branch-1.9` before creating 1.9.0 RC1 vote.
   - https://github.com/apache/orc/milestone/10
   
   BTW, please let us know your opinions about the above approach, @wgtmac , @williamhyun , @guiyanakuang , @stiga-huang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r903653302


##########
c++/include/orc/OrcFile.hh:
##########
@@ -105,20 +105,26 @@ namespace orc {
   /**
    * Create a stream to a local file or HDFS file if path begins with "hdfs://"
    * @param path the name of the file in the local file system or HDFS
+   * @param metrics the metrics of the reader
    */
-  ORC_UNIQUE_PTR<InputStream> readFile(const std::string& path);
+  ORC_UNIQUE_PTR<InputStream> readFile(const std::string& path,
+                                       ReaderMetrics& metrics);

Review Comment:
   Parameter references for the metrics do cause compatibility issues. I replaced the all parameter reference with metrics pointer. And use macro DEFINE_AUTO_STOPWATCH  to do the metrics pointer null checks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r916029792


##########
c++/include/orc/Reader.hh:
##########
@@ -76,6 +99,11 @@ namespace orc {
      */
     ReaderOptions& setMemoryPool(MemoryPool& pool);
 
+    /**
+     * Set the reader metrics.

Review Comment:
   Could you add a description where `setReaderMetrics(nullptr)`? I guess, it could be used to disable ReaderMetrics?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r896881255


##########
c++/include/orc/Reader.hh:
##########
@@ -39,6 +39,29 @@ namespace orc {
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  struct ReaderMetrics {
+    volatile uint64_t ReaderCount;
+    volatile uint64_t ReaderInclusiveLatencyUs;
+    volatile uint64_t DecompCount;
+    volatile uint64_t DecompLatencyUs;
+    volatile uint64_t DecodingCount;
+    volatile uint64_t DecodingLatencyUs;
+    volatile uint64_t ByteDecodingCount;
+    volatile uint64_t ByteDecodingLatencyUs;
+
+    ReaderMetrics() {
+      ReaderCount = 0;
+      ReaderInclusiveLatencyUs = 0;
+      DecompCount = 0;
+      DecompLatencyUs = 0;
+      DecodingCount = 0;
+      DecodingLatencyUs = 0;
+      ByteDecodingCount = 0;
+      ByteDecodingLatencyUs = 0;

Review Comment:
   I add the IO latency and count in the OrcFile.cc and OrcHdfsFile.cc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
wgtmac commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1153460185

   I have requested @stiga-huang for review because he has opened the JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] stiga-huang commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
stiga-huang commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r897760686


##########
c++/src/Utils.hh:
##########
@@ -0,0 +1,58 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_UTILS_HH
+#define ORC_UTILS_HH
+
+#include <atomic>
+#include <chrono>
+
+namespace orc {
+
+class AutoStopwatch {
+    std::chrono::high_resolution_clock::time_point start;

Review Comment:
   This file uses 4 spaces indention. I think our code style uses 2 spaces.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r897457855


##########
c++/src/Compression.cc:
##########
@@ -633,9 +642,10 @@ DIAGNOSTIC_PUSH
   ZlibDecompressionStream::ZlibDecompressionStream
                    (std::unique_ptr<SeekableInputStream> inStream,
                     size_t bufferSize,
-                    MemoryPool& _pool
+                    MemoryPool& _pool,
+                    ReaderMetrics& metrics

Review Comment:
   Rename the argument to _metrics, this is why the MasOS build fails. Same for constructors below.
   
   > [ 48%] Building CXX object c++/src/CMakeFiles/orc.dir/Compression.cc.o
   /Users/runner/work/orc/orc/c++/src/Compression.cc:646:36: error: parameter 'metrics' shadows member inherited from type 'DecompressionStream' [-Werror,-Wshadow-field]
                       ReaderMetrics& metrics
                                      ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:396:20: note: declared here
       ReaderMetrics& metrics;
                      ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:767:36: error: parameter 'metrics' shadows member inherited from type 'DecompressionStream' [-Werror,-Wshadow-field]
                       ReaderMetrics& metrics
                                      ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:396:20: note: declared here
       ReaderMetrics& metrics;
                      ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:816:46: error: parameter 'metrics' shadows member inherited from type 'DecompressionStream' [-Werror,-Wshadow-field]
                                 ReaderMetrics& metrics
                                                ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:396:20: note: declared here
       ReaderMetrics& metrics;
                      ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:861:43: error: parameter 'metrics' shadows member inherited from type 'DecompressionStream' [-Werror,-Wshadow-field]
                              ReaderMetrics& metrics
                                             ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:396:20: note: declared here
       ReaderMetrics& metrics;
                      ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:895:43: error: parameter 'metrics' shadows member inherited from type 'DecompressionStream' [-Werror,-Wshadow-field]
                              ReaderMetrics& metrics
                                             ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:396:20: note: declared here
       ReaderMetrics& metrics;
                      ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:[1201](https://github.com/apache/orc/runs/6884140918?check_suite_focus=true#step:5:1202):44: error: parameter 'metrics' shadows member inherited from type 'DecompressionStream' [-Werror,-Wshadow-field]
                               ReaderMetrics& metrics)
                                              ^
   /Users/runner/work/orc/orc/c++/src/Compression.cc:396:20: note: declared here
       ReaderMetrics& metrics;
                      ^
   6 errors generated.
   make[2]: *** [c++/src/CMakeFiles/orc.dir/Compression.cc.o] Error 1
   make[1]: *** [c++/src/CMakeFiles/orc.dir/all] Error 2
   make: *** [all] Error 2



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1198457211

   BTW, please include the reproducible step for the following observation, @XinyuZeng .
   > reading time increased by more than 2x with orc-scan on my machine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1202069403

   > @dongjoon-hyun I created jira-issue 1232 for this problem. Also, even the metrics is disabled by setting nullptr, there is still some cost of initing the AutoStopWatch object and destructing it. The following perf flamegraph shows the cost. Although the extra time may be trivial, it may be better to be disabled at compile time.
   
   
   I am confused about why does the chrono time function call still appear in the flamegraph after disabling the metrics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r911682751


##########
c++/include/orc/Reader.hh:
##########
@@ -32,13 +32,40 @@
 #include <set>
 #include <string>
 #include <vector>
+#include <atomic>
 
 namespace orc {
 
   // classes that hold data members so we can maintain binary compatibility
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  /**
+   * Expose the reader metrics including the latency and
+   * number of calls of the decompression/decoding/IO modules.
+   */
+  struct ReaderMetrics {
+    std::atomic<uint64_t> ReaderCount{0};
+    // ReaderInclusiveLatencyUs contains the latency of
+    // the decompression/decoding/IO modules.
+    std::atomic<uint64_t> ReaderInclusiveLatencyUs{0};
+    std::atomic<uint64_t> DecompCount{0};

Review Comment:
   The DecompCount and DecompLatencyUs have been replaced with DecompressionCall and DecompressionLatencyUs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1203459943

   > 
   
   I submitted [ORC-1232 PR](https://github.com/apache/orc/pull/1206) in order to diable the metrics collector at compile phase. If there is no requirement for metrics collection, users do not have to turn on BUILD_CPP_ENABLE_METRICS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
wgtmac commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1179883140

   > Thank you, @coderex2522 and @wgtmac .
   > 
   > I have three questions as the final pieces:
   > 
   > * Backward-compatibility: Apache ORC C++ API is used in the downstream. So, although this PR looks good, it would be great to make it sure that we don't break the backward compatibility once more. I believe the default value of new parameters could reduce the pain.
   > * The ability to disable this metrics. According to the code, we can disable this feature by hand over NULL pointer? Could you double-confirm it? Could you add a test coverage to confirm that this case is possible? It will be helpful to prevent accidentally segfault in the future.
   > * Last question is about the target version. For this PR, if you want, you can backport this to branch-1.8 for Apache ORC 1.8.0. Which one do you prefer v1.9.0 (Next year) or v1.8.0 (This year).
   
   Good suggestion @dongjoon-hyun.
   
   We should be able to set nullptr to ReaderMetrics to disable it at run time. BTW, we can totally disable it at compile time if we have something like this: https://github.com/apache/impala/blob/master/be/src/util/runtime-profile-counters.h#L53


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r911683770


##########
c++/src/Utils.hh:
##########
@@ -0,0 +1,63 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_UTILS_HH
+#define ORC_UTILS_HH
+
+#include <atomic>
+#include <chrono>
+
+namespace orc {
+
+class AutoStopwatch {
+  std::chrono::high_resolution_clock::time_point start;
+  std::atomic<uint64_t>* latencyUs;
+  std::atomic<uint64_t>* count;
+
+public:
+  AutoStopwatch(std::atomic<uint64_t>* _latencyUs,
+                std::atomic<uint64_t>* _count)
+                : latencyUs(_latencyUs),
+                  count(_count) {
+    if (latencyUs) {
+      start = std::chrono::high_resolution_clock::now();
+    }
+  }
+
+  ~AutoStopwatch() {
+    if (latencyUs) {
+      std::chrono::microseconds elapsedTime =
+        std::chrono::duration_cast<std::chrono::microseconds>(
+          std::chrono::high_resolution_clock::now() - start);
+      latencyUs->fetch_add(static_cast<uint64_t>(elapsedTime.count()));
+    }
+
+    if (count) {
+      count->fetch_add(1);
+    }
+  }
+};
+
+#define DEFINE_AUTO_STOPWATCH(METRICS_PTR, LATENCY_VAR, COUNT_VAR)    \

Review Comment:
   Ok, the macro DEFINE_AUTO_STOPWATCH was replaced with SCOPED_STOPWATCH.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r911718586


##########
c++/src/RleDecoderV2.cc:
##########
@@ -432,6 +435,7 @@ void RleDecoderV2::skip(uint64_t numValues) {
 void RleDecoderV2::next(int64_t* const data,
                         const uint64_t numValues,
                         const char* const notNull) {
+  DEFINE_AUTO_STOPWATCH(metrics, DecodingLatencyUs, DecodingCount);

Review Comment:
   Currently the pure decoding time is got by subtracting the time taken by the readByte() function. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] XinyuZeng commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
XinyuZeng commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1197606237

   It seems like this feature is enabled by default and will bring extra overhead to the decoding performance, as I tested via orc-scan. Could this feature be disabled by default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun closed pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #1158: ORC-961: [C++] expose related metrics of the reader
URL: https://github.com/apache/orc/pull/1158


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r898626261


##########
c++/src/Utils.hh:
##########
@@ -0,0 +1,58 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_UTILS_HH
+#define ORC_UTILS_HH
+
+#include <atomic>
+#include <chrono>
+
+namespace orc {
+
+class AutoStopwatch {
+    std::chrono::high_resolution_clock::time_point start;

Review Comment:
   Yes, I fixed the Utils.hh file format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] stiga-huang commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
stiga-huang commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r903329410


##########
c++/include/orc/Reader.hh:
##########
@@ -32,13 +32,34 @@
 #include <set>
 #include <string>
 #include <vector>
+#include <atomic>
 
 namespace orc {
 
   // classes that hold data members so we can maintain binary compatibility
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  /**
+   * Expose the reader metrics including the latency and
+   * number of calls of the decompression/decoding/IO modules.

Review Comment:
   Please document the relation of the latency counters. IIUC, ReaderInclusiveLatencyUs contains the time of decoding latency. Decoding latency contains the time of decompression latency which finally contains the time of IO blocking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r895301590


##########
c++/include/orc/Reader.hh:
##########
@@ -39,6 +39,29 @@ namespace orc {
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  struct ReaderMetrics {
+    volatile uint64_t ReaderCount;

Review Comment:
   Use std::atomic_uint64_t to replace volatile keyword as it does not guarantee atomicity.



##########
c++/include/orc/Reader.hh:
##########
@@ -455,6 +478,12 @@ namespace orc {
      */
     virtual bool hasCorrectStatistics() const = 0;
 
+    /**
+     * Get metrics of the reader
+     * @return the accumulated reader metrics to current state.
+     */
+    virtual ReaderMetrics getReaderMetrics() const = 0;

Review Comment:
   const ReaderMetrics& ?



##########
tools/src/ToolsHelper.cc:
##########
@@ -80,10 +83,25 @@ bool parseOptions(int* argc, char** argv[], uint64_t* batchSize, orc::RowReaderO
         }
         break;
       }
+      case 'm' : {
+        *showMetrics = true;
+        break;
+      }
       default: break;
     }
   } while (opt != -1);
   *argc -= optind;
   *argv += optind;
   return true;
 }
+
+void printReaderMetrics(std::ostream& out, const orc::ReaderMetrics& metrics) {
+  out << "ReaderCount: " << metrics.ReaderCount << std::endl;
+  out << "ElapsedTimeUs: " << metrics.ReaderInclusiveLatencyUs << std::endl;

Review Comment:
   nit: user may expect to read time in seconds which is much easier to interpret.



##########
c++/include/orc/Reader.hh:
##########
@@ -39,6 +39,29 @@ namespace orc {
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  struct ReaderMetrics {

Review Comment:
   Please add comment to help understanding.



##########
c++/include/orc/Reader.hh:
##########
@@ -39,6 +39,29 @@ namespace orc {
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  struct ReaderMetrics {
+    volatile uint64_t ReaderCount;
+    volatile uint64_t ReaderInclusiveLatencyUs;
+    volatile uint64_t DecompCount;
+    volatile uint64_t DecompLatencyUs;
+    volatile uint64_t DecodingCount;
+    volatile uint64_t DecodingLatencyUs;
+    volatile uint64_t ByteDecodingCount;
+    volatile uint64_t ByteDecodingLatencyUs;
+
+    ReaderMetrics() {
+      ReaderCount = 0;
+      ReaderInclusiveLatencyUs = 0;
+      DecompCount = 0;
+      DecompLatencyUs = 0;
+      DecodingCount = 0;
+      DecodingLatencyUs = 0;
+      ByteDecodingCount = 0;
+      ByteDecodingLatencyUs = 0;

Review Comment:
   Can we support blocking io latency as well?



##########
c++/src/Reader.cc:
##########
@@ -1154,6 +1165,8 @@ namespace orc {
   }
 
   bool RowReaderImpl::next(ColumnVectorBatch& data) {
+    AutoStopwatch measure(&contents->readerMetrics->ReaderInclusiveLatencyUs,
+                  &contents->readerMetrics->ReaderCount);

Review Comment:
   Fix alignment



##########
c++/include/orc/Reader.hh:
##########
@@ -39,6 +39,29 @@ namespace orc {
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  struct ReaderMetrics {
+    volatile uint64_t ReaderCount;
+    volatile uint64_t ReaderInclusiveLatencyUs;
+    volatile uint64_t DecompCount;
+    volatile uint64_t DecompLatencyUs;
+    volatile uint64_t DecodingCount;
+    volatile uint64_t DecodingLatencyUs;
+    volatile uint64_t ByteDecodingCount;
+    volatile uint64_t ByteDecodingLatencyUs;
+
+    ReaderMetrics() {
+      ReaderCount = 0;

Review Comment:
   nit: provide the default value in the declaration so we can avoid explicit constructor here. 



##########
c++/src/Utils.hh:
##########
@@ -0,0 +1,61 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_UTILS_HH
+#define ORC_UTILS_HH
+
+#include <stdint.h>
+#include <time.h>
+#include <sys/time.h>
+
+namespace orc {
+
+static inline uint64_t getCurrentTimeUs() {
+  timespec ts;
+  clock_gettime(CLOCK_MONOTONIC, &ts);
+  return static_cast<uint64_t>(ts.tv_sec) * 1000000ULL +
+         static_cast<uint64_t>(ts.tv_nsec) / 1000;
+}
+
+class AutoStopwatch {
+    uint64_t start;
+    volatile uint64_t* latencyUs;
+    volatile uint64_t* count;
+
+public:
+    AutoStopwatch(volatile uint64_t* latency,
+                  volatile uint64_t* cnt) {
+        start = getCurrentTimeUs();
+        latencyUs = latency;
+        count = cnt;
+    }
+
+    ~AutoStopwatch() {
+        if (latencyUs) {
+            uint64_t elapsedTime = getCurrentTimeUs() - start;
+             __sync_fetch_and_add(latencyUs, elapsedTime);

Review Comment:
   Windows does not provide this API, please use std::atomic instead.



##########
c++/src/Utils.hh:
##########
@@ -0,0 +1,61 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_UTILS_HH
+#define ORC_UTILS_HH
+
+#include <stdint.h>
+#include <time.h>
+#include <sys/time.h>

Review Comment:
   Windows may fail with this inclusion. You may want to use std chrono library. See : https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1155949580

   > FYI, this PR still fails on all MacOSes, @coderex2522 .
   
   fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] stiga-huang commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
stiga-huang commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1156241369

   Thanks for adding these metrics! Can we also expose the stats of `SargsApplier`?
   https://github.com/apache/orc/blob/afc671d214020116cb5d6cde8c3b2dfc945295c5/c%2B%2B/src/sargs/SargsApplier.hh#L122-L123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1181256803

   > Thank you, @coderex2522 and @wgtmac .
   > 
   > I have three questions as the final pieces:
   > 
   > * Backward-compatibility: Apache ORC C++ API is used in the downstream. So, although this PR looks good, it would be great to make it sure that we don't break the backward compatibility once more. I believe the default value of new parameters could reduce the pain.
   > * The ability to disable this metrics. According to the code, we can disable this feature by hand over NULL pointer? Could you double-confirm it? Could you add a test coverage to confirm that this case is possible? It will be helpful to prevent accidentally segfault in the future.
   > * Last question is about the target version. For this PR, if you want, you can backport this to branch-1.8 for Apache ORC 1.8.0. Which one do you prefer v1.9.0 (Next year) or v1.8.0 (This year).
   
   * External projects will depend on the api interface in the include directory and need to maintain compatibility, while the header files in the src directory will not be dependent on external projects, so I prefer not to add the default value.
   * I will add the new test to coverage the disable metrics case.
   * The target version v1.9.0 is fine, I have no requirement to support in branch-1.8.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r898629880


##########
tools/src/ToolsHelper.cc:
##########
@@ -80,10 +83,25 @@ bool parseOptions(int* argc, char** argv[], uint64_t* batchSize, orc::RowReaderO
         }
         break;
       }
+      case 'm' : {
+        *showMetrics = true;
+        break;
+      }
       default: break;
     }
   } while (opt != -1);
   *argc -= optind;
   *argv += optind;
   return true;
 }
+
+void printReaderMetrics(std::ostream& out, const orc::ReaderMetrics& metrics) {
+  out << "ReaderCount: " << metrics.ReaderCount << std::endl;
+  out << "ElapsedTimeUs: " << metrics.ReaderInclusiveLatencyUs << std::endl;

Review Comment:
   The output of time metrics is uniformly changed to be in seconds.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r896878729


##########
c++/include/orc/Reader.hh:
##########
@@ -39,6 +39,29 @@ namespace orc {
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  struct ReaderMetrics {
+    volatile uint64_t ReaderCount;
+    volatile uint64_t ReaderInclusiveLatencyUs;
+    volatile uint64_t DecompCount;
+    volatile uint64_t DecompLatencyUs;
+    volatile uint64_t DecodingCount;
+    volatile uint64_t DecodingLatencyUs;
+    volatile uint64_t ByteDecodingCount;
+    volatile uint64_t ByteDecodingLatencyUs;
+
+    ReaderMetrics() {
+      ReaderCount = 0;

Review Comment:
   has removed the explicit constructor.



##########
c++/include/orc/Reader.hh:
##########
@@ -455,6 +478,12 @@ namespace orc {
      */
     virtual bool hasCorrectStatistics() const = 0;
 
+    /**
+     * Get metrics of the reader
+     * @return the accumulated reader metrics to current state.
+     */
+    virtual ReaderMetrics getReaderMetrics() const = 0;

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r896879193


##########
c++/include/orc/Reader.hh:
##########
@@ -39,6 +39,29 @@ namespace orc {
   struct ReaderOptionsPrivate;
   struct RowReaderOptionsPrivate;
 
+  struct ReaderMetrics {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r896881833


##########
c++/src/Utils.hh:
##########
@@ -0,0 +1,61 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_UTILS_HH
+#define ORC_UTILS_HH
+
+#include <stdint.h>
+#include <time.h>
+#include <sys/time.h>
+
+namespace orc {
+
+static inline uint64_t getCurrentTimeUs() {
+  timespec ts;
+  clock_gettime(CLOCK_MONOTONIC, &ts);
+  return static_cast<uint64_t>(ts.tv_sec) * 1000000ULL +
+         static_cast<uint64_t>(ts.tv_nsec) / 1000;
+}
+
+class AutoStopwatch {
+    uint64_t start;
+    volatile uint64_t* latencyUs;
+    volatile uint64_t* count;
+
+public:
+    AutoStopwatch(volatile uint64_t* latency,
+                  volatile uint64_t* cnt) {
+        start = getCurrentTimeUs();
+        latencyUs = latency;
+        count = cnt;
+    }
+
+    ~AutoStopwatch() {
+        if (latencyUs) {
+            uint64_t elapsedTime = getCurrentTimeUs() - start;
+             __sync_fetch_and_add(latencyUs, elapsedTime);

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
coderex2522 commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1157149827

   
   
   
   > Thanks for adding these metrics! Can we also expose the stats of `SargsApplier`?
   > 
   > https://github.com/apache/orc/blob/afc671d214020116cb5d6cde8c3b2dfc945295c5/c%2B%2B/src/sargs/SargsApplier.hh#L122-L123
   
   Sure, I add the SelectedRowGroupCount and EvaluatedRowGroupCount to expose the PPD metrics. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
wgtmac commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r912578816


##########
c++/src/sargs/SargsApplier.cc:
##########
@@ -173,16 +174,23 @@ namespace orc {
     if (!ret) {
       // reset mNextSkippedRows when the current stripe does not satisfy the PPD
       mNextSkippedRows.clear();
+      if (mMetrics != nullptr) {
+        mMetrics->EvaluatedRowGroupCount.fetch_add(stripeRowGroupCount);
+      }
     }
     return ret;
   }
 
-  bool SargsApplier::evaluateFileStatistics(const proto::Footer& footer) {
+  bool SargsApplier::evaluateFileStatistics(const proto::Footer& footer,
+                                            uint64_t fileRowGroupCount) {
     if (!mHasEvaluatedFileStats) {
       if (footer.statistics_size() == 0) {
         mFileStatsEvalResult = true;
       } else {
         mFileStatsEvalResult = evaluateColumnStatistics(footer.statistics());
+        if (!mFileStatsEvalResult && mMetrics != nullptr) {
+          mMetrics->EvaluatedRowGroupCount.fetch_add(fileRowGroupCount);

Review Comment:
   The unit of input split is stripe. If the current input split does not contain all stripes in the file, it is incorrect to count row groups of the whole file.



##########
c++/src/Reader.cc:
##########
@@ -286,6 +287,10 @@ namespace orc {
           lastStripe = i + 1;
         }
       }
+      if (footer->rowindexstride() > 0) {
+        fileRowGroupCount +=

Review Comment:
   I'd suggest to rename fileRowGroupCount to something like numSplitRowGroups,  numRequestedRowGroups or numRowGroupsInRange. Then refine it to count stripe in range only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] stiga-huang commented on a diff in pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
stiga-huang commented on code in PR #1158:
URL: https://github.com/apache/orc/pull/1158#discussion_r901538588


##########
c++/include/orc/OrcFile.hh:
##########
@@ -105,20 +105,26 @@ namespace orc {
   /**
    * Create a stream to a local file or HDFS file if path begins with "hdfs://"
    * @param path the name of the file in the local file system or HDFS
+   * @param metrics the metrics of the reader
    */
-  ORC_UNIQUE_PTR<InputStream> readFile(const std::string& path);
+  ORC_UNIQUE_PTR<InputStream> readFile(const std::string& path,
+                                       ReaderMetrics& metrics);

Review Comment:
   I think this breaks the compactability of older versions. Should we use pointer type and default to be nullptr?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1198455653

   Thank you for suggestion, @XinyuZeng . Yes, it was also discussed before https://github.com/apache/orc/pull/1158#pullrequestreview-1031811661 .
   
   Could you open a new official JIRA for that? We can add that configuration for that at Apache ORC 1.9.0 too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1201507294

   Thank you for the detail, @coderex2522 . Could you make a PR to provide an opt-out way to achieve your goal?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] XinyuZeng commented on pull request #1158: ORC-961: [C++] expose related metrics of the reader

Posted by GitBox <gi...@apache.org>.
XinyuZeng commented on PR #1158:
URL: https://github.com/apache/orc/pull/1158#issuecomment-1202090174

   
   
   
   > > @dongjoon-hyun I created jira-issue 1232 for this problem. Also, even the metrics is disabled by setting nullptr, there is still some cost of initing the AutoStopWatch object and destructing it. The following perf flamegraph shows the cost. Although the extra time may be trivial, it may be better to be disabled at compile time.
   > 
   > I am confused about why does the chrono time function call still appear in the flamegraph after disabling the metrics.
   
   The SCOPED_STOPWATCH and SCOPED_MINUS_STOPWATCH macros still create the AutoStopwatch object which has chrono members, even after we pass in null METRICS_PTR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org