You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by GitBox <gi...@apache.org> on 2022/10/13 06:05:19 UTC
[GitHub] [orc] wgtmac commented on a diff in pull request #1275: ORC-1286: [C++] replace DataBuffer with BlockBuffer in class BufferedOutputStream

wgtmac commented on code in PR #1275:
URL: https://github.com/apache/orc/pull/1275#discussion_r994171033


##########
c++/src/Compression.cc:
##########
@@ -112,7 +118,10 @@ namespace orc {
                                                 bufferSize(0),
                                                 outputPosition(0),
                                                 outputSize(0) {
-    // PASS
+    // init header pointer array
+    for (int i = 0; i < HEADER_SIZE; ++i) {

Review Comment:
   Use single memset?



##########
c++/src/Compression.cc:
##########
@@ -138,19 +147,44 @@ namespace orc {
            static_cast<uint64_t>(outputSize - outputPosition);
   }
 
+  void CompressionStreamBase::writeData(const unsigned char* data, int size) {
+    int offset = 0;
+    while (offset < size) {
+      if (outputPosition == outputSize) {
+        if (!BufferedOutputStream::Next(
+          reinterpret_cast<void **>(&outputBuffer),
+          &outputSize)) {
+            throw std::runtime_error(
+                "Failed to get next output buffer from output stream.");
+        }
+        outputPosition = 0;
+      } else  if (outputPosition > outputSize) {
+        // this will unlikely happen, but we have seen a few on zstd v1.1.0

Review Comment:
   I'd suggest not to explicitly mention that name but simply explain this is a safety check. BTW, the exception should mention it fails in the compressor.



##########
c++/src/io/OutputStream.cc:
##########
@@ -37,7 +37,7 @@ namespace orc {
                                     : outputStream(outStream),
                                       blockSize(blockSize_),
                                       metrics(metrics_) {
-    dataBuffer.reset(new DataBuffer<char>(pool));
+    dataBuffer.reset(new BlockBuffer(pool, blockSize));

Review Comment:
   Does the **blockSize_** here have the same meaning of BlockBuffer?



##########
c++/src/io/OutputStream.cc:
##########
@@ -95,9 +91,11 @@ namespace orc {
 
   uint64_t BufferedOutputStream::flush() {
     uint64_t dataSize = dataBuffer->size();
+    for (uint64_t i = 0; i < dataBuffer->getBlockNumber(); ++i)
     {
       SCOPED_STOPWATCH(metrics, IOBlockingLatencyUs, IOCount);
-      outputStream->write(dataBuffer->data(), dataSize);
+      auto block = dataBuffer->getBlock(i);
+      outputStream->write(block.data, block.size);

Review Comment:
   Here may introduce a performance regression by issuing more write I/Os. If writing to cloud object store such as S3, concatenating buffer and making single write is probably a better choice. A possible solution is to add void BlockBuffer::writeTo(OutputStream*) and handle the logic internally.



##########
c++/src/Compression.cc:
##########
@@ -138,19 +147,44 @@ namespace orc {
            static_cast<uint64_t>(outputSize - outputPosition);
   }
 
+  void CompressionStreamBase::writeData(const unsigned char* data, int size) {

Review Comment:
   Please add a comment



##########
c++/src/Compression.cc:
##########
@@ -93,6 +95,10 @@ namespace orc {
 
     // Compress output buffer size
     int outputSize;
+
+    // Compression block header pointer array
+    static const int HEADER_SIZE = 3;
+    char* header[HEADER_SIZE];

Review Comment:
   Can we use std::array?



##########
c++/src/io/OutputStream.cc:
##########
@@ -95,9 +91,11 @@ namespace orc {
 
   uint64_t BufferedOutputStream::flush() {
     uint64_t dataSize = dataBuffer->size();
+    for (uint64_t i = 0; i < dataBuffer->getBlockNumber(); ++i)
     {
       SCOPED_STOPWATCH(metrics, IOBlockingLatencyUs, IOCount);

Review Comment:
   Measure it outside the loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org