You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "coderex2522 (via GitHub)" <gi...@apache.org> on 2023/02/01 04:16:08 UTC

[GitHub] [orc] coderex2522 opened a new pull request, #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

coderex2522 opened a new pull request, #1394:
URL: https://github.com/apache/orc/pull/1394

   ### What changes were proposed in this pull request?
   This pr provides the API to set the initial buffer capacity of  BufferedOutputStream.
   
   ### Why are the changes needed?
   It's convenient for users to adjust the buffer size of the BufferedOutputStream according to the written data.
   
   ### How was this patch tested?
   Add the WriterTest.setOutputBufferCapacity to test different buffer capacity of BufferedOutputStream.
   In addition, I did a simple test similar to this [issue](https://github.com/apache/orc/issues/1240)
   
   |    PeakMem     | BufferCapacity = 1MB |  BufferCapacity = 64KB |
   | ------------- | --- | ------------------------------- |
   |10000 Fields  | 38.96GB  | 11.48GB |
   |1000 Fields    | 3.92GB | 1.17GB |
   |100 Fields    | 0.41GB | 0.13GB |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.
coderex2522 commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1095326993


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,

Review Comment:
   Class CompressionStream is a subclass of class BufferedOutputStream which contains the buffer. And BufferedOutputStream has no logic associated with the compression algorithm, so it may not be appropriate to use compressed stream to describe it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1092748437


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,17 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+     /**
+     * Set the buffer capacity of output stream.
+     */
+    WriterOptions& setOutputBufferCapacity(uint64_t capacity);

Review Comment:
   Could you describe the characteristic of this capacity, immutable or updatabe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 merged pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.
coderex2522 merged PR #1394:
URL: https://github.com/apache/orc/pull/1394


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1094734698


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.

Review Comment:
   ```suggestion
        * Set the initial capacity to the output buffer of the compressed stream.
   ```



##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,

Review Comment:
   ```suggestion
        * Each column contains one or more compressed streams depending on its type,
   ```



##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,
+     * and these buffers will automatically expand when memory is exhausted.

Review Comment:
   Correct me if I am wrong here.



##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,
+     * and these buffers will automatically expand when memory is exhausted.

Review Comment:
   ```suggestion
        * and these buffers will automatically expand when necessary but up to the compression block size.
   ```



##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,
+     * and these buffers will automatically expand when memory is exhausted.
+     */
+    WriterOptions& setOutputBufferCapacity(uint64_t capacity);
+
+    /**
+     * Get the buffer capacity of output stream.

Review Comment:
   ```suggestion
        * Get the initial capacity of output buffer in the compressed stream.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1394:
URL: https://github.com/apache/orc/pull/1394#issuecomment-1411445656

   cc @wgtmac , @stiga-huang , @williamhyun , @guiyanakuang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.
coderex2522 commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1094597849


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,17 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+     /**
+     * Set the buffer capacity of output stream.
+     */
+    WriterOptions& setOutputBufferCapacity(uint64_t capacity);

Review Comment:
   Add some descriptions to the API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.
coderex2522 commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1094594909


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,17 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+     /**

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1094738767


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,
+     * and these buffers will automatically expand when memory is exhausted.

Review Comment:
   ```suggestion
        * and these buffers will automatically expand when more memory is necessary.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.
coderex2522 commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1095327973


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,
+     * and these buffers will automatically expand when memory is exhausted.

Review Comment:
   When the used size of the buffer reaches the initial capacity, the buffer will continue to expand in compression block size units.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1094739476


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,
+     * and these buffers will automatically expand when memory is exhausted.
+     */
+    WriterOptions& setOutputBufferCapacity(uint64_t capacity);
+
+    /**
+     * Get the buffer capacity of output stream.

Review Comment:
   ```suggestion
        * Get the initial capacity of output buffer in the class BufferedOutputStream.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1094738767


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,
+     * and these buffers will automatically expand when memory is exhausted.

Review Comment:
   ```suggestion
        * and these buffers will automatically expand when more memory is required.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1394:
URL: https://github.com/apache/orc/pull/1394#issuecomment-1421011575

   FYI, we provide merge script like Apache Spark community which preserves the committer identities when you merge someone else's PR.
   - https://github.com/apache/orc/blob/main/dev/merge_orc_pr.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] williamhyun commented on pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "williamhyun (via GitHub)" <gi...@apache.org>.
williamhyun commented on PR #1394:
URL: https://github.com/apache/orc/pull/1394#issuecomment-1426619382

   @coderex2522 
   
   Hey Xin, I've made an official announcement in the ORC dev and user mailing list welcoming you as our new committer. 
   If you have not yet subscribed to these mailing lists, please do so ASAP.
   
   In addition, Could you make a PR to update the ORC website to add you as a committer?
   - https://orc.apache.org/develop/committers/
   This is the page that should be updated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1095366122


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,

Review Comment:
   Then it makes sense to me to use `BufferedOutputStream` explicitly. We have defined several output streams (e.g. one from `OrcFile.hh`). It should be clear to user which one we are referring to.



##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,
+     * and these buffers will automatically expand when memory is exhausted.

Review Comment:
   In the `CompressionStream`, will the output buffer exceed the compression block size?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1092749682


##########
c++/test/TestWriter.cc:
##########
@@ -1835,6 +1835,63 @@ namespace orc {
     testSuppressPresentStream(CompressionKind_SNAPPY);
   }
 
+  void testSetOutputBufferCapacity(uint64_t capacity) {
+    MemoryOutputStream memStream(DEFAULT_MEM_STREAM_SIZE);
+    MemoryPool* pool = getDefaultPool();
+    size_t rowCount = 1000;
+    {
+      auto type = std::unique_ptr<Type>(
+        Type::buildTypeFromString("struct<col1:int,col2:int>"));
+      WriterOptions options;
+      options.setStripeSize(1024 * 1024)
+          .setCompressionBlockSize(1024)
+          .setCompression(CompressionKind_NONE)
+          .setMemoryPool(pool)
+          .setRowIndexStride(1000)
+          .setOutputBufferCapacity(capacity);
+
+      auto writer = createWriter(*type, &memStream, options);
+      auto batch = writer->createRowBatch(rowCount);
+      auto& structBatch = dynamic_cast<StructVectorBatch&>(*batch);
+      auto& longBatch1 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[0]);
+      auto& longBatch2 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[1]);
+      structBatch.numElements = rowCount;
+      longBatch1.numElements = rowCount;
+      longBatch2.numElements = rowCount;
+      for (size_t i = 0; i < rowCount; ++i) {
+        longBatch1.data[i] = static_cast<int64_t>(i * 100);
+        longBatch2.data[i] = static_cast<int64_t>(i * 300);
+      }
+      writer->add(*batch);
+      writer->close();
+    }
+    // read orc file & check the data
+    {
+      std::unique_ptr<InputStream> inStream(
+          new MemoryInputStream(memStream.getData(), memStream.getLength()));
+      ReaderOptions readerOptions;
+      readerOptions.setMemoryPool(*pool);
+      std::unique_ptr<Reader> reader = createReader(std::move(inStream), readerOptions);
+      std::unique_ptr<RowReader> rowReader = createRowReader(reader.get());
+      auto batch = rowReader->createRowBatch(rowCount);
+      EXPECT_TRUE(rowReader->next(*batch));
+      EXPECT_EQ(rowCount, batch->numElements);
+      auto& structBatch = dynamic_cast<StructVectorBatch&>(*batch);
+      auto& longBatch1 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[0]);
+      auto& longBatch2 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[1]);
+      for (size_t i = 0; i < rowCount; ++i) {
+        EXPECT_EQ(longBatch1.data[i], static_cast<int64_t>(i * 100));
+        EXPECT_EQ(longBatch2.data[i], static_cast<int64_t>(i * 300));
+      }
+    }
+  }
+
+  TEST(WriterTest, setOutputBufferCapacity) {

Review Comment:
   Thank you for adding this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1394:
URL: https://github.com/apache/orc/pull/1394#issuecomment-1421009183

   Congratulation, @coderex2522 . :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.
coderex2522 commented on PR #1394:
URL: https://github.com/apache/orc/pull/1394#issuecomment-1426808160

   > In addition, Could you make a PR to update the ORC website to add you as a committer?
   
   Thanks, I've subscribed to these mailing lists before. And I will submit a PR to update the ORC website as soon as possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1092749947


##########
c++/test/TestWriter.cc:
##########
@@ -1835,6 +1835,63 @@ namespace orc {
     testSuppressPresentStream(CompressionKind_SNAPPY);
   }
 
+  void testSetOutputBufferCapacity(uint64_t capacity) {
+    MemoryOutputStream memStream(DEFAULT_MEM_STREAM_SIZE);
+    MemoryPool* pool = getDefaultPool();
+    size_t rowCount = 1000;
+    {
+      auto type = std::unique_ptr<Type>(
+        Type::buildTypeFromString("struct<col1:int,col2:int>"));
+      WriterOptions options;
+      options.setStripeSize(1024 * 1024)
+          .setCompressionBlockSize(1024)
+          .setCompression(CompressionKind_NONE)
+          .setMemoryPool(pool)
+          .setRowIndexStride(1000)
+          .setOutputBufferCapacity(capacity);
+
+      auto writer = createWriter(*type, &memStream, options);
+      auto batch = writer->createRowBatch(rowCount);
+      auto& structBatch = dynamic_cast<StructVectorBatch&>(*batch);
+      auto& longBatch1 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[0]);
+      auto& longBatch2 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[1]);
+      structBatch.numElements = rowCount;
+      longBatch1.numElements = rowCount;
+      longBatch2.numElements = rowCount;
+      for (size_t i = 0; i < rowCount; ++i) {
+        longBatch1.data[i] = static_cast<int64_t>(i * 100);
+        longBatch2.data[i] = static_cast<int64_t>(i * 300);
+      }
+      writer->add(*batch);
+      writer->close();
+    }
+    // read orc file & check the data
+    {
+      std::unique_ptr<InputStream> inStream(
+          new MemoryInputStream(memStream.getData(), memStream.getLength()));
+      ReaderOptions readerOptions;
+      readerOptions.setMemoryPool(*pool);
+      std::unique_ptr<Reader> reader = createReader(std::move(inStream), readerOptions);
+      std::unique_ptr<RowReader> rowReader = createRowReader(reader.get());
+      auto batch = rowReader->createRowBatch(rowCount);
+      EXPECT_TRUE(rowReader->next(*batch));
+      EXPECT_EQ(rowCount, batch->numElements);
+      auto& structBatch = dynamic_cast<StructVectorBatch&>(*batch);
+      auto& longBatch1 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[0]);
+      auto& longBatch2 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[1]);
+      for (size_t i = 0; i < rowCount; ++i) {
+        EXPECT_EQ(longBatch1.data[i], static_cast<int64_t>(i * 100));
+        EXPECT_EQ(longBatch2.data[i], static_cast<int64_t>(i * 300));
+      }
+    }
+  }
+
+  TEST(WriterTest, setOutputBufferCapacity) {
+    testSetOutputBufferCapacity(1024);
+    testSetOutputBufferCapacity(1024 * 1024);
+    testSetOutputBufferCapacity(1024 * 1024 * 1024);

Review Comment:
   If this is `1GB`, I'd remove this from unit test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.
coderex2522 commented on PR #1394:
URL: https://github.com/apache/orc/pull/1394#issuecomment-1415947612

   It seems to be caused by googletest package download failure. Compile successfully after retry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1092771395


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,17 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+     /**
+     * Set the buffer capacity of output stream.
+     */
+    WriterOptions& setOutputBufferCapacity(uint64_t capacity);

Review Comment:
   IMO, the user may not know what is the `OutputBuffer` or how many buffers will be allocated. We need to add comment to cover some basics.



##########
c++/test/TestWriter.cc:
##########
@@ -1835,6 +1835,63 @@ namespace orc {
     testSuppressPresentStream(CompressionKind_SNAPPY);
   }
 
+  void testSetOutputBufferCapacity(uint64_t capacity) {
+    MemoryOutputStream memStream(DEFAULT_MEM_STREAM_SIZE);
+    MemoryPool* pool = getDefaultPool();
+    size_t rowCount = 1000;
+    {
+      auto type = std::unique_ptr<Type>(
+        Type::buildTypeFromString("struct<col1:int,col2:int>"));
+      WriterOptions options;
+      options.setStripeSize(1024 * 1024)
+          .setCompressionBlockSize(1024)

Review Comment:
   Should we change `setCompressionBlockSize` accordingly? It seems that even you have changed the output buffer size but the page is very small and no resize will happen actually. 



##########
c++/test/TestWriter.cc:
##########
@@ -1835,6 +1835,63 @@ namespace orc {
     testSuppressPresentStream(CompressionKind_SNAPPY);
   }
 
+  void testSetOutputBufferCapacity(uint64_t capacity) {
+    MemoryOutputStream memStream(DEFAULT_MEM_STREAM_SIZE);
+    MemoryPool* pool = getDefaultPool();
+    size_t rowCount = 1000;
+    {
+      auto type = std::unique_ptr<Type>(
+        Type::buildTypeFromString("struct<col1:int,col2:int>"));
+      WriterOptions options;
+      options.setStripeSize(1024 * 1024)
+          .setCompressionBlockSize(1024)
+          .setCompression(CompressionKind_NONE)
+          .setMemoryPool(pool)
+          .setRowIndexStride(1000)
+          .setOutputBufferCapacity(capacity);
+
+      auto writer = createWriter(*type, &memStream, options);
+      auto batch = writer->createRowBatch(rowCount);
+      auto& structBatch = dynamic_cast<StructVectorBatch&>(*batch);
+      auto& longBatch1 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[0]);
+      auto& longBatch2 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[1]);
+      structBatch.numElements = rowCount;
+      longBatch1.numElements = rowCount;
+      longBatch2.numElements = rowCount;
+      for (size_t i = 0; i < rowCount; ++i) {
+        longBatch1.data[i] = static_cast<int64_t>(i * 100);
+        longBatch2.data[i] = static_cast<int64_t>(i * 300);
+      }
+      writer->add(*batch);
+      writer->close();
+    }
+    // read orc file & check the data
+    {
+      std::unique_ptr<InputStream> inStream(
+          new MemoryInputStream(memStream.getData(), memStream.getLength()));
+      ReaderOptions readerOptions;
+      readerOptions.setMemoryPool(*pool);
+      std::unique_ptr<Reader> reader = createReader(std::move(inStream), readerOptions);
+      std::unique_ptr<RowReader> rowReader = createRowReader(reader.get());
+      auto batch = rowReader->createRowBatch(rowCount);
+      EXPECT_TRUE(rowReader->next(*batch));
+      EXPECT_EQ(rowCount, batch->numElements);
+      auto& structBatch = dynamic_cast<StructVectorBatch&>(*batch);
+      auto& longBatch1 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[0]);
+      auto& longBatch2 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[1]);
+      for (size_t i = 0; i < rowCount; ++i) {
+        EXPECT_EQ(longBatch1.data[i], static_cast<int64_t>(i * 100));
+        EXPECT_EQ(longBatch2.data[i], static_cast<int64_t>(i * 300));
+      }
+    }
+  }
+
+  TEST(WriterTest, setOutputBufferCapacity) {
+    testSetOutputBufferCapacity(1024);
+    testSetOutputBufferCapacity(1024 * 1024);
+    testSetOutputBufferCapacity(1024 * 1024 * 1024);

Review Comment:
   +1



##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,17 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+     /**

Review Comment:
   This is detected by the CI format check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1092749211


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,17 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+     /**
+     * Set the buffer capacity of output stream.
+     */
+    WriterOptions& setOutputBufferCapacity(uint64_t capacity);

Review Comment:
   To be clear, I guess you can say like `initial capacity` and mention when and how this will be changed from the initial value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1092748015


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,17 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+     /**

Review Comment:
   nit. Indentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.
coderex2522 commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1094599133


##########
c++/test/TestWriter.cc:
##########
@@ -1835,6 +1835,63 @@ namespace orc {
     testSuppressPresentStream(CompressionKind_SNAPPY);
   }
 
+  void testSetOutputBufferCapacity(uint64_t capacity) {
+    MemoryOutputStream memStream(DEFAULT_MEM_STREAM_SIZE);
+    MemoryPool* pool = getDefaultPool();
+    size_t rowCount = 1000;
+    {
+      auto type = std::unique_ptr<Type>(
+        Type::buildTypeFromString("struct<col1:int,col2:int>"));
+      WriterOptions options;
+      options.setStripeSize(1024 * 1024)
+          .setCompressionBlockSize(1024)
+          .setCompression(CompressionKind_NONE)
+          .setMemoryPool(pool)
+          .setRowIndexStride(1000)
+          .setOutputBufferCapacity(capacity);
+
+      auto writer = createWriter(*type, &memStream, options);
+      auto batch = writer->createRowBatch(rowCount);
+      auto& structBatch = dynamic_cast<StructVectorBatch&>(*batch);
+      auto& longBatch1 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[0]);
+      auto& longBatch2 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[1]);
+      structBatch.numElements = rowCount;
+      longBatch1.numElements = rowCount;
+      longBatch2.numElements = rowCount;
+      for (size_t i = 0; i < rowCount; ++i) {
+        longBatch1.data[i] = static_cast<int64_t>(i * 100);
+        longBatch2.data[i] = static_cast<int64_t>(i * 300);
+      }
+      writer->add(*batch);
+      writer->close();
+    }
+    // read orc file & check the data
+    {
+      std::unique_ptr<InputStream> inStream(
+          new MemoryInputStream(memStream.getData(), memStream.getLength()));
+      ReaderOptions readerOptions;
+      readerOptions.setMemoryPool(*pool);
+      std::unique_ptr<Reader> reader = createReader(std::move(inStream), readerOptions);
+      std::unique_ptr<RowReader> rowReader = createRowReader(reader.get());
+      auto batch = rowReader->createRowBatch(rowCount);
+      EXPECT_TRUE(rowReader->next(*batch));
+      EXPECT_EQ(rowCount, batch->numElements);
+      auto& structBatch = dynamic_cast<StructVectorBatch&>(*batch);
+      auto& longBatch1 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[0]);
+      auto& longBatch2 = dynamic_cast<LongVectorBatch&>(*structBatch.fields[1]);
+      for (size_t i = 0; i < rowCount; ++i) {
+        EXPECT_EQ(longBatch1.data[i], static_cast<int64_t>(i * 100));
+        EXPECT_EQ(longBatch2.data[i], static_cast<int64_t>(i * 300));
+      }
+    }
+  }
+
+  TEST(WriterTest, setOutputBufferCapacity) {
+    testSetOutputBufferCapacity(1024);
+    testSetOutputBufferCapacity(1024 * 1024);
+    testSetOutputBufferCapacity(1024 * 1024 * 1024);

Review Comment:
   The 1GB case has been deleted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] coderex2522 commented on a diff in pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "coderex2522 (via GitHub)" <gi...@apache.org>.
coderex2522 commented on code in PR #1394:
URL: https://github.com/apache/orc/pull/1394#discussion_r1095868937


##########
c++/include/orc/Writer.hh:
##########
@@ -262,6 +262,19 @@ namespace orc {
      * @return if not set, the default is false
      */
     bool getUseTightNumericVector() const;
+
+    /**
+     * Set the initial buffer capacity of output stream.
+     * Each column contains multiple output streams with buffers,
+     * and these buffers will automatically expand when memory is exhausted.

Review Comment:
   The outputBuffer(char* variable) of the class CompressionStreamBase doesn't exceed the compression block size,  and the dataBuffer(BlockBuffer variable) of the class BufferedOutputStream will exceed the compression block size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] dongjoon-hyun commented on pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #1394:
URL: https://github.com/apache/orc/pull/1394#issuecomment-1415252734

   +1 for @wgtmac 's advice.
   
   BTW, `googletest` failures in the CI is related to this PR?
   ```
   CMake Error at googletest_ep-stamp/googletest_ep-download-RELWITHDEBINFO.cmake:49 (message):
     Command failed: 1
      '/usr/local/bin/cmake' '-Dmake=' '-Dconfig=' '-P' '/home/runner/work/orc/orc/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download-RELWITHDEBINFO-impl.cmake'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] wgtmac commented on pull request #1394: ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #1394:
URL: https://github.com/apache/orc/pull/1394#issuecomment-1416653826

   @coderex2522 I will leave it to you to merge it to test your permission.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org