You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/06 15:33:16 UTC

[GitHub] [arrow] ahmet-uyar opened a new pull request #11872: Arrow-8823 Calculating compression ratio

ahmet-uyar opened a new pull request #11872:
URL: https://github.com/apache/arrow/pull/11872


   Compression ratio is calculated on serialized Record Batches. 
   Compression Ratio defined as: Compressed-size / Uncompressed-size
   The result is stored in ipc.WriteStats class. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-986944065


   https://issues.apache.org/jira/browse/ARROW-8823


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #11872: Arrow-8823 Calculating compression ratio

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-986888047


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763176858



##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -75,6 +76,19 @@ struct WriteStats {
   /// Number of replaced dictionaries (i.e. where a dictionary batch replaces
   /// an existing dictionary with an unrelated new dictionary).
   int64_t num_replaced_dictionaries = 0;
+
+  /// initial and serialized (may be compressed) body lengths for record batches
+  /// these values show the total sizes of all record batch body lengths
+  int64_t raw_body_length = 0;
+  int64_t serialized_body_length = 0;

Review comment:
       Sometimes it is not compressed, only serialized. Should we name it as total_serialized_body_size?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763222328



##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,52 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();
+  ASSERT_OK_AND_ASSIGN(write_options2.codec, util::Codec::Create(Compression::LZ4_FRAME));
+
+  // pre-computed compression ratios for record batches with Compression::LZ4_FRAME
+  std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f};

Review comment:
       added raw record batch sizes instead. 

##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,52 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();

Review comment:
       done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #11872: ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-988479588


   Benchmark runs are scheduled for baseline = 9cf4275a19c994879172e5d3b03ade9a96a10721 and contender = 77722d990c28e22a56847b0dc49a79b83a6c1a45. 77722d990c28e22a56847b0dc49a79b83a6c1a45 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/26a4f85ef5a547079d922da56e969418...91874ad70cba40c88dd7f4b28d3655da/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/f4dce272e63a4badb48789db2008ef61...f1c8031c505b4f9ab42760f956e7c437/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f6f8d19d14aa4337a2095a1988acf496...1a5fd3d0d015446796c3b9cd6fa1d86b/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763209776



##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -75,6 +76,19 @@ struct WriteStats {
   /// Number of replaced dictionaries (i.e. where a dictionary batch replaces
   /// an existing dictionary with an unrelated new dictionary).
   int64_t num_replaced_dictionaries = 0;
+
+  /// initial and serialized (may be compressed) body lengths for record batches
+  /// these values show the total sizes of all record batch body lengths

Review comment:
       yes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11872: ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-988479588


   Benchmark runs are scheduled for baseline = 9cf4275a19c994879172e5d3b03ade9a96a10721 and contender = 77722d990c28e22a56847b0dc49a79b83a6c1a45. 77722d990c28e22a56847b0dc49a79b83a6c1a45 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/26a4f85ef5a547079d922da56e969418...91874ad70cba40c88dd7f4b28d3655da/)
   [Failed :arrow_down:2.22% :arrow_up:0.74%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/f4dce272e63a4badb48789db2008ef61...f1c8031c505b4f9ab42760f956e7c437/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f6f8d19d14aa4337a2095a1988acf496...1a5fd3d0d015446796c3b9cd6fa1d86b/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763180315



##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -75,6 +76,19 @@ struct WriteStats {
   /// Number of replaced dictionaries (i.e. where a dictionary batch replaces
   /// an existing dictionary with an unrelated new dictionary).
   int64_t num_replaced_dictionaries = 0;
+
+  /// initial and serialized (may be compressed) body lengths for record batches
+  /// these values show the total sizes of all record batch body lengths
+  int64_t raw_body_length = 0;
+  int64_t serialized_body_length = 0;

Review comment:
       Yes, why not.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r764080561



##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,61 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions options_uncompressed = IpcWriteOptions::Defaults();
+
+  std::vector<std::shared_ptr<RecordBatch>> batches(3);
+  // empty record batch
+  ASSERT_OK(MakeIntBatchSized(0, &batches[0]));
+  // record batch with int values
+  ASSERT_OK(MakeIntBatchSized(2000, &batches[1], 100));
+
+  // record batch with DictionaryArray
+  random::RandomArrayGenerator rg(/*seed=*/0);
+  int64_t length = 500;
+  int dict_size = 50;
+  std::shared_ptr<Array> dict =
+      rg.String(dict_size, /*min_length=*/5, /*max_length=*/5, /*null_probability=*/0);
+  std::shared_ptr<Array> indices =
+      rg.Int32(length, /*min=*/0, /*max=*/dict_size - 1, /*null_probability=*/0.1);
+  auto dict_type = dictionary(int32(), utf8());
+  auto dict_field = field("f1", dict_type);
+  ASSERT_OK_AND_ASSIGN(auto dict_array,
+                       DictionaryArray::FromArrays(dict_type, indices, dict));
+
+  auto schema = ::arrow::schema({field("f0", utf8()), dict_field});
+  batches[2] =
+      RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});
+
+  for (size_t i = 0; i < batches.size(); ++i) {
+    // without compression
+    ASSERT_OK(helper.Init(batches[i]->schema(), options_uncompressed));
+    ASSERT_OK(helper.WriteBatch(batches[i]));
+    ASSERT_OK(helper.Finish());
+    // padding can make serialized data slightly larger than the raw data size
+    // when no compression is used
+    ASSERT_LE(helper.writer_->stats().total_raw_body_size,
+              helper.writer_->stats().total_serialized_body_size);
+
+    if (!util::Codec::IsAvailable(Compression::LZ4_FRAME)) {
+      continue;
+    }
+
+    IpcWriteOptions options_compressed = IpcWriteOptions::Defaults();
+    ASSERT_OK_AND_ASSIGN(options_compressed.codec,
+                         util::Codec::Create(Compression::LZ4_FRAME));
+
+    // with compression
+    ASSERT_OK(helper.Init(batches[i]->schema(), options_compressed));
+    ASSERT_OK(helper.WriteBatch(batches[i]));
+    ASSERT_OK(helper.Finish());
+    ASSERT_GE(helper.writer_->stats().total_raw_body_size,
+              helper.writer_->stats().total_serialized_body_size);

Review comment:
       equality is needed, because one of the record batches has zero rows and both total_raw_body_size and total_serialized_body_size are zero. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-986944092


   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #11872: ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-988219779


   Benchmark runs are scheduled for baseline = 9cf4275a19c994879172e5d3b03ade9a96a10721 and contender = 77722d990c28e22a56847b0dc49a79b83a6c1a45. 77722d990c28e22a56847b0dc49a79b83a6c1a45 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/9b40d0890b27427383b416d8e62d5dbe...9f33767f1d224b4f9e4fa66f997217e2/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/ec45e9e0b35445539b2cfc59d56ce805...9bc1f30b5c56435e8bcae394642af765/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/84909d0fddc64ab7b28a6f223c05c6c4...c57b1b3dc71d4109903ce7a87c31ae3f/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #11872: Arrow-8823 Calculating compression ratio

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763142640



##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -75,6 +76,19 @@ struct WriteStats {
   /// Number of replaced dictionaries (i.e. where a dictionary batch replaces
   /// an existing dictionary with an unrelated new dictionary).
   int64_t num_replaced_dictionaries = 0;
+
+  /// initial and serialized (may be compressed) body lengths for record batches
+  /// these values show the total sizes of all record batch body lengths
+  int64_t raw_body_length = 0;
+  int64_t serialized_body_length = 0;

Review comment:
       In the Arrow C++ APIs, "length" generally points to the logical number of elements (see e.g. `Array::length()`), while "size" points to the physical size in bytes (as in `Buffer::size()`). So I think this should be:
   ```c++
   int64_t total_raw_body_size = 0;
   int64_t total_compressed_body_size = 0;
   ```

##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,52 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();

Review comment:
       Can you give these clearer names, e.g. `options_uncompressed` and `options_compressed`?

##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -75,6 +76,19 @@ struct WriteStats {
   /// Number of replaced dictionaries (i.e. where a dictionary batch replaces
   /// an existing dictionary with an unrelated new dictionary).
   int64_t num_replaced_dictionaries = 0;
+
+  /// initial and serialized (may be compressed) body lengths for record batches
+  /// these values show the total sizes of all record batch body lengths
+  int64_t raw_body_length = 0;
+  int64_t serialized_body_length = 0;
+
+  /// compression ratio for the body of all record batches serialized
+  /// this is equivalent to:
+  ///    serialized_body_length / raw_body_length

Review comment:
       Since this is equivalent, I don't think there is any value in exposing it. People can trivially calculate it themselves.
   

##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,52 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();
+  ASSERT_OK_AND_ASSIGN(write_options2.codec, util::Codec::Create(Compression::LZ4_FRAME));
+
+  // pre-computed compression ratios for record batches with Compression::LZ4_FRAME
+  std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f};
+
+  std::vector<std::shared_ptr<RecordBatch>> batches(3);
+  // empty record batch
+  ASSERT_OK(MakeIntBatchSized(0, &batches[0]));
+  // record batch with int values
+  ASSERT_OK(MakeIntBatchSized(2000, &batches[1], 100));
+
+  // record batch with DictionaryArray
+  random::RandomArrayGenerator rg(/*seed=*/0);
+  int64_t length = 500;
+  int dict_size = 50;
+  std::shared_ptr<Array> dict = rg.String(dict_size, /*min_length=*/5, /*max_length=*/5, /*null_probability=*/0);
+  std::shared_ptr<Array> indices = rg.Int32(length, /*min=*/0, /*max=*/dict_size - 1, /*null_probability=*/0.1);
+  auto dict_type = dictionary(int32(), utf8());
+  auto dict_field = field("f1", dict_type);
+  ASSERT_OK_AND_ASSIGN(auto dict_array,
+                       DictionaryArray::FromArrays(dict_type, indices, dict));
+
+  auto schema = ::arrow::schema({field("f0", utf8()), dict_field});
+  batches[2] =
+    RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});
+
+  for(size_t i = 0; i < batches.size(); ++i) {

Review comment:
       As a suggestion for these tests, it would be nice to check that:
   * the raw body size is accurate (it can be hard-coded, since it should be stable and deterministic)
   * the compressed body size is equal to or smaller than the raw body size, depending on the compression parameter
   

##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -75,6 +76,19 @@ struct WriteStats {
   /// Number of replaced dictionaries (i.e. where a dictionary batch replaces
   /// an existing dictionary with an unrelated new dictionary).
   int64_t num_replaced_dictionaries = 0;
+
+  /// initial and serialized (may be compressed) body lengths for record batches
+  /// these values show the total sizes of all record batch body lengths

Review comment:
       Will this also include the dictionary batches?

##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,52 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();
+  ASSERT_OK_AND_ASSIGN(write_options2.codec, util::Codec::Create(Compression::LZ4_FRAME));
+
+  // pre-computed compression ratios for record batches with Compression::LZ4_FRAME
+  std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f};

Review comment:
       I don't think we want to hard-code this. The values can vary depending on the version of the lz4 library, or internal details of how we initialize the compressor. Just testing that _some_ compression happens should be sufficient.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763221776



##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -75,6 +76,19 @@ struct WriteStats {
   /// Number of replaced dictionaries (i.e. where a dictionary batch replaces
   /// an existing dictionary with an unrelated new dictionary).
   int64_t num_replaced_dictionaries = 0;
+
+  /// initial and serialized (may be compressed) body lengths for record batches
+  /// these values show the total sizes of all record batch body lengths
+  int64_t raw_body_length = 0;
+  int64_t serialized_body_length = 0;
+
+  /// compression ratio for the body of all record batches serialized
+  /// this is equivalent to:
+  ///    serialized_body_length / raw_body_length

Review comment:
       removed compression ratio from WriteStats. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11872: ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-988479588


   Benchmark runs are scheduled for baseline = 9cf4275a19c994879172e5d3b03ade9a96a10721 and contender = 77722d990c28e22a56847b0dc49a79b83a6c1a45. 77722d990c28e22a56847b0dc49a79b83a6c1a45 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/26a4f85ef5a547079d922da56e969418...91874ad70cba40c88dd7f4b28d3655da/)
   [Failed :arrow_down:2.22% :arrow_up:0.74%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/f4dce272e63a4badb48789db2008ef61...f1c8031c505b4f9ab42760f956e7c437/)
   [Finished :arrow_down:0.22% :arrow_up:0.18%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f6f8d19d14aa4337a2095a1988acf496...1a5fd3d0d015446796c3b9cd6fa1d86b/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763271175



##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,52 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();
+  ASSERT_OK_AND_ASSIGN(write_options2.codec, util::Codec::Create(Compression::LZ4_FRAME));
+
+  // pre-computed compression ratios for record batches with Compression::LZ4_FRAME
+  std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f};
+
+  std::vector<std::shared_ptr<RecordBatch>> batches(3);
+  // empty record batch
+  ASSERT_OK(MakeIntBatchSized(0, &batches[0]));
+  // record batch with int values
+  ASSERT_OK(MakeIntBatchSized(2000, &batches[1], 100));
+
+  // record batch with DictionaryArray
+  random::RandomArrayGenerator rg(/*seed=*/0);
+  int64_t length = 500;
+  int dict_size = 50;
+  std::shared_ptr<Array> dict = rg.String(dict_size, /*min_length=*/5, /*max_length=*/5, /*null_probability=*/0);
+  std::shared_ptr<Array> indices = rg.Int32(length, /*min=*/0, /*max=*/dict_size - 1, /*null_probability=*/0.1);
+  auto dict_type = dictionary(int32(), utf8());
+  auto dict_field = field("f1", dict_type);
+  ASSERT_OK_AND_ASSIGN(auto dict_array,
+                       DictionaryArray::FromArrays(dict_type, indices, dict));
+
+  auto schema = ::arrow::schema({field("f0", utf8()), dict_field});
+  batches[2] =
+    RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});
+
+  for(size_t i = 0; i < batches.size(); ++i) {

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r764186800



##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,61 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11872: ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-988219779


   Benchmark runs are scheduled for baseline = 9cf4275a19c994879172e5d3b03ade9a96a10721 and contender = 77722d990c28e22a56847b0dc49a79b83a6c1a45. 77722d990c28e22a56847b0dc49a79b83a6c1a45 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/9b40d0890b27427383b416d8e62d5dbe...9f33767f1d224b4f9e4fa66f997217e2/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/ec45e9e0b35445539b2cfc59d56ce805...9bc1f30b5c56435e8bcae394642af765/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/84909d0fddc64ab7b28a6f223c05c6c4...c57b1b3dc71d4109903ce7a87c31ae3f/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r764079908



##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,61 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions options_uncompressed = IpcWriteOptions::Defaults();
+
+  std::vector<std::shared_ptr<RecordBatch>> batches(3);
+  // empty record batch
+  ASSERT_OK(MakeIntBatchSized(0, &batches[0]));
+  // record batch with int values
+  ASSERT_OK(MakeIntBatchSized(2000, &batches[1], 100));
+
+  // record batch with DictionaryArray
+  random::RandomArrayGenerator rg(/*seed=*/0);
+  int64_t length = 500;
+  int dict_size = 50;
+  std::shared_ptr<Array> dict =
+      rg.String(dict_size, /*min_length=*/5, /*max_length=*/5, /*null_probability=*/0);
+  std::shared_ptr<Array> indices =
+      rg.Int32(length, /*min=*/0, /*max=*/dict_size - 1, /*null_probability=*/0.1);
+  auto dict_type = dictionary(int32(), utf8());
+  auto dict_field = field("f1", dict_type);
+  ASSERT_OK_AND_ASSIGN(auto dict_array,
+                       DictionaryArray::FromArrays(dict_type, indices, dict));
+
+  auto schema = ::arrow::schema({field("f0", utf8()), dict_field});
+  batches[2] =
+      RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});
+
+  for (size_t i = 0; i < batches.size(); ++i) {
+    // without compression
+    ASSERT_OK(helper.Init(batches[i]->schema(), options_uncompressed));
+    ASSERT_OK(helper.WriteBatch(batches[i]));
+    ASSERT_OK(helper.Finish());
+    // padding can make serialized data slightly larger than the raw data size
+    // when no compression is used
+    ASSERT_LE(helper.writer_->stats().total_raw_body_size,
+              helper.writer_->stats().total_serialized_body_size);
+
+    if (!util::Codec::IsAvailable(Compression::LZ4_FRAME)) {
+      continue;
+    }
+
+    IpcWriteOptions options_compressed = IpcWriteOptions::Defaults();
+    ASSERT_OK_AND_ASSIGN(options_compressed.codec,
+                         util::Codec::Create(Compression::LZ4_FRAME));
+
+    // with compression
+    ASSERT_OK(helper.Init(batches[i]->schema(), options_compressed));
+    ASSERT_OK(helper.WriteBatch(batches[i]));
+    ASSERT_OK(helper.Finish());
+    ASSERT_GE(helper.writer_->stats().total_raw_body_size,
+              helper.writer_->stats().total_serialized_body_size);

Review comment:
       equality is needed, because one of the record batches has zero rows and both total_raw_body_size and total_serialized_body_size are zero. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou closed pull request #11872: ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics

Posted by GitBox <gi...@apache.org>.

pitrou closed pull request #11872:
URL: https://github.com/apache/arrow/pull/11872


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #11872: ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-988479588


   Benchmark runs are scheduled for baseline = 9cf4275a19c994879172e5d3b03ade9a96a10721 and contender = 77722d990c28e22a56847b0dc49a79b83a6c1a45. 77722d990c28e22a56847b0dc49a79b83a6c1a45 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/26a4f85ef5a547079d922da56e969418...91874ad70cba40c88dd7f4b28d3655da/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/f4dce272e63a4badb48789db2008ef61...f1c8031c505b4f9ab42760f956e7c437/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f6f8d19d14aa4337a2095a1988acf496...1a5fd3d0d015446796c3b9cd6fa1d86b/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763975968



##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,61 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions options_uncompressed = IpcWriteOptions::Defaults();
+
+  std::vector<std::shared_ptr<RecordBatch>> batches(3);
+  // empty record batch
+  ASSERT_OK(MakeIntBatchSized(0, &batches[0]));
+  // record batch with int values
+  ASSERT_OK(MakeIntBatchSized(2000, &batches[1], 100));
+
+  // record batch with DictionaryArray
+  random::RandomArrayGenerator rg(/*seed=*/0);
+  int64_t length = 500;
+  int dict_size = 50;
+  std::shared_ptr<Array> dict =
+      rg.String(dict_size, /*min_length=*/5, /*max_length=*/5, /*null_probability=*/0);
+  std::shared_ptr<Array> indices =
+      rg.Int32(length, /*min=*/0, /*max=*/dict_size - 1, /*null_probability=*/0.1);
+  auto dict_type = dictionary(int32(), utf8());
+  auto dict_field = field("f1", dict_type);
+  ASSERT_OK_AND_ASSIGN(auto dict_array,
+                       DictionaryArray::FromArrays(dict_type, indices, dict));
+
+  auto schema = ::arrow::schema({field("f0", utf8()), dict_field});
+  batches[2] =
+      RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});
+
+  for (size_t i = 0; i < batches.size(); ++i) {
+    // without compression
+    ASSERT_OK(helper.Init(batches[i]->schema(), options_uncompressed));
+    ASSERT_OK(helper.WriteBatch(batches[i]));
+    ASSERT_OK(helper.Finish());
+    // padding can make serialized data slightly larger than the raw data size
+    // when no compression is used
+    ASSERT_LE(helper.writer_->stats().total_raw_body_size,
+              helper.writer_->stats().total_serialized_body_size);
+
+    if (!util::Codec::IsAvailable(Compression::LZ4_FRAME)) {
+      continue;
+    }
+
+    IpcWriteOptions options_compressed = IpcWriteOptions::Defaults();
+    ASSERT_OK_AND_ASSIGN(options_compressed.codec,
+                         util::Codec::Create(Compression::LZ4_FRAME));
+
+    // with compression
+    ASSERT_OK(helper.Init(batches[i]->schema(), options_compressed));
+    ASSERT_OK(helper.WriteBatch(batches[i]));
+    ASSERT_OK(helper.Finish());
+    ASSERT_GE(helper.writer_->stats().total_raw_body_size,
+              helper.writer_->stats().total_serialized_body_size);

Review comment:
       Can this be `ASSERT_GT` instead or will it fail the test?

##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,61 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio

Review comment:
       Nit: rename this test and update the comment now that we don't compute a ratio anymore?

##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -57,7 +57,8 @@ struct IpcPayload {
   MessageType type = MessageType::NONE;
   std::shared_ptr<Buffer> metadata;
   std::vector<std::shared_ptr<Buffer>> body_buffers;
-  int64_t body_length = 0;
+  int64_t body_length = 0;      // serialized body length (maybe compressed)
+  int64_t raw_body_length = 0;  // initial uncompressed body_length

Review comment:
       ```suggestion
     int64_t body_length = 0;      // serialized body length (padded, maybe compressed)
     int64_t raw_body_length = 0;  // initial uncompressed body length
   ```

##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -75,6 +76,11 @@ struct WriteStats {
   /// Number of replaced dictionaries (i.e. where a dictionary batch replaces
   /// an existing dictionary with an unrelated new dictionary).
   int64_t num_replaced_dictionaries = 0;
+
+  /// initial and serialized (may be compressed) body sizes in bytes for record batches
+  /// these values show the total sizes of all record batch body lengths

Review comment:
       ```suggestion
     /// Total size in bytes of record batches emitted.
     /// The "raw" size counts the original buffer sizes, while the "serialized" size
     /// includes padding and (optionally) compression.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763225907



##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,52 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();
+  ASSERT_OK_AND_ASSIGN(write_options2.codec, util::Codec::Create(Compression::LZ4_FRAME));
+
+  // pre-computed compression ratios for record batches with Compression::LZ4_FRAME
+  std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f};
+
+  std::vector<std::shared_ptr<RecordBatch>> batches(3);
+  // empty record batch
+  ASSERT_OK(MakeIntBatchSized(0, &batches[0]));
+  // record batch with int values
+  ASSERT_OK(MakeIntBatchSized(2000, &batches[1], 100));
+
+  // record batch with DictionaryArray
+  random::RandomArrayGenerator rg(/*seed=*/0);
+  int64_t length = 500;
+  int dict_size = 50;
+  std::shared_ptr<Array> dict = rg.String(dict_size, /*min_length=*/5, /*max_length=*/5, /*null_probability=*/0);
+  std::shared_ptr<Array> indices = rg.Int32(length, /*min=*/0, /*max=*/dict_size - 1, /*null_probability=*/0.1);
+  auto dict_type = dictionary(int32(), utf8());
+  auto dict_field = field("f1", dict_type);
+  ASSERT_OK_AND_ASSIGN(auto dict_array,
+                       DictionaryArray::FromArrays(dict_type, indices, dict));
+
+  auto schema = ::arrow::schema({field("f0", utf8()), dict_field});
+  batches[2] =
+    RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});
+
+  for(size_t i = 0; i < batches.size(); ++i) {

Review comment:
       Done as suggested. 
   But there is a slight difference. When a record-batch is serialized, buffer-sizes complemented to the multiple of 8. So when there is no compression, serialized record batch sizes can be slightly larger. In that case, raw-sizes are less than or equal to the serialized size. 
   
   In addition, when compression is used, if there is very little data (a few hundred bytes maybe), compressed size can actually be larger than the raw size. But I have not put this case in to the test case. So this is not a problem. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #11872: Arrow-8823 Calculating compression ratio

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#issuecomment-986908812


   Hi @ahmet-uyar , can you ensure the PR title is properly formatted? See the automated comment above for guidelines. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r763242124



##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,52 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();
+  ASSERT_OK_AND_ASSIGN(write_options2.codec, util::Codec::Create(Compression::LZ4_FRAME));
+
+  // pre-computed compression ratios for record batches with Compression::LZ4_FRAME
+  std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f};
+
+  std::vector<std::shared_ptr<RecordBatch>> batches(3);
+  // empty record batch
+  ASSERT_OK(MakeIntBatchSized(0, &batches[0]));
+  // record batch with int values
+  ASSERT_OK(MakeIntBatchSized(2000, &batches[1], 100));
+
+  // record batch with DictionaryArray
+  random::RandomArrayGenerator rg(/*seed=*/0);
+  int64_t length = 500;
+  int dict_size = 50;
+  std::shared_ptr<Array> dict = rg.String(dict_size, /*min_length=*/5, /*max_length=*/5, /*null_probability=*/0);
+  std::shared_ptr<Array> indices = rg.Int32(length, /*min=*/0, /*max=*/dict_size - 1, /*null_probability=*/0.1);
+  auto dict_type = dictionary(int32(), utf8());
+  auto dict_field = field("f1", dict_type);
+  ASSERT_OK_AND_ASSIGN(auto dict_array,
+                       DictionaryArray::FromArrays(dict_type, indices, dict));
+
+  auto schema = ::arrow::schema({field("f0", utf8()), dict_field});
+  batches[2] =
+    RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});
+
+  for(size_t i = 0; i < batches.size(); ++i) {

Review comment:
       Ah, you're right, the padding can make the size slightly larger. Can you add a comment explaining this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ahmet-uyar commented on a change in pull request #11872: ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages

Posted by GitBox <gi...@apache.org>.

ahmet-uyar commented on a change in pull request #11872:
URL: https://github.com/apache/arrow/pull/11872#discussion_r764080561



##########
File path: cpp/src/arrow/ipc/read_write_test.cc
##########
@@ -1727,6 +1727,61 @@ TEST(TestIpcFileFormat, FooterMetaData) {
   ASSERT_TRUE(out_metadata->Equals(*metadata));
 }
 
+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions options_uncompressed = IpcWriteOptions::Defaults();
+
+  std::vector<std::shared_ptr<RecordBatch>> batches(3);
+  // empty record batch
+  ASSERT_OK(MakeIntBatchSized(0, &batches[0]));
+  // record batch with int values
+  ASSERT_OK(MakeIntBatchSized(2000, &batches[1], 100));
+
+  // record batch with DictionaryArray
+  random::RandomArrayGenerator rg(/*seed=*/0);
+  int64_t length = 500;
+  int dict_size = 50;
+  std::shared_ptr<Array> dict =
+      rg.String(dict_size, /*min_length=*/5, /*max_length=*/5, /*null_probability=*/0);
+  std::shared_ptr<Array> indices =
+      rg.Int32(length, /*min=*/0, /*max=*/dict_size - 1, /*null_probability=*/0.1);
+  auto dict_type = dictionary(int32(), utf8());
+  auto dict_field = field("f1", dict_type);
+  ASSERT_OK_AND_ASSIGN(auto dict_array,
+                       DictionaryArray::FromArrays(dict_type, indices, dict));
+
+  auto schema = ::arrow::schema({field("f0", utf8()), dict_field});
+  batches[2] =
+      RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});
+
+  for (size_t i = 0; i < batches.size(); ++i) {
+    // without compression
+    ASSERT_OK(helper.Init(batches[i]->schema(), options_uncompressed));
+    ASSERT_OK(helper.WriteBatch(batches[i]));
+    ASSERT_OK(helper.Finish());
+    // padding can make serialized data slightly larger than the raw data size
+    // when no compression is used
+    ASSERT_LE(helper.writer_->stats().total_raw_body_size,
+              helper.writer_->stats().total_serialized_body_size);
+
+    if (!util::Codec::IsAvailable(Compression::LZ4_FRAME)) {
+      continue;
+    }
+
+    IpcWriteOptions options_compressed = IpcWriteOptions::Defaults();
+    ASSERT_OK_AND_ASSIGN(options_compressed.codec,
+                         util::Codec::Create(Compression::LZ4_FRAME));
+
+    // with compression
+    ASSERT_OK(helper.Init(batches[i]->schema(), options_compressed));
+    ASSERT_OK(helper.WriteBatch(batches[i]));
+    ASSERT_OK(helper.Finish());
+    ASSERT_GE(helper.writer_->stats().total_raw_body_size,
+              helper.writer_->stats().total_serialized_body_size);

Review comment:
       equality is needed, because one of the record batches has zero rows and both total_raw_body_size and total_serialized_body_size are zero. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org