You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "yyang52 (via GitHub)" <gi...@apache.org> on 2023/06/02 08:58:41 UTC

[GitHub] [arrow] yyang52 opened a new pull request, #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

yyang52 opened a new pull request, #35886:
URL: https://github.com/apache/arrow/pull/35886

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   Based on https://github.com/apache/arrow/issues/35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer.
   
   Authored-by: Yang Yang [yang10.yang@intel.com](mailto:yang10.yang@intel.com)
   Co-authored-by: Rambacher, Mark [mark.rambacher@intel.com](mailto:mark.rambacher@intel.com)
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   Add CodecOptions and replace `compression_level` when creating the Codec. The design is basically based on previous discussions. 
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   Yes
   ### Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   Yes, when user creates the `WriterProperties`
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1236236957


##########
cpp/src/parquet/properties.h:
##########
@@ -190,7 +193,14 @@ class PARQUET_EXPORT ColumnProperties {
 
   size_t max_statistics_size() const { return max_stats_size_; }
 
-  int compression_level() const { return compression_level_; }
+  int compression_level() const { return codec_options_->compression_level_; }
+
+  const std::shared_ptr<CodecOptions>& codec_options() const {
+    if (!codec_options_) {
+      return std::make_shared<CodecOptions>();

Review Comment:
   This is a problem:
   ```
   C:\projects\arrow\cpp\src\parquet\properties.h(200) : error C2220: the following warning is treated as an error
   C:\projects\arrow\cpp\src\parquet\properties.h(200) : warning C4172: returning address of local variable or temporary
   ninja: build stopped: subcommand failed.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1623106604

   > May you help to review the code change as well as trigger the remaining CI checks?
   
   May you help to trigger the remaining CI checks again? Thanks! @pitrou  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1646108511

   After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit a9035cbf778dbe762de1f6f53f00de8a424bfdad.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/15244735270) has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1576446660

   cc @pitrou @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1223745846


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {

Review Comment:
   Why not use enum class?



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   +1. Not writing explicit constructor can enable struct aggregate initialization in C++20.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1223916388


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   yes, then we could do one-line built for CodecOptions with `compression_level`, which has been used in several places



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1586721098

   > @yyang52 There are Python test failures on CI.
   > 
   > Also, is it possible to pass a base `CodecOptions` when a derived class exists? (e.g. for gzip). Ideally it should be supported.
   
   Seems like the CI has passed :) Please check if more refactors are needed, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1573395340

   :warning: GitHub issue #35287 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1217788338


##########
cpp/src/parquet/properties.h:
##########
@@ -192,6 +197,8 @@ class PARQUET_EXPORT ColumnProperties {
 
   int compression_level() const { return compression_level_; }

Review Comment:
   Since `Codec` contains `compression_level_`, would this be duplicated or useless?



##########
cpp/src/parquet/column_writer.h:
##########
@@ -87,8 +88,9 @@ class PARQUET_EXPORT PageWriter {
 
   static std::unique_ptr<PageWriter> Open(
       std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-      int compression_level, ColumnChunkMetaDataBuilder* metadata,
-      int16_t row_group_ordinal = -1, int16_t column_chunk_ordinal = -1,
+      const std::shared_ptr<CodecOptions>& codec_options,

Review Comment:
   This should be put to the end for capability



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1631722144

   > 
   
   Thanks for your kind support! It's a pleasure to make contribution to Arrow community! @pitrou @mapleFU @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1630892068

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1606608889

   https://github.com/apache/arrow/actions/runs/5367297586/jobs/9737382909?pr=35886
   
   There seems to be lint error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1622882937

   > I'm a little busy these days, I'll take a look tonight
   
   Sure, take your time. Just let me know if more actions needed, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1241881608


##########
cpp/src/parquet/column_writer.h:
##########
@@ -87,8 +88,9 @@ class PARQUET_EXPORT PageWriter {
 
   static std::unique_ptr<PageWriter> Open(
       std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-      int compression_level, ColumnChunkMetaDataBuilder* metadata,

Review Comment:
   We can indeed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1214128697


##########
cpp/src/parquet/properties.h:
##########
@@ -172,8 +172,8 @@ class PARQUET_EXPORT ColumnProperties {
     max_stats_size_ = max_stats_size;
   }
 
-  void set_compression_level(int compression_level) {

Review Comment:
   should we keep `set_compression_level` for api compability?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1254522764


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +108,40 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  explicit CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_ = kUseDefaultCompressionLevel;

Review Comment:
   Forgot this, but style nit (this is a public member, it's weird to add a trailing underscore):
   ```suggestion
     int compression_level = kUseDefaultCompressionLevel;
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1622877575

   I'm a little busy these days, I'll take a look tonight


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1247484665


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,44 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  explicit CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;

Review Comment:
   yeah I made some changes to these part, will push those soon.



##########
cpp/src/parquet/platform.h:
##########
@@ -24,7 +24,8 @@
 #include "arrow/io/interfaces.h"  // IWYU pragma: export
 #include "arrow/status.h"         // IWYU pragma: export
 #include "arrow/type_fwd.h"       // IWYU pragma: export
-#include "arrow/util/macros.h"    // IWYU pragma: export
+#include "arrow/util/compression.h"

Review Comment:
   yes, could I include that directly in `column_writer.h`, as the forward declaration is not sufficient to initialize an object as far as I learned?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1247376553


##########
cpp/src/parquet/platform.h:
##########
@@ -24,7 +24,8 @@
 #include "arrow/io/interfaces.h"  // IWYU pragma: export
 #include "arrow/status.h"         // IWYU pragma: export
 #include "arrow/type_fwd.h"       // IWYU pragma: export
-#include "arrow/util/macros.h"    // IWYU pragma: export
+#include "arrow/util/compression.h"

Review Comment:
   If not, then it will prevent us to initialize as `CodecOptions{}`? Am I doing this right?
   ```
   column_writer.h:101:43: error: invalid use of incomplete type 'parquet::CodecOptions' (aka 'arrow::util::CodecOptions')
         const CodecOptions& codec_options = CodecOptions{});
                                             ^~~~~~~~~~~~~~
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1218969075


##########
cpp/src/arrow/util/compression.h:
##########
@@ -124,7 +168,9 @@ class ARROW_EXPORT Codec {
 
   /// \brief Create a codec for the given compression algorithm
   static Result<std::unique_ptr<Codec>> Create(
-      Compression::type codec, int compression_level = kUseDefaultCompressionLevel);
+      Compression::type codec,

Review Comment:
   Great, that makes sense :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1219302368


##########
cpp/src/arrow/util/compression_brotli.cc:
##########
@@ -221,6 +226,13 @@ class BrotliCodec : public Codec {
     return ptr;
   }
 
+  Status Init() override {
+    if (window_bits_ < BROTLI_MIN_WINDOW_BITS || window_bits_ > BROTLI_MAX_WINDOW_BITS) {
+      return Status::Invalid("window_bits should be within 10 ~ 24");

Review Comment:
   Let's make the error message respect the macro values:
   ```suggestion
         return Status::Invalid("window_bits should be between ", BROTLI_MIN_WINDOW_BITS,
             " and ", BROTLI_MAX_WINDOW_BITS);
   ```



##########
cpp/src/arrow/util/compression_lz4.cc:
##########
@@ -17,13 +17,13 @@
 
 #include "arrow/util/compression_internal.h"
 
-#include <cstdint>
-#include <cstring>
-#include <memory>
-
 #include <lz4.h>
 #include <lz4frame.h>
 #include <lz4hc.h>
+#include <cstdint>
+#include <cstring>
+#include <iostream>
+#include <memory>

Review Comment:
   Those changes shouldn't be necessary? (also, why `iostream`?)



##########
cpp/src/arrow/util/compression_zlib.cc:
##########
@@ -46,6 +45,9 @@ namespace {
 // Maximum window size
 constexpr int WINDOW_BITS = 15;
 
+// Minimum window size
+constexpr int kGZipMinWindowBits = 9;

Review Comment:
   Can you also rename the above to `kGZipMaxWindowBits`?



##########
cpp/src/arrow/util/compression.h:
##########
@@ -124,7 +168,15 @@ class ARROW_EXPORT Codec {
 
   /// \brief Create a codec for the given compression algorithm
   static Result<std::unique_ptr<Codec>> Create(
-      Compression::type codec, int compression_level = kUseDefaultCompressionLevel);
+      Compression::type codec,
+      const std::shared_ptr<CodecOptions>& codec_options =
+          std::make_shared<CodecOptions>(kUseDefaultCompressionLevel));
+
+  /// \brief Create a codec for the given compression algorithm
+  /// \deprecated and left for backwards compatibility.
+  ARROW_DEPRECATED("Use CodecOptions to create Codec")

Review Comment:
   I don't think this should be deprecated, since it's a perfectly good API for common usage (most people will never change the windows bits or any other advanced compression parameters).



##########
cpp/src/arrow/flight/flight_benchmark.cc:
##########
@@ -432,7 +432,9 @@ int main(int argc, char** argv) {
     const int level = level_str.empty() ? arrow::util::kUseDefaultCompressionLevel
                                         : std::stoi(level_str);
     const auto type = arrow::util::Codec::GetCompressionType(name).ValueOrDie();
-    auto codec = arrow::util::Codec::Create(type, level).ValueOrDie();
+    auto codec = arrow::util::Codec::Create(
+                     type, std::make_shared<arrow::util::CodecOptions>(level))
+                     .ValueOrDie();

Review Comment:
   These changes (and other similar ones) can be reverted if we remove the deprecation.



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class GZipCodecOptions : public CodecOptions {
+ public:
+  ~GZipCodecOptions() = default;

Review Comment:
   I don't think you need to explicitly define the destructor?



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class GZipCodecOptions : public CodecOptions {
+ public:
+  ~GZipCodecOptions() = default;
+
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;
+};
+
+// ----------------------------------------------------------------------
+// brotli codec options implementation
+
+constexpr int kBrotliDefaultWindowBits = 22;
+
+class BrotliCodecOptions : public CodecOptions {

Review Comment:
   Same here?
   ```suggestion
   class ARROW_EXPORT BrotliCodecOptions : public CodecOptions {
   ```
   



##########
cpp/src/parquet/column_writer.h:
##########
@@ -21,6 +21,7 @@
 #include <cstring>
 #include <memory>
 
+#include "arrow/util/compression.h"

Review Comment:
   ```suggestion
   #include "arrow/type_fwd.h"
   ```



##########
cpp/src/arrow/util/compression_zlib.cc:
##########
@@ -55,8 +57,8 @@ constexpr int DETECT_CODEC = 32;
 constexpr int kGZipMinCompressionLevel = 1;
 constexpr int kGZipMaxCompressionLevel = 9;
 
-int CompressionWindowBitsForFormat(GZipFormat::type format) {
-  int window_bits = WINDOW_BITS;
+int CompressionWindowBitsForFormat(GZipFormat::type format, int input_window_bits) {
+  int window_bits = input_window_bits;

Review Comment:
   This is a bit convoluted
   ```suggestion
   int CompressionWindowBitsForFormat(GZipFormat::type format, int window_bits) {
   ```



##########
cpp/src/arrow/util/compression.cc:
##########
@@ -135,6 +135,92 @@ Result<int> Codec::DefaultCompressionLevel(Compression::type codec_type) {
   return codec->default_compression_level();
 }
 
+Result<std::unique_ptr<Codec>> Codec::Create(
+    Compression::type codec_type, const std::shared_ptr<CodecOptions>& codec_options) {
+  if (!IsAvailable(codec_type)) {
+    if (codec_type == Compression::LZO) {
+      return Status::NotImplemented("LZO codec not implemented");
+    }
+
+    auto name = GetCodecAsString(codec_type);
+    if (name == "unknown") {
+      return Status::Invalid("Unrecognized codec");
+    }
+
+    return Status::NotImplemented("Support for codec '", GetCodecAsString(codec_type),
+                                  "' not built");
+  }
+
+  auto compression_level = codec_options->compression_level_;
+  if (compression_level != kUseDefaultCompressionLevel &&
+      !SupportsCompressionLevel(codec_type)) {
+    return Status::Invalid("Codec '", GetCodecAsString(codec_type),
+                           "' doesn't support setting a compression level.");
+  }
+
+  std::unique_ptr<Codec> codec;
+  switch (codec_type) {
+    case Compression::UNCOMPRESSED:
+      return nullptr;
+    case Compression::SNAPPY:
+#ifdef ARROW_WITH_SNAPPY
+      codec = internal::MakeSnappyCodec();
+#endif
+      break;
+    case Compression::GZIP: {
+#ifdef ARROW_WITH_ZLIB
+      std::shared_ptr<GZipCodecOptions> opt =
+          std::dynamic_pointer_cast<GZipCodecOptions>(codec_options);
+      codec = internal::MakeGZipCodec(compression_level,
+                                      opt ? opt->gzip_format : GZipFormat::GZIP,
+                                      opt ? opt->window_bits : kGZipDefaultWindowBits);
+#endif
+      break;
+    }
+    case Compression::BROTLI: {
+#ifdef ARROW_WITH_BROTLI
+      std::shared_ptr<BrotliCodecOptions> opt =
+          std::dynamic_pointer_cast<BrotliCodecOptions>(codec_options);
+      codec = internal::MakeBrotliCodec(
+          compression_level, opt ? opt->window_bits : kBrotliDefaultWindowBits);
+#endif
+      break;
+    }
+    case Compression::LZ4:
+#ifdef ARROW_WITH_LZ4
+      codec = internal::MakeLz4RawCodec(compression_level);
+#endif
+      break;
+    case Compression::LZ4_FRAME:
+#ifdef ARROW_WITH_LZ4
+      codec = internal::MakeLz4FrameCodec(compression_level);
+#endif
+      break;
+    case Compression::LZ4_HADOOP:
+#ifdef ARROW_WITH_LZ4
+      codec = internal::MakeLz4HadoopRawCodec();
+#endif
+      break;
+    case Compression::ZSTD:
+#ifdef ARROW_WITH_ZSTD
+      codec = internal::MakeZSTDCodec(compression_level);
+#endif
+      break;
+    case Compression::BZ2:
+#ifdef ARROW_WITH_BZ2
+      codec = internal::MakeBZ2Codec(compression_level);
+#endif
+      break;
+    default:
+      break;
+  }
+
+  DCHECK_NE(codec, nullptr);
+  RETURN_NOT_OK(codec->Init());
+  return std::move(codec);
+}
+
+// Deprecated and use CodecOptions to create Codec instead
 Result<std::unique_ptr<Codec>> Codec::Create(Compression::type codec_type,

Review Comment:
   There's a lot of repetition below. Can you please implement this overload in terms of the one above?



##########
cpp/src/parquet/types.h:
##########
@@ -487,6 +487,12 @@ bool IsCodecSupported(Compression::type codec);
 PARQUET_EXPORT
 std::unique_ptr<Codec> GetCodec(Compression::type codec);
 
+PARQUET_EXPORT
+std::unique_ptr<Codec> GetCodec(Compression::type codec,
+                                const std::shared_ptr<CodecOptions>& codec_options);

Review Comment:
   `const CodecOptions&`



##########
cpp/src/parquet/types.h:
##########
@@ -487,6 +487,12 @@ bool IsCodecSupported(Compression::type codec);
 PARQUET_EXPORT
 std::unique_ptr<Codec> GetCodec(Compression::type codec);
 
+PARQUET_EXPORT
+std::unique_ptr<Codec> GetCodec(Compression::type codec,
+                                const std::shared_ptr<CodecOptions>& codec_options);
+
+/// \deprecated and left for backwards compatibility.
+ARROW_DEPRECATED("Use CodecOptions to create Codec")

Review Comment:
   Again, I'm not sure it's useful to deprecate this one.



##########
cpp/src/parquet/column_writer.h:
##########
@@ -87,8 +88,9 @@ class PARQUET_EXPORT PageWriter {
 
   static std::unique_ptr<PageWriter> Open(
       std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-      int compression_level, ColumnChunkMetaDataBuilder* metadata,

Review Comment:
   Let's not remove this API but create a new one taking `const CodecOptions&`, to avoid breaking API compatibility.



##########
r/src/compression.cpp:
##########
@@ -23,7 +23,8 @@
 // [[arrow::export]]
 std::shared_ptr<arrow::util::Codec> util___Codec__Create(arrow::Compression::type codec,
                                                          R_xlen_t compression_level) {
-  return ValueOrStop(arrow::util::Codec::Create(codec, compression_level));
+  return ValueOrStop(arrow::util::Codec::Create(
+      codec, std::make_shared<CodecOptions>(CodecOptions(compression_level))));

Review Comment:
   Can probably revert this.



##########
cpp/src/arrow/util/compression_zlib.cc:
##########
@@ -461,6 +468,9 @@ class GZipCodec : public Codec {
   }
 
   Status Init() override {
+    if (window_bits_ < kGZipMinWindowBits || window_bits_ > WINDOW_BITS) {
+      return Status::Invalid("window_bits should be within 9 ~ 15");

Review Comment:
   Here as well, please incorporate the macro values in the error message.



##########
cpp/src/arrow/util/compression_zlib.cc:
##########
@@ -17,15 +17,14 @@
 
 #include "arrow/util/compression_internal.h"
 
+#include <zconf.h>
+#include <zlib.h>

Review Comment:
   Nit, but can you undo the include ordering change? It's a bit gratuitous. Also, it makes things more readable to separate stdlib imports from third-party library imports.



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class GZipCodecOptions : public CodecOptions {

Review Comment:
   Probably need to ARROW_EXPORT this again for Windows.
   ```suggestion
   class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
   ```



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   Or simply:
   ```c++
   class ARROW_EXPORT CodecOptions {
    public:
     virtual ~CodecOptions() = default;
   
     int compression_level_ = kUseDefaultCompressionLevel;
   };
   ```



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -249,9 +249,9 @@ int LevelEncoder::Encode(int batch_size, const int16_t* levels) {
 class SerializedPageWriter : public PageWriter {
  public:
   SerializedPageWriter(std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-                       int compression_level, ColumnChunkMetaDataBuilder* metadata,
-                       int16_t row_group_ordinal, int16_t column_chunk_ordinal,
-                       bool use_page_checksum_verification,
+                       const std::shared_ptr<CodecOptions>& codec_options,

Review Comment:
   No need for `shared_ptr` here.
   ```suggestion
                          const CodecOptions& codec_options,
   ```



##########
cpp/src/arrow/util/compression.h:
##########
@@ -124,7 +168,15 @@ class ARROW_EXPORT Codec {
 
   /// \brief Create a codec for the given compression algorithm
   static Result<std::unique_ptr<Codec>> Create(
-      Compression::type codec, int compression_level = kUseDefaultCompressionLevel);
+      Compression::type codec,
+      const std::shared_ptr<CodecOptions>& codec_options =
+          std::make_shared<CodecOptions>(kUseDefaultCompressionLevel));

Review Comment:
   It's not necessary to mandate  `shared_ptr` here. Codecs won't take ownership of the pointer, so you can pass a const reference instead:
   ```c++
     static Result<std::unique_ptr<Codec>> Create(
         Compression::type codec, const CodecOptions& codec_options = {});
   ```
   



##########
cpp/src/parquet/types.cc:
##########
@@ -51,9 +51,30 @@ bool IsCodecSupported(Compression::type codec) {
 }
 
 std::unique_ptr<Codec> GetCodec(Compression::type codec) {
-  return GetCodec(codec, Codec::UseDefaultCompressionLevel());
+  return GetCodec(codec, std::make_shared<CodecOptions>());
 }
 
+std::unique_ptr<Codec> GetCodec(Compression::type codec,
+                                const std::shared_ptr<CodecOptions>& codec_options) {
+  std::unique_ptr<Codec> result;
+  if (codec == Compression::LZO) {
+    throw ParquetException(
+        "While LZO compression is supported by the Parquet format in "
+        "general, it is currently not supported by the C++ implementation.");
+  }
+
+  if (!IsCodecSupported(codec)) {
+    std::stringstream ss;
+    ss << "Codec type " << Codec::GetCodecAsString(codec)
+       << " not supported in Parquet format";
+    throw ParquetException(ss.str());
+  }
+
+  PARQUET_ASSIGN_OR_THROW(result, Codec::Create(codec, codec_options));
+  return result;
+}
+
+// Deprecated and use CodecOptions to create Codec instead
 std::unique_ptr<Codec> GetCodec(Compression::type codec, int compression_level) {

Review Comment:
   Here as well, can you implement this overload in terms of the one above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1223920461


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;
+};
+
+// ----------------------------------------------------------------------
+// brotli codec options implementation
+
+constexpr int kBrotliDefaultWindowBits = 22;

Review Comment:
   yes, but then we need to include the `brotli` header here as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #35886:
URL: https://github.com/apache/arrow/pull/35886


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1247481017


##########
cpp/src/parquet/platform.h:
##########
@@ -24,7 +24,8 @@
 #include "arrow/io/interfaces.h"  // IWYU pragma: export
 #include "arrow/status.h"         // IWYU pragma: export
 #include "arrow/type_fwd.h"       // IWYU pragma: export
-#include "arrow/util/macros.h"    // IWYU pragma: export
+#include "arrow/util/compression.h"

Review Comment:
   @yyang52 You don't need it in `platform.h`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1247482352


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,44 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  explicit CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;

Review Comment:
   So would this be changed to a `std::nullopt`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1246170803


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   Per tests, we may need to do this? Am I missing something?
   ```  
   CodecOptions codec_options;
   codec_options.compression_level_ = compression_level;
   return Codec::Create(codec_type, codec_options);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1223918093


##########
cpp/src/parquet/properties.h:
##########
@@ -192,6 +197,8 @@ class PARQUET_EXPORT ColumnProperties {
 
   int compression_level() const { return compression_level_; }

Review Comment:
   that makes sense :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1220691885


##########
cpp/src/arrow/util/compression_zlib.cc:
##########
@@ -17,15 +17,14 @@
 
 #include "arrow/util/compression_internal.h"
 
+#include <zconf.h>
+#include <zlib.h>

Review Comment:
   Sure, that might be due to some local format auto-change, will revert those.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1241873189


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;

Review Comment:
   That sounds like a good idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1241876835


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;

Review Comment:
   I also vote for this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1247405427


##########
cpp/src/parquet/properties.h:
##########
@@ -172,8 +172,8 @@ class PARQUET_EXPORT ColumnProperties {
     max_stats_size_ = max_stats_size;
   }
 
-  void set_compression_level(int compression_level) {

Review Comment:
   I guess both can be used



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1271424177


##########
cpp/src/parquet/file_writer.cc:
##########
@@ -291,12 +304,24 @@ class RowGroupSerializer : public RowGroupWriter::Contents {
       auto oi_builder = page_index_builder_ && properties_->page_index_enabled(path)
                             ? page_index_builder_->GetOffsetIndexBuilder(column_ordinal)
                             : nullptr;
-      std::unique_ptr<PageWriter> pager = PageWriter::Open(
-          sink_, properties_->compression(path), properties_->compression_level(path),
-          col_meta, static_cast<int16_t>(row_group_ordinal_),
-          static_cast<int16_t>(column_ordinal), properties_->memory_pool(),
-          buffered_row_group_, meta_encryptor, data_encryptor,
-          properties_->page_checksum_enabled(), ci_builder, oi_builder);
+      auto codec_options = properties_->codec_options(path)
+                               ? (properties_->codec_options(path)).get()
+                               : nullptr;
+
+      std::unique_ptr<PageWriter> pager;
+      if (!codec_options) {
+        pager = PageWriter::Open(sink_, properties_->compression(path), col_meta,
+                                 row_group_ordinal_, static_cast<int16_t>(column_ordinal),
+                                 properties_->memory_pool(), false, meta_encryptor,

Review Comment:
   This part lost `buffered_row_group_`, and set it to false. Sorry that I didn't found it at first time



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1236679927


##########
cpp/src/parquet/properties.h:
##########
@@ -190,7 +193,14 @@ class PARQUET_EXPORT ColumnProperties {
 
   size_t max_statistics_size() const { return max_stats_size_; }
 
-  int compression_level() const { return compression_level_; }
+  int compression_level() const { return codec_options_->compression_level_; }
+
+  const std::shared_ptr<CodecOptions>& codec_options() const {
+    if (!codec_options_) {
+      return std::make_shared<CodecOptions>();

Review Comment:
   this seems to be fixed, but there are Python errors that I have no ideas on the hints



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1241875496


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   @yyang52 I'm not sure I understand what you're saying. Both definitions allow writing `CodecOptions{compression_level}`, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1246206155


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   It gives me the compilation error:
   ```
   error: no matching constructor for initialization of 'arrow::util::CodecOptions'
     return Codec::Create(codec_type, CodecOptions{compression_level});
                                      ^           ~~~~~~~~~~~~~~~~~~~
   /home/yangyang/arrow/cpp/src/arrow/util/compression.h:111:20: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'int' to 'const arrow::util::CodecOptions' for 1st argument
   class ARROW_EXPORT CodecOptions {
                      ^
   /home/yangyang/arrow/cpp/src/arrow/util/compression.h:111:20: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 1 was provided
   1 error generated.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1614307121

   May you help to review the code change as well as trigger the remaining CI checks? Thanks! @mapleFU @pitrou @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1623038523

   > @yyang52 We're busy preparing a release right now. You might start by merging/rebasing from main to get a more passing CI.
   
   Sure, no rush! I will rebase from main and re-run CI checks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1622779185

   Hi, may I know if there is any change needed or concerns regarding this PR? Thanks! 
   @mapleFU @pitrou @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1231802272


##########
cpp/src/parquet/column_writer.h:
##########
@@ -87,8 +88,9 @@ class PARQUET_EXPORT PageWriter {
 
   static std::unique_ptr<PageWriter> Open(
       std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-      int compression_level, ColumnChunkMetaDataBuilder* metadata,
-      int16_t row_group_ordinal = -1, int16_t column_chunk_ordinal = -1,
+      const std::shared_ptr<CodecOptions>& codec_options,

Review Comment:
   In addition to the comments above, should be consolidate `Compression::type codec` and `const std::shared_ptr<CodecOptions>& codec_options`? codec_options should know codec.



##########
cpp/src/parquet/column_writer.h:
##########
@@ -21,6 +21,7 @@
 #include <cstring>
 #include <memory>
 
+#include "arrow/type_fwd.h"

Review Comment:
   Is this required? It should be included by `parquet/platform.h`.



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   That sounds reasonable for now. But it would be harder to use once we add more parameters.



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;
+};
+
+// ----------------------------------------------------------------------
+// brotli codec options implementation
+
+constexpr int kBrotliDefaultWindowBits = 22;

Review Comment:
   Same with my previous comment. We can use `BROTLI_DEFAULT_WINDOW` in the `.cc` file



##########
cpp/src/parquet/column_writer.h:
##########
@@ -87,8 +88,9 @@ class PARQUET_EXPORT PageWriter {
 
   static std::unique_ptr<PageWriter> Open(
       std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-      int compression_level, ColumnChunkMetaDataBuilder* metadata,

Review Comment:
   Should we mark the old one as deprecated? The new one should always be preferable and we don't want to maintain different overloads in the long term.



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;

Review Comment:
   This could be easily out of sync with the upstream. What about using `std::optional<int>` and always use the library default value if it is unset?



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation

Review Comment:
   ```suggestion
   // GZip codec options implementation
   ```



##########
cpp/src/parquet/properties.h:
##########
@@ -192,6 +197,8 @@ class PARQUET_EXPORT ColumnProperties {
 
   int compression_level() const { return compression_level_; }

Review Comment:
   +1 on this. `compression_level_` is duplicated.



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {

Review Comment:
   It was an internal class in the past. As it is moved to a public header now, changing it to enum class would be preferable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1223961635


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;
+};
+
+// ----------------------------------------------------------------------
+// brotli codec options implementation
+
+constexpr int kBrotliDefaultWindowBits = 22;

Review Comment:
   Yes you're right. I'm not familiar with compression algorithms, so maybe I'm asking a stupid question. but I'm afraid that brorli might change it's default bits.
   
   Would it be ok for a optional `window_bits`, if it's set, use it, otherwise using default value?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1218964447


##########
cpp/src/arrow/util/compression.h:
##########
@@ -124,7 +168,9 @@ class ARROW_EXPORT Codec {
 
   /// \brief Create a codec for the given compression algorithm
   static Result<std::unique_ptr<Codec>> Create(
-      Compression::type codec, int compression_level = kUseDefaultCompressionLevel);
+      Compression::type codec,

Review Comment:
   Yes, that could solve the compatibility issue of different APIs. Just to make it clear, you suggest marking the newly added API with CodecOptions as deprecated right? Or the previous one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1226143592


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {

Review Comment:
   I just move `GZipFormat` from `compression_internal.h` to `compression.h` to be able to use it in `CodecOptions` without changing the type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1606832209

   > https://github.com/apache/arrow/actions/runs/5367297586/jobs/9737382909?pr=35886
   > 
   > There seems to be lint error.
   
   Doing some fixes, could you help to re-trigger other checks? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1605883210

   Addressed most comments with most CI checks passed (2 fails seem also happening on other PRs). One left thing is hardcoding `window_bits` in `CodecOptions` may easily be out-of-sync with third-party dependency, as mentioned by @wgtmac. Do you have any suggestions/thoughts on that? Please also have a look at other changes I made and see if any other improvements needed. Thanks! @pitrou @mapleFU @wgtmac 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1217818957


##########
cpp/src/arrow/util/compression.h:
##########
@@ -124,7 +168,9 @@ class ARROW_EXPORT Codec {
 
   /// \brief Create a codec for the given compression algorithm
   static Result<std::unique_ptr<Codec>> Create(
-      Compression::type codec, int compression_level = kUseDefaultCompressionLevel);
+      Compression::type codec,

Review Comment:
   This is a breaking change. It would be better to add a new overload. Then implement this one based on the new overload and label it as deprecated.



##########
cpp/src/parquet/types.h:
##########
@@ -488,7 +488,8 @@ PARQUET_EXPORT
 std::unique_ptr<Codec> GetCodec(Compression::type codec);
 
 PARQUET_EXPORT
-std::unique_ptr<Codec> GetCodec(Compression::type codec, int compression_level);
+std::unique_ptr<Codec> GetCodec(Compression::type codec,

Review Comment:
   Another breaking change here.



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -683,20 +683,21 @@ class BufferedPageWriter : public PageWriter {
 
 std::unique_ptr<PageWriter> PageWriter::Open(
     std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-    int compression_level, ColumnChunkMetaDataBuilder* metadata,
-    int16_t row_group_ordinal, int16_t column_chunk_ordinal, MemoryPool* pool,
-    bool buffered_row_group, std::shared_ptr<Encryptor> meta_encryptor,
-    std::shared_ptr<Encryptor> data_encryptor, bool page_write_checksum_enabled,
-    ColumnIndexBuilder* column_index_builder, OffsetIndexBuilder* offset_index_builder) {
+    const std::shared_ptr<CodecOptions>& codec_options,
+    ColumnChunkMetaDataBuilder* metadata, int16_t row_group_ordinal,
+    int16_t column_chunk_ordinal, MemoryPool* pool, bool buffered_row_group,
+    std::shared_ptr<Encryptor> meta_encryptor, std::shared_ptr<Encryptor> data_encryptor,
+    bool page_write_checksum_enabled, ColumnIndexBuilder* column_index_builder,
+    OffsetIndexBuilder* offset_index_builder) {
   if (buffered_row_group) {
     return std::unique_ptr<PageWriter>(new BufferedPageWriter(
-        std::move(sink), codec, compression_level, metadata, row_group_ordinal,
+        std::move(sink), codec, codec_options, metadata, row_group_ordinal,

Review Comment:
   I am not sure if we are able to move `codec` into `codec_options` as well. I don't have a preference here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1220768693


##########
cpp/src/arrow/util/compression.h:
##########
@@ -124,7 +168,15 @@ class ARROW_EXPORT Codec {
 
   /// \brief Create a codec for the given compression algorithm
   static Result<std::unique_ptr<Codec>> Create(
-      Compression::type codec, int compression_level = kUseDefaultCompressionLevel);
+      Compression::type codec,
+      const std::shared_ptr<CodecOptions>& codec_options =
+          std::make_shared<CodecOptions>(kUseDefaultCompressionLevel));
+
+  /// \brief Create a codec for the given compression algorithm
+  /// \deprecated and left for backwards compatibility.
+  ARROW_DEPRECATED("Use CodecOptions to create Codec")

Review Comment:
   That makes more sense, will keep the existing one as it was



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1223885218


##########
cpp/src/parquet/column_io_benchmark.cc:
##########
@@ -40,8 +40,8 @@ std::shared_ptr<Int64Writer> BuildWriter(int64_t output_size,
                                          ColumnDescriptor* schema,
                                          const WriterProperties* properties,
                                          Compression::type codec) {
-  std::unique_ptr<PageWriter> pager =
-      PageWriter::Open(dst, codec, Codec::UseDefaultCompressionLevel(), metadata);
+    std::unique_ptr<PageWriter> pager =

Review Comment:
   fmt?



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;
+};
+
+// ----------------------------------------------------------------------
+// brotli codec options implementation
+
+constexpr int kBrotliDefaultWindowBits = 22;

Review Comment:
   Can it use `BROTLI_DEFAULT_WINDOW`?



##########
cpp/src/parquet/properties.h:
##########
@@ -192,6 +197,8 @@ class PARQUET_EXPORT ColumnProperties {
 
   int compression_level() const { return compression_level_; }

Review Comment:
   Now we have a `compression_level` and a `CodecOptions`. Previously, user can set `compression_level`, and the compression level would be changed.
   
   After this patch, if user change the compression level, the codec will not be changed.
   
   Can we make `CodecOptions` mutate it's `compression_level`, and make `set_compression_level` set the level in `CodecOptions` directly?



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -690,26 +690,40 @@ class BufferedPageWriter : public PageWriter {
 
 std::unique_ptr<PageWriter> PageWriter::Open(
     std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-    int compression_level, ColumnChunkMetaDataBuilder* metadata,
-    int16_t row_group_ordinal, int16_t column_chunk_ordinal, MemoryPool* pool,
-    bool buffered_row_group, std::shared_ptr<Encryptor> meta_encryptor,
-    std::shared_ptr<Encryptor> data_encryptor, bool page_write_checksum_enabled,
-    ColumnIndexBuilder* column_index_builder, OffsetIndexBuilder* offset_index_builder) {
+    const std::shared_ptr<CodecOptions>& codec_options,
+    ColumnChunkMetaDataBuilder* metadata, int16_t row_group_ordinal,
+    int16_t column_chunk_ordinal, MemoryPool* pool, bool buffered_row_group,
+    std::shared_ptr<Encryptor> meta_encryptor, std::shared_ptr<Encryptor> data_encryptor,
+    bool page_write_checksum_enabled, ColumnIndexBuilder* column_index_builder,
+    OffsetIndexBuilder* offset_index_builder) {
   if (buffered_row_group) {
     return std::unique_ptr<PageWriter>(new BufferedPageWriter(
-        std::move(sink), codec, compression_level, metadata, row_group_ordinal,
+        std::move(sink), codec, *codec_options.get(), metadata, row_group_ordinal,

Review Comment:
   I think just `*codec_options` is ok?



##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   I guess @yyang52 's idea is that `CodecOptions` can explicit built from `compression_level`, am I right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1615314470

   There're 6 failing checks, but seems like not introduced by this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1229753260


##########
cpp/examples/arrow/parquet_read_write.cc:
##########
@@ -165,11 +165,44 @@ arrow::Status WriteInBatches(std::string path_to_file) {
   return arrow::Status::OK();
 }
 
+arrow::Status WriteWithCodecOptions(std::string path_to_file) {
+  using parquet::ArrowWriterProperties;
+  using parquet::WriterProperties;
+
+  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> table, GetTable());
+
+  // Customize codec options with compression level and window bits
+  // the default window bits is 15, here we set a small number of 12 for some use scenario
+  // with limited hisotry buffer size
+  auto codec_options = std::make_shared<arrow::util::GZipCodecOptions>();
+  codec_options->compression_level_ = 9;
+  codec_options->window_bits = 12;
+
+  // Choose compression
+  std::shared_ptr<WriterProperties> props = WriterProperties::Builder()
+                                                .compression(arrow::Compression::GZIP)
+                                                ->codec_options(codec_options)
+                                                ->build();
+
+  // Opt to store Arrow schema for easier reads back into Arrow
+  std::shared_ptr<ArrowWriterProperties> arrow_props =
+      ArrowWriterProperties::Builder().store_schema()->build();
+
+  std::shared_ptr<arrow::io::FileOutputStream> outfile;
+  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open(path_to_file));
+
+  ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(*table.get(),
+                                                 arrow::default_memory_pool(), outfile,
+                                                 /*chunk_size=*/3, props, arrow_props));
+  return arrow::Status::OK();
+}
+
 arrow::Status RunExamples(std::string path_to_file) {
   ARROW_RETURN_NOT_OK(WriteFullFile(path_to_file));
   ARROW_RETURN_NOT_OK(ReadFullFile(path_to_file));
   ARROW_RETURN_NOT_OK(WriteInBatches(path_to_file));
   ARROW_RETURN_NOT_OK(ReadInBatches(path_to_file));
+  ARROW_RETURN_NOT_OK(WriteWithCodecOptions(path_to_file));

Review Comment:
   I don't think it is useful to add this example. We're not showcasing other options in this file, either.



##########
cpp/src/arrow/util/compression_brotli.cc:
##########
@@ -221,6 +226,14 @@ class BrotliCodec : public Codec {
     return ptr;
   }
 
+  Status Init() override {
+    if (window_bits_ < BROTLI_MIN_WINDOW_BITS || window_bits_ > BROTLI_MAX_WINDOW_BITS) {
+      return Status::Invalid("window_bits should be between ", BROTLI_MIN_WINDOW_BITS,

Review Comment:
   Nit: make the context more explicit in the error message?
   ```suggestion
         return Status::Invalid("Brotli window_bits should be between ", BROTLI_MIN_WINDOW_BITS,
   ```



##########
cpp/src/parquet/column_writer_test.cc:
##########
@@ -118,8 +118,9 @@ class TestPrimitiveWriter : public PrimitiveTypedTest<TestType> {
 
     metadata_ = ColumnChunkMetaDataBuilder::Make(writer_properties_, this->descr_);
     std::unique_ptr<PageWriter> pager = PageWriter::Open(
-        sink_, column_properties.compression(), Codec::UseDefaultCompressionLevel(),
-        metadata_.get(), /* row_group_ordinal */ -1, /* column_chunk_ordinal*/ -1,
+        sink_, column_properties.compression(), std::make_shared<CodecOptions>(),

Review Comment:
   This change isn't necessary AFAICT.



##########
cpp/src/parquet/column_writer_test.cc:
##########
@@ -162,6 +163,25 @@ class TestPrimitiveWriter : public PrimitiveTypedTest<TestType> {
     ASSERT_NO_FATAL_FAILURE(this->ReadAndCompare(compression, num_rows, enable_checksum));
   }
 
+  void TestRequiredWithCodecOptions(
+      Encoding::type encoding, Compression::type compression, bool enable_dictionary,
+      bool enable_statistics, int64_t num_rows = SMALL_SIZE,
+      int compression_level = Codec::UseDefaultCompressionLevel(),
+      ::arrow::util::GZipFormat::type format = ::arrow::util::GZipFormat::GZIP,
+      int window_bits = ::arrow::util::kGZipDefaultWindowBits,
+      bool enable_checksum = false) {

Review Comment:
   Why not simply take a `const CodecOptions&` here?



##########
cpp/src/parquet/properties.h:
##########
@@ -154,6 +154,7 @@ class PARQUET_EXPORT ColumnProperties {
         statistics_enabled_(statistics_enabled),
         max_stats_size_(max_stats_size),
         compression_level_(Codec::UseDefaultCompressionLevel()),
+        codec_options_(std::make_shared<CodecOptions>()),

Review Comment:
   Can we just leave this null-initialized by default?



##########
cpp/src/parquet/column_writer_test.cc:
##########
@@ -522,6 +560,11 @@ TYPED_TEST(TestPrimitiveWriter, RequiredPlainWithStatsAndGzipCompression) {
   this->TestRequiredWithSettings(Encoding::PLAIN, Compression::GZIP, false, true,
                                  LARGE_SIZE);
 }
+
+TYPED_TEST(TestPrimitiveWriter, RequiredPlainWithGzipCodecOptions) {
+  this->TestRequiredWithCodecOptions(Encoding::PLAIN, Compression::GZIP, false, false,
+                                     LARGE_SIZE, 10, ::arrow::util::GZipFormat::GZIP, 12);

Review Comment:
   For readability, can you please pass parameter names explicitly?
   ```suggestion
     this->TestRequiredWithCodecOptions(Encoding::PLAIN, Compression::GZIP, false, false,
                                        LARGE_SIZE, /*xxx=*/ 10, /*xxx=*/ ::arrow::util::GZipFormat::GZIP, /*xxx=*/ 12);
   ```



##########
cpp/src/parquet/properties.h:
##########
@@ -431,6 +445,49 @@ class PARQUET_EXPORT WriterProperties {
       return this->compression_level(path->ToDotString(), compression_level);
     }
 
+    /// \brief Specify the default codec options for the compressor in
+    /// every column. Previously only compression level is supported to be customized,
+    /// with CodecOptions, more specific properties could be set by users.
+    ///
+    /// For compression level, it could be set by codec_options.compression_level. In
+    /// case a column does not have an explicitly specified compression level, the default
+    /// one would be used.
+    ///
+    /// The provided compression level is compressor specific. The user would
+    /// have to familiarize oneself with the available levels for the selected
+    /// compressor.  If the compressor does not allow for selecting different
+    /// compression levels, calling this function would not have any effect.
+    /// Parquet and Arrow do not validate the passed compression level.  If no
+    /// level is selected by the user or if the special
+    /// std::numeric_limits<int>::min() value is passed, then Arrow selects the
+    /// compression level.
+    ///
+    /// For GZip Codec, users could set window_bits and format with GZipCodecOptions.
+    /// For other Codecs, CodecOptions could be used to set compression_level. More
+    /// specific codec options to be added.

Review Comment:
   This section does not seem particularly informative either.



##########
cpp/src/parquet/properties.h:
##########
@@ -618,6 +677,7 @@ class PARQUET_EXPORT WriterProperties {
     std::unordered_map<std::string, Encoding::type> encodings_;
     std::unordered_map<std::string, Compression::type> codecs_;
     std::unordered_map<std::string, int32_t> codecs_compression_level_;

Review Comment:
   Let's remove this and only keep `codec_options_`?



##########
cpp/src/parquet/properties.h:
##########
@@ -431,6 +445,49 @@ class PARQUET_EXPORT WriterProperties {
       return this->compression_level(path->ToDotString(), compression_level);
     }

Review Comment:
   Can you change the docstrings for `compression_level()` to also mention the existence of `codec_options()` as an alternative?
   For example:
   > If other compressor-specific options need to be set in addition to the compression level, use the `codec_options` method.



##########
cpp/src/parquet/properties_test.cc:
##########
@@ -70,6 +70,37 @@ TEST(TestWriterProperties, AdvancedHandling) {
   ASSERT_EQ(ParquetDataPageVersion::V2, props->data_page_version());
 }
 
+TEST(TestWriterProperties, SetCodecOptions) {
+  WriterProperties::Builder builder;
+  builder.compression("gzip", Compression::GZIP);
+  builder.compression("zstd", Compression::ZSTD);
+  builder.compression("brotli", Compression::BROTLI);
+  auto gzip_codec_options = std::make_shared<::arrow::util::GZipCodecOptions>();
+  gzip_codec_options->compression_level_ = 9;
+  gzip_codec_options->window_bits = 12;
+  builder.codec_options("gzip", gzip_codec_options);
+  auto codec_options = std::make_shared<CodecOptions>();
+  builder.codec_options(codec_options);
+  auto brotli_codec_options = std::make_shared<::arrow::util::BrotliCodecOptions>();
+  brotli_codec_options->compression_level_ = 11;
+  brotli_codec_options->window_bits = 20;
+  builder.codec_options("brotli", brotli_codec_options);
+  std::shared_ptr<WriterProperties> props = builder.build();
+
+  ASSERT_EQ(9,
+            props->codec_options(ColumnPath::FromDotString("gzip"))->compression_level_);
+  ASSERT_EQ(12, std::dynamic_pointer_cast<::arrow::util::GZipCodecOptions>(
+                    props->codec_options(ColumnPath::FromDotString("gzip")))
+                    ->window_bits);
+  ASSERT_EQ(Codec::UseDefaultCompressionLevel(),
+            props->codec_options(ColumnPath::FromDotString("zstd"))->compression_level_);

Review Comment:
   This is not very interesting as the default compression level would be used in any case. Can you pass a non-default value above?



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -36,7 +36,7 @@
 #include "arrow/util/bit_util.h"
 #include "arrow/util/bitmap_ops.h"
 #include "arrow/util/checked_cast.h"
-#include "arrow/util/compression.h"
+

Review Comment:
   Can you remove the spurious newline?



##########
cpp/src/arrow/util/compression_zlib.cc:
##########
@@ -461,6 +468,10 @@ class GZipCodec : public Codec {
   }
 
   Status Init() override {
+    if (window_bits_ < kGZipMinWindowBits || window_bits_ > kGZipMaxWindowBits) {
+      return Status::Invalid("window_bits should be between ", kGZipMinWindowBits,

Review Comment:
   Nit
   ```suggestion
         return Status::Invalid("GZip window_bits should be between ", kGZipMinWindowBits,
   ```



##########
cpp/src/parquet/properties.h:
##########
@@ -174,6 +175,16 @@ class PARQUET_EXPORT ColumnProperties {
 
   void set_compression_level(int compression_level) {
     compression_level_ = compression_level;

Review Comment:
   +1, I think we can just keep the codec options.



##########
cpp/src/parquet/properties.h:
##########
@@ -431,6 +445,49 @@ class PARQUET_EXPORT WriterProperties {
       return this->compression_level(path->ToDotString(), compression_level);
     }
 
+    /// \brief Specify the default codec options for the compressor in
+    /// every column. Previously only compression level is supported to be customized,
+    /// with CodecOptions, more specific properties could be set by users.
+    ///
+    /// For compression level, it could be set by codec_options.compression_level. In
+    /// case a column does not have an explicitly specified compression level, the default
+    /// one would be used.

Review Comment:
   This seems needlessly wordy.
   ```suggestion
       /// \brief Specify the default codec options for the compressor in
       /// every column.
       ///
       /// The codec options allow configuring the compression level as well
       /// as other codec-specific options.
   ```



##########
cpp/src/parquet/properties.h:
##########
@@ -431,6 +445,49 @@ class PARQUET_EXPORT WriterProperties {
       return this->compression_level(path->ToDotString(), compression_level);
     }
 
+    /// \brief Specify the default codec options for the compressor in
+    /// every column. Previously only compression level is supported to be customized,
+    /// with CodecOptions, more specific properties could be set by users.
+    ///
+    /// For compression level, it could be set by codec_options.compression_level. In
+    /// case a column does not have an explicitly specified compression level, the default
+    /// one would be used.
+    ///
+    /// The provided compression level is compressor specific. The user would
+    /// have to familiarize oneself with the available levels for the selected
+    /// compressor.  If the compressor does not allow for selecting different
+    /// compression levels, calling this function would not have any effect.
+    /// Parquet and Arrow do not validate the passed compression level.  If no
+    /// level is selected by the user or if the special
+    /// std::numeric_limits<int>::min() value is passed, then Arrow selects the
+    /// compression level.

Review Comment:
   Remove this section?



##########
cpp/src/parquet/column_writer.h:
##########
@@ -87,8 +88,9 @@ class PARQUET_EXPORT PageWriter {
 
   static std::unique_ptr<PageWriter> Open(
       std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-      int compression_level, ColumnChunkMetaDataBuilder* metadata,
-      int16_t row_group_ordinal = -1, int16_t column_chunk_ordinal = -1,
+      const std::shared_ptr<CodecOptions>& codec_options,

Review Comment:
   Also, unless the `PageWriter` needs to keep ownership of the codec options, this should simply be `const CodecOptions& = {}`.



##########
cpp/src/parquet/column_writer.h:
##########
@@ -87,8 +88,9 @@ class PARQUET_EXPORT PageWriter {
 
   static std::unique_ptr<PageWriter> Open(
       std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
-      int compression_level, ColumnChunkMetaDataBuilder* metadata,
-      int16_t row_group_ordinal = -1, int16_t column_chunk_ordinal = -1,
+      const std::shared_ptr<CodecOptions>& codec_options,

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1219333466


##########
cpp/src/parquet/column_writer.h:
##########
@@ -21,6 +21,7 @@
 #include <cstring>
 #include <memory>
 
+#include "arrow/util/compression.h"

Review Comment:
   ```suggestion
   #include "arrow/type_fwd.h"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1236679927


##########
cpp/src/parquet/properties.h:
##########
@@ -190,7 +193,14 @@ class PARQUET_EXPORT ColumnProperties {
 
   size_t max_statistics_size() const { return max_stats_size_; }
 
-  int compression_level() const { return compression_level_; }
+  int compression_level() const { return codec_options_->compression_level_; }
+
+  const std::shared_ptr<CodecOptions>& codec_options() const {
+    if (!codec_options_) {
+      return std::make_shared<CodecOptions>();

Review Comment:
   this seems to be fixed, but there are Python errors that I have no ideas on the hints



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1246201321


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   Doesn't `CodecOptions{compression_level}` work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1247440029


##########
cpp/src/parquet/platform.h:
##########
@@ -24,7 +24,8 @@
 #include "arrow/io/interfaces.h"  // IWYU pragma: export
 #include "arrow/status.h"         // IWYU pragma: export
 #include "arrow/type_fwd.h"       // IWYU pragma: export
-#include "arrow/util/macros.h"    // IWYU pragma: export
+#include "arrow/util/compression.h"

Review Comment:
   Maybe forward declr like `default_memory_pool()` can be used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1223975847


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;
+};
+
+// ----------------------------------------------------------------------
+// brotli codec options implementation
+
+constexpr int kBrotliDefaultWindowBits = 22;

Review Comment:
   yes, I think that's just how it is implemented now in `GZip/BrotliCodecOptions`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1223990519


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;
+};
+
+// ----------------------------------------------------------------------
+// brotli codec options implementation
+
+constexpr int kBrotliDefaultWindowBits = 22;

Review Comment:
   yeah it seems better to use their `BROTLI_DEFAULT_WINDOW`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1219789164


##########
cpp/src/arrow/util/compression.h:
##########
@@ -124,7 +168,15 @@ class ARROW_EXPORT Codec {
 
   /// \brief Create a codec for the given compression algorithm
   static Result<std::unique_ptr<Codec>> Create(
-      Compression::type codec, int compression_level = kUseDefaultCompressionLevel);
+      Compression::type codec,
+      const std::shared_ptr<CodecOptions>& codec_options =
+          std::make_shared<CodecOptions>(kUseDefaultCompressionLevel));
+
+  /// \brief Create a codec for the given compression algorithm
+  /// \deprecated and left for backwards compatibility.
+  ARROW_DEPRECATED("Use CodecOptions to create Codec")

Review Comment:
   Sounds reasonable. I was thinking it is always preferable to use the new `CodecOptions` to set the level. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1582147882

   > @yyang52 There are Python test failures on CI.
   > 
   > Also, is it possible to pass a base `CodecOptions` when a derived class exists? (e.g. for gzip). Ideally it should be supported.
   
   Yes, I think it supports that (as checked in TEST(TestCodecMisc, SpecifyCompressionLevel) to pass base CodecOptions to different Codecs)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1631023917

   Thanks a lot for your contribution @yyang52 ! I hope you didn't mind the delays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1229796770


##########
cpp/src/parquet/properties_test.cc:
##########
@@ -70,6 +70,37 @@ TEST(TestWriterProperties, AdvancedHandling) {
   ASSERT_EQ(ParquetDataPageVersion::V2, props->data_page_version());
 }
 
+TEST(TestWriterProperties, SetCodecOptions) {
+  WriterProperties::Builder builder;
+  builder.compression("gzip", Compression::GZIP);
+  builder.compression("zstd", Compression::ZSTD);
+  builder.compression("brotli", Compression::BROTLI);
+  auto gzip_codec_options = std::make_shared<::arrow::util::GZipCodecOptions>();
+  gzip_codec_options->compression_level_ = 9;
+  gzip_codec_options->window_bits = 12;
+  builder.codec_options("gzip", gzip_codec_options);
+  auto codec_options = std::make_shared<CodecOptions>();
+  builder.codec_options(codec_options);
+  auto brotli_codec_options = std::make_shared<::arrow::util::BrotliCodecOptions>();
+  brotli_codec_options->compression_level_ = 11;
+  brotli_codec_options->window_bits = 20;
+  builder.codec_options("brotli", brotli_codec_options);
+  std::shared_ptr<WriterProperties> props = builder.build();
+
+  ASSERT_EQ(9,
+            props->codec_options(ColumnPath::FromDotString("gzip"))->compression_level_);
+  ASSERT_EQ(12, std::dynamic_pointer_cast<::arrow::util::GZipCodecOptions>(
+                    props->codec_options(ColumnPath::FromDotString("gzip")))
+                    ->window_bits);
+  ASSERT_EQ(Codec::UseDefaultCompressionLevel(),
+            props->codec_options(ColumnPath::FromDotString("zstd"))->compression_level_);

Review Comment:
   This is not very interesting as the default compression level would be used in any case. Can you pass a non-default value for `codec_options` above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1254509177


##########
cpp/src/parquet/column_writer.h:
##########
@@ -85,6 +87,22 @@ class PARQUET_EXPORT PageWriter {
  public:
   virtual ~PageWriter() {}
 
+  static std::unique_ptr<PageWriter> Open(
+      std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
+      ColumnChunkMetaDataBuilder* metadata, int16_t row_group_ordinal = -1,
+      int16_t column_chunk_ordinal = -1,
+      ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(),
+      bool buffered_row_group = false,
+      std::shared_ptr<Encryptor> header_encryptor = NULLPTR,
+      std::shared_ptr<Encryptor> data_encryptor = NULLPTR,
+      bool page_write_checksum_enabled = false,
+      // column_index_builder MUST outlive the PageWriter
+      ColumnIndexBuilder* column_index_builder = NULLPTR,
+      // offset_index_builder MUST outlive the PageWriter
+      OffsetIndexBuilder* offset_index_builder = NULLPTR,
+      const CodecOptions& codec_options = CodecOptions{});
+
+  ARROW_DEPRECATED("Use the one with CodecOptions instead")

Review Comment:
   ```suggestion
     ARROW_DEPRECATED("Deprecated in 13.0.0. Use CodecOptions-taking overload instead.")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1229501565


##########
cpp/src/parquet/properties.h:
##########
@@ -174,6 +175,16 @@ class PARQUET_EXPORT ColumnProperties {
 
   void set_compression_level(int compression_level) {
     compression_level_ = compression_level;

Review Comment:
   Can we just remove `compression_level_`?



##########
cpp/src/parquet/properties.h:
##########
@@ -192,6 +197,8 @@ class PARQUET_EXPORT ColumnProperties {
 
   int compression_level() const { return compression_level_; }

Review Comment:
   Can we just get it from `codec_options_`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1271627915


##########
cpp/src/parquet/file_writer.cc:
##########
@@ -291,12 +304,24 @@ class RowGroupSerializer : public RowGroupWriter::Contents {
       auto oi_builder = page_index_builder_ && properties_->page_index_enabled(path)
                             ? page_index_builder_->GetOffsetIndexBuilder(column_ordinal)
                             : nullptr;
-      std::unique_ptr<PageWriter> pager = PageWriter::Open(
-          sink_, properties_->compression(path), properties_->compression_level(path),
-          col_meta, static_cast<int16_t>(row_group_ordinal_),
-          static_cast<int16_t>(column_ordinal), properties_->memory_pool(),
-          buffered_row_group_, meta_encryptor, data_encryptor,
-          properties_->page_checksum_enabled(), ci_builder, oi_builder);
+      auto codec_options = properties_->codec_options(path)
+                               ? (properties_->codec_options(path)).get()
+                               : nullptr;
+
+      std::unique_ptr<PageWriter> pager;
+      if (!codec_options) {
+        pager = PageWriter::Open(sink_, properties_->compression(path), col_meta,
+                                 row_group_ordinal_, static_cast<int16_t>(column_ordinal),
+                                 properties_->memory_pool(), false, meta_encryptor,

Review Comment:
   Sorry for this missing and thanks for the fix!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1223985498


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,46 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};
+
+// ----------------------------------------------------------------------
+// gzip codec options implementation
+
+struct GZipFormat {
+  enum type {
+    ZLIB,
+    DEFLATE,
+    GZIP,
+  };
+};
+
+constexpr int kGZipDefaultWindowBits = 15;
+
+class ARROW_EXPORT GZipCodecOptions : public CodecOptions {
+ public:
+  GZipFormat::type gzip_format = GZipFormat::GZIP;
+  int window_bits = kGZipDefaultWindowBits;
+};
+
+// ----------------------------------------------------------------------
+// brotli codec options implementation
+
+constexpr int kBrotliDefaultWindowBits = 22;

Review Comment:
   Sure, I notice that this patch hard code the encoder's internal into `compression.h`, so I'm afraid if they're changed in the 3rd libraries, we'll get confused here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1630896663

   Revision: 73b401800f968fa0ed9d80caee41b8f8b1face85
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-e0a8b0150b](https://github.com/ursacomputing/crossbow/branches/all?query=actions-e0a8b0150b)
   
   |Task|Status|
   |----|------|
   |test-alpine-linux-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-alpine-linux-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5520955653/jobs/10068347443)|
   |test-build-cpp-fuzz|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/5520955175/jobs/10068346360)|
   |test-conda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5520958349/jobs/10068354978)|
   |test-conda-cpp-valgrind|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-e0a8b0150b-azure-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/runs/14948001659)|
   |test-cuda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-cuda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5520957384/jobs/10068352489)|
   |test-debian-11-cpp-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-debian-11-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/5520956475/jobs/10068349739)|
   |test-debian-11-cpp-i386|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-debian-11-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/5520959463/jobs/10068357755)|
   |test-fedora-35-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-fedora-35-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5520958602/jobs/10068355649)|
   |test-ubuntu-20.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-ubuntu-20.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5520957967/jobs/10068354058)|
   |test-ubuntu-20.04-cpp-bundled|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-ubuntu-20.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/5520957625/jobs/10068353218)|
   |test-ubuntu-20.04-cpp-minimal-with-formats|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-ubuntu-20.04-cpp-minimal-with-formats)](https://github.com/ursacomputing/crossbow/actions/runs/5520956159/jobs/10068348793)|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions/runs/5520955939/jobs/10068348251)|
   |test-ubuntu-22.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5520959868/jobs/10068358688)|
   |test-ubuntu-22.04-cpp-20|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e0a8b0150b-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/5520957134/jobs/10068351656)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1624505149

   > @yyang52 There are a couple of comments left to address, can you take a look?
   
   Doing some fix, may you help to take a look and trigger remaining CI checks? Thanks! @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1623033713

   @yyang52 We're busy preparing a release right now. You might start by merging/rebasing from main to get a more passing CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1218967222


##########
cpp/src/arrow/util/compression.h:
##########
@@ -124,7 +168,9 @@ class ARROW_EXPORT Codec {
 
   /// \brief Create a codec for the given compression algorithm
   static Result<std::unique_ptr<Codec>> Create(
-      Compression::type codec, int compression_level = kUseDefaultCompressionLevel);
+      Compression::type codec,

Review Comment:
   I mean deprecating the existing one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35886:
URL: https://github.com/apache/arrow/pull/35886#issuecomment-1578349795

   @yyang52 There are Python test failures on CI.
   
   Also, is it possible to pass a base `CodecOptions` when a derived class exists? (e.g. for gzip). Ideally it should be supported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1241880050


##########
cpp/src/parquet/platform.h:
##########
@@ -24,7 +24,8 @@
 #include "arrow/io/interfaces.h"  // IWYU pragma: export
 #include "arrow/status.h"         // IWYU pragma: export
 #include "arrow/type_fwd.h"       // IWYU pragma: export
-#include "arrow/util/macros.h"    // IWYU pragma: export
+#include "arrow/util/compression.h"

Review Comment:
   I don't think adding this inclusion is necessary, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1246213884


##########
cpp/src/arrow/util/compression.h:
##########
@@ -107,6 +107,50 @@ class ARROW_EXPORT Decompressor {
   // XXX add methods for buffer size heuristics?
 };
 
+/// \brief Compression codec options
+class ARROW_EXPORT CodecOptions {
+ public:
+  CodecOptions(int compression_level = kUseDefaultCompressionLevel) {
+    compression_level_ = compression_level;
+  }
+  virtual ~CodecOptions() = default;
+
+  int compression_level_;
+};

Review Comment:
   Ah, sorry! Let's forget it then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1214139719


##########
cpp/src/parquet/properties.h:
##########
@@ -172,8 +172,8 @@ class PARQUET_EXPORT ColumnProperties {
     max_stats_size_ = max_stats_size;
   }
 
-  void set_compression_level(int compression_level) {

Review Comment:
   For now I replace the compression_level with CodecOptions, could also keep that for compatibility :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] yyang52 commented on a diff in pull request #35886: GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter

Posted by "yyang52 (via GitHub)" <gi...@apache.org>.
yyang52 commented on code in PR #35886:
URL: https://github.com/apache/arrow/pull/35886#discussion_r1217775208


##########
cpp/src/parquet/properties.h:
##########
@@ -172,8 +172,8 @@ class PARQUET_EXPORT ColumnProperties {
     max_stats_size_ = max_stats_size;
   }
 
-  void set_compression_level(int compression_level) {

Review Comment:
   seems like python codec api also takes compression_level as input, which also rely on cpp API (where we changed to Codec::Create(Compression::type, CodecOptions)). should we also modify python API or keep Codec::Create(Compression::type, compression_level)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org