You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "davisusanibar (via GitHub)" <gi...@apache.org> on 2023/04/04 00:13:34 UTC

[GitHub] [arrow] davisusanibar opened a new pull request, #34873: GH-34749 : [Java] Make Zstd compression level configurable

davisusanibar opened a new pull request, #34873:
URL: https://github.com/apache/arrow/pull/34873

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Closes: https://github.com/apache/arrow/issues/34749
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Make compression level configurable for Zstd
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Yes
   
   ### Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on PR #34873:
URL: https://github.com/apache/arrow/pull/34873#issuecomment-1516611325

   Hi @lidavidm do you have any comments for this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on PR #34873:
URL: https://github.com/apache/arrow/pull/34873#issuecomment-1576953934

   @lidavidm could you help me to retry job failed? or Is it needed to review that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1174222296


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamWriter.java:
##########
@@ -76,12 +76,13 @@ public ArrowStreamWriter(VectorSchemaRoot root, DictionaryProvider provider, Wri
    * @param option IPC write options
    * @param compressionFactory Compression codec factory
    * @param codecType Codec type
+   * @param compressionLevel Compression level
    * @param out WritableByteChannel for writing.
    */
   public ArrowStreamWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out,
                            IpcOption option, CompressionCodec.Factory compressionFactory,
-                           CompressionUtil.CodecType codecType) {

Review Comment:
   Ditto here.



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -72,7 +72,7 @@ protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, Writab
   }
 
   protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out, IpcOption option) {
-    this(root, provider, out, option, NoCompressionCodec.Factory.INSTANCE, CompressionUtil.CodecType.NO_COMPRESSION);
+    this(root, provider, out, option, NoCompressionCodec.Factory.INSTANCE, CompressionUtil.CodecType.NO_COMPRESSION, 3);

Review Comment:
   Use a constant for the integer?



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java:
##########
@@ -73,8 +73,8 @@ public ArrowFileWriter(VectorSchemaRoot root, DictionaryProvider provider, Writa
 
   public ArrowFileWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out,
                          Map<String, String> metaData, IpcOption option, CompressionCodec.Factory compressionFactory,
-                         CompressionUtil.CodecType codecType) {
-    super(root, provider, out, option, compressionFactory, codecType);
+                         CompressionUtil.CodecType codecType, int compressionLevel) {

Review Comment:
   Well, now this is a breaking change, since the old constructor isn't available anymore.



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -119,6 +119,53 @@ protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, Writab
     this.schema = new Schema(fields, root.getSchema().getCustomMetadata());
   }
 
+  /**
+   * Note: fields are not closed when the writer is closed.
+   *
+   * @param root               the vectors to write to the output
+   * @param provider           where to find the dictionaries
+   * @param out                the output where to write
+   * @param option             IPC write options
+   * @param compressionFactory Compression codec factory
+   * @param codecType          Compression codec
+   * @param compressionLevel   Compression level
+   */
+  protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out, IpcOption option,

Review Comment:
   This now breaks compatibility. I don't think that's what Gang was suggesting with a delegate constructor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #34873:
URL: https://github.com/apache/arrow/pull/34873#issuecomment-1569009709

   @davisusanibar can you rebase and/or take a look at the build error?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1175926144


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -53,6 +53,8 @@ public abstract class ArrowWriter implements AutoCloseable {
 
   protected static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class);
 
+  protected static final int compressionLevel = 3;

Review Comment:
   The default level is tricky here:
   - LZ4 default level is 1 with range from 1 to 9
   - ZSTD default level is 3 with range from 1 to 22 (also negative levels -1 to -5 are supported)
   
   There is a [discussion](https://github.com/apache/arrow/issues/35287) on the C++ side to add support of `Codec::Options`. Would it be good to do similar things on the Java side so we can accommodate different default levels per codec?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1177305878


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -53,6 +53,8 @@ public abstract class ArrowWriter implements AutoCloseable {
 
   protected static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class);
 
+  protected static final int compressionLevel = 3;

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34873:
URL: https://github.com/apache/arrow/pull/34873#issuecomment-1576956310

   > @lidavidm could you help me to retry job failed? or Is it needed to review that?
   
   I just re-triggered the failed one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1174294276


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -119,6 +119,53 @@ protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, Writab
     this.schema = new Schema(fields, root.getSchema().getCustomMetadata());
   }
 
+  /**
+   * Note: fields are not closed when the writer is closed.
+   *
+   * @param root               the vectors to write to the output
+   * @param provider           where to find the dictionaries
+   * @param out                the output where to write
+   * @param option             IPC write options
+   * @param compressionFactory Compression codec factory
+   * @param codecType          Compression codec
+   * @param compressionLevel   Compression level
+   */
+  protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out, IpcOption option,

Review Comment:
   Oh, I've decided to refactor constructor to add new parameter needed, .... otherwise will be to maintain the constructor as it is now and add a new constructor for new compression_level args needed. Please let me know how to proceed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1175852852


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -53,6 +53,8 @@ public abstract class ArrowWriter implements AutoCloseable {
 
   protected static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class);
 
+  protected static final int compressionLevel = 3;

Review Comment:
   And this should be clearly named DEFAULT_COMPRESSION_LEVEL



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -53,6 +53,8 @@ public abstract class ArrowWriter implements AutoCloseable {
 
   protected static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class);
 
+  protected static final int compressionLevel = 3;

Review Comment:
   Constants should use UPPER_SNAKE_CASE.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1177332188


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamWriter.java:
##########
@@ -81,7 +82,25 @@ public ArrowStreamWriter(VectorSchemaRoot root, DictionaryProvider provider, Wri
   public ArrowStreamWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out,
                            IpcOption option, CompressionCodec.Factory compressionFactory,
                            CompressionUtil.CodecType codecType) {
-    super(root, provider, out, option, compressionFactory, codecType);
+    this(root, provider, out, option, compressionFactory, codecType, Optional.ofNullable(null));

Review Comment:
   ditto



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java:
##########
@@ -74,7 +75,13 @@ public ArrowFileWriter(VectorSchemaRoot root, DictionaryProvider provider, Writa
   public ArrowFileWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out,
                          Map<String, String> metaData, IpcOption option, CompressionCodec.Factory compressionFactory,
                          CompressionUtil.CodecType codecType) {
-    super(root, provider, out, option, compressionFactory, codecType);
+    this(root, provider, out, metaData, option, compressionFactory, codecType, Optional.ofNullable(null));

Review Comment:
   ```suggestion
       this(root, provider, out, metaData, option, compressionFactory, codecType, Optional.empty());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34873:
URL: https://github.com/apache/arrow/pull/34873#issuecomment-1577571815

   Benchmark runs are scheduled for baseline = f5a57292e5187b1a1640b04303e9f6b63135892a and contender = 87d082477762fa9b720f5772c5331bd629ed90ba. 87d082477762fa9b720f5772c5331bd629ed90ba is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4ece6145ff5644119e5555943531e2fe...0b5a207d4c1a4efcbbba25bcf5a10079/)
   [Failed :arrow_down:1.65% :arrow_up:0.15%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/eda9686ef21f44b0be3fc8b355572192...8bced070c2254025bb285fc8b792c598/)
   [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/35af33bb753f4e939b9b4402006c8e2e...059a570021bf4fe7afbaff0fb83818e2/)
   [Failed :arrow_down:0.48% :arrow_up:0.27%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f802168894604b648f002a49a0bef8cb...b02bad2427124ade9cebe813bcc05abd/)
   Buildkite builds:
   [Failed] [`87d08247` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2980)
   [Failed] [`87d08247` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3016)
   [Failed] [`87d08247` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2981)
   [Failed] [`87d08247` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3006)
   [Finished] [`f5a57292` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2979)
   [Finished] [`f5a57292` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3015)
   [Finished] [`f5a57292` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2980)
   [Finished] [`f5a57292` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3005)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1175991625


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -53,6 +53,8 @@ public abstract class ArrowWriter implements AutoCloseable {
 
   protected static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class);
 
+  protected static final int compressionLevel = 3;

Review Comment:
   +1 for Optional



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1174311081


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -119,6 +119,53 @@ protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, Writab
     this.schema = new Schema(fields, root.getSchema().getCustomMetadata());
   }
 
+  /**
+   * Note: fields are not closed when the writer is closed.
+   *
+   * @param root               the vectors to write to the output
+   * @param provider           where to find the dictionaries
+   * @param out                the output where to write
+   * @param option             IPC write options
+   * @param compressionFactory Compression codec factory
+   * @param codecType          Compression codec
+   * @param compressionLevel   Compression level
+   */
+  protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out, IpcOption option,

Review Comment:
   As already suggested: why not add a new constructor, and have the old ones delegate to the new one? That way we can avoid code duplication.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1175983092


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -53,6 +53,8 @@ public abstract class ArrowWriter implements AutoCloseable {
 
   protected static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class);
 
+  protected static final int compressionLevel = 3;

Review Comment:
   Hmm, we could leave that for a different PR?
   
   Or we could use Optional<Integer> here and switch between constructors that way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1177224767


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -53,6 +53,8 @@ public abstract class ArrowWriter implements AutoCloseable {
 
   protected static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class);
 
+  protected static final int compressionLevel = 3;

Review Comment:
   We can use an `Optional<int> level` parameter and pass it to specific codecs via `createCodec`. `Lz4CompressionCodec` and `ZstdCompressionCodec` can respect the value when it is set or use specific default level if value is unset.
   
   BTW, I just checked that `FramedLZ4CompressorOutputStream` does not have the concept of level. It requires user to fine-tune the parameters: https://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/lz4/FramedLZ4CompressorOutputStream.Parameters.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1156616671


##########
java/compression/src/main/java/org/apache/arrow/compression/CommonsCompressionFactory.java:
##########
@@ -27,15 +27,22 @@
  */
 public class CommonsCompressionFactory implements CompressionCodec.Factory {
 
-  public static final CommonsCompressionFactory INSTANCE = new CommonsCompressionFactory();
+  private int compressionLevel;
+
+  public CommonsCompressionFactory(int compressionLevel) {
+    this.compressionLevel = compressionLevel;
+  }
+
+  @Deprecated
+  public static final CommonsCompressionFactory INSTANCE = new CommonsCompressionFactory(3);
 
   @Override
   public CompressionCodec createCodec(CompressionUtil.CodecType codecType) {

Review Comment:
   Is it better to extend `createCodec` to add an `Optional<int> level`? The downside of the current implementation is that we need to create separate instances for different levels.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1157445328


##########
java/compression/src/main/java/org/apache/arrow/compression/CommonsCompressionFactory.java:
##########
@@ -27,15 +27,22 @@
  */
 public class CommonsCompressionFactory implements CompressionCodec.Factory {
 
-  public static final CommonsCompressionFactory INSTANCE = new CommonsCompressionFactory();
+  private int compressionLevel;
+
+  public CommonsCompressionFactory(int compressionLevel) {
+    this.compressionLevel = compressionLevel;
+  }
+
+  @Deprecated
+  public static final CommonsCompressionFactory INSTANCE = new CommonsCompressionFactory(3);
 
   @Override
   public CompressionCodec createCodec(CompressionUtil.CodecType codecType) {

Review Comment:
   Just updated.
   
   ```
   ArrowFileWriter writer = new ArrowFileWriter(vectorSchemaRoot, null,
     fileOutputStream.getChannel(), null, IpcOption.DEFAULT,
     CommonsCompressionFactory.INSTANCE, zstdCodec.getCodecType(), /*compressionLevel*/ 7)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34873:
URL: https://github.com/apache/arrow/pull/34873#issuecomment-1495158876

   * Closes: #34749


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm merged pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm merged PR #34873:
URL: https://github.com/apache/arrow/pull/34873


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on PR #34873:
URL: https://github.com/apache/arrow/pull/34873#issuecomment-1524407486

   There seems to be a build error? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1160368543


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -119,6 +119,53 @@ protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, Writab
     this.schema = new Schema(fields, root.getSchema().getCustomMetadata());
   }
 
+  /**
+   * Note: fields are not closed when the writer is closed.
+   *
+   * @param root               the vectors to write to the output
+   * @param provider           where to find the dictionaries
+   * @param out                the output where to write
+   * @param option             IPC write options
+   * @param compressionFactory Compression codec factory
+   * @param codecType          Compression codec
+   * @param compressionLevel   Compression level
+   */
+  protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out, IpcOption option,

Review Comment:
   This duplicates a lot with the previous ctor. Probably we can use one as a delegate or refactor one to accept a created codec?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1157446078


##########
java/compression/src/main/java/org/apache/arrow/compression/CommonsCompressionFactory.java:
##########
@@ -27,15 +27,22 @@
  */
 public class CommonsCompressionFactory implements CompressionCodec.Factory {
 
-  public static final CommonsCompressionFactory INSTANCE = new CommonsCompressionFactory();
+  private int compressionLevel;
+
+  public CommonsCompressionFactory(int compressionLevel) {
+    this.compressionLevel = compressionLevel;
+  }
+
+  @Deprecated
+  public static final CommonsCompressionFactory INSTANCE = new CommonsCompressionFactory(3);
 
   @Override
   public CompressionCodec createCodec(CompressionUtil.CodecType codecType) {

Review Comment:
   I'm going to fill a ticket to implement Builder patter for Reader/Writes class.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1173148616


##########
java/compression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java:
##########
@@ -85,4 +85,46 @@ public void testArrowFileZstdRoundTrip() throws Exception {
     }
   }
 
+  @Test
+  public void testArrowFileZstdWithCompressionLevelRoundTrip() throws Exception {
+    // Prepare sample data
+    final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
+    List<Field> fields = new ArrayList<>();
+    fields.add(new Field("col", FieldType.notNullable(new ArrowType.Utf8()), new ArrayList<>()));
+    VectorSchemaRoot root = VectorSchemaRoot.create(new Schema(fields), allocator);
+    final int rowCount = 10;
+    GenerateSampleData.generateTestData(root.getVector(0), rowCount);
+    root.setRowCount(rowCount);
+
+    // Write an in-memory compressed arrow file
+    ByteArrayOutputStream out = new ByteArrayOutputStream();
+    try (final ArrowFileWriter writer =
+             new ArrowFileWriter(root, null, Channels.newChannel(out), new HashMap<>(),
+                 IpcOption.DEFAULT, CommonsCompressionFactory.INSTANCE, CompressionUtil.CodecType.ZSTD, 7)) {
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+    }
+
+    // Read the in-memory compressed arrow file with CommonsCompressionFactory provided
+    try (ArrowFileReader reader =
+             new ArrowFileReader(new ByteArrayReadableSeekableByteChannel(out.toByteArray()),
+                 allocator, CommonsCompressionFactory.INSTANCE)) {
+      Assert.assertEquals(1, reader.getRecordBlocks().size());
+      Assert.assertTrue(reader.loadNextBatch());
+      Assert.assertTrue(root.equals(reader.getVectorSchemaRoot()));
+      Assert.assertFalse(reader.loadNextBatch());
+    }
+
+    // Read the in-memory compressed arrow file without CompressionFactory provided

Review Comment:
   Isn't this tested elsewhere already?



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -119,6 +119,53 @@ protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, Writab
     this.schema = new Schema(fields, root.getSchema().getCustomMetadata());
   }
 
+  /**
+   * Note: fields are not closed when the writer is closed.
+   *
+   * @param root               the vectors to write to the output
+   * @param provider           where to find the dictionaries
+   * @param out                the output where to write
+   * @param option             IPC write options
+   * @param compressionFactory Compression codec factory
+   * @param codecType          Compression codec
+   * @param compressionLevel   Compression level
+   */
+  protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out, IpcOption option,

Review Comment:
   Agreed, can we consolidate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1176767869


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -53,6 +53,8 @@ public abstract class ArrowWriter implements AutoCloseable {
 
   protected static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class);
 
+  protected static final int compressionLevel = 3;

Review Comment:
   > The default level is tricky here:
   > 
   > * LZ4 default level is 1 with range from 1 to 9
   > * ZSTD default level is 3 with range from 1 to 22 (also negative levels -1 to -5 are supported)
   > 
   > There is a [discussion](https://github.com/apache/arrow/issues/35287) on the C++ side to add support of `Codec::Options`. Would it be good to do similar things on the Java side so we can accommodate different default levels per codec?
   
   Uhmm, our current compression version only support compression level for Zstd not for LZ4. I was reviewing and is close the same as [Spark Compression Level Zstd](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1176768542


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -53,6 +53,8 @@ public abstract class ArrowWriter implements AutoCloseable {
 
   protected static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class);
 
+  protected static final int compressionLevel = 3;

Review Comment:
   > +1 for Optional
   
   Could you help me on this about where o how?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1157446078


##########
java/compression/src/main/java/org/apache/arrow/compression/CommonsCompressionFactory.java:
##########
@@ -27,15 +27,22 @@
  */
 public class CommonsCompressionFactory implements CompressionCodec.Factory {
 
-  public static final CommonsCompressionFactory INSTANCE = new CommonsCompressionFactory();
+  private int compressionLevel;
+
+  public CommonsCompressionFactory(int compressionLevel) {
+    this.compressionLevel = compressionLevel;
+  }
+
+  @Deprecated
+  public static final CommonsCompressionFactory INSTANCE = new CommonsCompressionFactory(3);
 
   @Override
   public CompressionCodec createCodec(CompressionUtil.CodecType codecType) {

Review Comment:
   I'm going to fill a new ticket to implement Builder pattern for Reader/Writes class.
   
   ie:
   ```
   ArrowStreamWriter writer = new ArrowStreamWriter.Builder(schemaRootWrite, fileOutputStream).setCodec(codec).build();
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1175516546


##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java:
##########
@@ -73,8 +73,8 @@ public ArrowFileWriter(VectorSchemaRoot root, DictionaryProvider provider, Writa
 
   public ArrowFileWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out,
                          Map<String, String> metaData, IpcOption option, CompressionCodec.Factory compressionFactory,
-                         CompressionUtil.CodecType codecType) {
-    super(root, provider, out, option, compressionFactory, codecType);
+                         CompressionUtil.CodecType codecType, int compressionLevel) {

Review Comment:
   Updated



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -119,6 +119,53 @@ protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, Writab
     this.schema = new Schema(fields, root.getSchema().getCustomMetadata());
   }
 
+  /**
+   * Note: fields are not closed when the writer is closed.
+   *
+   * @param root               the vectors to write to the output
+   * @param provider           where to find the dictionaries
+   * @param out                the output where to write
+   * @param option             IPC write options
+   * @param compressionFactory Compression codec factory
+   * @param codecType          Compression codec
+   * @param compressionLevel   Compression level
+   */
+  protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out, IpcOption option,

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34873:
URL: https://github.com/apache/arrow/pull/34873#issuecomment-1495158932

   :warning: GitHub issue #34749 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] davisusanibar commented on a diff in pull request #34873: GH-34749 : [Java] Make Zstd compression level configurable

Posted by "davisusanibar (via GitHub)" <gi...@apache.org>.
davisusanibar commented on code in PR #34873:
URL: https://github.com/apache/arrow/pull/34873#discussion_r1174177749


##########
java/compression/src/test/java/org/apache/arrow/compression/TestArrowReaderWriterWithCompression.java:
##########
@@ -85,4 +85,46 @@ public void testArrowFileZstdRoundTrip() throws Exception {
     }
   }
 
+  @Test
+  public void testArrowFileZstdWithCompressionLevelRoundTrip() throws Exception {
+    // Prepare sample data
+    final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
+    List<Field> fields = new ArrayList<>();
+    fields.add(new Field("col", FieldType.notNullable(new ArrowType.Utf8()), new ArrayList<>()));
+    VectorSchemaRoot root = VectorSchemaRoot.create(new Schema(fields), allocator);
+    final int rowCount = 10;
+    GenerateSampleData.generateTestData(root.getVector(0), rowCount);
+    root.setRowCount(rowCount);
+
+    // Write an in-memory compressed arrow file
+    ByteArrayOutputStream out = new ByteArrayOutputStream();
+    try (final ArrowFileWriter writer =
+             new ArrowFileWriter(root, null, Channels.newChannel(out), new HashMap<>(),
+                 IpcOption.DEFAULT, CommonsCompressionFactory.INSTANCE, CompressionUtil.CodecType.ZSTD, 7)) {
+      writer.start();
+      writer.writeBatch();
+      writer.end();
+    }
+
+    // Read the in-memory compressed arrow file with CommonsCompressionFactory provided
+    try (ArrowFileReader reader =
+             new ArrowFileReader(new ByteArrayReadableSeekableByteChannel(out.toByteArray()),
+                 allocator, CommonsCompressionFactory.INSTANCE)) {
+      Assert.assertEquals(1, reader.getRecordBlocks().size());
+      Assert.assertTrue(reader.loadNextBatch());
+      Assert.assertTrue(root.equals(reader.getVectorSchemaRoot()));
+      Assert.assertFalse(reader.loadNextBatch());
+    }
+
+    // Read the in-memory compressed arrow file without CompressionFactory provided

Review Comment:
   Deleted



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowWriter.java:
##########
@@ -119,6 +119,53 @@ protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, Writab
     this.schema = new Schema(fields, root.getSchema().getCustomMetadata());
   }
 
+  /**
+   * Note: fields are not closed when the writer is closed.
+   *
+   * @param root               the vectors to write to the output
+   * @param provider           where to find the dictionaries
+   * @param out                the output where to write
+   * @param option             IPC write options
+   * @param compressionFactory Compression codec factory
+   * @param codecType          Compression codec
+   * @param compressionLevel   Compression level
+   */
+  protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out, IpcOption option,

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org