You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "dennybritz (via GitHub)" <gi...@apache.org> on 2023/02/25 11:16:19 UTC

[GitHub] [arrow-datafusion] dennybritz opened a new pull request, #5397: Zstd compression

dennybritz opened a new pull request, #5397:
URL: https://github.com/apache/arrow-datafusion/pull/5397

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5327
   
   # What changes are included in this PR?
   
   - Support zstd compression
   - Refactor tests to remove boilerplate code
   - New dependencies when `compression` feature is enabled: `zstd` and `async-compression/zstd`
   
   # Are these changes tested?
   
   - I adjusted the tests to include the `zstd` variant and also tested manually
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on a diff in pull request #5397: Zstd compression

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #5397:
URL: https://github.com/apache/arrow-datafusion/pull/5397#discussion_r1117951498


##########
datafusion/core/src/datasource/file_format/file_type.rs:
##########
@@ -95,6 +98,9 @@ impl FileCompressionType {
     /// Xz-ed file (liblzma)
     pub const XZ: Self = Self { variant: XZ };
 
+    /// Xz-ed file (liblzma)
+    pub const ZSTD: Self = Self { variant: ZSTD };
+

Review Comment:
   ```suggestion
       /// Zstd-ed file
       pub const ZSTD: Self = Self { variant: ZSTD };
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5397: Support Zstd compressed files

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5397:
URL: https://github.com/apache/arrow-datafusion/pull/5397


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on a diff in pull request #5397: Zstd compression

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #5397:
URL: https://github.com/apache/arrow-datafusion/pull/5397#discussion_r1117951734


##########
datafusion/core/src/datasource/file_format/file_type.rs:
##########
@@ -140,6 +146,11 @@ impl FileCompressionType {
                 ReaderStream::new(AsyncXzDecoder::new(StreamReader::new(s)))
                     .map_err(err_converter),
             ),
+            #[cfg(feature = "compression")]
+            ZSTD => Box::new(
+                ReaderStream::new(AsyncZstdDecoer::new(StreamReader::new(s)))
+                    .map_err(err_converter),
+            ),
             #[cfg(not(feature = "compression"))]
             GZIP | BZIP2 | XZ => {
                 return Err(DataFusionError::NotImplemented(

Review Comment:
   ```suggestion
               #[cfg(not(feature = "compression"))]
               GZIP | BZIP2 | XZ | ZSTD => {
                   return Err(DataFusionError::NotImplemented(
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #5397: Support Zstd compressed files

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #5397:
URL: https://github.com/apache/arrow-datafusion/pull/5397#issuecomment-1446234414

   Benchmark runs are scheduled for baseline = 5ffa8d78f7e5796fa7c2f98b19346576ab2a4bb8 and contender = 8202a395a2bf9e8df55b31ecb99398204bbc3ac6. 8202a395a2bf9e8df55b31ecb99398204bbc3ac6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/acc5b411ae9c44c08eb090472f836736...e48fb809904f4020bf29cb047b53154b/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/35abde722e454ac2818c2fe3b0198116...be432331f2f44e73856b0cd51e63acbb/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1ac0b9c9dd304b4d8f742e88b719b0d1...c8d7edcb300f455daad548e87bb16b3e/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/db87ab560fd549b3bb2ccd1b110ce22a...9c347a1e94e9452cbcc9409a3f15823b/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on a diff in pull request #5397: Zstd compression

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #5397:
URL: https://github.com/apache/arrow-datafusion/pull/5397#discussion_r1117951717


##########
datafusion/core/src/datasource/file_format/file_type.rs:
##########
@@ -162,6 +173,11 @@ impl FileCompressionType {
             BZIP2 => Box::new(BzDecoder::new(r)),
             #[cfg(feature = "compression")]
             XZ => Box::new(XzDecoder::new(r)),
+            #[cfg(feature = "compression")]
+            ZSTD => match ZstdDecoder::new(r) {
+                Ok(decoder) => Box::new(decoder),
+                Err(e) => return Err(DataFusionError::External(Box::new(e))),
+            },
             #[cfg(not(feature = "compression"))]
             GZIP | BZIP2 | XZ => {
                 return Err(DataFusionError::NotImplemented(

Review Comment:
   ```suggestion
               #[cfg(not(feature = "compression"))]
               GZIP | BZIP2 | XZ | ZSTD => {
                   return Err(DataFusionError::NotImplemented(
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #5397: Zstd compression

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #5397:
URL: https://github.com/apache/arrow-datafusion/pull/5397#discussion_r1118066236


##########
datafusion/core/src/datasource/file_format/file_type.rs:
##########
@@ -162,8 +173,13 @@ impl FileCompressionType {
             BZIP2 => Box::new(BzDecoder::new(r)),
             #[cfg(feature = "compression")]
             XZ => Box::new(XzDecoder::new(r)),
+            #[cfg(feature = "compression")]
+            ZSTD => match ZstdDecoder::new(r) {
+                Ok(decoder) => Box::new(decoder),
+                Err(e) => return Err(DataFusionError::External(Box::new(e))),

Review Comment:
   I'm a little confused about why we use `DataFusionError::External` warp the error instead of return directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org