You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/07/22 21:10:25 UTC

[GitHub] [beam] grufino opened a new pull request, #22419: add zstd compression support according to issue 22393

grufino opened a new pull request, #22419:
URL: https://github.com/apache/beam/pull/22419

   **Please** add a meaningful description for your change here
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on PR #22419:
URL: https://github.com/apache/beam/pull/22419#issuecomment-1201807001

   There is an error:
   
   ```
   Container does not include required Beam dependencies or has conflicting dependencies. If Beam dependencies have changed, you need to regenerate base_image_requirements.txt files. See: https://s.apache.org/beam-python-requirements-generate
   ```
   
   I'll do that and push a commit here.
   
   For future changes: please don't squash reviewed and unreviewed commits via force-push, until review has finalized.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #22419: DRAFT: add zstd compression support according to issue 22393

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #22419:
URL: https://github.com/apache/beam/pull/22419#issuecomment-1192950383

   Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @tvalentyn for label python.
   R: @pabloem for label io.
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
   
   The PR bot will only process comments in the main thread (not review comments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn merged pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
tvalentyn merged PR #22419:
URL: https://github.com/apache/beam/pull/22419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933878334


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   @tvalentyn done, for the tests I reused the filesystem_test test cases for zst, hopefully that is enough, but let me know



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino-blackbird commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino-blackbird commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933550128


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   thank you for the suggestion, I agree and found this: https://github.com/indygreg/python-zstandard/issues/157
   apparently it is related to the compression level used. What do you think about referencing this issue as a comment? It doesn't seem like the library intends to fix this in an automated way, as the issue was closed, but I still think it's better to leave the value as it seems to work as a general recommendation (for my use case I tested in 10GB+ compressed or 100GB+ decompressed files and it is working fine)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino commented on pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino commented on PR #22419:
URL: https://github.com/apache/beam/pull/22419#issuecomment-1193251115

   R: @tvalentyn, @pabloem


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933532914


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -96,14 +102,19 @@ def mime_type(cls, compression_type, default='application/octet-stream'):
         cls.BZIP2: 'application/x-bz2',
         cls.DEFLATE: 'application/x-deflate',
         cls.GZIP: 'application/x-gzip',
+        cls.ZSTD: 'application/x-zstd',

Review Comment:
   nit: I think it's `application/zstd`
   https://datatracker.ietf.org/doc/html/rfc8478 
   https://datatracker.ietf.org/doc/html/rfc6648#appendix-B



##########
sdks/python/setup.py:
##########
@@ -226,6 +226,7 @@ def get_portability_package_data():
         'pytz>=2018.3',
         'requests>=2.24.0,<3.0.0',
         'typing-extensions>=3.7.0',
+        'zstandard>=0.18.0',

Review Comment:
   ```suggestion
           'zstandard>=0.18.0,<1',
   ```



##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   We should have some explanation here in a comment.
   Is there a related bug/discussion with zstd  maintainers that gives this recommendation from an authoritative source?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r930043048


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   I know this seems arbitrary (and probably is), but without this param the decompressor fails with error: 
   ```
   zstd.ZstdError: zstd decompress error: Frame requires too much memory for decoding
   ```
   I thought about using `DEFAULT_READ_BUFFER_SIZE` but it seems too low for this param. I can either create a constant or a new parameter if necessary, let me know.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino-blackbird commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino-blackbird commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933550128


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   thank you for the suggestion, I agree and found this: https://github.com/indygreg/python-zstandard/issues/157
   apparently it is related to the compression level used. What do you think about referencing this issue as a comment? It doesn't seem like the library intends to fix this in an automated way, as the issue was closed, but I still think it's better to leave the value as it seems to work as a general recommendation (for my use case I tested in 10GB+ compressed or 100GB+ decompressed files in several difference sizes of workers from 5GB to 64GB RAM and it is working fine)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino-blackbird commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino-blackbird commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933550128


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   thank you for the suggestion, I agree and found this: https://github.com/indygreg/python-zstandard/issues/157
   apparently it is related to the compression level used. What do you think about referencing this issue as a comment? It doesn't seem like the library intends to fix this in an automated way, as the issue was closed, but I still think it's better to leave the value as it seems to work as a general recommendation (for my use case I tested in 10GB+ compressed or 100GB+ decompressed files in several difference sizes and workers from 5GB to 64GB RAM and it is working fine)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino-blackbird commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino-blackbird commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933550128


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   thank you for the suggestion, I agree and found this: https://github.com/indygreg/python-zstandard/issues/157
   apparently it is related to the compression level used. What do you think about adding this as a comment? It doesn't seem like the library intends to fix this, as the issue was closed, but I still think it's better to leave the value as it seems to work as a general recommendation (for my use case I tested in 10GB+ compressed or 100GB+ decompressed files and it is working fine)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933553906


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   thank you for the suggestion, I agree and found this: https://github.com/indygreg/python-zstandard/issues/157
   apparently it is related to the compression level used. What do you think about referencing this issue as a comment? It doesn't seem like the library intends to fix this in an automated way, as the issue was closed, but I still think it's better to leave the value as it seems to work as a general recommendation (for my use case I tested in 10GB+ compressed or 100GB+ decompressed files in several difference sizes and workers from 5GB to 64GB RAM and it is working fine)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] codecov[bot] commented on pull request #22419: DRAFT: add zstd compression support according to issue 22393

Posted by GitBox <gi...@apache.org>.
codecov[bot] commented on PR #22419:
URL: https://github.com/apache/beam/pull/22419#issuecomment-1192986574

   # [Codecov](https://codecov.io/gh/apache/beam/pull/22419?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#22419](https://codecov.io/gh/apache/beam/pull/22419?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (485503d) into [master](https://codecov.io/gh/apache/beam/commit/df99d6a2eb9beff87b8df97a51c6dac71d7e9cc3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (df99d6a) will **increase** coverage by `0.00%`.
   > The diff coverage is `66.66%`.
   
   > :exclamation: Current head 485503d differs from pull request most recent head 5f295e3. Consider uploading reports for the commit 5f295e3 to get more accurate results
   
   ```diff
   @@           Coverage Diff           @@
   ##           master   #22419   +/-   ##
   =======================================
     Coverage   74.16%   74.17%           
   =======================================
     Files         706      706           
     Lines       93190    93222   +32     
   =======================================
   + Hits        69116    69147   +31     
   - Misses      22806    22807    +1     
     Partials     1268     1268           
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | python | `83.54% <66.66%> (+<0.01%)` | :arrow_up: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/beam/pull/22419?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [sdks/python/apache\_beam/io/filesystem.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vaW8vZmlsZXN5c3RlbS5weQ==) | `88.40% <66.66%> (-0.36%)` | :arrow_down: |
   | [sdks/python/apache\_beam/utils/interactive\_utils.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdXRpbHMvaW50ZXJhY3RpdmVfdXRpbHMucHk=) | `95.12% <0.00%> (-2.44%)` | :arrow_down: |
   | [sdks/python/apache\_beam/io/source\_test\_utils.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vaW8vc291cmNlX3Rlc3RfdXRpbHMucHk=) | `88.01% <0.00%> (-1.39%)` | :arrow_down: |
   | [sdks/python/apache\_beam/transforms/combiners.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdHJhbnNmb3Jtcy9jb21iaW5lcnMucHk=) | `93.05% <0.00%> (-0.39%)` | :arrow_down: |
   | [.../runners/portability/fn\_api\_runner/translations.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9wb3J0YWJpbGl0eS9mbl9hcGlfcnVubmVyL3RyYW5zbGF0aW9ucy5weQ==) | `92.65% <0.00%> (-0.34%)` | :arrow_down: |
   | [...eam/runners/interactive/interactive\_environment.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9pbnRlcmFjdGl2ZS9pbnRlcmFjdGl2ZV9lbnZpcm9ubWVudC5weQ==) | `91.71% <0.00%> (-0.31%)` | :arrow_down: |
   | [...ache\_beam/runners/portability/local\_job\_service.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9wb3J0YWJpbGl0eS9sb2NhbF9qb2Jfc2VydmljZS5weQ==) | `82.42% <0.00%> (+0.14%)` | :arrow_up: |
   | [...ks/python/apache\_beam/runners/worker/sdk\_worker.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy93b3JrZXIvc2RrX3dvcmtlci5weQ==) | `89.09% <0.00%> (+0.15%)` | :arrow_up: |
   | [...hon/apache\_beam/runners/worker/bundle\_processor.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy93b3JrZXIvYnVuZGxlX3Byb2Nlc3Nvci5weQ==) | `93.30% <0.00%> (+0.37%)` | :arrow_up: |
   | [sdks/python/apache\_beam/pipeline.py](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcGlwZWxpbmUucHk=) | `92.84% <0.00%> (+0.41%)` | :arrow_up: |
   | ... and [3 more](https://codecov.io/gh/apache/beam/pull/22419/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino-blackbird commented on pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino-blackbird commented on PR #22419:
URL: https://github.com/apache/beam/pull/22419#issuecomment-1197534167

   R: @ryanthompson591 , @ahmedabu98
   can you please have a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino-blackbird commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino-blackbird commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933550128


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   thank you for the suggestion, I agree and found this: https://github.com/indygreg/python-zstandard/issues/157
   apparently it is related to the compression level used. What do you think about referencing this issue as a comment? It doesn't seem like the library intends to fix this, as the issue was closed, but I still think it's better to leave the value as it seems to work as a general recommendation (for my use case I tested in 10GB+ compressed or 100GB+ decompressed files and it is working fine)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino closed pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino closed pull request #22419: Add zstd compression/decompression support
URL: https://github.com/apache/beam/pull/22419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino commented on pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino commented on PR #22419:
URL: https://github.com/apache/beam/pull/22419#issuecomment-1197534777

   R: @ryanthompson591 , @ahmedabu98
   can you please have a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino commented on pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino commented on PR #22419:
URL: https://github.com/apache/beam/pull/22419#issuecomment-1201912964

   thank you @tvalentyn, I wasn't sure the root cause of the error, lesson learned for next contributions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933588821


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   can we also add a unit test exercising this compression on a small payload?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r933588473


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   sure. let's do that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #22419:
URL: https://github.com/apache/beam/pull/22419#issuecomment-1193251225

   Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r930043048


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   I know this seems arbitrary (and probably is), but without this param the decompressor fails with the following error when reading big files: 
   ```
   zstd.ZstdError: zstd decompress error: Frame requires too much memory for decoding
   ```
   I thought about using `DEFAULT_READ_BUFFER_SIZE` but it seems too low for this param. I can either create a constant or a new parameter if necessary, let me know what you find best.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] grufino commented on a diff in pull request #22419: Add zstd compression/decompression support

Posted by GitBox <gi...@apache.org>.
grufino commented on code in PR #22419:
URL: https://github.com/apache/beam/pull/22419#discussion_r930043048


##########
sdks/python/apache_beam/io/filesystem.py:
##########
@@ -166,6 +177,9 @@ def _initialize_decompressor(self):
       self._decompressor = bz2.BZ2Decompressor()
     elif self._compression_type == CompressionTypes.DEFLATE:
       self._decompressor = zlib.decompressobj()
+    elif self._compression_type == CompressionTypes.ZSTD:
+      self._decompressor = zstandard.ZstdDecompressor(
+          max_window_size=2147483648).decompressobj()

Review Comment:
   I know this seems arbitrary (and probably is), but without this param the decompressor fails with error: 
   ```
   zstd.ZstdError: zstd decompress error: Frame requires too much memory for decoding
   ```
   I thought about using `DEFAULT_READ_BUFFER_SIZE` but it seems too low for this param. I can either create a constant or a new parameter if necessary, let me know what you find best.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org