You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 18:11:00 UTC

[GitHub] [beam] damccorm opened a new issue, #20560: Cannot set compression level when writing compressed files

damccorm opened a new issue, #20560:
URL: https://github.com/apache/beam/issues/20560

   CompressedFile._initialize_compressor hardcodes the compression level used when writing:
   
    
   self._compressor = zlib.compressobj(
             zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, self._gzip_mask)
    
   It would be good to be able to control this, as I have a large set of GZIP compressed files that are creating output 10x larger then the input size when writing the same data back.
    
   I've tried various monkeypatching approaches: these seem to work with the local runner, but failed when using DataflowRunner. For example:
    
   class WriteData(beam.PTransform):
       def __init__(self, dst):
           import zlib
   
           self._dst = dst
   
           def _initialize_compressor(self):
               self._compressor = zlib.compressobj(
                   zlib.Z_BEST_COMPRESSION, zlib.DEFLATED, self._gzip_mask
               )
   
           CompressedFile._initialize_compressor = _initialize_compressor
   
       def expand(self, p):
           return p | WriteToText(
               file_path_prefix=self._dst,
               file_name_suffix=".tsv.gz",
               compression_type="gzip",
           )
   
   Imported from Jira [BEAM-11282](https://issues.apache.org/jira/browse/BEAM-11282). Original Jira may contain additional context.
   Reported by: JackWhelpton.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #20560: Provide a way to set non-default compression level when writing compressed files

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #20560:
URL: https://github.com/apache/beam/issues/20560#issuecomment-1164249914

   @johnjcasey @chamikaramj who could help triage this further


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #20560: Provide a way to set non-default compression level when writing compressed files

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #20560:
URL: https://github.com/apache/beam/issues/20560#issuecomment-1164249050

   Possible options:
   - change default compression level for gzip
   - introduce a new compression type
   - add a knob/hint to control compression (may not be supported by all compression types)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org