You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/01 17:01:59 UTC

[GitHub] [beam] steveniemitz commented on pull request #17783: [BEAM-14534] Allow users to compress values being shuffled in dataflow

steveniemitz commented on PR #17783:
URL: https://github.com/apache/beam/pull/17783#issuecomment-1143881332

   > The other part of the change makes sense to reduce byte[] copies by using ByteString.
   > 
   > CC: @tudorm
   
   Maybe I'll pull the ByteString refactoring stuff out into another review just to make this easier?  Do you have any particular issues with it using ByteString there?
   
   The downsides with using Output/Input stream are really too big to ignore here, the performance differences are orders of magnitude in our tests.  The main problem is that most "stream" compressor implementations are designed to compress a large amount of data, but in this case we're usually only compressing a few 100-1KB.  It makes the overhead from creating/destroying the compressor streams very high (comparatively at least).  We ran into this problem both with deflate and zstd, and its one of the reasons we ended up with an interface like this.  If its really a non-starter putting this on OSS with a similar interface that's fine though, we can continue maintaining this in our own fork for the time being. 
   
   The PipelineVisitor idea is interesting, although I'm skeptical how well it'd work in practice.  For example with a Combine the coder for the data being shuffled is the accumulator coder, not the value coder of the KV.  I bet you'd need a bunch of special cases to pick the "right" coder to wrap for various transforms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org