You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 16:26:09 UTC

[GitHub] [beam] damccorm opened a new issue, #20283: Add alternate constructor to improve byte encoding performance in SortValues

damccorm opened a new issue, #20283:
URL: https://github.com/apache/beam/issues/20283

   The `SortValues` transform operates on key-groups of `KV<PrimaryKeyT, Iterable<KV<SecondaryKeyT, ValueT\>\>\>`. From those key groups it iterates through each element and uses `CoderUtils.encodeToByteArray` on each SecondaryKeyT-ValueT pair. This operation can be expensive and its parallelism is limited by the # of key groups.
   
   I'd like to propose adding an alternative to `SortValuesDoFn` that operates on `KV<PrimaryKeyT, Iterable<KV<byte[], byte[]\>\>\>` and can skip the encoding step within the key-group. The user's pipeline may be able to encode the data to bytes in a prior step in a much more parallelized and efficient way (i.e. in a `MapElements` transform). I've seen performance gains in every Dataflow metric from from patching this in my team's pipeline.
   
   
   (I would visualize the alternative vs pre-existing constructors to look similar to generic vs specific Avro constructors, where the generic constructor has a static type and specific Avro has a parameterized T.)
    
   
   What do you think? 
   
   Imported from Jira [BEAM-10042](https://issues.apache.org/jira/browse/BEAM-10042). Original Jira may contain additional context.
   Reported by: clairemcginty.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org