You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 20:06:24 UTC

[GitHub] [beam] damccorm opened a new issue, #20900: Add support for out-of-band pickling (pickle 5) in PickleCoder

damccorm opened a new issue, #20900:
URL: https://github.com/apache/beam/issues/20900

   dev@ discussion: https://lists.apache.org/thread.html/r266f37640901544927205b913d4903340d6c59c3d94905e5dda0db42%40%3Cdev.beam.apache.org%3E
   
   PickleCoder should support out-of-band pickling to avoid unnecessary memory copies for types that support it (including numpy and pandas types).
   
   Imported from Jira [BEAM-12418](https://issues.apache.org/jira/browse/BEAM-12418). Original Jira may contain additional context.
   Reported by: bhulette.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on issue #20900: Add support for out-of-band pickling (pickle 5) in PickleCoder

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on issue #20900:
URL: https://github.com/apache/beam/issues/20900#issuecomment-1251658001

   I looked into this a little bit. I don't think we can see huge performance wins here because we don't have an out-of-band path for transferring the data (e.g. shared memory). No matter what we're going to need to write/read the buffers over the Fn API in-band with the rest of the encoded object.
   
   _However_ [as noted in PEP 574](https://peps.python.org/pep-0574/#improved-in-band-performance) one can still see an improvement in in-band performance with the pickle 5 protocol. It can eliminate a memcopy on the serialization path because we no longer have to materialize a full byte[] representing the serialized object (copying all the buffers) and then copy that byte[] to the output buffer.
   
   I ran some benchmarks to confirm this, we can see that with pickle5, pickle.dump performs better when writing to a file-like object:
   ![image](https://user-images.githubusercontent.com/675055/191133887-f97f35ce-206c-4adc-90c6-d49b30c2401f.png)
   
   To take advantage of this I think we'd need to expose the OutputStream as a file-like object, and pass that to pickle.dump. This would allow pickle to write buffers directly to the output stream. Note the current behavior is to execute `pickle.dumps` (one memcopy) and write the result to the output stream (second memcopy).
   
   CCing a few folks:
   @robertwb @shoyer @tvalentyn @apilloud 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org