You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neltherion (via GitHub)" <gi...@apache.org> on 2023/01/23 10:58:26 UTC

[GitHub] [arrow] Neltherion opened a new issue, #11239: Any replacements for `pa.serialize()` and `pa.deserialize()` ?

Neltherion opened a new issue, #11239:
URL: https://github.com/apache/arrow/issues/11239

   It seems that PyArrow has deprecated the `pa.serialize()` and `pa.deserialize()` methods and suggests to use other options such a Pickle5. 
   
   Using Pickle5 doesn't seem to have the same performance as PyArrow's deprecated Serialization method. Is there ANY proper replacements for pa.serialize() and pa.deserialize()?
   
   Here's a simplified code that compares the difference between PyArrow & Pickle when Serializing/Deserializing:
   
   ```
   import time
   
   import numpy as np
   import pickle5
   import pyarrow as pa
   
   
   class Person:
       def __init__(self, Thumbnail: np.ndarray = None):
           if Thumbnail is not None:
               self.Thumbnail: np.ndarray = Thumbnail
           else:
               self.Thumbnail: np.ndarray = np.random.rand(256, 256, 3)
   
   
   def serialize_Person(person):
       return {'Thumbnail': person.Thumbnail}
   
   
   def deserialize_Person(person):
       return Person(person['Thumbnail'])
   
   
   context = pa.SerializationContext()
   context.register_type(Person, 'Person', custom_serializer=serialize_Person, custom_deserializer=deserialize_Person)
   
   PERSONS = [Person() for i in range(100)]
   
   """
   PyArrow
   """
   t1 = time.time()
   persons_serialized = pa.serialize(PERSONS, context=context).to_buffer()
   persons_deserialized = pa.deserialize(persons_serialized, context=context)
   t2 = time.time()
   print(f'PyArrow Time => {t2 - t1}')
   
   """
   Pickle
   """
   t1 = time.time()
   persons_pickled = pickle5.dumps(PERSONS, protocol=5)
   persons_depickled = pickle5.loads(persons_pickled)
   t2 = time.time()
   print(f'Pickle Time => {t2 - t1}')
   ```
   
   The outputs on my system are:
   
   ```
   PyArrow Time => 0.04499983787536621
   Pickle Time => 0.2220008373260498
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #11239: Any replacements for `pa.serialize()` and `pa.deserialize()` ?

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #11239:
URL: https://github.com/apache/arrow/issues/11239#issuecomment-1400402716

   @Neltherion That question probably belongs better on a ZMQ forum or bug tracker. I have no idea about which ZMQ APIs would help your use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #11239: [Python] Any replacements for `pa.serialize()` and `pa.deserialize()` ?

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #11239:
URL: https://github.com/apache/arrow/issues/11239#issuecomment-1402257847

   Well, I already suggested `sendmsg` if you're building your own transmission system. `sendmsg` doesn't make an intermediate copy of the buffers you give it. If you need a similar primitive from ZMQ, it's a ZMQ question.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Neltherion commented on issue #11239: Any replacements for `pa.serialize()` and `pa.deserialize()` ?

Posted by "Neltherion (via GitHub)" <gi...@apache.org>.
Neltherion commented on issue #11239:
URL: https://github.com/apache/arrow/issues/11239#issuecomment-1400269716

   > @Neltherion You could indeed send the buffers one by one or send them all at once using [sendmsg](https://docs.python.org/3/library/socket.html#socket.socket.sendmsg).
   
   @pitrou Thanks. But is this also doable with ZMQ? It seems ZMQ is only able to send bytearray data types. Since we have a byte-like object (the serialized object) and a pythonic list of buffers, how can we concatenate them together somehow to avoid memory copies and also be able to send the two in one go using ZMQ?
   
   [The code over at StackOverflow](https://stackoverflow.com/questions/75201514/sending-pickled-objects-using-pickle5s-out-of-band-buffers-over-the-network) has a working example to test ideas out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #11239: Any replacements for `pa.serialize()` and `pa.deserialize()` ?

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #11239:
URL: https://github.com/apache/arrow/issues/11239#issuecomment-1400223697

   But instead of trying to recreate it yourself I would suggest you take a look at [Dask distributed](https://github.com/dask/distributed/). It has a lot of optimizations for transfer of Numpy- and Arrow-like data already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Neltherion commented on issue #11239: [Python] Any replacements for `pa.serialize()` and `pa.deserialize()` ?

Posted by "Neltherion (via GitHub)" <gi...@apache.org>.
Neltherion commented on issue #11239:
URL: https://github.com/apache/arrow/issues/11239#issuecomment-1402251969

   It's not about ZMQ: Pickle5 gives us a list of buffers and a serialized object. If we want to send them together, we have to somehow stick them together which results in memory copies, and sending them one by one would result in complexities along the way (packets not getting there, book-keeping the received data and sticking them using `pickle.load()`, etc...). If there is a way to stick them together in a serialized state, not ZMQ nor any other legacy code would have difficulties but as of now, I haven't found a way to keep these two objects together.
   
   So how are we able to fully use this new feature, without somehow keeping these two different objects separate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Neltherion commented on issue #11239: Any replacements for `pa.serialize()` and `pa.deserialize()` ?

Posted by "Neltherion (via GitHub)" <gi...@apache.org>.
Neltherion commented on issue #11239:
URL: https://github.com/apache/arrow/issues/11239#issuecomment-1400155385

   @pitrou I have asked a [question](https://stackoverflow.com/questions/75201514/sending-pickled-objects-using-pickle5s-out-of-band-buffers-over-the-network) about how to send objects that have been serialized using Pickle's out-of-band buffers over the network.
   
   I have no idea how to send serialized objects with pickle now that each object is broken down to a serialized object and a list of buffers. Is this even possible or should we try sending them one by one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #11239: Any replacements for `pa.serialize()` and `pa.deserialize()` ?

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #11239:
URL: https://github.com/apache/arrow/issues/11239#issuecomment-1400219240

   @Neltherion You could indeed send the buffers one by one or send them all at once using [sendmsg](https://docs.python.org/3/library/socket.html#socket.socket.sendmsg).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org