You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Will Jones <wi...@gmail.com> on 2023/03/15 21:23:37 UTC

Plasma will be removed in Arrow 12.0.0

Hello all,

First, a reminder that Plasma has been deprecated and will be removed in
the 12.0.0 release of the C++, Python, and Java Arrow libraries. [1]

I know some used Plasma as a convenient way to share Arrow data between
Python processes, so I pulled together a quick performance comparison
against two supported alternatives: Flight over unix domain socket and the
Python sharedmemory module. [2] The shared memory example performs
comparably to Plasma, but I don't think is accessible from other languages.
The Flight test is slower than shared memory, but still fairly fast, and of
course works across languages. I wrote a little more about the shared
memory case in a stackoverflow answer [3].

If you have migrated off of Plasma and want to share with other users what
you moved to, please do so in this thread.

Best,

Will Jones

[1] https://github.com/apache/arrow/issues/33243
[2] https://github.com/wjones127/arrow-ipc-bench
[3] https://stackoverflow.com/a/75402621/2048858

Re: Plasma will be removed in Arrow 12.0.0

Posted by David Li <li...@apache.org>.
I'd suggest explicitly chunking the table into batches of maybe ~2 MiB (it appears the table is one contiguous chunk and I believe it'll just try to send that entire table as one chunk). IIRC the Flight benchmark over localhost should be up to a couple GiB/s. (That said, that doesn't match up to the sharedmemory results still.)

Flight-UCX with shared memory transport* was more like ~7GiB/s? as I recall. 

But fundamentally Flight is a client/server RPC framework and not an interprocess shared memory cache so while you can build a caching service with Flight, it won't be exactly the same as Plasma.

That said, if the goal is just convenience and not necessarily performance, I wonder if providing a reference implementation or recipe based on Flight or even possibly redis or memcached would suffice...

* IIRC, this means that UCX uses shared memory to copy buffers between processes, which is still different than Plasma just mapping the same (immutable) buffer into multiple processes.

On Thu, Mar 16, 2023, at 10:35, Antoine Pitrou wrote:
> 0.5 GB/second for local Flight transfer seems unexpectedly slow (one 
> could expect 10x more), but perhaps tuning of default parameters needs 
> to be improving. David Li can probably elaborate on that.
>
> I'll add that Unix sockets might not be the fastest anymore these days. 
> It may be worth testing on TCP.
>
> Regards
>
> Antoine.
>
>
> Le 15/03/2023 à 22:23, Will Jones a écrit :
>> Hello all,
>> 
>> First, a reminder that Plasma has been deprecated and will be removed in
>> the 12.0.0 release of the C++, Python, and Java Arrow libraries. [1]
>> 
>> I know some used Plasma as a convenient way to share Arrow data between
>> Python processes, so I pulled together a quick performance comparison
>> against two supported alternatives: Flight over unix domain socket and the
>> Python sharedmemory module. [2] The shared memory example performs
>> comparably to Plasma, but I don't think is accessible from other languages.
>> The Flight test is slower than shared memory, but still fairly fast, and of
>> course works across languages. I wrote a little more about the shared
>> memory case in a stackoverflow answer [3].
>> 
>> If you have migrated off of Plasma and want to share with other users what
>> you moved to, please do so in this thread.
>> 
>> Best,
>> 
>> Will Jones
>> 
>> [1] https://github.com/apache/arrow/issues/33243
>> [2] https://github.com/wjones127/arrow-ipc-bench
>> [3] https://stackoverflow.com/a/75402621/2048858
>>

Re: Plasma will be removed in Arrow 12.0.0

Posted by Antoine Pitrou <an...@python.org>.
0.5 GB/second for local Flight transfer seems unexpectedly slow (one 
could expect 10x more), but perhaps tuning of default parameters needs 
to be improving. David Li can probably elaborate on that.

I'll add that Unix sockets might not be the fastest anymore these days. 
It may be worth testing on TCP.

Regards

Antoine.


Le 15/03/2023 à 22:23, Will Jones a écrit :
> Hello all,
> 
> First, a reminder that Plasma has been deprecated and will be removed in
> the 12.0.0 release of the C++, Python, and Java Arrow libraries. [1]
> 
> I know some used Plasma as a convenient way to share Arrow data between
> Python processes, so I pulled together a quick performance comparison
> against two supported alternatives: Flight over unix domain socket and the
> Python sharedmemory module. [2] The shared memory example performs
> comparably to Plasma, but I don't think is accessible from other languages.
> The Flight test is slower than shared memory, but still fairly fast, and of
> course works across languages. I wrote a little more about the shared
> memory case in a stackoverflow answer [3].
> 
> If you have migrated off of Plasma and want to share with other users what
> you moved to, please do so in this thread.
> 
> Best,
> 
> Will Jones
> 
> [1] https://github.com/apache/arrow/issues/33243
> [2] https://github.com/wjones127/arrow-ipc-bench
> [3] https://stackoverflow.com/a/75402621/2048858
> 

Re: Plasma will be removed in Arrow 12.0.0

Posted by Will Jones <wi...@gmail.com>.
Thanks for the feedback on the benchmark. By switching from Unix domain
socket to TCP and reducing the batch size to under 5MB I was able to get
nearly 5Gbps throughput. I think Unix domain sockets are just slower on
Macs. Updated that repo [1]

[1] https://github.com/wjones127/arrow-ipc-bench/tree/main

On Fri, Mar 17, 2023 at 9:16 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 17/03/2023 à 16:34, Alessandro Molina a écrit :
> > How does PyArrow cope with multiprocessing.Manager?
>
> I'm not sure anyone tried it. Also, I don't think
> multiprocessing.Manager was updated to use pickle v5 out-of-band buffers
> (which would help reduce copying), so I wouldn't expect very high
> performance.
>
> Generally, I don't think multiprocessing.Manager is very much used these
> days. It also doesn't receive a lot of maintenance.
>
>

Re: Plasma will be removed in Arrow 12.0.0

Posted by Antoine Pitrou <an...@python.org>.
Le 17/03/2023 à 16:34, Alessandro Molina a écrit :
> How does PyArrow cope with multiprocessing.Manager?

I'm not sure anyone tried it. Also, I don't think 
multiprocessing.Manager was updated to use pickle v5 out-of-band buffers 
(which would help reduce copying), so I wouldn't expect very high 
performance.

Generally, I don't think multiprocessing.Manager is very much used these 
days. It also doesn't receive a lot of maintenance.


Re: Plasma will be removed in Arrow 12.0.0

Posted by Alessandro Molina <al...@voltrondata.com.INVALID>.
How does PyArrow cope with multiprocessing.Manager? I remember there were
some inefficiencies when Pickle was used (mostly related to slicing) but
that in theory it should work.
That is probably an easy enough replacement for Plasma and is standard.

On Wed, Mar 15, 2023 at 10:24 PM Will Jones <wi...@gmail.com> wrote:

> Hello all,
>
> First, a reminder that Plasma has been deprecated and will be removed in
> the 12.0.0 release of the C++, Python, and Java Arrow libraries. [1]
>
> I know some used Plasma as a convenient way to share Arrow data between
> Python processes, so I pulled together a quick performance comparison
> against two supported alternatives: Flight over unix domain socket and the
> Python sharedmemory module. [2] The shared memory example performs
> comparably to Plasma, but I don't think is accessible from other languages.
> The Flight test is slower than shared memory, but still fairly fast, and of
> course works across languages. I wrote a little more about the shared
> memory case in a stackoverflow answer [3].
>
> If you have migrated off of Plasma and want to share with other users what
> you moved to, please do so in this thread.
>
> Best,
>
> Will Jones
>
> [1] https://github.com/apache/arrow/issues/33243
> [2] https://github.com/wjones127/arrow-ipc-bench
> [3] https://stackoverflow.com/a/75402621/2048858
>