You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Louis C <lc...@outlook.fr> on 2020/09/21 08:59:46 UTC

[C++][Python] Shared memory with Arrow ?

Hello,
Excuse me if this is a frequent question but I am trying to find a way to share data (Feather/Parquet tables for instance) between different processes (IPC), the ideal would be to use shared memory as I could write data with a process and read it with another one without any copy. The different processes could be 2 separate C++ processes or 1 C++ program with a Python one. The platform would be primarily Windows, but it would be better it if it was also compatible with Linux.
As I understand it Arrow should be able to do something like this, but I can’t find the proper way to do it.
I looked into mapped files, but it seems like it is only useful to read data as one needs to have the size of the written data before writing it in a mapped file. I tried Flight too, but this is not shared memory IPC. There was also Plasma, but it seems not to be fully maintained anymore (and not available for Windows for the moment).
Is there a way to achieve this with Arrow ?
Kind regards
Louis C

RE: [C++][Python] Shared memory with Arrow ?

Posted by Louis C <lc...@outlook.fr>.
Hello Uwe,

Thanks for your quick reply !
To answer your question, the use case would be : let's say I have a table A in a particular format (meaning not Arrow ) in the C++ program, which can be potentially very big, and that I want to transfer to a Python program (and use it as a pandas Dataframe for instance) (I understand it is better to use Feather if we want to avoid copies). As the table can be quite big, I export it by chunks, in a progressive way, for example in Parquet using the WriteColumnChunk of  the parquet::FileWriter class(I used to do it also for Feather but since the V0.17, the API for exporting by chunks seems to have disappeared...), to reduce the memory footprint (and potentially accelerate computation times). So I know I would have Arrow objects but do not know precisely their size before writing them entirely.
But indeed, actually, as you said I could try to compute their size before or create the entire thing I want to export before exporting it. A solution could also be to take a rough estimate of the size and reserve enough memory before resizing it to the correct value (truncating the file) (or keeping track of the real size of the file somewhere).
Anyway thanks for your answer, I was not sure if mapped files was the way to go in Arrow.

Cheers,
Louis
________________________________
De : Uwe L. Korn <uw...@xhochy.com>
Envoyé : lundi 21 septembre 2020 22:27
À : user@arrow.apache.org <us...@arrow.apache.org>
Objet : Re: [C++][Python] Shared memory with Arrow ?

Hello Luis,

As you already mentioned, mapped files, Windows name for shared memory, need the size to be available ahead. This is the same on other operating systems, too. Flight will copy the data when transferring from one process to another. So there you will have the copy again.

So to actually better understand your use case: Why aren't you able to calculate the size beforehand? To construct an Arrow structure, you also need to know it's size. When using the builders for incremental creation, we have tuned  everything to minimize the amount of copies but they still copy when the size doesn't match and we cannot extend the existing memory region in-place.

Cheers
Uwe

Am 21.09.2020 um 10:59 schrieb Louis C <lc...@outlook.fr>:



Hello,

Excuse me if this is a frequent question but I am trying to find a way to share data (Feather/Parquet tables for instance) between different processes (IPC), the ideal would be to use shared memory as I could write data with a process and read it with another one without any copy. The different processes could be 2 separate C++ processes or 1 C++ program with a Python one. The platform would be primarily Windows, but it would be better it if it was also compatible with Linux.

As I understand it Arrow should be able to do something like this, but I can’t find the proper way to do it.

I looked into mapped files, but it seems like it is only useful to read data as one needs to have the size of the written data before writing it in a mapped file. I tried Flight too, but this is not shared memory IPC. There was also Plasma, but it seems not to be fully maintained anymore (and not available for Windows for the moment).
Is there a way to achieve this with Arrow ?

Kind regards

Louis C

Re: [C++][Python] Shared memory with Arrow ?

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Luis,

As you already mentioned, mapped files, Windows name for shared memory, need the size to be available ahead. This is the same on other operating systems, too. Flight will copy the data when transferring from one process to another. So there you will have the copy again.

So to actually better understand your use case: Why aren't you able to calculate the size beforehand? To construct an Arrow structure, you also need to know it's size. When using the builders for incremental creation, we have tuned  everything to minimize the amount of copies but they still copy when the size doesn't match and we cannot extend the existing memory region in-place.

Cheers 
Uwe

> Am 21.09.2020 um 10:59 schrieb Louis C <lc...@outlook.fr>:
> 
> 
> Hello,
> 
> Excuse me if this is a frequent question but I am trying to find a way to share data (Feather/Parquet tables for instance) between different processes (IPC), the ideal would be to use shared memory as I could write data with a process and read it with another one without any copy. The different processes could be 2 separate C++ processes or 1 C++ program with a Python one. The platform would be primarily Windows, but it would be better it if it was also compatible with Linux. 
> 
> As I understand it Arrow should be able to do something like this, but I can’t find the proper way to do it.  
> 
> I looked into mapped files, but it seems like it is only useful to read data as one needs to have the size of the written data before writing it in a mapped file. I tried Flight too, but this is not shared memory IPC. There was also Plasma, but it seems not to be fully maintained anymore (and not available for Windows for the moment). 
> Is there a way to achieve this with Arrow ?  
> 
> Kind regards 
> 
> Louis C