You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Hagai Har-Gil <ha...@protonmail.com> on 2021/08/19 14:31:56 UTC

[IPC] Concurrent reads and writes to a Stream

Hello,

My IPC use case involves writing a Windows-based real tlime data stream in Python, and reading it back concurrently with Rust. I tried using the IPC protocol to accomplish that, but I'm currently unsuccessful. The issue I'm facing is probably related to concurrent access to a file by the two different processes, where one is attempting to write and the other to read. Code examples can be found in this issue that I filed against the Rust (unofficial) IPC implementation (https://github.com/jorgecarleitao/arrow2/issues/301), but I believe that this problem isn't due to a faulty implementation.

My question is a bit more broad - is my use case a valid one for the IPC protocol? Do I need to (somehow) perform checks and synchronization between the two concurrent processes so that they don't access the file simultaneuosly? This is obviously easier when the two "actors" live under the same program and mutexes are available, but two different processes in two different languages seem more difficult to sync.

Is there another way to achieve this? My fallback is basically writing lots of tiny independent parquet files in Python and accessing them (with some delay) in Rust. This might work but it's not a clean solution, and it may raise other filesystem-oriented issues depending on my data rate and file size limiit.

Thanks in advance.

Re: [IPC] Concurrent reads and writes to a Stream

Posted by Hagai Har-Gil <ha...@protonmail.com>.
Thanks for the reply, I'll try that socket solution out.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Thursday, August 19th, 2021 at 6:56 PM, Antoine Pitrou <an...@python.org> wrote:

> On Thu, 19 Aug 2021 15:40:53 +0000
>
> Hagai Har-Gil hagaihargil@protonmail.com wrote:
>
> > Right - I have a different app that uses sockets in another context for a similar goal.
> >
> > The thing is - the Stream object is "advertised" (so to say) as a suitable holder for such data. E.g., looking at the docs for `pyarrow.ipc.open_stream()` and `pyarrow.ipc.NativeFile`, they specifically mention how this is the right approach when doing streaming, and I assumed that concurrent reading from that stream is a viable use case for such files.
> >
> > Perhaps I'm just completely ignorant of this topic and should've realized that a NativeFile can't support this use case, but I believe that a minimal warning against such "abuse" of the IPC protocol might be helpful in the future.
>
> Well, the IPC protocol does not change the semantics of the underlying
>
> file. If you're using a regular disk file, then by construction there's
>
> no guarding against unsynchronised access. If you're using a socket,
>
> then you get synchronisation by construction.
>
> I notice it is not possible currently to create a pyarrow.OSFile
>
> from a file descriptor:
>
> https://issues.apache.org/jira/browse/ARROW-10906
>
> However, you should be able to create a pyarrow.PythonFile from a
>
> Python socket's file object (obtained using socket.makefile()). It
>
> will be less performant, but should hopefully work.
>
> Regards
>
> Antoine.

Re: [IPC] Concurrent reads and writes to a Stream

Posted by Antoine Pitrou <an...@python.org>.
On Thu, 19 Aug 2021 15:40:53 +0000
Hagai Har-Gil <ha...@protonmail.com> wrote:
> Right - I have a different app that uses sockets in another context for a similar goal.
> 
> The thing is - the Stream object is "advertised" (so to say) as a suitable holder for such data. E.g., looking at the docs for `pyarrow.ipc.open_stream()` and `pyarrow.ipc.NativeFile`, they specifically mention how this is the right approach when doing streaming, and I assumed that concurrent reading from that stream is a viable use case for such files.
> 
> Perhaps I'm just completely ignorant of this topic and should've realized that a NativeFile can't support this use case, but I believe that a minimal warning against such "abuse" of the IPC protocol might be helpful in the future.

Well, the IPC protocol does not change the semantics of the underlying
file.  If you're using a regular disk file, then by construction there's
no guarding against unsynchronised access.  If you're using a socket,
then you get synchronisation by construction.

I notice it is not possible currently to create a pyarrow.OSFile
from a file descriptor:
https://issues.apache.org/jira/browse/ARROW-10906

However, you should be able to create a pyarrow.PythonFile from a
Python socket's file object (obtained using socket.makefile()).  It
will be less performant, but should hopefully work.

Regards

Antoine.



Re: [IPC] Concurrent reads and writes to a Stream

Posted by Hagai Har-Gil <ha...@protonmail.com>.
Right - I have a different app that uses sockets in another context for a similar goal.

The thing is - the Stream object is "advertised" (so to say) as a suitable holder for such data. E.g., looking at the docs for `pyarrow.ipc.open_stream()` and `pyarrow.ipc.NativeFile`, they specifically mention how this is the right approach when doing streaming, and I assumed that concurrent reading from that stream is a viable use case for such files. In addition, I haven't seen in the official Arrow docs anything warning against such concurrent reads.

Perhaps I'm just completely ignorant of this topic and should've realized that a NativeFile can't support this use case, but I believe that a minimal warning against such "abuse" of the IPC protocol might be helpful in the future. Alternatively, advise users that wish to concurrently read and write to use a socket as the "file" connecting the two processes.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Thursday, August 19th, 2021 at 5:47 PM, Antoine Pitrou <an...@python.org> wrote:

> On Thu, 19 Aug 2021 14:31:56 +0000
>
> Hagai Har-Gil hagaihargil@protonmail.com wrote:
>
> > Hello,
> >
> > My IPC use case involves writing a Windows-based real tlime data stream in Python, and reading it back concurrently with Rust. I tried using the IPC protocol to accomplish that, but I'm currently unsuccessful. The issue I'm facing is probably related to concurrent access to a file by the two different processes, where one is attempting to write and the other to read. Code examples can be found in this issue that I filed against the Rust (unofficial) IPC implementation (https://github.com/jorgecarleitao/arrow2/issues/301), but I believe that this problem isn't due to a faulty implementation.
> >
> > My question is a bit more broad - is my use case a valid one for the IPC protocol?
>
> Well... what you're trying to do is usually achived by having one
>
> process writing to e.g. a pipe or socket, and the other process reading
>
> from the other end of the pipe or socket. Using an actual on-disk file
>
> for that sounds a bit weird.
>
> Regards
>
> Antoine.

Re: [IPC] Concurrent reads and writes to a Stream

Posted by Antoine Pitrou <an...@python.org>.
On Thu, 19 Aug 2021 14:31:56 +0000
Hagai Har-Gil <ha...@protonmail.com> wrote:

> Hello,
> 
> My IPC use case involves writing a Windows-based real tlime data stream in Python, and reading it back concurrently with Rust. I tried using the IPC protocol to accomplish that, but I'm currently unsuccessful. The issue I'm facing is probably related to concurrent access to a file by the two different processes, where one is attempting to write and the other to read. Code examples can be found in this issue that I filed against the Rust (unofficial) IPC implementation (https://github.com/jorgecarleitao/arrow2/issues/301), but I believe that this problem isn't due to a faulty implementation.
> 
> My question is a bit more broad - is my use case a valid one for the IPC protocol?

Well... what you're trying to do is usually achived by having one
process writing to e.g. a pipe or socket, and the other process reading
from the other end of the pipe or socket.  Using an actual on-disk file
for that sounds a bit weird.

Regards

Antoine.