You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "amoeba (via GitHub)" <gi...@apache.org> on 2023/11/06 21:42:17 UTC

[I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

amoeba opened a new issue, #38617:
URL: https://github.com/apache/arrow/issues/38617

   ### Describe the enhancement requested
   
   GRPC (and by extension FlightRPC) isn't fork-safe:
   
   > gRPC Python wraps gRPC core, which uses multithreading for performance, and hence doesn't support fork() [[Source](https://github.com/grpc/grpc/blob/master/doc/fork_support.md)]
   
   While it may be generally expected for library users to consider thread/fork safety while using such mechanisms, Python's [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) package makes it very easy for users who are unfamiliar with these concepts to run into trouble. In the case of FlightRPC, users can get intro trouble when they fork _after_ instantiating a `FlightClient` in a failed attempt to increase performance when they should fork before importing and using `pyarrow.flight` or probably consider another approach. When operators of FlightRPC servers encounter users making this type of mistake, debugging can be very difficult because the errors seen on the server may make little sense since the clients are essentially broken.
   
   Improved documentation and, better, explicit prevention of fork might make Flight easier to implement for users.
   
   There are a couple of different concerns here:
   
   - **Where to put such a prevention:** I think it makes more sense to put this into PyArrow Flight rather than C++ so that not all implementations need to pay any cost associated with it. Plus, it's a common pattern in the Python ecosystem to protect users from common mistakes.
   - **How it could be done:** Python has a [mechanism to register fork handlers](https://docs.python.org/3/library/os.html#os.register_at_fork) but I don't think fork is the only way to run into trouble so I think a PID-check of sorts similar to how [fsspec does it](https://github.com/fsspec/filesystem_spec/pull/572) could be built into FlightClient and would let us offer the highest level of prevention.
   
   I'm curious if others think this is a good idea or not and if anyone has thoughts on the approach.
   
   ### Component(s)
   
   FlightRPC, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1894783486

   Hi @pitrou, I did some testing so we're clear what actually breaks and how. I've pushed four separate reproducers to a branch on my fork at https://github.com/amoeba/arrow/tree/gh-38617/gh-38617 (see README in that folder).
   
   I tested using `os.fork` directly and also `multiprocessing.Pool` (the latter was how this issue was originally encountered) and the summary of what I found is:
   
   - Forking then using a FlightClient from the parent process seems to work fine
   - Other combinations do not:
     - Forking then using a FlightClient from the child process: The RPC never appears to run, the server never appears to see it.
     - Forking then using the FlightClient from both the parent and the child: Different failures happen including segfault, sigbus, hangs, cryptic GRPC errors
     - Forking using `multiprocesessing.Pool`, giving the workers access to the FlightClient via global (since we can't pickle): No child calls to the RPC ever appear to start, they just hang.
   
   Let me know if other tests would be useful and if the ones I've done make sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1899434302

   Yeah, I think that's the situation we want to cover here. 
   
   I was originally thinking this could just be built into FlightClient and not the entire DLL/module:  Save the PID when we create a FlightClient, do a PID-check either on every RPC or in a background thread like it appears fsspec does, and raise an Exception when the PID doesn't match. My reasoning for putting the check just in FlightClient was to limit the impact to just where we expect the check to be useful since we the two methods have overhead. Would remembering the PID at DLL load or module import work better?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1850777571

   Yes, it should also hold for FlightServer, @nph. Do you have a use case where it'd be useful to have a similar check in FlightServer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "nph (via GitHub)" <gi...@apache.org>.
nph commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1850967935

   Thanks @amoeba. I do have a use case for which a similar check in FlightServer would be useful. I have a FlightServer that serves batches of features to the client for ML training/inference tasks. The data sampling (from Arrow datasets) and feature computation on the server can take advantage of multiprocessing. Happy to discuss more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "nph (via GitHub)" <gi...@apache.org>.
nph commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1851213170

   That all makes sense - thanks for clarifying @amoeba. I can't think of an obvious parallel on the server side but if the check is relatively lightweight then it seems like it might be worthwhile adding it to FlightServer? Or perhaps just improving the docs as per your original suggestion above. Cheers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1819712891

   Thanks @pitrou, I can work that up and report back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1819674428

   It would be good to find out exactly what doesn't work:
   * creating a Flight client, forking, and using the client from the parent process?
   * creating a Flight client, forking, and using the client from the child process?
   * something else
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1898463937

   Ok, so any situation where the Flight component is used from a process which is not the process used to load it initially should be flagged, right? Something like "remember the PID at DLL load time or module import time, then return an error if trying to do something from another PID"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "nph (via GitHub)" <gi...@apache.org>.
nph commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1850757397

   > In the case of FlightRPC, users can get intro trouble when they fork after instantiating a FlightClient...
   
   Does the same logic hold for a `FlightServer` instance too? Seems like it would do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "nph (via GitHub)" <gi...@apache.org>.
nph commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1851276217

   Sounds good 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1827852562

   I think a PID check or similar in the Python bindings would be sufficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1851873450

   A PID check won't tell you anything in the parent, so if that's what you're looking for, you can probably register atfork handlers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1916445147

   Right, so it seems we should store a global PID at the DLL level, since even a client created in a forked child isn't safe to use.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1819663528

   I'm pinging a few folks I think it would be good to get feedback from: @pitrou, @jorisvandenbossche, @zeroshade, @lidavidm. Anyone else is also encouraged to chime in too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1851006592

   So what would you say the check would help you with in this case?
   
   The idea of adding fork prevention in FlightClient is that the user would get a warning or error and would then they would know to rewrite their code. This is relevant for adopters of Flight because they might not control the code that gets written on the client side and this check would help save them time debugging and educating their users.
   
   Is there a parallel for the server side? If you have some general questions about building scalable services on top of Arrow and Flight, we can discuss that elsewhere.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1851223558

   I think maybe for the scope of this ticket, adding a PID-check just to FlightClient and improving the docs would be good start. If we find others are interested in having the same check on FlightServer, it would be pretty easy to re-use the implementation here. I'll let others chime in too.
   
   If you have any questions about scaling Flight or find the project is lacking documentation you wish it had, feel free to open another issue or reach out on the mailing list (See https://arrow.apache.org/community/).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1906086557

   When you say "Forking then using a FlightClient from the child process" above, was the FlightClient created in the parent _before forking_ or in the child?
   
   Also, what happens if you create a first Flight client in the parent before forking, then create another one from child?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #38617:
URL: https://github.com/apache/arrow/issues/38617#issuecomment-1915973767

   > When you say "Forking then using a FlightClient from the child process" above, was the FlightClient created in the parent before forking or in the child?
   
   Yeah, I should've made this clear. In these tests the client was created in the parent prior to forking.
   
   > Also, what happens if you create a first Flight client in the parent before forking, then create another one from child?
   
   I ran this a bunch of times and see different behaviors at random, similar to what I saw above using a single client from both the parent and child:
   
   - Client hangs, server doesn't see any RPCs, eventually fails with `pyarrow._flight.FlightUnavailableError: Flight returned unavailable error, with message: failed to connect to all addresses; last error: UNKNOWN: ipv4:0.0.0.0:8815: tcp handshaker shutdown`
   - Client hangs after server sees one RPC, I eventually interrupt it
   - Client runs and exits cleanly, only one RPC (from the parent) gets called
   - Client process segfaults after the server sees one RPC. I think this only happens if I don't restart the server between test runs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org