You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "kou (via GitHub)" <gi...@apache.org> on 2023/06/19 07:28:33 UTC

[GitHub] [arrow] kou opened a new issue, #36155: [C++][Go][FlightRPC] Add support for long-running queries

kou opened a new issue, #36155:
URL: https://github.com/apache/arrow/issues/36155

   ### Describe the enhancement requested
   
   Based on the proposal in https://docs.google.com/document/d/1jhPyPZSOo2iy0LqIJVUs9KWPyFULVFJXTILDfkadx2g/edit# .
   See also the discussion thread: https://lists.apache.org/thread/247z3t06mf132nocngc1jkp3oqglz7jp
   
   In Flight RPC, FlightInfo includes addresses of workers alongside result partition info. This lets clients fetch data directly from workers[^1], in parallel or even distributed across multiple machines. But this also comes with tradeoffs.
   
   [^1]: Of course, servers are free to return the location of a proxy/load balancer/etc., or omit locations and have the client fetch results from the same server that they issued the query to. Flight RPC offers this flexibility to servers; clients don’t have to know or care. 
   
   Queries generally don’t complete instantly (as much as we would like them to). So where can we put the ‘query evaluation time’?
   
     * In GetFlightInfo: block and wait for the query to complete.
       * Con: this is a long-running blocking call, which may fail or time out. Then when the client retries, the server has to redo all the work.
       * Con: parts of the result may be ready before others, but the client can’t do anything until everything is ready.
     * In DoGet: return a fixed number of partitions
       * Con: this makes handling worker failures hard. Systems like Trino support [fault-tolerant execution](https://trino.io/docs/current/admin/fault-tolerant-execution.html) by replacing workers at runtime. But GetFlightInfo has already passed, so we can’t notify the client of new workers[^2].
       * Con: we have to know or fix the partitioning up front.
   
   [^2]: Again, the server could proxy workers, or depend on Kubernetes DNS routing, or configure gRPC XDS. But this somewhat defeats the point of returning worker locations in the first place, and is much more complicated (operationally, implementation-wise).
   
   Neither solution is optimal.
   
   ### Proposal
   
   We can address this by adding a retryable version of GetFlightInfo. First, we add a new RPC call and result message:
   
   ```protobuf
   service FlightService {
     // ...
     rpc PollFlightInfo(FlightDescriptor) returns (RetryInfo);
   }
   
   message RetryInfo {
     // The currently available results so far.
     FlightInfo info = 1;
     // The descriptor the client should use on the next try.
     // If unset, the query is complete.
     FlightDescriptor retry_descriptor = 2;
     // The descriptor the client should use to cancel the query.
     // If the query is complete or the server does not support query
     // cancellation, this is not set. 
     FlightDescriptor cancel_descriptor = 3;
     // Query progress. Must be in [0.0, 1.0] but need not be
     // monotonic or nondecreasing. If unknown, do not set.
     optional double progress = 4;
     // Expiration time for this request. After this passes, the server
     // might not accept the retry_descriptor anymore (and the query may 
     // be cancelled). This may be updated on a call to PollFlightInfo.
     google.protobuf.Timestamp expiration_time = 5;
   }
   ```
   
   A client executes a query and polls for result completion. The server returns a FlightInfo representing the state of the query execution up to that point. 
   
   ```mermaid
   sequenceDiagram
       Client->>Server: PollFlightInfo(FlightDescriptor)
       Server->>Client: RetryInfo(FlightDescriptor', FlightInfo)
       Client->>Server: PollFlightInfo(FlightDescriptor')
       Server->>Client: RetryInfo(FlightDescriptor'', FlightInfo)
       Client->>Server: PollFlightInfo(FlightDescriptor'')
       Server->>Client: RetryInfo(_, FlightInfo)
   ```
   
   The server:
   * Must respond with the complete FlightInfo each time, not just the delta between the previous and current FlightInfo.
   * Should respond as quickly as possible on the first call.
   * Should not respond until the result would be different from last time. (That way, the client can “long poll” for updates without constantly making requests. Clients can set a short timeout to avoid blocking calls if desired.) 
   * May respond by only updating the “progress” value (though it shouldn’t spam the client with updates).
   * Should recognize a retry_descriptor that is not necessarily the latest (in case the client misses an update in between).
   * Should only append to the endpoints in FlightInfo each time. (Otherwise the client has to do extra work to identify what endpoints it has and hasn’t seen.) 
   
     When endpoints_ordered is set, this means the server returns endpoints in order.
   * Should return an error status instead of a response if the query fails. The client should not retry the request (except for TIMED_OUT and UNAVAILABLE, which may not originate from the server). 
   
   ### Prior Art
   
   * [Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/mgmt/data-api-calling.html): executing a query gives an ID that can be used to check the query status and fetch results.
   * [Google BigQuery Storage](https://github.com/googleapis/googleapis/blob/master/google/cloud/bigquery/storage/v1/storage.proto): you explicitly create a “read session”, after which you can read subsets of the response with further requests. There is no “query execution time” since BigQuery Storage only queries tables. Instead, running a query (with the base BigQuery API) will cache the result in a table that can be read via BigQuery Storage.
   * [Snowflake](https://docs.snowflake.com/en/developer-guide/sql-api/handling-responses.html#understanding-the-flow-of-execution): short queries return synchronously. Longer queries require polling for completion of the query. You cannot retrieve any results until the query is complete.
   
   ### Component(s)
   
   C++, FlightRPC, Format, Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #36155: [C++][Go][FlightRPC] Add support for long-running queries

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on issue #36155:
URL: https://github.com/apache/arrow/issues/36155#issuecomment-1657347686

   Ah, we can't use `RetryInfo::descriptor` because ProtoBuf generated codes use `descriptor`.
   I'll use `RetryInfo::flight_descriptor` like `FlightInfo::flight_descriptor`...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on issue #36155: [C++][Go][FlightRPC] Add support for long-running queries

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #36155:
URL: https://github.com/apache/arrow/issues/36155#issuecomment-1655560458

   I think the only problem there is, what if the query hasn't progressed far enough yet and there is nothing in the `RetryInfo::info` yet (no endpoints, no schema)?
   
   We could try to handle this by adding a `bytes app_metadata` field to `FlightInfo` like we do for other messages. That way the server always has a field where it can safely store some metadata to identify the query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #36155: [C++][Go][FlightRPC] Add support for long-running queries

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on issue #36155:
URL: https://github.com/apache/arrow/issues/36155#issuecomment-1654905888

   We've added `CancelFlightInfo` action by #35500 but it accepts `FlightInfo` not `FlightDescriptor`.
   So we can't use it for cancelling a running query with `cancel_descriptor`.
   
   I think that we need to use `PollFlightInfo(cancel_descriptor)` instead of `CancelFlightInfo` action to cancel a running query but it may confuse users. Or we can remove `RetryInfo::cancel_descriptor` and use `CancelFlightInfo(RetryInfo::info)` instead.
   
   @lidavidm Do you have any opinion for this case? If you don't have any opinion, I'll use the removing `RetryInfo::cancel_descriptor` and using `CancelFlightInfo(RetryInfo::info)` approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #36155: [C++][Go][Java][FlightRPC] Add support for long-running queries

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on issue #36155:
URL: https://github.com/apache/arrow/issues/36155#issuecomment-1678314628

   The discussion thread: https://lists.apache.org/thread/qcjpcw6m3p15wqxp6n6rqzlx01v1fl3v
   The vote thread: https://lists.apache.org/thread/rwwqws6ghymx07qtz8d1jy5hsgljq96y
   The vote carried: https://lists.apache.org/thread/9ty2vp3xnfnzcpl3wh5bnzbfszogomx2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou closed issue #36155: [C++][Go][Java][FlightRPC] Add support for long-running queries

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou closed issue #36155: [C++][Go][Java][FlightRPC] Add support for long-running queries
URL: https://github.com/apache/arrow/issues/36155


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on issue #36155: [C++][Go][FlightRPC] Add support for long-running queries

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on issue #36155:
URL: https://github.com/apache/arrow/issues/36155#issuecomment-1656417554

   > Or can we put `cancel_descriptor` content to `RetryInfo::info::flight_descriptor`?
   > `FlightInfo::flight_descriptor` must be the same `FlightDescriptor` client sent?
   
   Oh, hmm, you're right. I don't think it's a requirement to be the same. (Actually, I never understood why it's there in the first place!) So we can do that :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #36155: [C++][Go][FlightRPC] Add support for long-running queries

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on issue #36155:
URL: https://github.com/apache/arrow/issues/36155#issuecomment-1656381528

   > I think the only problem there is, what if the query hasn't progressed far enough yet and there is nothing in the `RetryInfo::info` yet (no endpoints, no schema)?
   
   Ah, you're right.
   
   > We could try to handle this by adding a `bytes app_metadata` field to `FlightInfo` like we do for other messages. That way the server always has a field where it can safely store some metadata to identify the query.
   
   Or can we put `cancel_descriptor` content to `RetryInfo::info::flight_descriptor`?
   `FlightInfo::flight_descriptor` must be the same `FlightDescriptor` client sent?
   
   (I don't object the `bytes app_metadata` approach.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #36155: [C++][Go][FlightRPC] Add support for long-running queries

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on issue #36155:
URL: https://github.com/apache/arrow/issues/36155#issuecomment-1657310324

   OK. I'll remove `cancel_descriptor` and rename `RetryInfo::retry_descriptor` to `RetryInfo::descriptor` because there is only one descriptor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org