You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Nate Bauernfeind <na...@deephaven.io> on 2021/06/02 21:20:43 UTC

[C++] Async Arrow Flight

It seems to me that the c++ arrow flight implementation uses only the
synchronous version of the gRPC API. gRPC supports asynchronous message
delivery in C++ via a CompletionQueue that must be polled. Has there been
any desire to standardize on a solution for asynchronous use cases, perhaps
delivered via a provided CompletionQueue?

For a simple async grpc c++ example you can look here:
https://github.com/grpc/grpc/blob/master/examples/cpp/helloworld/greeter_async_client.cc

Thanks,
Nate

--

Re: [C++] Async Arrow Flight

Posted by David Li <li...@apache.org>.

Hey Nate,

Thanks for digging this up. This looks reasonable - Arrow already has
a Future type. I think we could do something similar, but encapsulate
the CompletionQueue inside the Flight client so as to not expose gRPC
implementation details. (I'm actually unsure why their SDK exposes
this in the first place - two applications can't reasonably share one
completion queue. I'm likely missing something.)

Would a Future-based API work for what you're trying to do? For
wrapping asynchronous streams like DoGet, there's now the
AsyncGenerator interface[1], which is essentially an iterator of Futures.

We'd also need to expose a way to create such a FlightClient from a
gRPC channel, so that you can share the underlying connection(s). This
part is a little iffy - we don't want to expose gRPC types in the API
- so you might end up having to pass around some type-erased pointers
or such. (And of course again, this won't/can't solve the issue for
Python bindings, since gRPC Python is not implemented on top of
gRPC-C++.)

On that note, does your server-side need similar functionality? That
is, do you need some way to bind your C++ Flight service and your C++
gRPC service to the same server?

Finally, there's also the possibility of encoding everything in the
Flight DoAction endpoint - though that is definitely much less
convenient especially if you have multiple or complex APIs of your own
alongside Flight.

Best,
David

[1]: https://github.com/apache/arrow/blob/c43fab3d621bedef15470a1be43570be2026af20/cpp/src/arrow/util/async_generator.h

On 2021/06/21 21:42:52, Nate Bauernfeind <na...@deephaven.io> wrote: 
> Google Cloud supports asynchronous grpc in C++ for parts of their API.
> 
> One such client-facing API method is this:
> ```
>   future<StatusOr<google::pubsub::v1::PublishResponse>> AsyncPublish(
>       google::cloud::CompletionQueue& cq,
>       std::unique_ptr<grpc::ClientContext> context,
>       google::pubsub::v1::PublishRequest const& request) override {
>     return cq.MakeUnaryRpc(
>         [this](grpc::ClientContext* context,
>                google::pubsub::v1::PublishRequest const& request,
>                grpc::CompletionQueue* cq) {
>           return grpc_stub_->AsyncPublish(context, request, cq);
>         },
>         request, std::move(context));
>   }
> ```
> Source:
> https://github.com/googleapis/google-cloud-cpp/blob/0dfde771a42cff19ca3ef6e192473ca0646df497/google/cloud/pubsub/internal/publisher_stub.cc#L109-L120
> 
> The implementation integrates with gRPC's CompletionQueue interface:
> https://github.com/googleapis/google-cloud-cpp/blob/main/google/cloud/completion_queue.h
> 
> The completion queue returns a future abiding to this API:
> https://github.com/googleapis/google-cloud-cpp/blob/main/google/cloud/internal/future_base.h
> 
> It looks like a really nice solution as far as async gRPC; I wish gRPC had
> such an implementation in the base library. What do you think of this?
> Would this pattern appeal to you, David - or to anyone listening from the
> larger Arrow community?
> 
> Thanks,
> Nate
> 
> On Thu, Jun 3, 2021 at 4:50 PM David Li <li...@apache.org> wrote:
> 
> > I see, thanks. That sounds reasonable, and indeed if you are subscribing
> > to lots of data sources (instead of just trying to maximize throughput as I
> > think most people have done before) async may be helpful. There are
> > actually two features being described here  though:
> >  1. The C++ Flight client needs some way to wrap an existing gRPC client,
> > much like it can in Java. This may be tricky in C++ (and will require
> > applications to carefully manage gRPC versions) and will probably not be
> > possible for Python.
> >  2. The C++ Flight client should offer async APIs (perhaps callback-based
> > APIs or perhaps completion-queue based or something else).
> > https://issues.apache.org/jira/browse/ARROW-1009 may be relevant here.
> > I think gRPC C++ has experimental callback-based async APIs now but it may
> > be hard for us to actually take advantage of unless we can draw a hard line
> > on minimum required gRPC version. The main question will probably be
> > finding someone to do the work.
> >
> > Best,
> > David
> >
> > On Thu, Jun 3, 2021, at 12:11, Nate Bauernfeind wrote:
> > > In addition to Arrow Flight we have other gRPC APIs that work together
> > as a
> > > whole. For example, the API client establishes a session with the server.
> > > Then the client tells the server to create a derivative data stream by
> > > filtering/joining/computing/etc several source streams. The server will
> > > keep this derivative stream available as long as the session exists. This
> > > session is kept alive with a heartbeat. So a small part of this system
> > > requires a regular (but not overwhelming) heartbeat. If this derivative
> > > stream has any data source that is live (aka regularly updating) then the
> > > derivative is also live. The API client would likely want to subscribe to
> > > the derivative stream, but still needs to heartbeat. The subscription
> > > (communicated via Flight's DoExchange) should be able to last forever.
> > > Naturally, this API pattern is simply better suited to the async pattern.
> > >
> > > Ideally, we could re-use the same http2 connections, task queue and/or
> > > thread-pool between the arrow flight api and the other grpc apis (talking
> > > to the same endpoint).
> > >
> > > The gRPC callback pattern in Java is nice; it's a bit of a shame that
> > gRPC
> > > hasn't settled on a c++ callback pattern.
> > >
> > > These are the kinds of ideas that it enables:
> > > - Subscribe to multiple ticking sources simultaneously.
> > > - Can have multiple gRPC requests outstanding at the same time
> > > (particularly useful if server is in a remote data center with high RTT).
> > > - Can communicate with multiple remote hosts simultaneously.
> > > - In general, enables the client to be event driven.
> > >
> > > Note that our clients tend to be light; they typically listen to derived
> > > data and then act based on the information without further computation
> > > locally.
> > >
> > > The gRPC c++ async library does indeed look a bit under-documented. There
> > > are a few blog posts that highlight some of the surprises, but async
> > > behavior is a requirement for what we're working on. (for async-cpp-grpc
> > > surprises see:
> > >
> > https://www.gresearch.co.uk/article/lessons-learnt-from-writing-asynchronous-streaming-grpc-services-in-c/
> > > ).
> > >
> > > Nate
> > >
> > > On Wed, Jun 2, 2021 at 8:44 PM David Li <lidavidm@apache.org <mailto:
> > lidavidm%40apache.org>> wrote:
> > >
> > > > Hey Nate,
> > > >
> > > > I think there's an open JIRA for something like this. I'd love to have
> > > > something that plays nicely with asyncio/trio in Python and is
> > hopefully
> > > > more efficient. (I think it would also let us finally have per-message
> > > > timeouts instead of only a per-call deadline.) There are some
> > challenges
> > > > though, e.g. we wouldn't expose gRPC's event loop directly so that we
> > could
> > > > support other transports, but then that leaves more things to design. I
> > > > also recall the async C++ APIs being very underdocumented, I get the
> > sense
> > > > that they aren't actually used except to improve some benchmarks. I'll
> > note
> > > > for instance gRPC in Python, which offers async support, uses the
> > "core"
> > > > APIs directly and doesn't use anything C++ offers.
> > > >
> > > > But long story short, if you're interested in this I think it would be
> > a
> > > > useful addition. What sorts of things would it enable for you?
> > > >
> > > > -David
> > > >
> > > > On Wed, Jun 2, 2021, at 16:20, Nate Bauernfeind wrote:
> > > > > It seems to me that the c++ arrow flight implementation uses only the
> > > > > synchronous version of the gRPC API. gRPC supports asynchronous
> > message
> > > > > delivery in C++ via a CompletionQueue that must be polled. Has there
> > been
> > > > > any desire to standardize on a solution for asynchronous use cases,
> > > > perhaps
> > > > > delivered via a provided CompletionQueue?
> > > > >
> > > > > For a simple async grpc c++ example you can look here:
> > > > >
> > > >
> > https://github.com/grpc/grpc/blob/master/examples/cpp/helloworld/greeter_async_client.cc
> > > > >
> > > > > Thanks,
> > > > > Nate
> > > > >
> > > > > --
> > > > >
> > >
> > >
> > >
> > > --
> > >
> >
> 
> 
> --
>

Re: [C++] Async Arrow Flight

Posted by Nate Bauernfeind <na...@deephaven.io>.

Google Cloud supports asynchronous grpc in C++ for parts of their API.

One such client-facing API method is this:
```
  future<StatusOr<google::pubsub::v1::PublishResponse>> AsyncPublish(
      google::cloud::CompletionQueue& cq,
      std::unique_ptr<grpc::ClientContext> context,
      google::pubsub::v1::PublishRequest const& request) override {
    return cq.MakeUnaryRpc(
        [this](grpc::ClientContext* context,
               google::pubsub::v1::PublishRequest const& request,
               grpc::CompletionQueue* cq) {
          return grpc_stub_->AsyncPublish(context, request, cq);
        },
        request, std::move(context));
  }
```
Source:
https://github.com/googleapis/google-cloud-cpp/blob/0dfde771a42cff19ca3ef6e192473ca0646df497/google/cloud/pubsub/internal/publisher_stub.cc#L109-L120

The implementation integrates with gRPC's CompletionQueue interface:
https://github.com/googleapis/google-cloud-cpp/blob/main/google/cloud/completion_queue.h

The completion queue returns a future abiding to this API:
https://github.com/googleapis/google-cloud-cpp/blob/main/google/cloud/internal/future_base.h

It looks like a really nice solution as far as async gRPC; I wish gRPC had
such an implementation in the base library. What do you think of this?
Would this pattern appeal to you, David - or to anyone listening from the
larger Arrow community?

Thanks,
Nate

On Thu, Jun 3, 2021 at 4:50 PM David Li <li...@apache.org> wrote:

> I see, thanks. That sounds reasonable, and indeed if you are subscribing
> to lots of data sources (instead of just trying to maximize throughput as I
> think most people have done before) async may be helpful. There are
> actually two features being described here  though:
>  1. The C++ Flight client needs some way to wrap an existing gRPC client,
> much like it can in Java. This may be tricky in C++ (and will require
> applications to carefully manage gRPC versions) and will probably not be
> possible for Python.
>  2. The C++ Flight client should offer async APIs (perhaps callback-based
> APIs or perhaps completion-queue based or something else).
> https://issues.apache.org/jira/browse/ARROW-1009 may be relevant here.
> I think gRPC C++ has experimental callback-based async APIs now but it may
> be hard for us to actually take advantage of unless we can draw a hard line
> on minimum required gRPC version. The main question will probably be
> finding someone to do the work.
>
> Best,
> David
>
> On Thu, Jun 3, 2021, at 12:11, Nate Bauernfeind wrote:
> > In addition to Arrow Flight we have other gRPC APIs that work together
> as a
> > whole. For example, the API client establishes a session with the server.
> > Then the client tells the server to create a derivative data stream by
> > filtering/joining/computing/etc several source streams. The server will
> > keep this derivative stream available as long as the session exists. This
> > session is kept alive with a heartbeat. So a small part of this system
> > requires a regular (but not overwhelming) heartbeat. If this derivative
> > stream has any data source that is live (aka regularly updating) then the
> > derivative is also live. The API client would likely want to subscribe to
> > the derivative stream, but still needs to heartbeat. The subscription
> > (communicated via Flight's DoExchange) should be able to last forever.
> > Naturally, this API pattern is simply better suited to the async pattern.
> >
> > Ideally, we could re-use the same http2 connections, task queue and/or
> > thread-pool between the arrow flight api and the other grpc apis (talking
> > to the same endpoint).
> >
> > The gRPC callback pattern in Java is nice; it's a bit of a shame that
> gRPC
> > hasn't settled on a c++ callback pattern.
> >
> > These are the kinds of ideas that it enables:
> > - Subscribe to multiple ticking sources simultaneously.
> > - Can have multiple gRPC requests outstanding at the same time
> > (particularly useful if server is in a remote data center with high RTT).
> > - Can communicate with multiple remote hosts simultaneously.
> > - In general, enables the client to be event driven.
> >
> > Note that our clients tend to be light; they typically listen to derived
> > data and then act based on the information without further computation
> > locally.
> >
> > The gRPC c++ async library does indeed look a bit under-documented. There
> > are a few blog posts that highlight some of the surprises, but async
> > behavior is a requirement for what we're working on. (for async-cpp-grpc
> > surprises see:
> >
> https://www.gresearch.co.uk/article/lessons-learnt-from-writing-asynchronous-streaming-grpc-services-in-c/
> > ).
> >
> > Nate
> >
> > On Wed, Jun 2, 2021 at 8:44 PM David Li <lidavidm@apache.org <mailto:
> lidavidm%40apache.org>> wrote:
> >
> > > Hey Nate,
> > >
> > > I think there's an open JIRA for something like this. I'd love to have
> > > something that plays nicely with asyncio/trio in Python and is
> hopefully
> > > more efficient. (I think it would also let us finally have per-message
> > > timeouts instead of only a per-call deadline.) There are some
> challenges
> > > though, e.g. we wouldn't expose gRPC's event loop directly so that we
> could
> > > support other transports, but then that leaves more things to design. I
> > > also recall the async C++ APIs being very underdocumented, I get the
> sense
> > > that they aren't actually used except to improve some benchmarks. I'll
> note
> > > for instance gRPC in Python, which offers async support, uses the
> "core"
> > > APIs directly and doesn't use anything C++ offers.
> > >
> > > But long story short, if you're interested in this I think it would be
> a
> > > useful addition. What sorts of things would it enable for you?
> > >
> > > -David
> > >
> > > On Wed, Jun 2, 2021, at 16:20, Nate Bauernfeind wrote:
> > > > It seems to me that the c++ arrow flight implementation uses only the
> > > > synchronous version of the gRPC API. gRPC supports asynchronous
> message
> > > > delivery in C++ via a CompletionQueue that must be polled. Has there
> been
> > > > any desire to standardize on a solution for asynchronous use cases,
> > > perhaps
> > > > delivered via a provided CompletionQueue?
> > > >
> > > > For a simple async grpc c++ example you can look here:
> > > >
> > >
> https://github.com/grpc/grpc/blob/master/examples/cpp/helloworld/greeter_async_client.cc
> > > >
> > > > Thanks,
> > > > Nate
> > > >
> > > > --
> > > >
> >
> >
> >
> > --
> >
>


--

Re: [C++] Async Arrow Flight

Posted by David Li <li...@apache.org>.

I see, thanks. That sounds reasonable, and indeed if you are subscribing to lots of data sources (instead of just trying to maximize throughput as I think most people have done before) async may be helpful. There are actually two features being described here  though:
 1. The C++ Flight client needs some way to wrap an existing gRPC client, much like it can in Java. This may be tricky in C++ (and will require applications to carefully manage gRPC versions) and will probably not be possible for Python.
 2. The C++ Flight client should offer async APIs (perhaps callback-based APIs or perhaps completion-queue based or something else). https://issues.apache.org/jira/browse/ARROW-1009 may be relevant here.
I think gRPC C++ has experimental callback-based async APIs now but it may be hard for us to actually take advantage of unless we can draw a hard line on minimum required gRPC version. The main question will probably be finding someone to do the work.

Best,
David

On Thu, Jun 3, 2021, at 12:11, Nate Bauernfeind wrote:
> In addition to Arrow Flight we have other gRPC APIs that work together as a
> whole. For example, the API client establishes a session with the server.
> Then the client tells the server to create a derivative data stream by
> filtering/joining/computing/etc several source streams. The server will
> keep this derivative stream available as long as the session exists. This
> session is kept alive with a heartbeat. So a small part of this system
> requires a regular (but not overwhelming) heartbeat. If this derivative
> stream has any data source that is live (aka regularly updating) then the
> derivative is also live. The API client would likely want to subscribe to
> the derivative stream, but still needs to heartbeat. The subscription
> (communicated via Flight's DoExchange) should be able to last forever.
> Naturally, this API pattern is simply better suited to the async pattern.
> 
> Ideally, we could re-use the same http2 connections, task queue and/or
> thread-pool between the arrow flight api and the other grpc apis (talking
> to the same endpoint).
> 
> The gRPC callback pattern in Java is nice; it's a bit of a shame that gRPC
> hasn't settled on a c++ callback pattern.
> 
> These are the kinds of ideas that it enables:
> - Subscribe to multiple ticking sources simultaneously.
> - Can have multiple gRPC requests outstanding at the same time
> (particularly useful if server is in a remote data center with high RTT).
> - Can communicate with multiple remote hosts simultaneously.
> - In general, enables the client to be event driven.
> 
> Note that our clients tend to be light; they typically listen to derived
> data and then act based on the information without further computation
> locally.
> 
> The gRPC c++ async library does indeed look a bit under-documented. There
> are a few blog posts that highlight some of the surprises, but async
> behavior is a requirement for what we're working on. (for async-cpp-grpc
> surprises see:
> https://www.gresearch.co.uk/article/lessons-learnt-from-writing-asynchronous-streaming-grpc-services-in-c/
> ).
> 
> Nate
> 
> On Wed, Jun 2, 2021 at 8:44 PM David Li <lidavidm@apache.org <mailto:lidavidm%40apache.org>> wrote:
> 
> > Hey Nate,
> >
> > I think there's an open JIRA for something like this. I'd love to have
> > something that plays nicely with asyncio/trio in Python and is hopefully
> > more efficient. (I think it would also let us finally have per-message
> > timeouts instead of only a per-call deadline.) There are some challenges
> > though, e.g. we wouldn't expose gRPC's event loop directly so that we could
> > support other transports, but then that leaves more things to design. I
> > also recall the async C++ APIs being very underdocumented, I get the sense
> > that they aren't actually used except to improve some benchmarks. I'll note
> > for instance gRPC in Python, which offers async support, uses the "core"
> > APIs directly and doesn't use anything C++ offers.
> >
> > But long story short, if you're interested in this I think it would be a
> > useful addition. What sorts of things would it enable for you?
> >
> > -David
> >
> > On Wed, Jun 2, 2021, at 16:20, Nate Bauernfeind wrote:
> > > It seems to me that the c++ arrow flight implementation uses only the
> > > synchronous version of the gRPC API. gRPC supports asynchronous message
> > > delivery in C++ via a CompletionQueue that must be polled. Has there been
> > > any desire to standardize on a solution for asynchronous use cases,
> > perhaps
> > > delivered via a provided CompletionQueue?
> > >
> > > For a simple async grpc c++ example you can look here:
> > >
> > https://github.com/grpc/grpc/blob/master/examples/cpp/helloworld/greeter_async_client.cc
> > >
> > > Thanks,
> > > Nate
> > >
> > > --
> > >
> 
> 
> 
> --
>

Re: [C++] Async Arrow Flight

Posted by Nate Bauernfeind <na...@deephaven.io>.

In addition to Arrow Flight we have other gRPC APIs that work together as a
whole. For example, the API client establishes a session with the server.
Then the client tells the server to create a derivative data stream by
filtering/joining/computing/etc several source streams. The server will
keep this derivative stream available as long as the session exists. This
session is kept alive with a heartbeat. So a small part of this system
requires a regular (but not overwhelming) heartbeat. If this derivative
stream has any data source that is live (aka regularly updating) then the
derivative is also live. The API client would likely want to subscribe to
the derivative stream, but still needs to heartbeat. The subscription
(communicated via Flight's DoExchange) should be able to last forever.
Naturally, this API pattern is simply better suited to the async pattern.

Ideally, we could re-use the same http2 connections, task queue and/or
thread-pool between the arrow flight api and the other grpc apis (talking
to the same endpoint).

The gRPC callback pattern in Java is nice; it's a bit of a shame that gRPC
hasn't settled on a c++ callback pattern.

These are the kinds of ideas that it enables:
- Subscribe to multiple ticking sources simultaneously.
- Can have multiple gRPC requests outstanding at the same time
(particularly useful if server is in a remote data center with high RTT).
- Can communicate with multiple remote hosts simultaneously.
- In general, enables the client to be event driven.

Note that our clients tend to be light; they typically listen to derived
data and then act based on the information without further computation
locally.

The gRPC c++ async library does indeed look a bit under-documented. There
are a few blog posts that highlight some of the surprises, but async
behavior is a requirement for what we're working on. (for async-cpp-grpc
surprises see:
https://www.gresearch.co.uk/article/lessons-learnt-from-writing-asynchronous-streaming-grpc-services-in-c/
).

Nate

On Wed, Jun 2, 2021 at 8:44 PM David Li <li...@apache.org> wrote:

> Hey Nate,
>
> I think there's an open JIRA for something like this. I'd love to have
> something that plays nicely with asyncio/trio in Python and is hopefully
> more efficient. (I think it would also let us finally have per-message
> timeouts instead of only a per-call deadline.) There are some challenges
> though, e.g. we wouldn't expose gRPC's event loop directly so that we could
> support other transports, but then that leaves more things to design. I
> also recall the async C++ APIs being very underdocumented, I get the sense
> that they aren't actually used except to improve some benchmarks. I'll note
> for instance gRPC in Python, which offers async support, uses the "core"
> APIs directly and doesn't use anything C++ offers.
>
> But long story short, if you're interested in this I think it would be a
> useful addition. What sorts of things would it enable for you?
>
> -David
>
> On Wed, Jun 2, 2021, at 16:20, Nate Bauernfeind wrote:
> > It seems to me that the c++ arrow flight implementation uses only the
> > synchronous version of the gRPC API. gRPC supports asynchronous message
> > delivery in C++ via a CompletionQueue that must be polled. Has there been
> > any desire to standardize on a solution for asynchronous use cases,
> perhaps
> > delivered via a provided CompletionQueue?
> >
> > For a simple async grpc c++ example you can look here:
> >
> https://github.com/grpc/grpc/blob/master/examples/cpp/helloworld/greeter_async_client.cc
> >
> > Thanks,
> > Nate
> >
> > --
> >

--

Re: [C++] Async Arrow Flight

Posted by David Li <li...@apache.org>.

Hey Nate,

I think there's an open JIRA for something like this. I'd love to have something that plays nicely with asyncio/trio in Python and is hopefully more efficient. (I think it would also let us finally have per-message timeouts instead of only a per-call deadline.) There are some challenges though, e.g. we wouldn't expose gRPC's event loop directly so that we could support other transports, but then that leaves more things to design. I also recall the async C++ APIs being very underdocumented, I get the sense that they aren't actually used except to improve some benchmarks. I'll note for instance gRPC in Python, which offers async support, uses the "core" APIs directly and doesn't use anything C++ offers.

But long story short, if you're interested in this I think it would be a useful addition. What sorts of things would it enable for you?

-David

On Wed, Jun 2, 2021, at 16:20, Nate Bauernfeind wrote:
> It seems to me that the c++ arrow flight implementation uses only the
> synchronous version of the gRPC API. gRPC supports asynchronous message
> delivery in C++ via a CompletionQueue that must be polled. Has there been
> any desire to standardize on a solution for asynchronous use cases, perhaps
> delivered via a provided CompletionQueue?
> 
> For a simple async grpc c++ example you can look here:
> https://github.com/grpc/grpc/blob/master/examples/cpp/helloworld/greeter_async_client.cc
> 
> Thanks,
> Nate
> 
> --
>