You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by David Li <li...@apache.org> on 2022/04/27 14:47:28 UTC

Re: Flight/FlightSQL Optimization for Small Results?

Following up here - what are the next steps? The RFC PR looks fairly complete, maybe we can help build out implementations in C++/Java/other languages in preparation for a vote?

On Wed, Mar 9, 2022, at 00:23, Micah Kornfield wrote:
>>
>> The operation flow would be like this, or what would it look like?
>> Client ---> GetFlightInfo (query/update operation in payload) ---> Server
>> ---> Results (non-streamed)
>
>
> This is roughly the flow I was imagining if the server chooses to send back
> inlined data.
>
> -Micah
>
> On Tue, Mar 8, 2022 at 11:27 AM Gavin Ray <ra...@gmail.com> wrote:
>
>> Thank you for doing this, left a few questions on the GH issue
>>
>> I would adopt this proposal as soon as it makes it into nightlies
>> (or possibly earlier if it's just a matter of regenerating the proto
>> definitions)
>>
>> The operation flow would be like this, or what would it look like?
>>
>> Client ---> GetFlightInfo (query/update operation in payload) ---> Server
>> ---> Results (non-streamed)
>>
>>
>>
>>
>> On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Some people have already left comments on
>>> https://github.com/apache/arrow/pull/12571  More eyes on it would be
>>> appreciated.  If there aren't more comments, I'll try to start
>>> implementing
>>> this feature in Flight next week, and hopefully have a vote after it is
>>> supported in Java and C++/Python.
>>>
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>> > I put together straw-man proposal in PR [1] for the Flight changes.
>>> > Ultimately, it seemed based on the use-cases discussed inlining the
>>> data on
>>> > the Ticket made the most sense.  This might be overly complex (I'm not
>>> sure
>>> > how I feel about a enum indicating partial vs full results) but welcome
>>> > feedback.  Once we get consensus on this proposal, I can add changes to
>>> > Flight SQL and try to provide reference implementations.
>>> >
>>> > [1] https://github.com/apache/arrow/pull/12571
>>> >
>>> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <em...@gmail.com>
>>> > wrote:
>>> >
>>> >> Would it make sense to make this part of DoGet since it
>>> >>> still would be returning a record batch
>>> >>
>>> >> I would lean against this. I think in many cases the client doesn't
>>> know
>>> >> the size of the data that it expects.  Leaving the flexibility on the
>>> >> server side to send back inlined data when it thinks it makes sense,
>>> or a
>>> >> bunch of tickets when there is in fact a lot of data seems like the
>>> best
>>> >> option here.
>>> >>
>>> >> For cases like previewing data, you usually just want to get a small
>>> >>> amount
>>> >>> of data quickly.
>>> >>
>>> >> This is interesting and might be an additional use case.  If we did
>>> >> decide to extend FlightInfo we might also want a way of annotating
>>> inlined
>>> >> data with its corresponding ticket.  That way even for large results,
>>> you
>>> >> could still send back a small preview if desired.
>>> >>
>>> >> After considering it a little bit I think I'm sold that inlined data
>>> >> should not replace a ticket.  So in my mind the open question is
>>> whether
>>> >> the client needs to actively opt-in to inlined data.  The scenarios I
>>> could
>>> >> come with where inlined data isn't useful are:
>>> >> 1.  The client is an old client and isn't aware inline data might be
>>> >> returned.  In this case the main cost is of extra data on the wire and
>>> >> storing it as unknown fields [1].
>>> >> 2.  The client is a new client but still doesn't want to get inline
>>> data
>>> >> (it might want to distribute all consumption to other processes).  Same
>>> >> cost is paid as option 1.
>>> >>
>>> >> Are there other scenarios?  If servers choose reasonable limits on what
>>> >> data to inline, the extra complexity of negotiating with the client in
>>> this
>>> >> case might not be worth the benefits.
>>> >>
>>> >> Cheers,
>>> >> Micah
>>> >>
>>> >>
>>> >> [1]
>>> https://developers.google.com/protocol-buffers/docs/proto3#unknowns
>>> >>
>>> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cu...@gmail.com>
>>> wrote:
>>> >>
>>> >>> I think this would be a useful feature and be nice to have in Flight
>>> >>> core.
>>> >>> For cases like previewing data, you usually just want to get a small
>>> >>> amount
>>> >>> of data quickly. Would it make sense to make this part of DoGet since
>>> it
>>> >>> still would be returning a record batch? Perhaps a Ticket could be
>>> made
>>> >>> to
>>> >>> have an optional FlightDescriptor that would serve as an all-in-one
>>> shot?
>>> >>>
>>> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li <li...@apache.org> wrote:
>>> >>>
>>> >>> > I agree with something along Antoine's proposal, though: maybe we
>>> >>> should
>>> >>> > be more structured with the flags (akin to what Micah mentioned with
>>> >>> the
>>> >>> > Feature enum).
>>> >>> >
>>> >>> > Also, the flag could be embedded into the Flight SQL messages
>>> instead.
>>> >>> (So
>>> >>> > in effect, Flight would only add the capability to return data with
>>> >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide
>>> how
>>> >>> > they want to take advantage of that.)
>>> >>> >
>>> >>> > I think having a completely separate method and return type and
>>> having
>>> >>> to
>>> >>> > poll for it beforehand somewhat defeats the purpose of having
>>> it/would
>>> >>> be
>>> >>> > much harder of a transition.
>>> >>> >
>>> >>> > Also: it should be `repeated FlightInfo inline_data` right? In case
>>> we
>>> >>> > also need dictionary batches?
>>> >>> >
>>> >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote:
>>> >>> > > Can we just add the following field to the FlightDescriptor
>>> message:
>>> >>> > >
>>> >>> > >   bool accept_inline_data = 4;
>>> >>> > >
>>> >>> > > and this one to the FlightInfo message:
>>> >>> > >
>>> >>> > >   FlightData inline_data = 100;
>>> >>> > >
>>> >>> > > Then new clients can `accept_inline_data` to true (the default
>>> being
>>> >>> > > false if omitted) to signal servers that they can put the data if
>>> >>> > > `inline_data` if deemed small enough.
>>> >>> > >
>>> >>> > > (the `accept_inline_data` field could also be used to the Criteria
>>> >>> > > message)
>>> >>> > >
>>> >>> > >
>>> >>> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty
>>> >>> > > (FlightDescriptor being used in other contexts where
>>> >>> > > `accept_inline_data` makes no sense), we can instead define a new
>>> >>> > > method:
>>> >>> > >
>>> >>> > >   rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo)
>>> {}
>>> >>> > >
>>> >>> > > with:
>>> >>> > >
>>> >>> > > message GetFlightInfoRequest {
>>> >>> > >   FlightDescriptor flight_descriptor = 1;
>>> >>> > >   bool accept_inline_data = 2;
>>> >>> > > }
>>> >>> > >
>>> >>> > > Regards
>>> >>> > >
>>> >>> > > Antoine.
>>> >>> > >
>>> >>> > >
>>> >>> > > On Mon, 28 Feb 2022 11:29:12 -0800
>>> >>> > > James Duong <ja...@bitquilltech.com.INVALID> wrote:
>>> >>> > >> This seems reasonable, however we need to account for existing
>>> >>> Flight
>>> >>> > >> clients that were written before this.
>>> >>> > >>
>>> >>> > >> It seems like the server will need to still handle the ticket
>>> >>> returned
>>> >>> > for
>>> >>> > >> getStream() for clients that are unaware of the small result
>>> >>> > optimization.
>>> >>> > >>
>>> >>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <li...@apache.org>
>>> >>> wrote:
>>> >>> > >>
>>> >>> > >> > Ah, that makes more sense, that would be a reasonable
>>> extension to
>>> >>> > Flight
>>> >>> > >> > overall. (While we're at it, I think it would help to have an
>>> >>> > app_metadata
>>> >>> > >> > field in FlightInfo as well.)
>>> >>> > >> >
>>> >>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote:
>>> >>> > >> > >>
>>> >>> > >> > >> But it seems reasonable to add a one-shot query path using
>>> >>> DoGet.
>>> >>> > >> > >
>>> >>> > >> > >
>>> >>> > >> > > I was thinking more of adding a bytes field to FlightInfo
>>> that
>>> >>> > could
>>> >>> > >> > store
>>> >>> > >> > > arrow data.  That way GetFlightInfo would be the only RPC
>>> >>> necessary
>>> >>> > for
>>> >>> > >> > > small results when executing a CMD.  The client doesn't
>>> >>> necessarily
>>> >>> > know
>>> >>> > >> > > whether a query will return large or small results.
>>> >>> > >> > >
>>> >>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li <
>>> lidavidm@apache.org>
>>> >>> > wrote:
>>> >>> > >> > >
>>> >>> > >> > >> I think the focus was on large result sets (though I don't
>>> >>> recall
>>> >>> > this
>>> >>> > >> > >> being discussed before) and supporting multi-node setups
>>> (hence
>>> >>> > >> > >> GetFlightInfo/DoGet are separated). But it seems reasonable
>>> to
>>> >>> add
>>> >>> > a
>>> >>> > >> > >> one-shot query path using DoGet.
>>> >>> > >> > >>
>>> >>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote:
>>> >>> > >> > >> > I saw the same. A small, stateless query ability would be
>>> >>> nice
>>> >>> > >> > >> (connection
>>> >>> > >> > >> > open, initialization, query in one message, the resultset
>>> in
>>> >>> > the
>>> >>> > >> > response
>>> >>> > >> > >> > in one message)
>>> >>> > >> > >> >
>>> >>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield <
>>> >>> > emkornfield@gmail.com>
>>> >>> > >> > >> wrote:
>>> >>> > >> > >> >
>>> >>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not
>>> sure
>>> >>> if
>>> >>> > I'm
>>> >>> > >> > >> missing
>>> >>> > >> > >> >> it but is there any optimization for small results?  My
>>> >>> concern
>>> >>> > is
>>> >>> > >> > that
>>> >>> > >> > >> the
>>> >>> > >> > >> >> overhead of the RPCs for the DoGet after executing the
>>> query
>>> >>> > could
>>> >>> > >> > add
>>> >>> > >> > >> >> non-trivial latency for smaller results.
>>> >>> > >> > >> >>
>>> >>> > >> > >> >> Has anybody else thought about this/investigated it?  Am
>>> I
>>> >>> > >> > understanding
>>> >>> > >> > >> >> this correctly?
>>> >>> > >> > >> >>
>>> >>> > >> > >> >> Thanks,
>>> >>> > >> > >> >> Micah
>>> >>> > >> > >> >>
>>> >>> > >> > >>
>>> >>> > >> >
>>> >>> > >>
>>> >>> > >>
>>> >>> >
>>> >>>
>>> >>
>>>
>>

Re: Flight/FlightSQL Optimization for Small Results?

Posted by Micah Kornfield <em...@gmail.com>.
I'm sorry I haven't had the time I wanted to spend on implementation, I'd
still like to get to it but cannot commit to a timeline for a little bit.
if anybody would like to take on the implementation work, I'm happy to
review.

On Wed, Apr 27, 2022 at 9:06 AM Micah Kornfield <em...@gmail.com>
wrote:

> Yes, next step is implementation which I've been delayed on.  I hope to
> have a little time this week to work on it and will post an update.
>
> On Wed, Apr 27, 2022 at 7:48 AM David Li <li...@apache.org> wrote:
>
>> Following up here - what are the next steps? The RFC PR looks fairly
>> complete, maybe we can help build out implementations in C++/Java/other
>> languages in preparation for a vote?
>>
>> On Wed, Mar 9, 2022, at 00:23, Micah Kornfield wrote:
>> >>
>> >> The operation flow would be like this, or what would it look like?
>> >> Client ---> GetFlightInfo (query/update operation in payload) --->
>> Server
>> >> ---> Results (non-streamed)
>> >
>> >
>> > This is roughly the flow I was imagining if the server chooses to send
>> back
>> > inlined data.
>> >
>> > -Micah
>> >
>> > On Tue, Mar 8, 2022 at 11:27 AM Gavin Ray <ra...@gmail.com>
>> wrote:
>> >
>> >> Thank you for doing this, left a few questions on the GH issue
>> >>
>> >> I would adopt this proposal as soon as it makes it into nightlies
>> >> (or possibly earlier if it's just a matter of regenerating the proto
>> >> definitions)
>> >>
>> >> The operation flow would be like this, or what would it look like?
>> >>
>> >> Client ---> GetFlightInfo (query/update operation in payload) --->
>> Server
>> >> ---> Results (non-streamed)
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield <em...@gmail.com>
>> >> wrote:
>> >>
>> >>> Some people have already left comments on
>> >>> https://github.com/apache/arrow/pull/12571  More eyes on it would be
>> >>> appreciated.  If there aren't more comments, I'll try to start
>> >>> implementing
>> >>> this feature in Flight next week, and hopefully have a vote after it
>> is
>> >>> supported in Java and C++/Python.
>> >>>
>> >>>
>> >>> Thanks,
>> >>> Micah
>> >>>
>> >>> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> >>> wrote:
>> >>>
>> >>> > I put together straw-man proposal in PR [1] for the Flight changes.
>> >>> > Ultimately, it seemed based on the use-cases discussed inlining the
>> >>> data on
>> >>> > the Ticket made the most sense.  This might be overly complex (I'm
>> not
>> >>> sure
>> >>> > how I feel about a enum indicating partial vs full results) but
>> welcome
>> >>> > feedback.  Once we get consensus on this proposal, I can add
>> changes to
>> >>> > Flight SQL and try to provide reference implementations.
>> >>> >
>> >>> > [1] https://github.com/apache/arrow/pull/12571
>> >>> >
>> >>> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> >>> > wrote:
>> >>> >
>> >>> >> Would it make sense to make this part of DoGet since it
>> >>> >>> still would be returning a record batch
>> >>> >>
>> >>> >> I would lean against this. I think in many cases the client doesn't
>> >>> know
>> >>> >> the size of the data that it expects.  Leaving the flexibility on
>> the
>> >>> >> server side to send back inlined data when it thinks it makes
>> sense,
>> >>> or a
>> >>> >> bunch of tickets when there is in fact a lot of data seems like the
>> >>> best
>> >>> >> option here.
>> >>> >>
>> >>> >> For cases like previewing data, you usually just want to get a
>> small
>> >>> >>> amount
>> >>> >>> of data quickly.
>> >>> >>
>> >>> >> This is interesting and might be an additional use case.  If we did
>> >>> >> decide to extend FlightInfo we might also want a way of annotating
>> >>> inlined
>> >>> >> data with its corresponding ticket.  That way even for large
>> results,
>> >>> you
>> >>> >> could still send back a small preview if desired.
>> >>> >>
>> >>> >> After considering it a little bit I think I'm sold that inlined
>> data
>> >>> >> should not replace a ticket.  So in my mind the open question is
>> >>> whether
>> >>> >> the client needs to actively opt-in to inlined data.  The
>> scenarios I
>> >>> could
>> >>> >> come with where inlined data isn't useful are:
>> >>> >> 1.  The client is an old client and isn't aware inline data might
>> be
>> >>> >> returned.  In this case the main cost is of extra data on the wire
>> and
>> >>> >> storing it as unknown fields [1].
>> >>> >> 2.  The client is a new client but still doesn't want to get inline
>> >>> data
>> >>> >> (it might want to distribute all consumption to other processes).
>> Same
>> >>> >> cost is paid as option 1.
>> >>> >>
>> >>> >> Are there other scenarios?  If servers choose reasonable limits on
>> what
>> >>> >> data to inline, the extra complexity of negotiating with the
>> client in
>> >>> this
>> >>> >> case might not be worth the benefits.
>> >>> >>
>> >>> >> Cheers,
>> >>> >> Micah
>> >>> >>
>> >>> >>
>> >>> >> [1]
>> >>> https://developers.google.com/protocol-buffers/docs/proto3#unknowns
>> >>> >>
>> >>> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cu...@gmail.com>
>> >>> wrote:
>> >>> >>
>> >>> >>> I think this would be a useful feature and be nice to have in
>> Flight
>> >>> >>> core.
>> >>> >>> For cases like previewing data, you usually just want to get a
>> small
>> >>> >>> amount
>> >>> >>> of data quickly. Would it make sense to make this part of DoGet
>> since
>> >>> it
>> >>> >>> still would be returning a record batch? Perhaps a Ticket could be
>> >>> made
>> >>> >>> to
>> >>> >>> have an optional FlightDescriptor that would serve as an
>> all-in-one
>> >>> shot?
>> >>> >>>
>> >>> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li <li...@apache.org>
>> wrote:
>> >>> >>>
>> >>> >>> > I agree with something along Antoine's proposal, though: maybe
>> we
>> >>> >>> should
>> >>> >>> > be more structured with the flags (akin to what Micah mentioned
>> with
>> >>> >>> the
>> >>> >>> > Feature enum).
>> >>> >>> >
>> >>> >>> > Also, the flag could be embedded into the Flight SQL messages
>> >>> instead.
>> >>> >>> (So
>> >>> >>> > in effect, Flight would only add the capability to return data
>> with
>> >>> >>> > FlightInfo, and it's up to applications, like Flight SQL, to
>> decide
>> >>> how
>> >>> >>> > they want to take advantage of that.)
>> >>> >>> >
>> >>> >>> > I think having a completely separate method and return type and
>> >>> having
>> >>> >>> to
>> >>> >>> > poll for it beforehand somewhat defeats the purpose of having
>> >>> it/would
>> >>> >>> be
>> >>> >>> > much harder of a transition.
>> >>> >>> >
>> >>> >>> > Also: it should be `repeated FlightInfo inline_data` right? In
>> case
>> >>> we
>> >>> >>> > also need dictionary batches?
>> >>> >>> >
>> >>> >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote:
>> >>> >>> > > Can we just add the following field to the FlightDescriptor
>> >>> message:
>> >>> >>> > >
>> >>> >>> > >   bool accept_inline_data = 4;
>> >>> >>> > >
>> >>> >>> > > and this one to the FlightInfo message:
>> >>> >>> > >
>> >>> >>> > >   FlightData inline_data = 100;
>> >>> >>> > >
>> >>> >>> > > Then new clients can `accept_inline_data` to true (the default
>> >>> being
>> >>> >>> > > false if omitted) to signal servers that they can put the
>> data if
>> >>> >>> > > `inline_data` if deemed small enough.
>> >>> >>> > >
>> >>> >>> > > (the `accept_inline_data` field could also be used to the
>> Criteria
>> >>> >>> > > message)
>> >>> >>> > >
>> >>> >>> > >
>> >>> >>> > > Alternatively, if the FlightDescriptor expansion looks a bit
>> dirty
>> >>> >>> > > (FlightDescriptor being used in other contexts where
>> >>> >>> > > `accept_inline_data` makes no sense), we can instead define a
>> new
>> >>> >>> > > method:
>> >>> >>> > >
>> >>> >>> > >   rpc GetFlightInfoEx(GetFlightInfoRequest) returns
>> (FlightInfo)
>> >>> {}
>> >>> >>> > >
>> >>> >>> > > with:
>> >>> >>> > >
>> >>> >>> > > message GetFlightInfoRequest {
>> >>> >>> > >   FlightDescriptor flight_descriptor = 1;
>> >>> >>> > >   bool accept_inline_data = 2;
>> >>> >>> > > }
>> >>> >>> > >
>> >>> >>> > > Regards
>> >>> >>> > >
>> >>> >>> > > Antoine.
>> >>> >>> > >
>> >>> >>> > >
>> >>> >>> > > On Mon, 28 Feb 2022 11:29:12 -0800
>> >>> >>> > > James Duong <ja...@bitquilltech.com.INVALID> wrote:
>> >>> >>> > >> This seems reasonable, however we need to account for
>> existing
>> >>> >>> Flight
>> >>> >>> > >> clients that were written before this.
>> >>> >>> > >>
>> >>> >>> > >> It seems like the server will need to still handle the ticket
>> >>> >>> returned
>> >>> >>> > for
>> >>> >>> > >> getStream() for clients that are unaware of the small result
>> >>> >>> > optimization.
>> >>> >>> > >>
>> >>> >>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <
>> lidavidm@apache.org>
>> >>> >>> wrote:
>> >>> >>> > >>
>> >>> >>> > >> > Ah, that makes more sense, that would be a reasonable
>> >>> extension to
>> >>> >>> > Flight
>> >>> >>> > >> > overall. (While we're at it, I think it would help to have
>> an
>> >>> >>> > app_metadata
>> >>> >>> > >> > field in FlightInfo as well.)
>> >>> >>> > >> >
>> >>> >>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote:
>> >>> >>> > >> > >>
>> >>> >>> > >> > >> But it seems reasonable to add a one-shot query path
>> using
>> >>> >>> DoGet.
>> >>> >>> > >> > >
>> >>> >>> > >> > >
>> >>> >>> > >> > > I was thinking more of adding a bytes field to FlightInfo
>> >>> that
>> >>> >>> > could
>> >>> >>> > >> > store
>> >>> >>> > >> > > arrow data.  That way GetFlightInfo would be the only RPC
>> >>> >>> necessary
>> >>> >>> > for
>> >>> >>> > >> > > small results when executing a CMD.  The client doesn't
>> >>> >>> necessarily
>> >>> >>> > know
>> >>> >>> > >> > > whether a query will return large or small results.
>> >>> >>> > >> > >
>> >>> >>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li <
>> >>> lidavidm@apache.org>
>> >>> >>> > wrote:
>> >>> >>> > >> > >
>> >>> >>> > >> > >> I think the focus was on large result sets (though I
>> don't
>> >>> >>> recall
>> >>> >>> > this
>> >>> >>> > >> > >> being discussed before) and supporting multi-node setups
>> >>> (hence
>> >>> >>> > >> > >> GetFlightInfo/DoGet are separated). But it seems
>> reasonable
>> >>> to
>> >>> >>> add
>> >>> >>> > a
>> >>> >>> > >> > >> one-shot query path using DoGet.
>> >>> >>> > >> > >>
>> >>> >>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote:
>> >>> >>> > >> > >> > I saw the same. A small, stateless query ability
>> would be
>> >>> >>> nice
>> >>> >>> > >> > >> (connection
>> >>> >>> > >> > >> > open, initialization, query in one message, the
>> resultset
>> >>> in
>> >>> >>> > the
>> >>> >>> > >> > response
>> >>> >>> > >> > >> > in one message)
>> >>> >>> > >> > >> >
>> >>> >>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield <
>> >>> >>> > emkornfield@gmail.com>
>> >>> >>> > >> > >> wrote:
>> >>> >>> > >> > >> >
>> >>> >>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm
>> not
>> >>> sure
>> >>> >>> if
>> >>> >>> > I'm
>> >>> >>> > >> > >> missing
>> >>> >>> > >> > >> >> it but is there any optimization for small results?
>> My
>> >>> >>> concern
>> >>> >>> > is
>> >>> >>> > >> > that
>> >>> >>> > >> > >> the
>> >>> >>> > >> > >> >> overhead of the RPCs for the DoGet after executing
>> the
>> >>> query
>> >>> >>> > could
>> >>> >>> > >> > add
>> >>> >>> > >> > >> >> non-trivial latency for smaller results.
>> >>> >>> > >> > >> >>
>> >>> >>> > >> > >> >> Has anybody else thought about this/investigated
>> it?  Am
>> >>> I
>> >>> >>> > >> > understanding
>> >>> >>> > >> > >> >> this correctly?
>> >>> >>> > >> > >> >>
>> >>> >>> > >> > >> >> Thanks,
>> >>> >>> > >> > >> >> Micah
>> >>> >>> > >> > >> >>
>> >>> >>> > >> > >>
>> >>> >>> > >> >
>> >>> >>> > >>
>> >>> >>> > >>
>> >>> >>> >
>> >>> >>>
>> >>> >>
>> >>>
>> >>
>>
>

Re: Flight/FlightSQL Optimization for Small Results?

Posted by Micah Kornfield <em...@gmail.com>.
Yes, next step is implementation which I've been delayed on.  I hope to
have a little time this week to work on it and will post an update.

On Wed, Apr 27, 2022 at 7:48 AM David Li <li...@apache.org> wrote:

> Following up here - what are the next steps? The RFC PR looks fairly
> complete, maybe we can help build out implementations in C++/Java/other
> languages in preparation for a vote?
>
> On Wed, Mar 9, 2022, at 00:23, Micah Kornfield wrote:
> >>
> >> The operation flow would be like this, or what would it look like?
> >> Client ---> GetFlightInfo (query/update operation in payload) --->
> Server
> >> ---> Results (non-streamed)
> >
> >
> > This is roughly the flow I was imagining if the server chooses to send
> back
> > inlined data.
> >
> > -Micah
> >
> > On Tue, Mar 8, 2022 at 11:27 AM Gavin Ray <ra...@gmail.com> wrote:
> >
> >> Thank you for doing this, left a few questions on the GH issue
> >>
> >> I would adopt this proposal as soon as it makes it into nightlies
> >> (or possibly earlier if it's just a matter of regenerating the proto
> >> definitions)
> >>
> >> The operation flow would be like this, or what would it look like?
> >>
> >> Client ---> GetFlightInfo (query/update operation in payload) --->
> Server
> >> ---> Results (non-streamed)
> >>
> >>
> >>
> >>
> >> On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield <em...@gmail.com>
> >> wrote:
> >>
> >>> Some people have already left comments on
> >>> https://github.com/apache/arrow/pull/12571  More eyes on it would be
> >>> appreciated.  If there aren't more comments, I'll try to start
> >>> implementing
> >>> this feature in Flight next week, and hopefully have a vote after it is
> >>> supported in Java and C++/Python.
> >>>
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield <emkornfield@gmail.com
> >
> >>> wrote:
> >>>
> >>> > I put together straw-man proposal in PR [1] for the Flight changes.
> >>> > Ultimately, it seemed based on the use-cases discussed inlining the
> >>> data on
> >>> > the Ticket made the most sense.  This might be overly complex (I'm
> not
> >>> sure
> >>> > how I feel about a enum indicating partial vs full results) but
> welcome
> >>> > feedback.  Once we get consensus on this proposal, I can add changes
> to
> >>> > Flight SQL and try to provide reference implementations.
> >>> >
> >>> > [1] https://github.com/apache/arrow/pull/12571
> >>> >
> >>> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <
> emkornfield@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Would it make sense to make this part of DoGet since it
> >>> >>> still would be returning a record batch
> >>> >>
> >>> >> I would lean against this. I think in many cases the client doesn't
> >>> know
> >>> >> the size of the data that it expects.  Leaving the flexibility on
> the
> >>> >> server side to send back inlined data when it thinks it makes sense,
> >>> or a
> >>> >> bunch of tickets when there is in fact a lot of data seems like the
> >>> best
> >>> >> option here.
> >>> >>
> >>> >> For cases like previewing data, you usually just want to get a small
> >>> >>> amount
> >>> >>> of data quickly.
> >>> >>
> >>> >> This is interesting and might be an additional use case.  If we did
> >>> >> decide to extend FlightInfo we might also want a way of annotating
> >>> inlined
> >>> >> data with its corresponding ticket.  That way even for large
> results,
> >>> you
> >>> >> could still send back a small preview if desired.
> >>> >>
> >>> >> After considering it a little bit I think I'm sold that inlined data
> >>> >> should not replace a ticket.  So in my mind the open question is
> >>> whether
> >>> >> the client needs to actively opt-in to inlined data.  The scenarios
> I
> >>> could
> >>> >> come with where inlined data isn't useful are:
> >>> >> 1.  The client is an old client and isn't aware inline data might be
> >>> >> returned.  In this case the main cost is of extra data on the wire
> and
> >>> >> storing it as unknown fields [1].
> >>> >> 2.  The client is a new client but still doesn't want to get inline
> >>> data
> >>> >> (it might want to distribute all consumption to other processes).
> Same
> >>> >> cost is paid as option 1.
> >>> >>
> >>> >> Are there other scenarios?  If servers choose reasonable limits on
> what
> >>> >> data to inline, the extra complexity of negotiating with the client
> in
> >>> this
> >>> >> case might not be worth the benefits.
> >>> >>
> >>> >> Cheers,
> >>> >> Micah
> >>> >>
> >>> >>
> >>> >> [1]
> >>> https://developers.google.com/protocol-buffers/docs/proto3#unknowns
> >>> >>
> >>> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cu...@gmail.com>
> >>> wrote:
> >>> >>
> >>> >>> I think this would be a useful feature and be nice to have in
> Flight
> >>> >>> core.
> >>> >>> For cases like previewing data, you usually just want to get a
> small
> >>> >>> amount
> >>> >>> of data quickly. Would it make sense to make this part of DoGet
> since
> >>> it
> >>> >>> still would be returning a record batch? Perhaps a Ticket could be
> >>> made
> >>> >>> to
> >>> >>> have an optional FlightDescriptor that would serve as an all-in-one
> >>> shot?
> >>> >>>
> >>> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li <li...@apache.org>
> wrote:
> >>> >>>
> >>> >>> > I agree with something along Antoine's proposal, though: maybe we
> >>> >>> should
> >>> >>> > be more structured with the flags (akin to what Micah mentioned
> with
> >>> >>> the
> >>> >>> > Feature enum).
> >>> >>> >
> >>> >>> > Also, the flag could be embedded into the Flight SQL messages
> >>> instead.
> >>> >>> (So
> >>> >>> > in effect, Flight would only add the capability to return data
> with
> >>> >>> > FlightInfo, and it's up to applications, like Flight SQL, to
> decide
> >>> how
> >>> >>> > they want to take advantage of that.)
> >>> >>> >
> >>> >>> > I think having a completely separate method and return type and
> >>> having
> >>> >>> to
> >>> >>> > poll for it beforehand somewhat defeats the purpose of having
> >>> it/would
> >>> >>> be
> >>> >>> > much harder of a transition.
> >>> >>> >
> >>> >>> > Also: it should be `repeated FlightInfo inline_data` right? In
> case
> >>> we
> >>> >>> > also need dictionary batches?
> >>> >>> >
> >>> >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote:
> >>> >>> > > Can we just add the following field to the FlightDescriptor
> >>> message:
> >>> >>> > >
> >>> >>> > >   bool accept_inline_data = 4;
> >>> >>> > >
> >>> >>> > > and this one to the FlightInfo message:
> >>> >>> > >
> >>> >>> > >   FlightData inline_data = 100;
> >>> >>> > >
> >>> >>> > > Then new clients can `accept_inline_data` to true (the default
> >>> being
> >>> >>> > > false if omitted) to signal servers that they can put the data
> if
> >>> >>> > > `inline_data` if deemed small enough.
> >>> >>> > >
> >>> >>> > > (the `accept_inline_data` field could also be used to the
> Criteria
> >>> >>> > > message)
> >>> >>> > >
> >>> >>> > >
> >>> >>> > > Alternatively, if the FlightDescriptor expansion looks a bit
> dirty
> >>> >>> > > (FlightDescriptor being used in other contexts where
> >>> >>> > > `accept_inline_data` makes no sense), we can instead define a
> new
> >>> >>> > > method:
> >>> >>> > >
> >>> >>> > >   rpc GetFlightInfoEx(GetFlightInfoRequest) returns
> (FlightInfo)
> >>> {}
> >>> >>> > >
> >>> >>> > > with:
> >>> >>> > >
> >>> >>> > > message GetFlightInfoRequest {
> >>> >>> > >   FlightDescriptor flight_descriptor = 1;
> >>> >>> > >   bool accept_inline_data = 2;
> >>> >>> > > }
> >>> >>> > >
> >>> >>> > > Regards
> >>> >>> > >
> >>> >>> > > Antoine.
> >>> >>> > >
> >>> >>> > >
> >>> >>> > > On Mon, 28 Feb 2022 11:29:12 -0800
> >>> >>> > > James Duong <ja...@bitquilltech.com.INVALID> wrote:
> >>> >>> > >> This seems reasonable, however we need to account for existing
> >>> >>> Flight
> >>> >>> > >> clients that were written before this.
> >>> >>> > >>
> >>> >>> > >> It seems like the server will need to still handle the ticket
> >>> >>> returned
> >>> >>> > for
> >>> >>> > >> getStream() for clients that are unaware of the small result
> >>> >>> > optimization.
> >>> >>> > >>
> >>> >>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <
> lidavidm@apache.org>
> >>> >>> wrote:
> >>> >>> > >>
> >>> >>> > >> > Ah, that makes more sense, that would be a reasonable
> >>> extension to
> >>> >>> > Flight
> >>> >>> > >> > overall. (While we're at it, I think it would help to have
> an
> >>> >>> > app_metadata
> >>> >>> > >> > field in FlightInfo as well.)
> >>> >>> > >> >
> >>> >>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote:
> >>> >>> > >> > >>
> >>> >>> > >> > >> But it seems reasonable to add a one-shot query path
> using
> >>> >>> DoGet.
> >>> >>> > >> > >
> >>> >>> > >> > >
> >>> >>> > >> > > I was thinking more of adding a bytes field to FlightInfo
> >>> that
> >>> >>> > could
> >>> >>> > >> > store
> >>> >>> > >> > > arrow data.  That way GetFlightInfo would be the only RPC
> >>> >>> necessary
> >>> >>> > for
> >>> >>> > >> > > small results when executing a CMD.  The client doesn't
> >>> >>> necessarily
> >>> >>> > know
> >>> >>> > >> > > whether a query will return large or small results.
> >>> >>> > >> > >
> >>> >>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li <
> >>> lidavidm@apache.org>
> >>> >>> > wrote:
> >>> >>> > >> > >
> >>> >>> > >> > >> I think the focus was on large result sets (though I
> don't
> >>> >>> recall
> >>> >>> > this
> >>> >>> > >> > >> being discussed before) and supporting multi-node setups
> >>> (hence
> >>> >>> > >> > >> GetFlightInfo/DoGet are separated). But it seems
> reasonable
> >>> to
> >>> >>> add
> >>> >>> > a
> >>> >>> > >> > >> one-shot query path using DoGet.
> >>> >>> > >> > >>
> >>> >>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote:
> >>> >>> > >> > >> > I saw the same. A small, stateless query ability would
> be
> >>> >>> nice
> >>> >>> > >> > >> (connection
> >>> >>> > >> > >> > open, initialization, query in one message, the
> resultset
> >>> in
> >>> >>> > the
> >>> >>> > >> > response
> >>> >>> > >> > >> > in one message)
> >>> >>> > >> > >> >
> >>> >>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield <
> >>> >>> > emkornfield@gmail.com>
> >>> >>> > >> > >> wrote:
> >>> >>> > >> > >> >
> >>> >>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not
> >>> sure
> >>> >>> if
> >>> >>> > I'm
> >>> >>> > >> > >> missing
> >>> >>> > >> > >> >> it but is there any optimization for small results?
> My
> >>> >>> concern
> >>> >>> > is
> >>> >>> > >> > that
> >>> >>> > >> > >> the
> >>> >>> > >> > >> >> overhead of the RPCs for the DoGet after executing the
> >>> query
> >>> >>> > could
> >>> >>> > >> > add
> >>> >>> > >> > >> >> non-trivial latency for smaller results.
> >>> >>> > >> > >> >>
> >>> >>> > >> > >> >> Has anybody else thought about this/investigated it?
> Am
> >>> I
> >>> >>> > >> > understanding
> >>> >>> > >> > >> >> this correctly?
> >>> >>> > >> > >> >>
> >>> >>> > >> > >> >> Thanks,
> >>> >>> > >> > >> >> Micah
> >>> >>> > >> > >> >>
> >>> >>> > >> > >>
> >>> >>> > >> >
> >>> >>> > >>
> >>> >>> > >>
> >>> >>> >
> >>> >>>
> >>> >>
> >>>
> >>
>