You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Ryan Murray <ry...@dremio.com> on 2019/06/28 17:57:10 UTC

Spark and Arrow Flight

Hi All,

I have been working on building an arrow flight source for spark. The goal
here is for Spark to be able to use a group of arrow flight endpoints to
get a dataset pulled over to spark in parallel.

I am unsure of the best model for the spark <-> flight conversation and
wanted to get your opinion on the best way to go.

I am breaking up the query to flight from spark into 3 parts:
1) get the schema using GetFlightInfo. This is needed to do further lazy
operations in Spark
2) get the endpoints by calling GetFlightInfo a 2nd time with a different
argument. This returns the list endpoints on the parallel flight server.
The endpoints are not available till data is ready to be fetched, which is
done after the schema but is needed before DoGet is called.
3) call get stream on all endpoints from 2

I think I have to do each step however I don't like having to call getInfo
twice, it doesn't seem very elegant. I see a few options:
1) live with calling GetFlightInfo twice and with a custom bytes cmd to
differentiate the purpose of each call
2) add an argument to GetFlightInfo to tell it its being called only for
the schema
3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return just
the Schema in question
4) use DoAction and wrap the expected FlightInfo in a Result

I am aware that 4 is probably the least disruptive but I'm also not a fan
as (to me) it implies performing an action on the server side. Suggestions
2 & 3 are larger changes and I am reluctant to do that unless there is a
consensus here. None of them are great options and I am wondering what
everyone thinks the best approach might be? Particularly as I think this is
likely to come up in more applications than just spark.

Best,
Ryan

Re: Spark and Arrow Flight

Posted by Wes McKinney <we...@gmail.com>.

On Mon, Jul 1, 2019 at 3:50 PM David Li <li...@gmail.com> wrote:
>
> I think I'd prefer #3 over overloading an existing call (#2).
>
> We've been thinking about a similar issue, where sometimes we want
> just the schema, but the service can't necessarily return the schema
> without fetching data - right now we return a sentinel value in
> GetFlightInfo, but a separate RPC would let us explicitly indicate an
> error.
>
> I might be missing something though - what happens between step 1 and
> 2 that makes the endpoints available? Would it make sense to use
> DoAction to cause the backend to "prepare" the endpoints, and have the
> result of that be an encoded schema? So then the flow would be
> DoAction -> GetFlightInfo -> DoGet.

I think it depends on the particular server/planner implementation. If
preparing a dataset is expensive (imagine loading a large dataset into
a distributed cache, then dropping it later), then it might be that
you have:

DoAction: Load/Prepare $DATASET

... clients access the dataset using GetFlightInfo with path $DATASET

DoAction: Drop $DATASET

In other cases GetFlightInfo might contain a SQL query and so having a
separate DoAction workflow is not needed

>
> Best,
> David
>
> On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> > My inclination is either #2 or #3. #4 is an option of course, but I
> > like the more structured solution of explicitly requesting the schema
> > given a descriptor.
> >
> > In both cases, it's possible that schemas are sent twice, e.g. if you
> > call GetSchema and then later call GetFlightInfo and so you receive
> > the schema again. The schema is optional, so if it became a
> > performance problem then a particular server might return the schema
> > as null from GetFlightInfo.
> >
> > I think it's valid to want to make a single GetFlightInfo RPC request
> > that returns _both_ the schema and the query plan.
> >
> > Thoughts from others?
> >
> > On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <ja...@apache.org> wrote:
> >>
> >> My initial inclination is towards #3 but I'd be curious what others
> >> think.
> >> In the case of #3, I wonder if it makes sense to then pull the Schema off
> >> the GetFlightInfo response...
> >>
> >> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <ry...@dremio.com> wrote:
> >>
> >> > Hi All,
> >> >
> >> > I have been working on building an arrow flight source for spark. The
> >> > goal
> >> > here is for Spark to be able to use a group of arrow flight endpoints
> >> > to
> >> > get a dataset pulled over to spark in parallel.
> >> >
> >> > I am unsure of the best model for the spark <-> flight conversation and
> >> > wanted to get your opinion on the best way to go.
> >> >
> >> > I am breaking up the query to flight from spark into 3 parts:
> >> > 1) get the schema using GetFlightInfo. This is needed to do further
> >> > lazy
> >> > operations in Spark
> >> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> >> > different
> >> > argument. This returns the list endpoints on the parallel flight
> >> > server.
> >> > The endpoints are not available till data is ready to be fetched, which
> >> > is
> >> > done after the schema but is needed before DoGet is called.
> >> > 3) call get stream on all endpoints from 2
> >> >
> >> > I think I have to do each step however I don't like having to call
> >> > getInfo
> >> > twice, it doesn't seem very elegant. I see a few options:
> >> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
> >> > differentiate the purpose of each call
> >> > 2) add an argument to GetFlightInfo to tell it its being called only
> >> > for
> >> > the schema
> >> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
> >> > just
> >> > the Schema in question
> >> > 4) use DoAction and wrap the expected FlightInfo in a Result
> >> >
> >> > I am aware that 4 is probably the least disruptive but I'm also not a
> >> > fan
> >> > as (to me) it implies performing an action on the server side.
> >> > Suggestions
> >> > 2 & 3 are larger changes and I am reluctant to do that unless there is
> >> > a
> >> > consensus here. None of them are great options and I am wondering what
> >> > everyone thinks the best approach might be? Particularly as I think this
> >> > is
> >> > likely to come up in more applications than just spark.
> >> >
> >> > Best,
> >> > Ryan
> >> >
> >

Re: Spark and Arrow Flight

Posted by Wes McKinney <we...@gmail.com>.

Of course, it might make just as much sense in Apache Spark. Probably
worth bringing up with that community, too

On Wed, Jul 10, 2019 at 12:37 PM Wes McKinney <we...@gmail.com> wrote:
>
> hi Ryan -- I was thinking that this might be built separately from the
> main Java project. We don't have a model in the codebase yet for
> libraries that depend on the core libraries (this could be in an apps/
> directory at the top level, so apps/spark-flight-source or something).
> So the development procedure would be to build and install the Arrow
> libraries first and then build the Spark-Flight source as a follow up.
>
> I think there would be a lot of benefit to maintaining common
> development infrastructure -- for example, we could set up
> docker-compose tasks to spin up nodes to simulate a distributed system
> for testing and benchmarking purposes, and utilize common CI systems.
>
> - Wes
>
> On Wed, Jul 10, 2019 at 12:28 PM Ryan Murray <ry...@dremio.com> wrote:
> >
> > Hey Wes,
> >
> > Would be happy to! Jacques and I had originally thought to try and get it
> > into Spark but perhaps Arrow might be a better home. I think the only issue
> > is whether we want to bring Spark jars and their dependencies into Arrow.
> > One challenge I have had so far with the connector is managing the
> > transitive arrow dependencies from Spark, the connector only works on
> > relatively recent versions of Spark and potentially can create circular
> > arrow dependencies. I think this issue will be better once 1.0.0 is done
> > and we can rely on a stable format/api.
> >
> > Best,
> > Ryan
> >
> > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > Hi Ryan, have you thought about developing this inside Apache Arrow?
> > >
> > > On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cu...@gmail.com> wrote:
> > >
> > > > Great, thanks Ryan! I'll take a look
> > > >
> > > > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <ry...@dremio.com> wrote:
> > > >
> > > > > Hi Bryan,
> > > > >
> > > > > I have an implementation of option #3 nearly ready for a PR. I will
> > > > mention
> > > > > you when I publish it.
> > > > >
> > > > > The working prototype for the Spark connector is here:
> > > > > https://github.com/rymurr/flight-spark-source. It technically works
> > > (and
> > > > > is
> > > > > very fast!) however the implementation is pretty dodgy and needs to be
> > > > > cleaned up before ready for prime time. I plan to have it ready to go
> > > for
> > > > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> > > > if
> > > > > you have any comments or are interested in contributing!
> > > > >
> > > > > Best,
> > > > > Ryan
> > > > >
> > > > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cu...@gmail.com> wrote:
> > > > >
> > > > > > I'm in favor of option #3 also, but not sure what the best thing to
> > > do
> > > > > with
> > > > > > the existing FlightInfo response is. I'm definitely interested in
> > > > > > connecting Spark with Flight, can you share more details of your work
> > > > or
> > > > > is
> > > > > > it planned to be open sourced?
> > > > > >
> > > > > > Thanks,
> > > > > > Bryan
> > > > > >
> > > > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <an...@python.org>
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> > > implementation
> > > > > can
> > > > > > > rely on calling GetFlightInfo.
> > > > > > >
> > > > > > >
> > > > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > > > > >
> > > > > > > > We've been thinking about a similar issue, where sometimes we
> > > want
> > > > > > > > just the schema, but the service can't necessarily return the
> > > > schema
> > > > > > > > without fetching data - right now we return a sentinel value in
> > > > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> > > indicate
> > > > an
> > > > > > > > error.
> > > > > > > >
> > > > > > > > I might be missing something though - what happens between step 1
> > > > and
> > > > > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> > > have
> > > > > the
> > > > > > > > result of that be an encoded schema? So then the flow would be
> > > > > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > David
> > > > > > > >
> > > > > > > > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> > > > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> > > but
> > > > I
> > > > > > > >> like the more structured solution of explicitly requesting the
> > > > > schema
> > > > > > > >> given a descriptor.
> > > > > > > >>
> > > > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
> > > if
> > > > > you
> > > > > > > >> call GetSchema and then later call GetFlightInfo and so you
> > > > receive
> > > > > > > >> the schema again. The schema is optional, so if it became a
> > > > > > > >> performance problem then a particular server might return the
> > > > schema
> > > > > > > >> as null from GetFlightInfo.
> > > > > > > >>
> > > > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > > > > request
> > > > > > > >> that returns _both_ the schema and the query plan.
> > > > > > > >>
> > > > > > > >> Thoughts from others?
> > > > > > > >>
> > > > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> > > > jacques@apache.org>
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>> My initial inclination is towards #3 but I'd be curious what
> > > > others
> > > > > > > >>> think.
> > > > > > > >>> In the case of #3, I wonder if it makes sense to then pull the
> > > > > Schema
> > > > > > > off
> > > > > > > >>> the GetFlightInfo response...
> > > > > > > >>>
> > > > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
> > > rymurr@dremio.com>
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> Hi All,
> > > > > > > >>>>
> > > > > > > >>>> I have been working on building an arrow flight source for
> > > > spark.
> > > > > > The
> > > > > > > >>>> goal
> > > > > > > >>>> here is for Spark to be able to use a group of arrow flight
> > > > > > endpoints
> > > > > > > >>>> to
> > > > > > > >>>> get a dataset pulled over to spark in parallel.
> > > > > > > >>>>
> > > > > > > >>>> I am unsure of the best model for the spark <-> flight
> > > > > conversation
> > > > > > > and
> > > > > > > >>>> wanted to get your opinion on the best way to go.
> > > > > > > >>>>
> > > > > > > >>>> I am breaking up the query to flight from spark into 3 parts:
> > > > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> > > > > further
> > > > > > > >>>> lazy
> > > > > > > >>>> operations in Spark
> > > > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with
> > > a
> > > > > > > >>>> different
> > > > > > > >>>> argument. This returns the list endpoints on the parallel
> > > flight
> > > > > > > >>>> server.
> > > > > > > >>>> The endpoints are not available till data is ready to be
> > > > fetched,
> > > > > > > which
> > > > > > > >>>> is
> > > > > > > >>>> done after the schema but is needed before DoGet is called.
> > > > > > > >>>> 3) call get stream on all endpoints from 2
> > > > > > > >>>>
> > > > > > > >>>> I think I have to do each step however I don't like having to
> > > > call
> > > > > > > >>>> getInfo
> > > > > > > >>>> twice, it doesn't seem very elegant. I see a few options:
> > > > > > > >>>> 1) live with calling GetFlightInfo twice and with a custom
> > > bytes
> > > > > cmd
> > > > > > > to
> > > > > > > >>>> differentiate the purpose of each call
> > > > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being
> > > called
> > > > > only
> > > > > > > >>>> for
> > > > > > > >>>> the schema
> > > > > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> > > > > > return
> > > > > > > >>>> just
> > > > > > > >>>> the Schema in question
> > > > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > > > > > > >>>>
> > > > > > > >>>> I am aware that 4 is probably the least disruptive but I'm
> > > also
> > > > > not
> > > > > > a
> > > > > > > >>>> fan
> > > > > > > >>>> as (to me) it implies performing an action on the server side.
> > > > > > > >>>> Suggestions
> > > > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless
> > > > > there
> > > > > > is
> > > > > > > >>>> a
> > > > > > > >>>> consensus here. None of them are great options and I am
> > > > wondering
> > > > > > what
> > > > > > > >>>> everyone thinks the best approach might be? Particularly as I
> > > > > think
> > > > > > > this
> > > > > > > >>>> is
> > > > > > > >>>> likely to come up in more applications than just spark.
> > > > > > > >>>>
> > > > > > > >>>> Best,
> > > > > > > >>>> Ryan
> > > > > > > >>>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Ryan Murray  | Principal Consulting Engineer
> > > > >
> > > > > +447540852009 | rymurr@dremio.com
> > > > >
> > > > > <https://www.dremio.com/>
> > > > > Check out our GitHub <https://www.github.com/dremio>, join our
> > > community
> > > > > site <https://community.dremio.com/> & Download Dremio
> > > > > <https://www.dremio.com/download>
> > > > >
> > > >
> > >
> >
> >
> > --
> >
> > Ryan Murray  | Principal Consulting Engineer
> >
> > +447540852009 | rymurr@dremio.com
> >
> > <https://www.dremio.com/>
> > Check out our GitHub <https://www.github.com/dremio>, join our community
> > site <https://community.dremio.com/> & Download Dremio
> > <https://www.dremio.com/download>

Re: Spark and Arrow Flight

Posted by Wes McKinney <we...@gmail.com>.

hi Ryan -- I was thinking that this might be built separately from the
main Java project. We don't have a model in the codebase yet for
libraries that depend on the core libraries (this could be in an apps/
directory at the top level, so apps/spark-flight-source or something).
So the development procedure would be to build and install the Arrow
libraries first and then build the Spark-Flight source as a follow up.

I think there would be a lot of benefit to maintaining common
development infrastructure -- for example, we could set up
docker-compose tasks to spin up nodes to simulate a distributed system
for testing and benchmarking purposes, and utilize common CI systems.

- Wes

On Wed, Jul 10, 2019 at 12:28 PM Ryan Murray <ry...@dremio.com> wrote:
>
> Hey Wes,
>
> Would be happy to! Jacques and I had originally thought to try and get it
> into Spark but perhaps Arrow might be a better home. I think the only issue
> is whether we want to bring Spark jars and their dependencies into Arrow.
> One challenge I have had so far with the connector is managing the
> transitive arrow dependencies from Spark, the connector only works on
> relatively recent versions of Spark and potentially can create circular
> arrow dependencies. I think this issue will be better once 1.0.0 is done
> and we can rely on a stable format/api.
>
> Best,
> Ryan
>
> On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <we...@gmail.com> wrote:
>
> > Hi Ryan, have you thought about developing this inside Apache Arrow?
> >
> > On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cu...@gmail.com> wrote:
> >
> > > Great, thanks Ryan! I'll take a look
> > >
> > > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <ry...@dremio.com> wrote:
> > >
> > > > Hi Bryan,
> > > >
> > > > I have an implementation of option #3 nearly ready for a PR. I will
> > > mention
> > > > you when I publish it.
> > > >
> > > > The working prototype for the Spark connector is here:
> > > > https://github.com/rymurr/flight-spark-source. It technically works
> > (and
> > > > is
> > > > very fast!) however the implementation is pretty dodgy and needs to be
> > > > cleaned up before ready for prime time. I plan to have it ready to go
> > for
> > > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> > > if
> > > > you have any comments or are interested in contributing!
> > > >
> > > > Best,
> > > > Ryan
> > > >
> > > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cu...@gmail.com> wrote:
> > > >
> > > > > I'm in favor of option #3 also, but not sure what the best thing to
> > do
> > > > with
> > > > > the existing FlightInfo response is. I'm definitely interested in
> > > > > connecting Spark with Flight, can you share more details of your work
> > > or
> > > > is
> > > > > it planned to be open sourced?
> > > > >
> > > > > Thanks,
> > > > > Bryan
> > > > >
> > > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <an...@python.org>
> > > > wrote:
> > > > >
> > > > > >
> > > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> > implementation
> > > > can
> > > > > > rely on calling GetFlightInfo.
> > > > > >
> > > > > >
> > > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > > > >
> > > > > > > We've been thinking about a similar issue, where sometimes we
> > want
> > > > > > > just the schema, but the service can't necessarily return the
> > > schema
> > > > > > > without fetching data - right now we return a sentinel value in
> > > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> > indicate
> > > an
> > > > > > > error.
> > > > > > >
> > > > > > > I might be missing something though - what happens between step 1
> > > and
> > > > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> > have
> > > > the
> > > > > > > result of that be an encoded schema? So then the flow would be
> > > > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > > > >
> > > > > > > Best,
> > > > > > > David
> > > > > > >
> > > > > > > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> > > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> > but
> > > I
> > > > > > >> like the more structured solution of explicitly requesting the
> > > > schema
> > > > > > >> given a descriptor.
> > > > > > >>
> > > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
> > if
> > > > you
> > > > > > >> call GetSchema and then later call GetFlightInfo and so you
> > > receive
> > > > > > >> the schema again. The schema is optional, so if it became a
> > > > > > >> performance problem then a particular server might return the
> > > schema
> > > > > > >> as null from GetFlightInfo.
> > > > > > >>
> > > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > > > request
> > > > > > >> that returns _both_ the schema and the query plan.
> > > > > > >>
> > > > > > >> Thoughts from others?
> > > > > > >>
> > > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> > > jacques@apache.org>
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>> My initial inclination is towards #3 but I'd be curious what
> > > others
> > > > > > >>> think.
> > > > > > >>> In the case of #3, I wonder if it makes sense to then pull the
> > > > Schema
> > > > > > off
> > > > > > >>> the GetFlightInfo response...
> > > > > > >>>
> > > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
> > rymurr@dremio.com>
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>>> Hi All,
> > > > > > >>>>
> > > > > > >>>> I have been working on building an arrow flight source for
> > > spark.
> > > > > The
> > > > > > >>>> goal
> > > > > > >>>> here is for Spark to be able to use a group of arrow flight
> > > > > endpoints
> > > > > > >>>> to
> > > > > > >>>> get a dataset pulled over to spark in parallel.
> > > > > > >>>>
> > > > > > >>>> I am unsure of the best model for the spark <-> flight
> > > > conversation
> > > > > > and
> > > > > > >>>> wanted to get your opinion on the best way to go.
> > > > > > >>>>
> > > > > > >>>> I am breaking up the query to flight from spark into 3 parts:
> > > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> > > > further
> > > > > > >>>> lazy
> > > > > > >>>> operations in Spark
> > > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with
> > a
> > > > > > >>>> different
> > > > > > >>>> argument. This returns the list endpoints on the parallel
> > flight
> > > > > > >>>> server.
> > > > > > >>>> The endpoints are not available till data is ready to be
> > > fetched,
> > > > > > which
> > > > > > >>>> is
> > > > > > >>>> done after the schema but is needed before DoGet is called.
> > > > > > >>>> 3) call get stream on all endpoints from 2
> > > > > > >>>>
> > > > > > >>>> I think I have to do each step however I don't like having to
> > > call
> > > > > > >>>> getInfo
> > > > > > >>>> twice, it doesn't seem very elegant. I see a few options:
> > > > > > >>>> 1) live with calling GetFlightInfo twice and with a custom
> > bytes
> > > > cmd
> > > > > > to
> > > > > > >>>> differentiate the purpose of each call
> > > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being
> > called
> > > > only
> > > > > > >>>> for
> > > > > > >>>> the schema
> > > > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> > > > > return
> > > > > > >>>> just
> > > > > > >>>> the Schema in question
> > > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > > > > > >>>>
> > > > > > >>>> I am aware that 4 is probably the least disruptive but I'm
> > also
> > > > not
> > > > > a
> > > > > > >>>> fan
> > > > > > >>>> as (to me) it implies performing an action on the server side.
> > > > > > >>>> Suggestions
> > > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless
> > > > there
> > > > > is
> > > > > > >>>> a
> > > > > > >>>> consensus here. None of them are great options and I am
> > > wondering
> > > > > what
> > > > > > >>>> everyone thinks the best approach might be? Particularly as I
> > > > think
> > > > > > this
> > > > > > >>>> is
> > > > > > >>>> likely to come up in more applications than just spark.
> > > > > > >>>>
> > > > > > >>>> Best,
> > > > > > >>>> Ryan
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Ryan Murray  | Principal Consulting Engineer
> > > >
> > > > +447540852009 | rymurr@dremio.com
> > > >
> > > > <https://www.dremio.com/>
> > > > Check out our GitHub <https://www.github.com/dremio>, join our
> > community
> > > > site <https://community.dremio.com/> & Download Dremio
> > > > <https://www.dremio.com/download>
> > > >
> > >
> >
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rymurr@dremio.com
>
> <https://www.dremio.com/>
> Check out our GitHub <https://www.github.com/dremio>, join our community
> site <https://community.dremio.com/> & Download Dremio
> <https://www.dremio.com/download>

Re: Spark and Arrow Flight

Posted by David Li <li...@gmail.com>.

Ah, I was just wondering what the status was, thanks for the info!

I think that is a format change, so it would need to go through a vote
here, though.

Best,
David

On 7/25/19, Ryan Murray <ry...@dremio.com> wrote:
> Hey David,
>
> Yes I am. I have a 3/4 done patch ready to go, just got busy with a few
> other things. Are you hoping to use it soon? I would like to get to it this
> week but its looking increasingly unlikely.
>
> Best,
> Ryan
>
> On Thu, Jul 25, 2019 at 7:37 PM David Li <li...@gmail.com> wrote:
>
>> Hey Ryan,
>>
>> To follow up on this, are you planning on formally proposing the
>> GetSchema() call in Flight? I think it'd be interesting to have beyond
>> the Spark usecase as finding the schema may or may not be expensive
>> depending on the data stream (i.e. something computed on demand might
>> require data to be computed in order to get the schema), and
>> separating it from GetFlightInfo means that services that "don't know"
>> the schema ahead of time can still respond to that endpoint quickly.
>> (We could make the change minimal by leaving the schema in FlightInfo
>> and simply specifying it as best-effort.)
>>
>> Best,
>> David
>>
>> On 7/10/19, Ryan Murray <ry...@dremio.com> wrote:
>> > Hey Wes,
>> >
>> > Would be happy to! Jacques and I had originally thought to try and get
>> > it
>> > into Spark but perhaps Arrow might be a better home. I think the only
>> issue
>> > is whether we want to bring Spark jars and their dependencies into
>> > Arrow.
>> > One challenge I have had so far with the connector is managing the
>> > transitive arrow dependencies from Spark, the connector only works on
>> > relatively recent versions of Spark and potentially can create circular
>> > arrow dependencies. I think this issue will be better once 1.0.0 is
>> > done
>> > and we can rely on a stable format/api.
>> >
>> > Best,
>> > Ryan
>> >
>> > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <we...@gmail.com>
>> > wrote:
>> >
>> >> Hi Ryan, have you thought about developing this inside Apache Arrow?
>> >>
>> >> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cu...@gmail.com> wrote:
>> >>
>> >> > Great, thanks Ryan! I'll take a look
>> >> >
>> >> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <ry...@dremio.com>
>> >> > wrote:
>> >> >
>> >> > > Hi Bryan,
>> >> > >
>> >> > > I have an implementation of option #3 nearly ready for a PR. I
>> >> > > will
>> >> > mention
>> >> > > you when I publish it.
>> >> > >
>> >> > > The working prototype for the Spark connector is here:
>> >> > > https://github.com/rymurr/flight-spark-source. It technically
>> >> > > works
>> >> (and
>> >> > > is
>> >> > > very fast!) however the implementation is pretty dodgy and needs
>> >> > > to
>> >> > > be
>> >> > > cleaned up before ready for prime time. I plan to have it ready to
>> go
>> >> for
>> >> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please
>> >> > > shout
>> >> > if
>> >> > > you have any comments or are interested in contributing!
>> >> > >
>> >> > > Best,
>> >> > > Ryan
>> >> > >
>> >> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cu...@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > I'm in favor of option #3 also, but not sure what the best thing
>> to
>> >> do
>> >> > > with
>> >> > > > the existing FlightInfo response is. I'm definitely interested
>> >> > > > in
>> >> > > > connecting Spark with Flight, can you share more details of your
>> >> > > > work
>> >> > or
>> >> > > is
>> >> > > > it planned to be open sourced?
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Bryan
>> >> > > >
>> >> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou
>> >> > > > <antoine@python.org
>> >
>> >> > > wrote:
>> >> > > >
>> >> > > > >
>> >> > > > > Either #3 or #4 for me.  If #3, the default GetSchema
>> >> implementation
>> >> > > can
>> >> > > > > rely on calling GetFlightInfo.
>> >> > > > >
>> >> > > > >
>> >> > > > > Le 01/07/2019 à 22:50, David Li a écrit :
>> >> > > > > > I think I'd prefer #3 over overloading an existing call
>> >> > > > > > (#2).
>> >> > > > > >
>> >> > > > > > We've been thinking about a similar issue, where sometimes
>> >> > > > > > we
>> >> want
>> >> > > > > > just the schema, but the service can't necessarily return
>> >> > > > > > the
>> >> > schema
>> >> > > > > > without fetching data - right now we return a sentinel value
>> in
>> >> > > > > > GetFlightInfo, but a separate RPC would let us explicitly
>> >> indicate
>> >> > an
>> >> > > > > > error.
>> >> > > > > >
>> >> > > > > > I might be missing something though - what happens between
>> step
>> >> > > > > > 1
>> >> > and
>> >> > > > > > 2 that makes the endpoints available? Would it make sense to
>> >> > > > > > use
>> >> > > > > > DoAction to cause the backend to "prepare" the endpoints,
>> >> > > > > > and
>> >> have
>> >> > > the
>> >> > > > > > result of that be an encoded schema? So then the flow would
>> >> > > > > > be
>> >> > > > > > DoAction -> GetFlightInfo -> DoGet.
>> >> > > > > >
>> >> > > > > > Best,
>> >> > > > > > David
>> >> > > > > >
>> >> > > > > > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
>> >> > > > > >> My inclination is either #2 or #3. #4 is an option of
>> >> > > > > >> course,
>> >> but
>> >> > I
>> >> > > > > >> like the more structured solution of explicitly requesting
>> the
>> >> > > schema
>> >> > > > > >> given a descriptor.
>> >> > > > > >>
>> >> > > > > >> In both cases, it's possible that schemas are sent twice,
>> e.g.
>> >> if
>> >> > > you
>> >> > > > > >> call GetSchema and then later call GetFlightInfo and so you
>> >> > receive
>> >> > > > > >> the schema again. The schema is optional, so if it became a
>> >> > > > > >> performance problem then a particular server might return
>> >> > > > > >> the
>> >> > schema
>> >> > > > > >> as null from GetFlightInfo.
>> >> > > > > >>
>> >> > > > > >> I think it's valid to want to make a single GetFlightInfo
>> >> > > > > >> RPC
>> >> > > request
>> >> > > > > >> that returns _both_ the schema and the query plan.
>> >> > > > > >>
>> >> > > > > >> Thoughts from others?
>> >> > > > > >>
>> >> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
>> >> > jacques@apache.org>
>> >> > > > > wrote:
>> >> > > > > >>>
>> >> > > > > >>> My initial inclination is towards #3 but I'd be curious
>> >> > > > > >>> what
>> >> > others
>> >> > > > > >>> think.
>> >> > > > > >>> In the case of #3, I wonder if it makes sense to then pull
>> >> > > > > >>> the
>> >> > > Schema
>> >> > > > > off
>> >> > > > > >>> the GetFlightInfo response...
>> >> > > > > >>>
>> >> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
>> >> rymurr@dremio.com>
>> >> > > > > wrote:
>> >> > > > > >>>
>> >> > > > > >>>> Hi All,
>> >> > > > > >>>>
>> >> > > > > >>>> I have been working on building an arrow flight source
>> >> > > > > >>>> for
>> >> > spark.
>> >> > > > The
>> >> > > > > >>>> goal
>> >> > > > > >>>> here is for Spark to be able to use a group of arrow
>> >> > > > > >>>> flight
>> >> > > > endpoints
>> >> > > > > >>>> to
>> >> > > > > >>>> get a dataset pulled over to spark in parallel.
>> >> > > > > >>>>
>> >> > > > > >>>> I am unsure of the best model for the spark <-> flight
>> >> > > conversation
>> >> > > > > and
>> >> > > > > >>>> wanted to get your opinion on the best way to go.
>> >> > > > > >>>>
>> >> > > > > >>>> I am breaking up the query to flight from spark into 3
>> >> > > > > >>>> parts:
>> >> > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to
>> >> > > > > >>>> do
>> >> > > further
>> >> > > > > >>>> lazy
>> >> > > > > >>>> operations in Spark
>> >> > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time
>> >> > > > > >>>> with
>> >> a
>> >> > > > > >>>> different
>> >> > > > > >>>> argument. This returns the list endpoints on the parallel
>> >> flight
>> >> > > > > >>>> server.
>> >> > > > > >>>> The endpoints are not available till data is ready to be
>> >> > fetched,
>> >> > > > > which
>> >> > > > > >>>> is
>> >> > > > > >>>> done after the schema but is needed before DoGet is
>> >> > > > > >>>> called.
>> >> > > > > >>>> 3) call get stream on all endpoints from 2
>> >> > > > > >>>>
>> >> > > > > >>>> I think I have to do each step however I don't like
>> >> > > > > >>>> having
>> >> > > > > >>>> to
>> >> > call
>> >> > > > > >>>> getInfo
>> >> > > > > >>>> twice, it doesn't seem very elegant. I see a few options:
>> >> > > > > >>>> 1) live with calling GetFlightInfo twice and with a
>> >> > > > > >>>> custom
>> >> bytes
>> >> > > cmd
>> >> > > > > to
>> >> > > > > >>>> differentiate the purpose of each call
>> >> > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being
>> >> called
>> >> > > only
>> >> > > > > >>>> for
>> >> > > > > >>>> the schema
>> >> > > > > >>>> 3) add another rpc endpoint: ie
>> >> > > > > >>>> GetSchema(FlightDescriptor)
>> >> > > > > >>>> to
>> >> > > > return
>> >> > > > > >>>> just
>> >> > > > > >>>> the Schema in question
>> >> > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a
>> Result
>> >> > > > > >>>>
>> >> > > > > >>>> I am aware that 4 is probably the least disruptive but
>> >> > > > > >>>> I'm
>> >> also
>> >> > > not
>> >> > > > a
>> >> > > > > >>>> fan
>> >> > > > > >>>> as (to me) it implies performing an action on the server
>> >> > > > > >>>> side.
>> >> > > > > >>>> Suggestions
>> >> > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that
>> >> > > > > >>>> unless
>> >> > > there
>> >> > > > is
>> >> > > > > >>>> a
>> >> > > > > >>>> consensus here. None of them are great options and I am
>> >> > wondering
>> >> > > > what
>> >> > > > > >>>> everyone thinks the best approach might be? Particularly
>> >> > > > > >>>> as
>> >> > > > > >>>> I
>> >> > > think
>> >> > > > > this
>> >> > > > > >>>> is
>> >> > > > > >>>> likely to come up in more applications than just spark.
>> >> > > > > >>>>
>> >> > > > > >>>> Best,
>> >> > > > > >>>> Ryan
>> >> > > > > >>>>
>> >> > > > > >>
>> >> > > > >
>> >> > > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > >
>> >> > > Ryan Murray  | Principal Consulting Engineer
>> >> > >
>> >> > > +447540852009 | rymurr@dremio.com
>> >> > >
>> >> > > <https://www.dremio.com/>
>> >> > > Check out our GitHub <https://www.github.com/dremio>, join our
>> >> community
>> >> > > site <https://community.dremio.com/> & Download Dremio
>> >> > > <https://www.dremio.com/download>
>> >> > >
>> >> >
>> >>
>> >
>> >
>> > --
>> >
>> > Ryan Murray  | Principal Consulting Engineer
>> >
>> > +447540852009 | rymurr@dremio.com
>> >
>> > <https://www.dremio.com/>
>> > Check out our GitHub <https://www.github.com/dremio>, join our
>> > community
>> > site <https://community.dremio.com/> & Download Dremio
>> > <https://www.dremio.com/download>
>> >
>>
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rymurr@dremio.com
>
> <https://www.dremio.com/>
> Check out our GitHub <https://www.github.com/dremio>, join our community
> site <https://community.dremio.com/> & Download Dremio
> <https://www.dremio.com/download>
>

Re: Spark and Arrow Flight

Posted by Ryan Murray <ry...@dremio.com>.

Hey David,

Yes I am. I have a 3/4 done patch ready to go, just got busy with a few
other things. Are you hoping to use it soon? I would like to get to it this
week but its looking increasingly unlikely.

Best,
Ryan

On Thu, Jul 25, 2019 at 7:37 PM David Li <li...@gmail.com> wrote:

> Hey Ryan,
>
> To follow up on this, are you planning on formally proposing the
> GetSchema() call in Flight? I think it'd be interesting to have beyond
> the Spark usecase as finding the schema may or may not be expensive
> depending on the data stream (i.e. something computed on demand might
> require data to be computed in order to get the schema), and
> separating it from GetFlightInfo means that services that "don't know"
> the schema ahead of time can still respond to that endpoint quickly.
> (We could make the change minimal by leaving the schema in FlightInfo
> and simply specifying it as best-effort.)
>
> Best,
> David
>
> On 7/10/19, Ryan Murray <ry...@dremio.com> wrote:
> > Hey Wes,
> >
> > Would be happy to! Jacques and I had originally thought to try and get it
> > into Spark but perhaps Arrow might be a better home. I think the only
> issue
> > is whether we want to bring Spark jars and their dependencies into Arrow.
> > One challenge I have had so far with the connector is managing the
> > transitive arrow dependencies from Spark, the connector only works on
> > relatively recent versions of Spark and potentially can create circular
> > arrow dependencies. I think this issue will be better once 1.0.0 is done
> > and we can rely on a stable format/api.
> >
> > Best,
> > Ryan
> >
> > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <we...@gmail.com> wrote:
> >
> >> Hi Ryan, have you thought about developing this inside Apache Arrow?
> >>
> >> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cu...@gmail.com> wrote:
> >>
> >> > Great, thanks Ryan! I'll take a look
> >> >
> >> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <ry...@dremio.com> wrote:
> >> >
> >> > > Hi Bryan,
> >> > >
> >> > > I have an implementation of option #3 nearly ready for a PR. I will
> >> > mention
> >> > > you when I publish it.
> >> > >
> >> > > The working prototype for the Spark connector is here:
> >> > > https://github.com/rymurr/flight-spark-source. It technically works
> >> (and
> >> > > is
> >> > > very fast!) however the implementation is pretty dodgy and needs to
> >> > > be
> >> > > cleaned up before ready for prime time. I plan to have it ready to
> go
> >> for
> >> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please
> >> > > shout
> >> > if
> >> > > you have any comments or are interested in contributing!
> >> > >
> >> > > Best,
> >> > > Ryan
> >> > >
> >> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cu...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > I'm in favor of option #3 also, but not sure what the best thing
> to
> >> do
> >> > > with
> >> > > > the existing FlightInfo response is. I'm definitely interested in
> >> > > > connecting Spark with Flight, can you share more details of your
> >> > > > work
> >> > or
> >> > > is
> >> > > > it planned to be open sourced?
> >> > > >
> >> > > > Thanks,
> >> > > > Bryan
> >> > > >
> >> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <antoine@python.org
> >
> >> > > wrote:
> >> > > >
> >> > > > >
> >> > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> >> implementation
> >> > > can
> >> > > > > rely on calling GetFlightInfo.
> >> > > > >
> >> > > > >
> >> > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> >> > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> >> > > > > >
> >> > > > > > We've been thinking about a similar issue, where sometimes we
> >> want
> >> > > > > > just the schema, but the service can't necessarily return the
> >> > schema
> >> > > > > > without fetching data - right now we return a sentinel value
> in
> >> > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> >> indicate
> >> > an
> >> > > > > > error.
> >> > > > > >
> >> > > > > > I might be missing something though - what happens between
> step
> >> > > > > > 1
> >> > and
> >> > > > > > 2 that makes the endpoints available? Would it make sense to
> >> > > > > > use
> >> > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> >> have
> >> > > the
> >> > > > > > result of that be an encoded schema? So then the flow would be
> >> > > > > > DoAction -> GetFlightInfo -> DoGet.
> >> > > > > >
> >> > > > > > Best,
> >> > > > > > David
> >> > > > > >
> >> > > > > > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> >> > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> >> but
> >> > I
> >> > > > > >> like the more structured solution of explicitly requesting
> the
> >> > > schema
> >> > > > > >> given a descriptor.
> >> > > > > >>
> >> > > > > >> In both cases, it's possible that schemas are sent twice,
> e.g.
> >> if
> >> > > you
> >> > > > > >> call GetSchema and then later call GetFlightInfo and so you
> >> > receive
> >> > > > > >> the schema again. The schema is optional, so if it became a
> >> > > > > >> performance problem then a particular server might return the
> >> > schema
> >> > > > > >> as null from GetFlightInfo.
> >> > > > > >>
> >> > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> >> > > request
> >> > > > > >> that returns _both_ the schema and the query plan.
> >> > > > > >>
> >> > > > > >> Thoughts from others?
> >> > > > > >>
> >> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> >> > jacques@apache.org>
> >> > > > > wrote:
> >> > > > > >>>
> >> > > > > >>> My initial inclination is towards #3 but I'd be curious what
> >> > others
> >> > > > > >>> think.
> >> > > > > >>> In the case of #3, I wonder if it makes sense to then pull
> >> > > > > >>> the
> >> > > Schema
> >> > > > > off
> >> > > > > >>> the GetFlightInfo response...
> >> > > > > >>>
> >> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
> >> rymurr@dremio.com>
> >> > > > > wrote:
> >> > > > > >>>
> >> > > > > >>>> Hi All,
> >> > > > > >>>>
> >> > > > > >>>> I have been working on building an arrow flight source for
> >> > spark.
> >> > > > The
> >> > > > > >>>> goal
> >> > > > > >>>> here is for Spark to be able to use a group of arrow flight
> >> > > > endpoints
> >> > > > > >>>> to
> >> > > > > >>>> get a dataset pulled over to spark in parallel.
> >> > > > > >>>>
> >> > > > > >>>> I am unsure of the best model for the spark <-> flight
> >> > > conversation
> >> > > > > and
> >> > > > > >>>> wanted to get your opinion on the best way to go.
> >> > > > > >>>>
> >> > > > > >>>> I am breaking up the query to flight from spark into 3
> >> > > > > >>>> parts:
> >> > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> >> > > further
> >> > > > > >>>> lazy
> >> > > > > >>>> operations in Spark
> >> > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time
> >> > > > > >>>> with
> >> a
> >> > > > > >>>> different
> >> > > > > >>>> argument. This returns the list endpoints on the parallel
> >> flight
> >> > > > > >>>> server.
> >> > > > > >>>> The endpoints are not available till data is ready to be
> >> > fetched,
> >> > > > > which
> >> > > > > >>>> is
> >> > > > > >>>> done after the schema but is needed before DoGet is called.
> >> > > > > >>>> 3) call get stream on all endpoints from 2
> >> > > > > >>>>
> >> > > > > >>>> I think I have to do each step however I don't like having
> >> > > > > >>>> to
> >> > call
> >> > > > > >>>> getInfo
> >> > > > > >>>> twice, it doesn't seem very elegant. I see a few options:
> >> > > > > >>>> 1) live with calling GetFlightInfo twice and with a custom
> >> bytes
> >> > > cmd
> >> > > > > to
> >> > > > > >>>> differentiate the purpose of each call
> >> > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being
> >> called
> >> > > only
> >> > > > > >>>> for
> >> > > > > >>>> the schema
> >> > > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor)
> >> > > > > >>>> to
> >> > > > return
> >> > > > > >>>> just
> >> > > > > >>>> the Schema in question
> >> > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a
> Result
> >> > > > > >>>>
> >> > > > > >>>> I am aware that 4 is probably the least disruptive but I'm
> >> also
> >> > > not
> >> > > > a
> >> > > > > >>>> fan
> >> > > > > >>>> as (to me) it implies performing an action on the server
> >> > > > > >>>> side.
> >> > > > > >>>> Suggestions
> >> > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that
> >> > > > > >>>> unless
> >> > > there
> >> > > > is
> >> > > > > >>>> a
> >> > > > > >>>> consensus here. None of them are great options and I am
> >> > wondering
> >> > > > what
> >> > > > > >>>> everyone thinks the best approach might be? Particularly as
> >> > > > > >>>> I
> >> > > think
> >> > > > > this
> >> > > > > >>>> is
> >> > > > > >>>> likely to come up in more applications than just spark.
> >> > > > > >>>>
> >> > > > > >>>> Best,
> >> > > > > >>>> Ryan
> >> > > > > >>>>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > Ryan Murray  | Principal Consulting Engineer
> >> > >
> >> > > +447540852009 | rymurr@dremio.com
> >> > >
> >> > > <https://www.dremio.com/>
> >> > > Check out our GitHub <https://www.github.com/dremio>, join our
> >> community
> >> > > site <https://community.dremio.com/> & Download Dremio
> >> > > <https://www.dremio.com/download>
> >> > >
> >> >
> >>
> >
> >
> > --
> >
> > Ryan Murray  | Principal Consulting Engineer
> >
> > +447540852009 | rymurr@dremio.com
> >
> > <https://www.dremio.com/>
> > Check out our GitHub <https://www.github.com/dremio>, join our community
> > site <https://community.dremio.com/> & Download Dremio
> > <https://www.dremio.com/download>
> >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rymurr@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>

Re: Spark and Arrow Flight

Posted by David Li <li...@gmail.com>.

Hey Ryan,

To follow up on this, are you planning on formally proposing the
GetSchema() call in Flight? I think it'd be interesting to have beyond
the Spark usecase as finding the schema may or may not be expensive
depending on the data stream (i.e. something computed on demand might
require data to be computed in order to get the schema), and
separating it from GetFlightInfo means that services that "don't know"
the schema ahead of time can still respond to that endpoint quickly.
(We could make the change minimal by leaving the schema in FlightInfo
and simply specifying it as best-effort.)

Best,
David

On 7/10/19, Ryan Murray <ry...@dremio.com> wrote:
> Hey Wes,
>
> Would be happy to! Jacques and I had originally thought to try and get it
> into Spark but perhaps Arrow might be a better home. I think the only issue
> is whether we want to bring Spark jars and their dependencies into Arrow.
> One challenge I have had so far with the connector is managing the
> transitive arrow dependencies from Spark, the connector only works on
> relatively recent versions of Spark and potentially can create circular
> arrow dependencies. I think this issue will be better once 1.0.0 is done
> and we can rely on a stable format/api.
>
> Best,
> Ryan
>
> On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <we...@gmail.com> wrote:
>
>> Hi Ryan, have you thought about developing this inside Apache Arrow?
>>
>> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cu...@gmail.com> wrote:
>>
>> > Great, thanks Ryan! I'll take a look
>> >
>> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <ry...@dremio.com> wrote:
>> >
>> > > Hi Bryan,
>> > >
>> > > I have an implementation of option #3 nearly ready for a PR. I will
>> > mention
>> > > you when I publish it.
>> > >
>> > > The working prototype for the Spark connector is here:
>> > > https://github.com/rymurr/flight-spark-source. It technically works
>> (and
>> > > is
>> > > very fast!) however the implementation is pretty dodgy and needs to
>> > > be
>> > > cleaned up before ready for prime time. I plan to have it ready to go
>> for
>> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please
>> > > shout
>> > if
>> > > you have any comments or are interested in contributing!
>> > >
>> > > Best,
>> > > Ryan
>> > >
>> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cu...@gmail.com>
>> > > wrote:
>> > >
>> > > > I'm in favor of option #3 also, but not sure what the best thing to
>> do
>> > > with
>> > > > the existing FlightInfo response is. I'm definitely interested in
>> > > > connecting Spark with Flight, can you share more details of your
>> > > > work
>> > or
>> > > is
>> > > > it planned to be open sourced?
>> > > >
>> > > > Thanks,
>> > > > Bryan
>> > > >
>> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <an...@python.org>
>> > > wrote:
>> > > >
>> > > > >
>> > > > > Either #3 or #4 for me.  If #3, the default GetSchema
>> implementation
>> > > can
>> > > > > rely on calling GetFlightInfo.
>> > > > >
>> > > > >
>> > > > > Le 01/07/2019 à 22:50, David Li a écrit :
>> > > > > > I think I'd prefer #3 over overloading an existing call (#2).
>> > > > > >
>> > > > > > We've been thinking about a similar issue, where sometimes we
>> want
>> > > > > > just the schema, but the service can't necessarily return the
>> > schema
>> > > > > > without fetching data - right now we return a sentinel value in
>> > > > > > GetFlightInfo, but a separate RPC would let us explicitly
>> indicate
>> > an
>> > > > > > error.
>> > > > > >
>> > > > > > I might be missing something though - what happens between step
>> > > > > > 1
>> > and
>> > > > > > 2 that makes the endpoints available? Would it make sense to
>> > > > > > use
>> > > > > > DoAction to cause the backend to "prepare" the endpoints, and
>> have
>> > > the
>> > > > > > result of that be an encoded schema? So then the flow would be
>> > > > > > DoAction -> GetFlightInfo -> DoGet.
>> > > > > >
>> > > > > > Best,
>> > > > > > David
>> > > > > >
>> > > > > > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
>> > > > > >> My inclination is either #2 or #3. #4 is an option of course,
>> but
>> > I
>> > > > > >> like the more structured solution of explicitly requesting the
>> > > schema
>> > > > > >> given a descriptor.
>> > > > > >>
>> > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
>> if
>> > > you
>> > > > > >> call GetSchema and then later call GetFlightInfo and so you
>> > receive
>> > > > > >> the schema again. The schema is optional, so if it became a
>> > > > > >> performance problem then a particular server might return the
>> > schema
>> > > > > >> as null from GetFlightInfo.
>> > > > > >>
>> > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
>> > > request
>> > > > > >> that returns _both_ the schema and the query plan.
>> > > > > >>
>> > > > > >> Thoughts from others?
>> > > > > >>
>> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
>> > jacques@apache.org>
>> > > > > wrote:
>> > > > > >>>
>> > > > > >>> My initial inclination is towards #3 but I'd be curious what
>> > others
>> > > > > >>> think.
>> > > > > >>> In the case of #3, I wonder if it makes sense to then pull
>> > > > > >>> the
>> > > Schema
>> > > > > off
>> > > > > >>> the GetFlightInfo response...
>> > > > > >>>
>> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
>> rymurr@dremio.com>
>> > > > > wrote:
>> > > > > >>>
>> > > > > >>>> Hi All,
>> > > > > >>>>
>> > > > > >>>> I have been working on building an arrow flight source for
>> > spark.
>> > > > The
>> > > > > >>>> goal
>> > > > > >>>> here is for Spark to be able to use a group of arrow flight
>> > > > endpoints
>> > > > > >>>> to
>> > > > > >>>> get a dataset pulled over to spark in parallel.
>> > > > > >>>>
>> > > > > >>>> I am unsure of the best model for the spark <-> flight
>> > > conversation
>> > > > > and
>> > > > > >>>> wanted to get your opinion on the best way to go.
>> > > > > >>>>
>> > > > > >>>> I am breaking up the query to flight from spark into 3
>> > > > > >>>> parts:
>> > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
>> > > further
>> > > > > >>>> lazy
>> > > > > >>>> operations in Spark
>> > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time
>> > > > > >>>> with
>> a
>> > > > > >>>> different
>> > > > > >>>> argument. This returns the list endpoints on the parallel
>> flight
>> > > > > >>>> server.
>> > > > > >>>> The endpoints are not available till data is ready to be
>> > fetched,
>> > > > > which
>> > > > > >>>> is
>> > > > > >>>> done after the schema but is needed before DoGet is called.
>> > > > > >>>> 3) call get stream on all endpoints from 2
>> > > > > >>>>
>> > > > > >>>> I think I have to do each step however I don't like having
>> > > > > >>>> to
>> > call
>> > > > > >>>> getInfo
>> > > > > >>>> twice, it doesn't seem very elegant. I see a few options:
>> > > > > >>>> 1) live with calling GetFlightInfo twice and with a custom
>> bytes
>> > > cmd
>> > > > > to
>> > > > > >>>> differentiate the purpose of each call
>> > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being
>> called
>> > > only
>> > > > > >>>> for
>> > > > > >>>> the schema
>> > > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor)
>> > > > > >>>> to
>> > > > return
>> > > > > >>>> just
>> > > > > >>>> the Schema in question
>> > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
>> > > > > >>>>
>> > > > > >>>> I am aware that 4 is probably the least disruptive but I'm
>> also
>> > > not
>> > > > a
>> > > > > >>>> fan
>> > > > > >>>> as (to me) it implies performing an action on the server
>> > > > > >>>> side.
>> > > > > >>>> Suggestions
>> > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that
>> > > > > >>>> unless
>> > > there
>> > > > is
>> > > > > >>>> a
>> > > > > >>>> consensus here. None of them are great options and I am
>> > wondering
>> > > > what
>> > > > > >>>> everyone thinks the best approach might be? Particularly as
>> > > > > >>>> I
>> > > think
>> > > > > this
>> > > > > >>>> is
>> > > > > >>>> likely to come up in more applications than just spark.
>> > > > > >>>>
>> > > > > >>>> Best,
>> > > > > >>>> Ryan
>> > > > > >>>>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> > >
>> > > --
>> > >
>> > > Ryan Murray  | Principal Consulting Engineer
>> > >
>> > > +447540852009 | rymurr@dremio.com
>> > >
>> > > <https://www.dremio.com/>
>> > > Check out our GitHub <https://www.github.com/dremio>, join our
>> community
>> > > site <https://community.dremio.com/> & Download Dremio
>> > > <https://www.dremio.com/download>
>> > >
>> >
>>
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rymurr@dremio.com
>
> <https://www.dremio.com/>
> Check out our GitHub <https://www.github.com/dremio>, join our community
> site <https://community.dremio.com/> & Download Dremio
> <https://www.dremio.com/download>
>

Re: Spark and Arrow Flight

Posted by Ryan Murray <ry...@dremio.com>.

Hey Wes,

Would be happy to! Jacques and I had originally thought to try and get it
into Spark but perhaps Arrow might be a better home. I think the only issue
is whether we want to bring Spark jars and their dependencies into Arrow.
One challenge I have had so far with the connector is managing the
transitive arrow dependencies from Spark, the connector only works on
relatively recent versions of Spark and potentially can create circular
arrow dependencies. I think this issue will be better once 1.0.0 is done
and we can rely on a stable format/api.

Best,
Ryan

On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <we...@gmail.com> wrote:

> Hi Ryan, have you thought about developing this inside Apache Arrow?
>
> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cu...@gmail.com> wrote:
>
> > Great, thanks Ryan! I'll take a look
> >
> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <ry...@dremio.com> wrote:
> >
> > > Hi Bryan,
> > >
> > > I have an implementation of option #3 nearly ready for a PR. I will
> > mention
> > > you when I publish it.
> > >
> > > The working prototype for the Spark connector is here:
> > > https://github.com/rymurr/flight-spark-source. It technically works
> (and
> > > is
> > > very fast!) however the implementation is pretty dodgy and needs to be
> > > cleaned up before ready for prime time. I plan to have it ready to go
> for
> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> > if
> > > you have any comments or are interested in contributing!
> > >
> > > Best,
> > > Ryan
> > >
> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cu...@gmail.com> wrote:
> > >
> > > > I'm in favor of option #3 also, but not sure what the best thing to
> do
> > > with
> > > > the existing FlightInfo response is. I'm definitely interested in
> > > > connecting Spark with Flight, can you share more details of your work
> > or
> > > is
> > > > it planned to be open sourced?
> > > >
> > > > Thanks,
> > > > Bryan
> > > >
> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > >
> > > > >
> > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> implementation
> > > can
> > > > > rely on calling GetFlightInfo.
> > > > >
> > > > >
> > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > > >
> > > > > > We've been thinking about a similar issue, where sometimes we
> want
> > > > > > just the schema, but the service can't necessarily return the
> > schema
> > > > > > without fetching data - right now we return a sentinel value in
> > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> indicate
> > an
> > > > > > error.
> > > > > >
> > > > > > I might be missing something though - what happens between step 1
> > and
> > > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> have
> > > the
> > > > > > result of that be an encoded schema? So then the flow would be
> > > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > > >
> > > > > > Best,
> > > > > > David
> > > > > >
> > > > > > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> but
> > I
> > > > > >> like the more structured solution of explicitly requesting the
> > > schema
> > > > > >> given a descriptor.
> > > > > >>
> > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
> if
> > > you
> > > > > >> call GetSchema and then later call GetFlightInfo and so you
> > receive
> > > > > >> the schema again. The schema is optional, so if it became a
> > > > > >> performance problem then a particular server might return the
> > schema
> > > > > >> as null from GetFlightInfo.
> > > > > >>
> > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > > request
> > > > > >> that returns _both_ the schema and the query plan.
> > > > > >>
> > > > > >> Thoughts from others?
> > > > > >>
> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> > jacques@apache.org>
> > > > > wrote:
> > > > > >>>
> > > > > >>> My initial inclination is towards #3 but I'd be curious what
> > others
> > > > > >>> think.
> > > > > >>> In the case of #3, I wonder if it makes sense to then pull the
> > > Schema
> > > > > off
> > > > > >>> the GetFlightInfo response...
> > > > > >>>
> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
> rymurr@dremio.com>
> > > > > wrote:
> > > > > >>>
> > > > > >>>> Hi All,
> > > > > >>>>
> > > > > >>>> I have been working on building an arrow flight source for
> > spark.
> > > > The
> > > > > >>>> goal
> > > > > >>>> here is for Spark to be able to use a group of arrow flight
> > > > endpoints
> > > > > >>>> to
> > > > > >>>> get a dataset pulled over to spark in parallel.
> > > > > >>>>
> > > > > >>>> I am unsure of the best model for the spark <-> flight
> > > conversation
> > > > > and
> > > > > >>>> wanted to get your opinion on the best way to go.
> > > > > >>>>
> > > > > >>>> I am breaking up the query to flight from spark into 3 parts:
> > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> > > further
> > > > > >>>> lazy
> > > > > >>>> operations in Spark
> > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with
> a
> > > > > >>>> different
> > > > > >>>> argument. This returns the list endpoints on the parallel
> flight
> > > > > >>>> server.
> > > > > >>>> The endpoints are not available till data is ready to be
> > fetched,
> > > > > which
> > > > > >>>> is
> > > > > >>>> done after the schema but is needed before DoGet is called.
> > > > > >>>> 3) call get stream on all endpoints from 2
> > > > > >>>>
> > > > > >>>> I think I have to do each step however I don't like having to
> > call
> > > > > >>>> getInfo
> > > > > >>>> twice, it doesn't seem very elegant. I see a few options:
> > > > > >>>> 1) live with calling GetFlightInfo twice and with a custom
> bytes
> > > cmd
> > > > > to
> > > > > >>>> differentiate the purpose of each call
> > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being
> called
> > > only
> > > > > >>>> for
> > > > > >>>> the schema
> > > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> > > > return
> > > > > >>>> just
> > > > > >>>> the Schema in question
> > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > > > > >>>>
> > > > > >>>> I am aware that 4 is probably the least disruptive but I'm
> also
> > > not
> > > > a
> > > > > >>>> fan
> > > > > >>>> as (to me) it implies performing an action on the server side.
> > > > > >>>> Suggestions
> > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless
> > > there
> > > > is
> > > > > >>>> a
> > > > > >>>> consensus here. None of them are great options and I am
> > wondering
> > > > what
> > > > > >>>> everyone thinks the best approach might be? Particularly as I
> > > think
> > > > > this
> > > > > >>>> is
> > > > > >>>> likely to come up in more applications than just spark.
> > > > > >>>>
> > > > > >>>> Best,
> > > > > >>>> Ryan
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Ryan Murray  | Principal Consulting Engineer
> > >
> > > +447540852009 | rymurr@dremio.com
> > >
> > > <https://www.dremio.com/>
> > > Check out our GitHub <https://www.github.com/dremio>, join our
> community
> > > site <https://community.dremio.com/> & Download Dremio
> > > <https://www.dremio.com/download>
> > >
> >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rymurr@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>

Re: Spark and Arrow Flight

Posted by Wes McKinney <we...@gmail.com>.

Hi Ryan, have you thought about developing this inside Apache Arrow?

On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cu...@gmail.com> wrote:

> Great, thanks Ryan! I'll take a look
>
> On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <ry...@dremio.com> wrote:
>
> > Hi Bryan,
> >
> > I have an implementation of option #3 nearly ready for a PR. I will
> mention
> > you when I publish it.
> >
> > The working prototype for the Spark connector is here:
> > https://github.com/rymurr/flight-spark-source. It technically works (and
> > is
> > very fast!) however the implementation is pretty dodgy and needs to be
> > cleaned up before ready for prime time. I plan to have it ready to go for
> > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> if
> > you have any comments or are interested in contributing!
> >
> > Best,
> > Ryan
> >
> > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cu...@gmail.com> wrote:
> >
> > > I'm in favor of option #3 also, but not sure what the best thing to do
> > with
> > > the existing FlightInfo response is. I'm definitely interested in
> > > connecting Spark with Flight, can you share more details of your work
> or
> > is
> > > it planned to be open sourced?
> > >
> > > Thanks,
> > > Bryan
> > >
> > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > >
> > > >
> > > > Either #3 or #4 for me.  If #3, the default GetSchema implementation
> > can
> > > > rely on calling GetFlightInfo.
> > > >
> > > >
> > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > >
> > > > > We've been thinking about a similar issue, where sometimes we want
> > > > > just the schema, but the service can't necessarily return the
> schema
> > > > > without fetching data - right now we return a sentinel value in
> > > > > GetFlightInfo, but a separate RPC would let us explicitly indicate
> an
> > > > > error.
> > > > >
> > > > > I might be missing something though - what happens between step 1
> and
> > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > DoAction to cause the backend to "prepare" the endpoints, and have
> > the
> > > > > result of that be an encoded schema? So then the flow would be
> > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > >
> > > > > Best,
> > > > > David
> > > > >
> > > > > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> > > > >> My inclination is either #2 or #3. #4 is an option of course, but
> I
> > > > >> like the more structured solution of explicitly requesting the
> > schema
> > > > >> given a descriptor.
> > > > >>
> > > > >> In both cases, it's possible that schemas are sent twice, e.g. if
> > you
> > > > >> call GetSchema and then later call GetFlightInfo and so you
> receive
> > > > >> the schema again. The schema is optional, so if it became a
> > > > >> performance problem then a particular server might return the
> schema
> > > > >> as null from GetFlightInfo.
> > > > >>
> > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > request
> > > > >> that returns _both_ the schema and the query plan.
> > > > >>
> > > > >> Thoughts from others?
> > > > >>
> > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> jacques@apache.org>
> > > > wrote:
> > > > >>>
> > > > >>> My initial inclination is towards #3 but I'd be curious what
> others
> > > > >>> think.
> > > > >>> In the case of #3, I wonder if it makes sense to then pull the
> > Schema
> > > > off
> > > > >>> the GetFlightInfo response...
> > > > >>>
> > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <ry...@dremio.com>
> > > > wrote:
> > > > >>>
> > > > >>>> Hi All,
> > > > >>>>
> > > > >>>> I have been working on building an arrow flight source for
> spark.
> > > The
> > > > >>>> goal
> > > > >>>> here is for Spark to be able to use a group of arrow flight
> > > endpoints
> > > > >>>> to
> > > > >>>> get a dataset pulled over to spark in parallel.
> > > > >>>>
> > > > >>>> I am unsure of the best model for the spark <-> flight
> > conversation
> > > > and
> > > > >>>> wanted to get your opinion on the best way to go.
> > > > >>>>
> > > > >>>> I am breaking up the query to flight from spark into 3 parts:
> > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> > further
> > > > >>>> lazy
> > > > >>>> operations in Spark
> > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> > > > >>>> different
> > > > >>>> argument. This returns the list endpoints on the parallel flight
> > > > >>>> server.
> > > > >>>> The endpoints are not available till data is ready to be
> fetched,
> > > > which
> > > > >>>> is
> > > > >>>> done after the schema but is needed before DoGet is called.
> > > > >>>> 3) call get stream on all endpoints from 2
> > > > >>>>
> > > > >>>> I think I have to do each step however I don't like having to
> call
> > > > >>>> getInfo
> > > > >>>> twice, it doesn't seem very elegant. I see a few options:
> > > > >>>> 1) live with calling GetFlightInfo twice and with a custom bytes
> > cmd
> > > > to
> > > > >>>> differentiate the purpose of each call
> > > > >>>> 2) add an argument to GetFlightInfo to tell it its being called
> > only
> > > > >>>> for
> > > > >>>> the schema
> > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> > > return
> > > > >>>> just
> > > > >>>> the Schema in question
> > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > > > >>>>
> > > > >>>> I am aware that 4 is probably the least disruptive but I'm also
> > not
> > > a
> > > > >>>> fan
> > > > >>>> as (to me) it implies performing an action on the server side.
> > > > >>>> Suggestions
> > > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless
> > there
> > > is
> > > > >>>> a
> > > > >>>> consensus here. None of them are great options and I am
> wondering
> > > what
> > > > >>>> everyone thinks the best approach might be? Particularly as I
> > think
> > > > this
> > > > >>>> is
> > > > >>>> likely to come up in more applications than just spark.
> > > > >>>>
> > > > >>>> Best,
> > > > >>>> Ryan
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> >
> > --
> >
> > Ryan Murray  | Principal Consulting Engineer
> >
> > +447540852009 | rymurr@dremio.com
> >
> > <https://www.dremio.com/>
> > Check out our GitHub <https://www.github.com/dremio>, join our community
> > site <https://community.dremio.com/> & Download Dremio
> > <https://www.dremio.com/download>
> >
>

Re: Spark and Arrow Flight

Posted by Bryan Cutler <cu...@gmail.com>.

Great, thanks Ryan! I'll take a look

On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <ry...@dremio.com> wrote:

> Hi Bryan,
>
> I have an implementation of option #3 nearly ready for a PR. I will mention
> you when I publish it.
>
> The working prototype for the Spark connector is here:
> https://github.com/rymurr/flight-spark-source. It technically works (and
> is
> very fast!) however the implementation is pretty dodgy and needs to be
> cleaned up before ready for prime time. I plan to have it ready to go for
> the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout if
> you have any comments or are interested in contributing!
>
> Best,
> Ryan
>
> On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cu...@gmail.com> wrote:
>
> > I'm in favor of option #3 also, but not sure what the best thing to do
> with
> > the existing FlightInfo response is. I'm definitely interested in
> > connecting Spark with Flight, can you share more details of your work or
> is
> > it planned to be open sourced?
> >
> > Thanks,
> > Bryan
> >
> > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <an...@python.org>
> wrote:
> >
> > >
> > > Either #3 or #4 for me.  If #3, the default GetSchema implementation
> can
> > > rely on calling GetFlightInfo.
> > >
> > >
> > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > >
> > > > We've been thinking about a similar issue, where sometimes we want
> > > > just the schema, but the service can't necessarily return the schema
> > > > without fetching data - right now we return a sentinel value in
> > > > GetFlightInfo, but a separate RPC would let us explicitly indicate an
> > > > error.
> > > >
> > > > I might be missing something though - what happens between step 1 and
> > > > 2 that makes the endpoints available? Would it make sense to use
> > > > DoAction to cause the backend to "prepare" the endpoints, and have
> the
> > > > result of that be an encoded schema? So then the flow would be
> > > > DoAction -> GetFlightInfo -> DoGet.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> > > >> My inclination is either #2 or #3. #4 is an option of course, but I
> > > >> like the more structured solution of explicitly requesting the
> schema
> > > >> given a descriptor.
> > > >>
> > > >> In both cases, it's possible that schemas are sent twice, e.g. if
> you
> > > >> call GetSchema and then later call GetFlightInfo and so you receive
> > > >> the schema again. The schema is optional, so if it became a
> > > >> performance problem then a particular server might return the schema
> > > >> as null from GetFlightInfo.
> > > >>
> > > >> I think it's valid to want to make a single GetFlightInfo RPC
> request
> > > >> that returns _both_ the schema and the query plan.
> > > >>
> > > >> Thoughts from others?
> > > >>
> > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > >>>
> > > >>> My initial inclination is towards #3 but I'd be curious what others
> > > >>> think.
> > > >>> In the case of #3, I wonder if it makes sense to then pull the
> Schema
> > > off
> > > >>> the GetFlightInfo response...
> > > >>>
> > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <ry...@dremio.com>
> > > wrote:
> > > >>>
> > > >>>> Hi All,
> > > >>>>
> > > >>>> I have been working on building an arrow flight source for spark.
> > The
> > > >>>> goal
> > > >>>> here is for Spark to be able to use a group of arrow flight
> > endpoints
> > > >>>> to
> > > >>>> get a dataset pulled over to spark in parallel.
> > > >>>>
> > > >>>> I am unsure of the best model for the spark <-> flight
> conversation
> > > and
> > > >>>> wanted to get your opinion on the best way to go.
> > > >>>>
> > > >>>> I am breaking up the query to flight from spark into 3 parts:
> > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> further
> > > >>>> lazy
> > > >>>> operations in Spark
> > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> > > >>>> different
> > > >>>> argument. This returns the list endpoints on the parallel flight
> > > >>>> server.
> > > >>>> The endpoints are not available till data is ready to be fetched,
> > > which
> > > >>>> is
> > > >>>> done after the schema but is needed before DoGet is called.
> > > >>>> 3) call get stream on all endpoints from 2
> > > >>>>
> > > >>>> I think I have to do each step however I don't like having to call
> > > >>>> getInfo
> > > >>>> twice, it doesn't seem very elegant. I see a few options:
> > > >>>> 1) live with calling GetFlightInfo twice and with a custom bytes
> cmd
> > > to
> > > >>>> differentiate the purpose of each call
> > > >>>> 2) add an argument to GetFlightInfo to tell it its being called
> only
> > > >>>> for
> > > >>>> the schema
> > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> > return
> > > >>>> just
> > > >>>> the Schema in question
> > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > > >>>>
> > > >>>> I am aware that 4 is probably the least disruptive but I'm also
> not
> > a
> > > >>>> fan
> > > >>>> as (to me) it implies performing an action on the server side.
> > > >>>> Suggestions
> > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless
> there
> > is
> > > >>>> a
> > > >>>> consensus here. None of them are great options and I am wondering
> > what
> > > >>>> everyone thinks the best approach might be? Particularly as I
> think
> > > this
> > > >>>> is
> > > >>>> likely to come up in more applications than just spark.
> > > >>>>
> > > >>>> Best,
> > > >>>> Ryan
> > > >>>>
> > > >>
> > >
> >
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rymurr@dremio.com
>
> <https://www.dremio.com/>
> Check out our GitHub <https://www.github.com/dremio>, join our community
> site <https://community.dremio.com/> & Download Dremio
> <https://www.dremio.com/download>
>

Re: Spark and Arrow Flight

Posted by Ryan Murray <ry...@dremio.com>.

Hi Bryan,

I have an implementation of option #3 nearly ready for a PR. I will mention
you when I publish it.

The working prototype for the Spark connector is here:
https://github.com/rymurr/flight-spark-source. It technically works (and is
very fast!) however the implementation is pretty dodgy and needs to be
cleaned up before ready for prime time. I plan to have it ready to go for
the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout if
you have any comments or are interested in contributing!

Best,
Ryan

On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cu...@gmail.com> wrote:

> I'm in favor of option #3 also, but not sure what the best thing to do with
> the existing FlightInfo response is. I'm definitely interested in
> connecting Spark with Flight, can you share more details of your work or is
> it planned to be open sourced?
>
> Thanks,
> Bryan
>
> On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Either #3 or #4 for me.  If #3, the default GetSchema implementation can
> > rely on calling GetFlightInfo.
> >
> >
> > Le 01/07/2019 à 22:50, David Li a écrit :
> > > I think I'd prefer #3 over overloading an existing call (#2).
> > >
> > > We've been thinking about a similar issue, where sometimes we want
> > > just the schema, but the service can't necessarily return the schema
> > > without fetching data - right now we return a sentinel value in
> > > GetFlightInfo, but a separate RPC would let us explicitly indicate an
> > > error.
> > >
> > > I might be missing something though - what happens between step 1 and
> > > 2 that makes the endpoints available? Would it make sense to use
> > > DoAction to cause the backend to "prepare" the endpoints, and have the
> > > result of that be an encoded schema? So then the flow would be
> > > DoAction -> GetFlightInfo -> DoGet.
> > >
> > > Best,
> > > David
> > >
> > > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> > >> My inclination is either #2 or #3. #4 is an option of course, but I
> > >> like the more structured solution of explicitly requesting the schema
> > >> given a descriptor.
> > >>
> > >> In both cases, it's possible that schemas are sent twice, e.g. if you
> > >> call GetSchema and then later call GetFlightInfo and so you receive
> > >> the schema again. The schema is optional, so if it became a
> > >> performance problem then a particular server might return the schema
> > >> as null from GetFlightInfo.
> > >>
> > >> I think it's valid to want to make a single GetFlightInfo RPC request
> > >> that returns _both_ the schema and the query plan.
> > >>
> > >> Thoughts from others?
> > >>
> > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >>>
> > >>> My initial inclination is towards #3 but I'd be curious what others
> > >>> think.
> > >>> In the case of #3, I wonder if it makes sense to then pull the Schema
> > off
> > >>> the GetFlightInfo response...
> > >>>
> > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <ry...@dremio.com>
> > wrote:
> > >>>
> > >>>> Hi All,
> > >>>>
> > >>>> I have been working on building an arrow flight source for spark.
> The
> > >>>> goal
> > >>>> here is for Spark to be able to use a group of arrow flight
> endpoints
> > >>>> to
> > >>>> get a dataset pulled over to spark in parallel.
> > >>>>
> > >>>> I am unsure of the best model for the spark <-> flight conversation
> > and
> > >>>> wanted to get your opinion on the best way to go.
> > >>>>
> > >>>> I am breaking up the query to flight from spark into 3 parts:
> > >>>> 1) get the schema using GetFlightInfo. This is needed to do further
> > >>>> lazy
> > >>>> operations in Spark
> > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> > >>>> different
> > >>>> argument. This returns the list endpoints on the parallel flight
> > >>>> server.
> > >>>> The endpoints are not available till data is ready to be fetched,
> > which
> > >>>> is
> > >>>> done after the schema but is needed before DoGet is called.
> > >>>> 3) call get stream on all endpoints from 2
> > >>>>
> > >>>> I think I have to do each step however I don't like having to call
> > >>>> getInfo
> > >>>> twice, it doesn't seem very elegant. I see a few options:
> > >>>> 1) live with calling GetFlightInfo twice and with a custom bytes cmd
> > to
> > >>>> differentiate the purpose of each call
> > >>>> 2) add an argument to GetFlightInfo to tell it its being called only
> > >>>> for
> > >>>> the schema
> > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> return
> > >>>> just
> > >>>> the Schema in question
> > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > >>>>
> > >>>> I am aware that 4 is probably the least disruptive but I'm also not
> a
> > >>>> fan
> > >>>> as (to me) it implies performing an action on the server side.
> > >>>> Suggestions
> > >>>> 2 & 3 are larger changes and I am reluctant to do that unless there
> is
> > >>>> a
> > >>>> consensus here. None of them are great options and I am wondering
> what
> > >>>> everyone thinks the best approach might be? Particularly as I think
> > this
> > >>>> is
> > >>>> likely to come up in more applications than just spark.
> > >>>>
> > >>>> Best,
> > >>>> Ryan
> > >>>>
> > >>
> >
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rymurr@dremio.com

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>

Re: Spark and Arrow Flight

Posted by Bryan Cutler <cu...@gmail.com>.

I'm in favor of option #3 also, but not sure what the best thing to do with
the existing FlightInfo response is. I'm definitely interested in
connecting Spark with Flight, can you share more details of your work or is
it planned to be open sourced?

Thanks,
Bryan

On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <an...@python.org> wrote:

>
> Either #3 or #4 for me.  If #3, the default GetSchema implementation can
> rely on calling GetFlightInfo.
>
>
> Le 01/07/2019 à 22:50, David Li a écrit :
> > I think I'd prefer #3 over overloading an existing call (#2).
> >
> > We've been thinking about a similar issue, where sometimes we want
> > just the schema, but the service can't necessarily return the schema
> > without fetching data - right now we return a sentinel value in
> > GetFlightInfo, but a separate RPC would let us explicitly indicate an
> > error.
> >
> > I might be missing something though - what happens between step 1 and
> > 2 that makes the endpoints available? Would it make sense to use
> > DoAction to cause the backend to "prepare" the endpoints, and have the
> > result of that be an encoded schema? So then the flow would be
> > DoAction -> GetFlightInfo -> DoGet.
> >
> > Best,
> > David
> >
> > On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> >> My inclination is either #2 or #3. #4 is an option of course, but I
> >> like the more structured solution of explicitly requesting the schema
> >> given a descriptor.
> >>
> >> In both cases, it's possible that schemas are sent twice, e.g. if you
> >> call GetSchema and then later call GetFlightInfo and so you receive
> >> the schema again. The schema is optional, so if it became a
> >> performance problem then a particular server might return the schema
> >> as null from GetFlightInfo.
> >>
> >> I think it's valid to want to make a single GetFlightInfo RPC request
> >> that returns _both_ the schema and the query plan.
> >>
> >> Thoughts from others?
> >>
> >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> >>>
> >>> My initial inclination is towards #3 but I'd be curious what others
> >>> think.
> >>> In the case of #3, I wonder if it makes sense to then pull the Schema
> off
> >>> the GetFlightInfo response...
> >>>
> >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <ry...@dremio.com>
> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> I have been working on building an arrow flight source for spark. The
> >>>> goal
> >>>> here is for Spark to be able to use a group of arrow flight endpoints
> >>>> to
> >>>> get a dataset pulled over to spark in parallel.
> >>>>
> >>>> I am unsure of the best model for the spark <-> flight conversation
> and
> >>>> wanted to get your opinion on the best way to go.
> >>>>
> >>>> I am breaking up the query to flight from spark into 3 parts:
> >>>> 1) get the schema using GetFlightInfo. This is needed to do further
> >>>> lazy
> >>>> operations in Spark
> >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> >>>> different
> >>>> argument. This returns the list endpoints on the parallel flight
> >>>> server.
> >>>> The endpoints are not available till data is ready to be fetched,
> which
> >>>> is
> >>>> done after the schema but is needed before DoGet is called.
> >>>> 3) call get stream on all endpoints from 2
> >>>>
> >>>> I think I have to do each step however I don't like having to call
> >>>> getInfo
> >>>> twice, it doesn't seem very elegant. I see a few options:
> >>>> 1) live with calling GetFlightInfo twice and with a custom bytes cmd
> to
> >>>> differentiate the purpose of each call
> >>>> 2) add an argument to GetFlightInfo to tell it its being called only
> >>>> for
> >>>> the schema
> >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
> >>>> just
> >>>> the Schema in question
> >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> >>>>
> >>>> I am aware that 4 is probably the least disruptive but I'm also not a
> >>>> fan
> >>>> as (to me) it implies performing an action on the server side.
> >>>> Suggestions
> >>>> 2 & 3 are larger changes and I am reluctant to do that unless there is
> >>>> a
> >>>> consensus here. None of them are great options and I am wondering what
> >>>> everyone thinks the best approach might be? Particularly as I think
> this
> >>>> is
> >>>> likely to come up in more applications than just spark.
> >>>>
> >>>> Best,
> >>>> Ryan
> >>>>
> >>
>

Re: Spark and Arrow Flight

Posted by Antoine Pitrou <an...@python.org>.

Either #3 or #4 for me.  If #3, the default GetSchema implementation can
rely on calling GetFlightInfo.


Le 01/07/2019 à 22:50, David Li a écrit :
> I think I'd prefer #3 over overloading an existing call (#2).
> 
> We've been thinking about a similar issue, where sometimes we want
> just the schema, but the service can't necessarily return the schema
> without fetching data - right now we return a sentinel value in
> GetFlightInfo, but a separate RPC would let us explicitly indicate an
> error.
> 
> I might be missing something though - what happens between step 1 and
> 2 that makes the endpoints available? Would it make sense to use
> DoAction to cause the backend to "prepare" the endpoints, and have the
> result of that be an encoded schema? So then the flow would be
> DoAction -> GetFlightInfo -> DoGet.
> 
> Best,
> David
> 
> On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
>> My inclination is either #2 or #3. #4 is an option of course, but I
>> like the more structured solution of explicitly requesting the schema
>> given a descriptor.
>>
>> In both cases, it's possible that schemas are sent twice, e.g. if you
>> call GetSchema and then later call GetFlightInfo and so you receive
>> the schema again. The schema is optional, so if it became a
>> performance problem then a particular server might return the schema
>> as null from GetFlightInfo.
>>
>> I think it's valid to want to make a single GetFlightInfo RPC request
>> that returns _both_ the schema and the query plan.
>>
>> Thoughts from others?
>>
>> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <ja...@apache.org> wrote:
>>>
>>> My initial inclination is towards #3 but I'd be curious what others
>>> think.
>>> In the case of #3, I wonder if it makes sense to then pull the Schema off
>>> the GetFlightInfo response...
>>>
>>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <ry...@dremio.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have been working on building an arrow flight source for spark. The
>>>> goal
>>>> here is for Spark to be able to use a group of arrow flight endpoints
>>>> to
>>>> get a dataset pulled over to spark in parallel.
>>>>
>>>> I am unsure of the best model for the spark <-> flight conversation and
>>>> wanted to get your opinion on the best way to go.
>>>>
>>>> I am breaking up the query to flight from spark into 3 parts:
>>>> 1) get the schema using GetFlightInfo. This is needed to do further
>>>> lazy
>>>> operations in Spark
>>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a
>>>> different
>>>> argument. This returns the list endpoints on the parallel flight
>>>> server.
>>>> The endpoints are not available till data is ready to be fetched, which
>>>> is
>>>> done after the schema but is needed before DoGet is called.
>>>> 3) call get stream on all endpoints from 2
>>>>
>>>> I think I have to do each step however I don't like having to call
>>>> getInfo
>>>> twice, it doesn't seem very elegant. I see a few options:
>>>> 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
>>>> differentiate the purpose of each call
>>>> 2) add an argument to GetFlightInfo to tell it its being called only
>>>> for
>>>> the schema
>>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
>>>> just
>>>> the Schema in question
>>>> 4) use DoAction and wrap the expected FlightInfo in a Result
>>>>
>>>> I am aware that 4 is probably the least disruptive but I'm also not a
>>>> fan
>>>> as (to me) it implies performing an action on the server side.
>>>> Suggestions
>>>> 2 & 3 are larger changes and I am reluctant to do that unless there is
>>>> a
>>>> consensus here. None of them are great options and I am wondering what
>>>> everyone thinks the best approach might be? Particularly as I think this
>>>> is
>>>> likely to come up in more applications than just spark.
>>>>
>>>> Best,
>>>> Ryan
>>>>
>>

Re: Spark and Arrow Flight

Posted by David Li <li...@gmail.com>.

I think I'd prefer #3 over overloading an existing call (#2).

We've been thinking about a similar issue, where sometimes we want
just the schema, but the service can't necessarily return the schema
without fetching data - right now we return a sentinel value in
GetFlightInfo, but a separate RPC would let us explicitly indicate an
error.

I might be missing something though - what happens between step 1 and
2 that makes the endpoints available? Would it make sense to use
DoAction to cause the backend to "prepare" the endpoints, and have the
result of that be an encoded schema? So then the flow would be
DoAction -> GetFlightInfo -> DoGet.

Best,
David

On 7/1/19, Wes McKinney <we...@gmail.com> wrote:
> My inclination is either #2 or #3. #4 is an option of course, but I
> like the more structured solution of explicitly requesting the schema
> given a descriptor.
>
> In both cases, it's possible that schemas are sent twice, e.g. if you
> call GetSchema and then later call GetFlightInfo and so you receive
> the schema again. The schema is optional, so if it became a
> performance problem then a particular server might return the schema
> as null from GetFlightInfo.
>
> I think it's valid to want to make a single GetFlightInfo RPC request
> that returns _both_ the schema and the query plan.
>
> Thoughts from others?
>
> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <ja...@apache.org> wrote:
>>
>> My initial inclination is towards #3 but I'd be curious what others
>> think.
>> In the case of #3, I wonder if it makes sense to then pull the Schema off
>> the GetFlightInfo response...
>>
>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <ry...@dremio.com> wrote:
>>
>> > Hi All,
>> >
>> > I have been working on building an arrow flight source for spark. The
>> > goal
>> > here is for Spark to be able to use a group of arrow flight endpoints
>> > to
>> > get a dataset pulled over to spark in parallel.
>> >
>> > I am unsure of the best model for the spark <-> flight conversation and
>> > wanted to get your opinion on the best way to go.
>> >
>> > I am breaking up the query to flight from spark into 3 parts:
>> > 1) get the schema using GetFlightInfo. This is needed to do further
>> > lazy
>> > operations in Spark
>> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a
>> > different
>> > argument. This returns the list endpoints on the parallel flight
>> > server.
>> > The endpoints are not available till data is ready to be fetched, which
>> > is
>> > done after the schema but is needed before DoGet is called.
>> > 3) call get stream on all endpoints from 2
>> >
>> > I think I have to do each step however I don't like having to call
>> > getInfo
>> > twice, it doesn't seem very elegant. I see a few options:
>> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
>> > differentiate the purpose of each call
>> > 2) add an argument to GetFlightInfo to tell it its being called only
>> > for
>> > the schema
>> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
>> > just
>> > the Schema in question
>> > 4) use DoAction and wrap the expected FlightInfo in a Result
>> >
>> > I am aware that 4 is probably the least disruptive but I'm also not a
>> > fan
>> > as (to me) it implies performing an action on the server side.
>> > Suggestions
>> > 2 & 3 are larger changes and I am reluctant to do that unless there is
>> > a
>> > consensus here. None of them are great options and I am wondering what
>> > everyone thinks the best approach might be? Particularly as I think this
>> > is
>> > likely to come up in more applications than just spark.
>> >
>> > Best,
>> > Ryan
>> >
>

Re: Spark and Arrow Flight

Posted by Wes McKinney <we...@gmail.com>.

My inclination is either #2 or #3. #4 is an option of course, but I
like the more structured solution of explicitly requesting the schema
given a descriptor.

In both cases, it's possible that schemas are sent twice, e.g. if you
call GetSchema and then later call GetFlightInfo and so you receive
the schema again. The schema is optional, so if it became a
performance problem then a particular server might return the schema
as null from GetFlightInfo.

I think it's valid to want to make a single GetFlightInfo RPC request
that returns _both_ the schema and the query plan.

Thoughts from others?

On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> My initial inclination is towards #3 but I'd be curious what others think.
> In the case of #3, I wonder if it makes sense to then pull the Schema off
> the GetFlightInfo response...
>
> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <ry...@dremio.com> wrote:
>
> > Hi All,
> >
> > I have been working on building an arrow flight source for spark. The goal
> > here is for Spark to be able to use a group of arrow flight endpoints to
> > get a dataset pulled over to spark in parallel.
> >
> > I am unsure of the best model for the spark <-> flight conversation and
> > wanted to get your opinion on the best way to go.
> >
> > I am breaking up the query to flight from spark into 3 parts:
> > 1) get the schema using GetFlightInfo. This is needed to do further lazy
> > operations in Spark
> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a different
> > argument. This returns the list endpoints on the parallel flight server.
> > The endpoints are not available till data is ready to be fetched, which is
> > done after the schema but is needed before DoGet is called.
> > 3) call get stream on all endpoints from 2
> >
> > I think I have to do each step however I don't like having to call getInfo
> > twice, it doesn't seem very elegant. I see a few options:
> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
> > differentiate the purpose of each call
> > 2) add an argument to GetFlightInfo to tell it its being called only for
> > the schema
> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return just
> > the Schema in question
> > 4) use DoAction and wrap the expected FlightInfo in a Result
> >
> > I am aware that 4 is probably the least disruptive but I'm also not a fan
> > as (to me) it implies performing an action on the server side. Suggestions
> > 2 & 3 are larger changes and I am reluctant to do that unless there is a
> > consensus here. None of them are great options and I am wondering what
> > everyone thinks the best approach might be? Particularly as I think this is
> > likely to come up in more applications than just spark.
> >
> > Best,
> > Ryan
> >

Re: Spark and Arrow Flight

Posted by Jacques Nadeau <ja...@apache.org>.

My initial inclination is towards #3 but I'd be curious what others think.
In the case of #3, I wonder if it makes sense to then pull the Schema off
the GetFlightInfo response...

On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <ry...@dremio.com> wrote:

> Hi All,
>
> I have been working on building an arrow flight source for spark. The goal
> here is for Spark to be able to use a group of arrow flight endpoints to
> get a dataset pulled over to spark in parallel.
>
> I am unsure of the best model for the spark <-> flight conversation and
> wanted to get your opinion on the best way to go.
>
> I am breaking up the query to flight from spark into 3 parts:
> 1) get the schema using GetFlightInfo. This is needed to do further lazy
> operations in Spark
> 2) get the endpoints by calling GetFlightInfo a 2nd time with a different
> argument. This returns the list endpoints on the parallel flight server.
> The endpoints are not available till data is ready to be fetched, which is
> done after the schema but is needed before DoGet is called.
> 3) call get stream on all endpoints from 2
>
> I think I have to do each step however I don't like having to call getInfo
> twice, it doesn't seem very elegant. I see a few options:
> 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
> differentiate the purpose of each call
> 2) add an argument to GetFlightInfo to tell it its being called only for
> the schema
> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return just
> the Schema in question
> 4) use DoAction and wrap the expected FlightInfo in a Result
>
> I am aware that 4 is probably the least disruptive but I'm also not a fan
> as (to me) it implies performing an action on the server side. Suggestions
> 2 & 3 are larger changes and I am reluctant to do that unless there is a
> consensus here. None of them are great options and I am wondering what
> everyone thinks the best approach might be? Particularly as I think this is
> likely to come up in more applications than just spark.
>
> Best,
> Ryan
>