You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Joris Peeters <jo...@gmail.com> on 2020/06/23 17:50:38 UTC

Flight: beginner questions for usage

Hello,

I'm interested in using Flight for serving large amounts of data in a
parallelised manner, and just building some Python prototypes, based on
https://github.com/apache/arrow/blob/apache-arrow-0.17.1/python/examples/flight

In my use-case, we'd have a bunch of worker servers, serving a number of
different datasets (here called "datasetA" and "datasetB"), but also some
additional parameters to customise a single query (eg a date range if the
dataset is a time series, but can be other stuff too - depending on the
dataset).

The idea is for clients to hit a single coordinator with their entire query
(eg datasetA + [1970,2020]), and then getting instructed to hit a variety
of workers, with slices of this, e.g. {worker1: (datasetA, [1970, 1990)),
worker2: (datasetA, [1990-2020])}. I.e. I want to chunk up the original
request in a few smaller ones, to be handled by different workers, which
then retrieve the data from a DB and send it back to the client, which
aggregates.

Although I'm proto-typing from Python, this should work from a variety of
platforms.
Does that sound like something Flight should be able to do well?

If so - what are the intended semantics for the descriptor and ticket etc,
based on my previous example? I see idioms for "path" and "cmd" etc, but
neither really seems to fit. My query is more like some opaque JSON, e.g.
something you'd submit to an HTTP server. Is the idea to send a
string-serialisation of e.g:

{
  "dataset": "datasetA",
  "dateFrom": "1970-01-01",
  "dateTo": "2020-06-23"
}?

In that case, what should listFlights return, given that the queries are
dynamic? Something like,
["datasetA", "datasetB", ...] ?

I guess I'm mainly struggling to understand what a descriptor, ticket and
flight really are, within my context - and can't really find it in the
docs.
Just a link to some good docs would obviously be great as well! I'm hitting
https://arrow.apache.org/docs/python/api/flight.html which is  largely
empty. It does say "Flight is currently not distributed as part of wheels
or in Conda - it is only available when built from source appropriately."
which seems a bit pessimistic, as it appears present in both the pypi and
conda 0.17.1 package I checked.

Cheers,
-Joris.

Re: Flight: beginner questions for usage

Posted by Joris Peeters <jo...@gmail.com>.
OK, great, that clarifies a lot - I didn't appreciate how flexible the
interpretation can be. Thanks for the doc pointer as well.
I'll power onwards with the experiments. If I encounter areas of Flight
that are still a bit rough around the edges - being in beta  - then we
might be able to contribute as well.

-J

On Tue, Jun 23, 2020 at 8:34 PM David Li <li...@gmail.com> wrote:

> Hey Joris,
>
> Your plan sounds right for Flight. As for semantics:
>
> The descriptor and ticket format are mostly application defined. For
> instance, I think some places (Dremio?) just put a raw SQL query as
> the "cmd" of a descriptor; putting serialized JSON or Protobuf is also
> certainly fine.
>
> I'd say implementing _every_ endpoint isn't required - we don't use
> ListFlights for instance.
>
> In terms of what you described, I'd map a descriptor to a query, and a
> Flight to its execution; calling GetFlightInfo would return each
> worker in its own FlightEndpoint, and the Ticket would be something
> agreed upon by your coordinator and worker (e.g. the request and time
> range).
>
> For docs, have you seen this?
> https://arrow.apache.org/docs/format/Flight.html While it's labeled
> "Format", it contains an example of a Flight request flow.
>
> Best,
> David
>
> On 6/23/20, Joris Peeters <jo...@gmail.com> wrote:
> > Hello,
> >
> > I'm interested in using Flight for serving large amounts of data in a
> > parallelised manner, and just building some Python prototypes, based on
> >
> https://github.com/apache/arrow/blob/apache-arrow-0.17.1/python/examples/flight
> >
> > In my use-case, we'd have a bunch of worker servers, serving a number of
> > different datasets (here called "datasetA" and "datasetB"), but also some
> > additional parameters to customise a single query (eg a date range if the
> > dataset is a time series, but can be other stuff too - depending on the
> > dataset).
> >
> > The idea is for clients to hit a single coordinator with their entire
> query
> > (eg datasetA + [1970,2020]), and then getting instructed to hit a variety
> > of workers, with slices of this, e.g. {worker1: (datasetA, [1970, 1990)),
> > worker2: (datasetA, [1990-2020])}. I.e. I want to chunk up the original
> > request in a few smaller ones, to be handled by different workers, which
> > then retrieve the data from a DB and send it back to the client, which
> > aggregates.
> >
> > Although I'm proto-typing from Python, this should work from a variety of
> > platforms.
> > Does that sound like something Flight should be able to do well?
> >
> > If so - what are the intended semantics for the descriptor and ticket
> etc,
> > based on my previous example? I see idioms for "path" and "cmd" etc, but
> > neither really seems to fit. My query is more like some opaque JSON, e.g.
> > something you'd submit to an HTTP server. Is the idea to send a
> > string-serialisation of e.g:
> >
> > {
> >   "dataset": "datasetA",
> >   "dateFrom": "1970-01-01",
> >   "dateTo": "2020-06-23"
> > }?
> >
> > In that case, what should listFlights return, given that the queries are
> > dynamic? Something like,
> > ["datasetA", "datasetB", ...] ?
> >
> > I guess I'm mainly struggling to understand what a descriptor, ticket and
> > flight really are, within my context - and can't really find it in the
> > docs.
> > Just a link to some good docs would obviously be great as well! I'm
> hitting
> > https://arrow.apache.org/docs/python/api/flight.html which is  largely
> > empty. It does say "Flight is currently not distributed as part of wheels
> > or in Conda - it is only available when built from source appropriately."
> > which seems a bit pessimistic, as it appears present in both the pypi and
> > conda 0.17.1 package I checked.
> >
> > Cheers,
> > -Joris.
> >
>

Re: Flight: beginner questions for usage

Posted by David Li <li...@gmail.com>.
Hey Joris,

Your plan sounds right for Flight. As for semantics:

The descriptor and ticket format are mostly application defined. For
instance, I think some places (Dremio?) just put a raw SQL query as
the "cmd" of a descriptor; putting serialized JSON or Protobuf is also
certainly fine.

I'd say implementing _every_ endpoint isn't required - we don't use
ListFlights for instance.

In terms of what you described, I'd map a descriptor to a query, and a
Flight to its execution; calling GetFlightInfo would return each
worker in its own FlightEndpoint, and the Ticket would be something
agreed upon by your coordinator and worker (e.g. the request and time
range).

For docs, have you seen this?
https://arrow.apache.org/docs/format/Flight.html While it's labeled
"Format", it contains an example of a Flight request flow.

Best,
David

On 6/23/20, Joris Peeters <jo...@gmail.com> wrote:
> Hello,
>
> I'm interested in using Flight for serving large amounts of data in a
> parallelised manner, and just building some Python prototypes, based on
> https://github.com/apache/arrow/blob/apache-arrow-0.17.1/python/examples/flight
>
> In my use-case, we'd have a bunch of worker servers, serving a number of
> different datasets (here called "datasetA" and "datasetB"), but also some
> additional parameters to customise a single query (eg a date range if the
> dataset is a time series, but can be other stuff too - depending on the
> dataset).
>
> The idea is for clients to hit a single coordinator with their entire query
> (eg datasetA + [1970,2020]), and then getting instructed to hit a variety
> of workers, with slices of this, e.g. {worker1: (datasetA, [1970, 1990)),
> worker2: (datasetA, [1990-2020])}. I.e. I want to chunk up the original
> request in a few smaller ones, to be handled by different workers, which
> then retrieve the data from a DB and send it back to the client, which
> aggregates.
>
> Although I'm proto-typing from Python, this should work from a variety of
> platforms.
> Does that sound like something Flight should be able to do well?
>
> If so - what are the intended semantics for the descriptor and ticket etc,
> based on my previous example? I see idioms for "path" and "cmd" etc, but
> neither really seems to fit. My query is more like some opaque JSON, e.g.
> something you'd submit to an HTTP server. Is the idea to send a
> string-serialisation of e.g:
>
> {
>   "dataset": "datasetA",
>   "dateFrom": "1970-01-01",
>   "dateTo": "2020-06-23"
> }?
>
> In that case, what should listFlights return, given that the queries are
> dynamic? Something like,
> ["datasetA", "datasetB", ...] ?
>
> I guess I'm mainly struggling to understand what a descriptor, ticket and
> flight really are, within my context - and can't really find it in the
> docs.
> Just a link to some good docs would obviously be great as well! I'm hitting
> https://arrow.apache.org/docs/python/api/flight.html which is  largely
> empty. It does say "Flight is currently not distributed as part of wheels
> or in Conda - it is only available when built from source appropriately."
> which seems a bit pessimistic, as it appears present in both the pypi and
> conda 0.17.1 package I checked.
>
> Cheers,
> -Joris.
>