You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by ming zhang <mi...@gmail.com> on 2019/04/03 00:53:56 UTC

How to understand this comment

Hi All

Wonder how to understand this comment?

Looks like this assume we only have one "itinerary" and finish it need to
consume all the flight endpoints.

  /*
   * A list of endpoints associated with the flight. To consume the whole
   * flight, all endpoints must be consumed.
   */

What if we do want to provide more than one execution plan (itinerary) and
each have multiple steps?

Thanks
Ming

Re: How to understand this comment

Posted by Wes McKinney <we...@gmail.com>.
On Wed, Apr 3, 2019 at 9:56 AM Jacques Nadeau <ja...@apache.org> wrote:
>
> >
> > "To consume the whole flight, generally all endpoints must be consumed"
> >
>
> I actually think the problem is we changed some names at one point and this
> comment is a bit behind. This should probably read that "To consume the
> whole flight all Flight Itineraries must be consumed. Any endpoint listed
> can be used to consume a particular flight itinerary".

Ah you are right, apologies. Since FlightEndpoint has a list of
locations, the alternative transports could be found among those
locations

https://github.com/apache/arrow/blob/master/format/Flight.proto#L249

Re: How to understand this comment

Posted by Jacques Nadeau <ja...@apache.org>.
>
> "To consume the whole flight, generally all endpoints must be consumed"
>

I actually think the problem is we changed some names at one point and this
comment is a bit behind. This should probably read that "To consume the
whole flight all Flight Itineraries must be consumed. Any endpoint listed
can be used to consume a particular flight itinerary".

Re: How to understand this comment

Posted by Wes McKinney <we...@gmail.com>.
hi,

On Wed, Apr 3, 2019 at 8:19 AM ming zhang <mi...@gmail.com> wrote:
>
> but then the comment is not right? since not all flights will be consumed?
>
> "To consume the whole
> > > >    * flight, all endpoints must be consumed."
>

I think we might be splitting hairs a little bit. Would it help if we
added "generally" to the comment?

"To consume the whole flight, generally all endpoints must be consumed"

> if we introduce a itinerary concept, we have a complete story and mental
> model. something like
>
> message FlightGetInfo {
>   // schema of the dataset as described in Schema.fbs::Schema.
>   bytes schema = 1;
>
>   /*
>    * The descriptor associated with this info.
>    */
>   FlightDescriptor flight_descriptor = 2;
>
>   /*
>    * A list of endpoints associated with the flight. To consume the whole
>    * flight, all endpoints must be consumed.
>    */
>   repeated FlightItinerary itinerary = 3;
>
>   // Set these to -1 if unknown.
>   int64 total_records = 4;
>   int64 total_bytes = 5;
> }
>
> message FlightItinerary {
>   /*
>    * A list of endpoints associated with the itinerary. To consume the whole
>    * itinerary, all endpoints must be consumed.
>    */
>   repeated FlightEndpoint endpoint = 1;
> }
>

I think this adds extra complexity that many (most?) Flight users
won't need. If you truly have this need, you could add an itinerary
number to the data structure that's serialized in Ticket.

>
>
> On Wed, Apr 3, 2019 at 12:40 AM Wes McKinney <we...@gmail.com> wrote:
>
> > On Tue, Apr 2, 2019 at 10:08 PM ming zhang <mi...@gmail.com>
> > wrote:
> > >
> > > in a case where there are multiple ways to retrieve this logical data
> > set,
> > > how to represent this in the response?
> > >
> > > for example, assume there is a data set that has
> > > part 1 in endpoint 1 and part 2 in endpoint 2 with tcp as transport
> > > both part 1 and part 2 in endpoint 3, with infiniband as transport
> > >
> > > now how we return this back to client so client decide which one to
> > consume?
> > >
> >
> > One way to do this would be to have the Ticket payload contain a
> > serialized object (e.g. a protocol buffer) which contains this
> > additional metadata for the client to decide which endpoints to
> > access.
> >
> > The Ticket returned in FlightEndpoint is opaque data so can contain
> > anything. In our performance test server the Ticket is a protobuf:
> >
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/perf-server.cc#L169
> >
> > > the mental model i have with FightService is similar to travel with
> > flight.
> > >
> > > - a travel agency lists many travel choices.
> > > - for a trip from NYC to SFO, there are different itinerary, like
> > NYC->SFO
> > > directly. NYC->Chicago -> SFO, and so on.
> > > - for each itinerary, there could be one or more logical parts (hop)
> > > - for each hop, there are different ways, like business class, economic
> > > class, etc
> > > - they all need a ticket to finish the trip.
> > >
> > > is this right, or way off?
> > >
> > > Thanks
> > > Ming
> > >
> > >
> > > On Tue, Apr 2, 2019 at 9:59 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > A FlightGetInfo plan corresponds to a single logical dataset. The
> > > > dataset may be spread across multiple endpoints, so if you want the
> > > > whole dataset you have to execute DoGet against them all.
> > > >
> > > > I'm not sure what you mean by "provide more than one execution plan".
> > > >
> > > > - Wes
> > > >
> > > > On Tue, Apr 2, 2019 at 7:54 PM ming zhang <mi...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi All
> > > > >
> > > > > Wonder how to understand this comment?
> > > > >
> > > > > Looks like this assume we only have one "itinerary" and finish it
> > need to
> > > > > consume all the flight endpoints.
> > > > >
> > > > >   /*
> > > > >    * A list of endpoints associated with the flight. To consume the
> > whole
> > > > >    * flight, all endpoints must be consumed.
> > > > >    */
> > > > >
> > > > > What if we do want to provide more than one execution plan
> > (itinerary)
> > > > and
> > > > > each have multiple steps?
> > > > >
> > > > > Thanks
> > > > > Ming
> > > >
> >

Re: How to understand this comment

Posted by Jacques Nadeau <ja...@apache.org>.
The current model is there is a fixed number of itineraries. The available
endpoints could include multiple transports theoretically.

You're example where there are a variable number of itineraries depending
on protocol is not currently supported. In that case, I would suggest that
the list includes those as distinct datasets or you provide some kind of
criteria that states you want a preferred endpoint type if available when
listing the flights.

On Wed, Apr 3, 2019 at 6:19 AM ming zhang <mi...@gmail.com>
wrote:

> but then the comment is not right? since not all flights will be consumed?
>
> "To consume the whole
> > > >    * flight, all endpoints must be consumed."
>
> if we introduce a itinerary concept, we have a complete story and mental
> model. something like
>
> message FlightGetInfo {
>   // schema of the dataset as described in Schema.fbs::Schema.
>   bytes schema = 1;
>
>   /*
>    * The descriptor associated with this info.
>    */
>   FlightDescriptor flight_descriptor = 2;
>
>   /*
>    * A list of endpoints associated with the flight. To consume the whole
>    * flight, all endpoints must be consumed.
>    */
>   repeated FlightItinerary itinerary = 3;
>
>   // Set these to -1 if unknown.
>   int64 total_records = 4;
>   int64 total_bytes = 5;
> }
>
> message FlightItinerary {
>   /*
>    * A list of endpoints associated with the itinerary. To consume the
> whole
>    * itinerary, all endpoints must be consumed.
>    */
>   repeated FlightEndpoint endpoint = 1;
> }
>
>
>
> On Wed, Apr 3, 2019 at 12:40 AM Wes McKinney <we...@gmail.com> wrote:
>
> > On Tue, Apr 2, 2019 at 10:08 PM ming zhang <mi...@gmail.com>
> > wrote:
> > >
> > > in a case where there are multiple ways to retrieve this logical data
> > set,
> > > how to represent this in the response?
> > >
> > > for example, assume there is a data set that has
> > > part 1 in endpoint 1 and part 2 in endpoint 2 with tcp as transport
> > > both part 1 and part 2 in endpoint 3, with infiniband as transport
> > >
> > > now how we return this back to client so client decide which one to
> > consume?
> > >
> >
> > One way to do this would be to have the Ticket payload contain a
> > serialized object (e.g. a protocol buffer) which contains this
> > additional metadata for the client to decide which endpoints to
> > access.
> >
> > The Ticket returned in FlightEndpoint is opaque data so can contain
> > anything. In our performance test server the Ticket is a protobuf:
> >
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/perf-server.cc#L169
> >
> > > the mental model i have with FightService is similar to travel with
> > flight.
> > >
> > > - a travel agency lists many travel choices.
> > > - for a trip from NYC to SFO, there are different itinerary, like
> > NYC->SFO
> > > directly. NYC->Chicago -> SFO, and so on.
> > > - for each itinerary, there could be one or more logical parts (hop)
> > > - for each hop, there are different ways, like business class, economic
> > > class, etc
> > > - they all need a ticket to finish the trip.
> > >
> > > is this right, or way off?
> > >
> > > Thanks
> > > Ming
> > >
> > >
> > > On Tue, Apr 2, 2019 at 9:59 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > A FlightGetInfo plan corresponds to a single logical dataset. The
> > > > dataset may be spread across multiple endpoints, so if you want the
> > > > whole dataset you have to execute DoGet against them all.
> > > >
> > > > I'm not sure what you mean by "provide more than one execution plan".
> > > >
> > > > - Wes
> > > >
> > > > On Tue, Apr 2, 2019 at 7:54 PM ming zhang <
> ming.zhang.china@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi All
> > > > >
> > > > > Wonder how to understand this comment?
> > > > >
> > > > > Looks like this assume we only have one "itinerary" and finish it
> > need to
> > > > > consume all the flight endpoints.
> > > > >
> > > > >   /*
> > > > >    * A list of endpoints associated with the flight. To consume the
> > whole
> > > > >    * flight, all endpoints must be consumed.
> > > > >    */
> > > > >
> > > > > What if we do want to provide more than one execution plan
> > (itinerary)
> > > > and
> > > > > each have multiple steps?
> > > > >
> > > > > Thanks
> > > > > Ming
> > > >
> >
>

Re: How to understand this comment

Posted by ming zhang <mi...@gmail.com>.
but then the comment is not right? since not all flights will be consumed?

"To consume the whole
> > >    * flight, all endpoints must be consumed."

if we introduce a itinerary concept, we have a complete story and mental
model. something like

message FlightGetInfo {
  // schema of the dataset as described in Schema.fbs::Schema.
  bytes schema = 1;

  /*
   * The descriptor associated with this info.
   */
  FlightDescriptor flight_descriptor = 2;

  /*
   * A list of endpoints associated with the flight. To consume the whole
   * flight, all endpoints must be consumed.
   */
  repeated FlightItinerary itinerary = 3;

  // Set these to -1 if unknown.
  int64 total_records = 4;
  int64 total_bytes = 5;
}

message FlightItinerary {
  /*
   * A list of endpoints associated with the itinerary. To consume the whole
   * itinerary, all endpoints must be consumed.
   */
  repeated FlightEndpoint endpoint = 1;
}



On Wed, Apr 3, 2019 at 12:40 AM Wes McKinney <we...@gmail.com> wrote:

> On Tue, Apr 2, 2019 at 10:08 PM ming zhang <mi...@gmail.com>
> wrote:
> >
> > in a case where there are multiple ways to retrieve this logical data
> set,
> > how to represent this in the response?
> >
> > for example, assume there is a data set that has
> > part 1 in endpoint 1 and part 2 in endpoint 2 with tcp as transport
> > both part 1 and part 2 in endpoint 3, with infiniband as transport
> >
> > now how we return this back to client so client decide which one to
> consume?
> >
>
> One way to do this would be to have the Ticket payload contain a
> serialized object (e.g. a protocol buffer) which contains this
> additional metadata for the client to decide which endpoints to
> access.
>
> The Ticket returned in FlightEndpoint is opaque data so can contain
> anything. In our performance test server the Ticket is a protobuf:
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/perf-server.cc#L169
>
> > the mental model i have with FightService is similar to travel with
> flight.
> >
> > - a travel agency lists many travel choices.
> > - for a trip from NYC to SFO, there are different itinerary, like
> NYC->SFO
> > directly. NYC->Chicago -> SFO, and so on.
> > - for each itinerary, there could be one or more logical parts (hop)
> > - for each hop, there are different ways, like business class, economic
> > class, etc
> > - they all need a ticket to finish the trip.
> >
> > is this right, or way off?
> >
> > Thanks
> > Ming
> >
> >
> > On Tue, Apr 2, 2019 at 9:59 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > A FlightGetInfo plan corresponds to a single logical dataset. The
> > > dataset may be spread across multiple endpoints, so if you want the
> > > whole dataset you have to execute DoGet against them all.
> > >
> > > I'm not sure what you mean by "provide more than one execution plan".
> > >
> > > - Wes
> > >
> > > On Tue, Apr 2, 2019 at 7:54 PM ming zhang <mi...@gmail.com>
> > > wrote:
> > > >
> > > > Hi All
> > > >
> > > > Wonder how to understand this comment?
> > > >
> > > > Looks like this assume we only have one "itinerary" and finish it
> need to
> > > > consume all the flight endpoints.
> > > >
> > > >   /*
> > > >    * A list of endpoints associated with the flight. To consume the
> whole
> > > >    * flight, all endpoints must be consumed.
> > > >    */
> > > >
> > > > What if we do want to provide more than one execution plan
> (itinerary)
> > > and
> > > > each have multiple steps?
> > > >
> > > > Thanks
> > > > Ming
> > >
>

Re: How to understand this comment

Posted by Wes McKinney <we...@gmail.com>.
On Tue, Apr 2, 2019 at 10:08 PM ming zhang <mi...@gmail.com> wrote:
>
> in a case where there are multiple ways to retrieve this logical data set,
> how to represent this in the response?
>
> for example, assume there is a data set that has
> part 1 in endpoint 1 and part 2 in endpoint 2 with tcp as transport
> both part 1 and part 2 in endpoint 3, with infiniband as transport
>
> now how we return this back to client so client decide which one to consume?
>

One way to do this would be to have the Ticket payload contain a
serialized object (e.g. a protocol buffer) which contains this
additional metadata for the client to decide which endpoints to
access.

The Ticket returned in FlightEndpoint is opaque data so can contain
anything. In our performance test server the Ticket is a protobuf:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/perf-server.cc#L169

> the mental model i have with FightService is similar to travel with flight.
>
> - a travel agency lists many travel choices.
> - for a trip from NYC to SFO, there are different itinerary, like NYC->SFO
> directly. NYC->Chicago -> SFO, and so on.
> - for each itinerary, there could be one or more logical parts (hop)
> - for each hop, there are different ways, like business class, economic
> class, etc
> - they all need a ticket to finish the trip.
>
> is this right, or way off?
>
> Thanks
> Ming
>
>
> On Tue, Apr 2, 2019 at 9:59 PM Wes McKinney <we...@gmail.com> wrote:
>
> > Hi,
> >
> > A FlightGetInfo plan corresponds to a single logical dataset. The
> > dataset may be spread across multiple endpoints, so if you want the
> > whole dataset you have to execute DoGet against them all.
> >
> > I'm not sure what you mean by "provide more than one execution plan".
> >
> > - Wes
> >
> > On Tue, Apr 2, 2019 at 7:54 PM ming zhang <mi...@gmail.com>
> > wrote:
> > >
> > > Hi All
> > >
> > > Wonder how to understand this comment?
> > >
> > > Looks like this assume we only have one "itinerary" and finish it need to
> > > consume all the flight endpoints.
> > >
> > >   /*
> > >    * A list of endpoints associated with the flight. To consume the whole
> > >    * flight, all endpoints must be consumed.
> > >    */
> > >
> > > What if we do want to provide more than one execution plan (itinerary)
> > and
> > > each have multiple steps?
> > >
> > > Thanks
> > > Ming
> >

Re: How to understand this comment

Posted by ming zhang <mi...@gmail.com>.
in a case where there are multiple ways to retrieve this logical data set,
how to represent this in the response?

for example, assume there is a data set that has
part 1 in endpoint 1 and part 2 in endpoint 2 with tcp as transport
both part 1 and part 2 in endpoint 3, with infiniband as transport

now how we return this back to client so client decide which one to consume?

the mental model i have with FightService is similar to travel with flight.

- a travel agency lists many travel choices.
- for a trip from NYC to SFO, there are different itinerary, like NYC->SFO
directly. NYC->Chicago -> SFO, and so on.
- for each itinerary, there could be one or more logical parts (hop)
- for each hop, there are different ways, like business class, economic
class, etc
- they all need a ticket to finish the trip.

is this right, or way off?

Thanks
Ming


On Tue, Apr 2, 2019 at 9:59 PM Wes McKinney <we...@gmail.com> wrote:

> Hi,
>
> A FlightGetInfo plan corresponds to a single logical dataset. The
> dataset may be spread across multiple endpoints, so if you want the
> whole dataset you have to execute DoGet against them all.
>
> I'm not sure what you mean by "provide more than one execution plan".
>
> - Wes
>
> On Tue, Apr 2, 2019 at 7:54 PM ming zhang <mi...@gmail.com>
> wrote:
> >
> > Hi All
> >
> > Wonder how to understand this comment?
> >
> > Looks like this assume we only have one "itinerary" and finish it need to
> > consume all the flight endpoints.
> >
> >   /*
> >    * A list of endpoints associated with the flight. To consume the whole
> >    * flight, all endpoints must be consumed.
> >    */
> >
> > What if we do want to provide more than one execution plan (itinerary)
> and
> > each have multiple steps?
> >
> > Thanks
> > Ming
>

Re: How to understand this comment

Posted by Wes McKinney <we...@gmail.com>.
Hi,

A FlightGetInfo plan corresponds to a single logical dataset. The
dataset may be spread across multiple endpoints, so if you want the
whole dataset you have to execute DoGet against them all.

I'm not sure what you mean by "provide more than one execution plan".

- Wes

On Tue, Apr 2, 2019 at 7:54 PM ming zhang <mi...@gmail.com> wrote:
>
> Hi All
>
> Wonder how to understand this comment?
>
> Looks like this assume we only have one "itinerary" and finish it need to
> consume all the flight endpoints.
>
>   /*
>    * A list of endpoints associated with the flight. To consume the whole
>    * flight, all endpoints must be consumed.
>    */
>
> What if we do want to provide more than one execution plan (itinerary) and
> each have multiple steps?
>
> Thanks
> Ming