You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2021/08/03 00:16:50 UTC

[DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

hi folks,

This idea came up in passing in the past -- given that there are
multiple independent efforts to develop Arrow-native query engines
(and surely many more to come), it seems like it would be valuable to
have a way to enable user languages (like Java, Python, R, or Rust,
for example) to communicate with backend computing engines (like
DataFusion, or new computing capabilities being built in the Arrow C++
library) in a fashion that is "lower-level" than SQL and specialized
to Arrow's type system. So rather than leaving it to a SQL parser /
analyzer framework to generate an expression tree of relational
operators and then translate that to an Arrow-native compute-engine's
internal grammer, a user framework could provide the desired
Arrow-native expression tree / data manipulations directly and skip
the SQL altogether.

The idea of creating a "serialized intermediate representation (IR)"
for Arrow compute operations would be to serve use cases large and
small -- from the most complex TPC-* or time series database query to
the most simple array predicate/filter sent with an RPC request using
Arrow Flight. It is deliberately language- and engine-agnostic, with
the only commonality across usages being the Arrow columnar format
(schemas and array types). This would be better than leaving it to
each application to develop its own bespoke expression representations
for its needs.

I spent a while thinking about this and wrote up a brain dump RFC
document [1] and accompanying pull request [2] that makes the possibly
controversial choice of using Flatbuffers to represent the serialized
IR. I discuss the rationale for the choice of Flatbuffers in the RFC
document. This PR is obviously deficient in many regards (incomplete,
hacky, or unclear in places), and will need some help from others to
flesh out. I suspect that actually implementing the IR will be
necessary to work out many of the low-level details.

Note that this IR is intended to be more of a "superset" project than
a "lowest common denominator". So there may be things that it includes
which are only available in some engines (e.g. engines that have
specialized handling of time series data).

As some of my personal motivation for the project, concurrent with the
genesis of Apache Arrow, I started a Python project called Ibis [3]
(which is similar to R's dplyr project) which serves as a "Python
analytical query IR builder" that is capable of generating most of the
SQL standard, targeting many different SQL dialects and other backends
(like pandas). Microsoft ([4]) and Google ([5]) have used this library
as a "many-SQL" middleware. As such, I would like to be able to
translate between the in-memory "logical query" data structures in a
library like Ibis to a serialized format that can be executed by many
different Arrow-native query engines. The expression primitives
available in Ibis should serve as helpful test cases, too.

I look forward to the community's comments on the RFC document [1] and
pull request [2] -- I realize that arriving at consensus on a complex
and ambitious project like this can be challenging so I recommend
spending time on the "non-goals" section in the RFC and ask questions
if you are unclear about the scope of what problems this is trying to
solve. I would be happy to give Edit access on the RFC document to
others and would consider ideas about how to move forward with
something that is able to be implemented by different Arrow libraries
in the reasonably near future.

Thanks,
Wes

[1]: https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
[2]: https://github.com/apache/arrow/pull/10856
[3]: https://ibis-project.org/
[4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
[5]: https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Arun Sharma <ar...@sharma-home.net>.

Thank you for putting together this proposal. Very exciting development. I
left some comments in the RFC doc, summarized here as:

* Flatbuffer is usable as a serialization agnostic IDL (
https://adsharma.github.io/flattools/)
* serde library + msgpack is a worthy candidate to consider for
serialization in conjunction with flatbuffers for IDL and for large arrays
involving zero-copy.

I'm not super familiar with the type of problems this RFC or Ibis are
trying to solve, but wanted to draw attention to queries involving deeply
nested results and user defined functions in high level languages.

* I'm also the author of fquery (https://github.com/adsharma/fquery), which
has some similarities to ibis, but differing internals.
** Supports nested/graph queries. If a flat table is a basic abstraction in
Ibis, a nested dict with async iterable values is the fundamental
abstraction in fquery.
** This abstraction can support multi-model databases/engines (graph,
database and relational)
** Much smaller core (~2k lines of python), but a small number of
operators. Adding new operators is simple.

On Fri, Aug 13, 2021 at 2:26 PM Phillip Cloud <cp...@gmail.com> wrote:

> Hey all,
>
> Just wanted to give an update on the effort here.
>
> Ben Kietzman has created an alternative proposal to the initial design [1].
> It largely overlaps with the original, but differs in a few important ways:
>
> * A big focus of the design is on flexibility, allowing producers,
> consumers and ultimately end users of those systems the ability to define
> custom operations in the graph.
> * There are very few predefined relational operations (project, filter,
> join and a handful of others).
> * There are only 3 types of value expressions: literals, field references,
> and function calls.
> * The model of evaluation is one that requires a final sink node, to
> indicate where the record batches from child relations end up (a file, a
> socket, an in-memory buffer, etc).
>
> I've added notes [2] to the original Google doc (under the Alternative
> Design Notes subheading), and a few pseudocode examples.
> Unfortunately, these went out of date as soon as Ben pushed the PR [3], so
> I need to update those to reflect his changes. Regardless,
> the design is broadly the same, so it should still give a good indication
> of the details of the design.
>
> There are a decent number of review comments on the original PR that I plan
> to port over where they are still relevant.
> I also plan on adding support for window functions either tonight or on
> Monday.
>
> Please review this design at your earliest convenience. Since there's a
> fairly concrete set of types in flatbuffers that
> we can look at, ideally we can center discussion around the details in the
> PR.
>
> Thanks!
>
> [1]: https://github.com/apache/arrow/pull/10856
> [2]:
>
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> [3]: https://github.com/apache/arrow/pull/10934
>
> On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <jh...@gmail.com>
> wrote:
>
> > > Wes wrote:
> > >
> > > Supporting this kind of intra-application engine
> > > heterogeneity is one of the motivations for the project.
> >
> > +1
> >
> > The data format is the natural interface between tasks. (Defining “task”
> > here as “something that is programmed using the IR”.) That is Arrow’s
> > strength.
> >
> > So I think the IR should describe what each task should do, and tasks
> > should be fairly small. Not whole relational operators, operating on
> whole
> > tables, but pieces of relational operators, operating on batches or
> > sequences of batches.
> >
> > Elsethread, someone mentioned the LoLePop concept and the
> Kohn/Leis/Neuman
> > paper [1]. The LoLePop concept sounds good for our purposes.
> >
> > Julian
> >
> > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> >
> >
> > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <we...@gmail.com> wrote:
> > >
> > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cp...@gmail.com>
> > wrote:
> > >>
> > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > >> jorgecarleitao@gmail.com> wrote:
> > >>
> > >>> Couple of questions
> > >>>
> > >>> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data),
> the
> > >>> operation "(IR,data) - engine -> result" MUST be the same for all
> > "engine"?
> > >>>
> > >>
> > >> I think that might be a non-starter for mundane reasons: there's
> > probably
> > >> at least two engines
> > >> that disagree on the result of sum(x) where x is a floating point
> > column.
> > >>
> > >>
> > >>> 2. if yes, imo we may need to worry about:
> > >>> * a definition of equality that implementations agree on.
> > >>> * agreement over what the semantics look like. For example, do we use
> > >>> kleene logic for AND and OR?
> > >>>
> > >>
> > >> WRT Kleene logic my thoughts are that the IR should support both
> Kleene
> > and
> > >> non-Kleene,
> > >> and producers can choose their desired semantics.
> > >>
> > >> Ibis for example, would override the `&` operator in `a & b` to
> produce
> > >> `KleeneAnd(Column(a), Column(b))`.
> > >>
> > >>
> > >>>
> > >>> To try to understand the gist, let's pick an aggregated count over a
> > >>> column: engines often do partial counts over partitions followed by a
> > final
> > >>> "sum" over the partial counts. Is the idea that the query engine
> would
> > >>> communicate with the compute engine via 2 IRs where one is "count me
> > these"
> > >>> the other is "sum me these"?
> > >>>
> > >>> Best,
> > >>> Jorge
> > >>>
> > >>
> > >> Not in its current incarnation.
> > >>
> > >> The idea is that the IR producer communicates a desire to count(x) to
> a
> > >> consumer, and  it's up to the consumer to figure out how to turn that
> > count
> > >> into something that makes sense for itself. In your example that's a
> > series
> > >> of partial counts followed by a sum.
> > >>
> > >
> > > That said, I think there is a valid use case here where a system might
> > > make use of different engines to execute different (composable) layers
> > > of a complex query.
> > >
> > > For example:
> > >
> > > * suppose you want to scan and do predicate pushdown on an unusual
> > > data source that is only accessible from one particular engine but
> > > * you need to do some analytical operation with the scan results that
> > > is only supported by another engine
> > >
> > > You could decompose the query into two stages with an IR relational
> > > expression for each stage and use then the engines together to
> > > accomplish what you need (of course, you would need an orchestration
> > > layer to handle plumbing the query engine inputs and outputs together
> > > as Arrow streams). Supporting this kind of intra-application engine
> > > heterogeneity is one of the motivations for the project.
> > >
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cp...@gmail.com>
> > wrote:
> > >>>
> > >>>> Thanks Wes,
> > >>>>
> > >>>> Great to be back working on Arrow again and engaging with the
> > community.
> > >>> I
> > >>>> am really excited about this effort.
> > >>>>
> > >>>> I think there are a number of concerns I see as important to address
> > in
> > >>> the
> > >>>> compute IR proposal:
> > >>>>
> > >>>> 1. Requirement for output types.
> > >>>>
> > >>>> I think that so far there's been many reasons for requiring
> > conforming IR
> > >>>> producers and consumers to adhere to output types, but I haven't
> seen
> > a
> > >>>> strong rationale for keeping them optional (in the semantic sense,
> not
> > >>> WRT
> > >>>> any particular serialization format's representation of
> optionality).
> > >>>>
> > >>>> I think a design that includes unambiguous semantics for output
> types
> > (a
> > >>>> consumer must produce a value of the requested type or it's an
> > >>>> error/non-conforming) is simpler to reason about for producers, and
> > >>>> provides a strong guarantee for end users (humans or machines
> > >>> constructing
> > >>>> IR from an API and expecting the thing they ask for back from the IR
> > >>>> consumer).
> > >>>>
> > >>>> 2. Flexibility
> > >>>>
> > >>>> The current PR is currently unable to support what I think are
> killer
> > >>>> features of the IR: custom operators (relational or column) and
> UDFs.
> > In
> > >>> my
> > >>>> mind, on top of the generalized compute description that the IR
> > offers,
> > >>> the
> > >>>> ability for producers and consumers of IR to extend the IR without
> > >>> needing
> > >>>> to modify Arrow or depend on anything except the format is itself
> > >>> something
> > >>>> that is necessary to gain adoption.
> > >>>>
> > >>>> Developers will need to build custom relational operators (e.g.,
> > scans of
> > >>>> backends that don't exist anywhere for which a user has code to
> > >>> implement)
> > >>>> and custom functions (anything operating on a column that doesn't
> > already
> > >>>> exist, really). Furthermore, I think we can actually drive building
> an
> > >>>> Arrow consumer using the same API that an end user would use to
> extend
> > >>> the
> > >>>> IR.
> > >>>>
> > >>>> 3. Window Functions
> > >>>>
> > >>>> Window functions are, I think, an important part of the IR value
> > >>>> proposition, as they are one of the more complex operators in
> > databases.
> > >>> I
> > >>>> think we need to have something in the initial IR proposal to
> support
> > >>> these
> > >>>> operations.
> > >>>>
> > >>>> 4. Non relational Joins
> > >>>>
> > >>>> Things like as-of join and window join operators aren't yet fleshed
> > out
> > >>> in
> > >>>> the IR, and I'm not sure they should be in scope for the initial
> > >>> prototype.
> > >>>> I think once we settle on a design, we can work the design of these
> > >>>> particular operators out during the initial prototype. I think the
> > >>>> specification of these operators should basically be PR #2 after the
> > >>>> initial design lands.
> > >>>>
> > >>>> # Order of Work
> > >>>>
> > >>>> 1. Nail down the design. Anything else is a non-starter.
> > >>>>
> > >>>> 2. Prototype an IR producer using Ibis
> > >>>>
> > >>>> Ibis is IMO a good candidate for a first IR producer as it has a
> > number
> > >>> of
> > >>>> desirable properties that make prototyping faster and allow for us
> to
> > >>>> refine the design of the IR as needed based on how the
> implementation
> > >>> goes:
> > >>>> * It's written in Python so it has native support for nearly all of
> > >>>> flatbuffers' features without having to creating bindings to C++.
> > >>>> * There's already a set of rules for type checking, as well as APIs
> > for
> > >>>> constructing expression trees, which means we don't need to worry
> > about
> > >>>> building a type checker for the prototype.
> > >>>>
> > >>>> 3. Prototype an IR consumer in C++
> > >>>>
> > >>>> I think in parallel to the producer prototype we can further inform
> > the
> > >>>> design from the consumer side by prototyping an IR consumer in C++
> . I
> > >>> know
> > >>>> Ben Kietzman has expressed interest in working on this.
> > >>>>
> > >>>> Very interested to hear others' thoughts.
> > >>>>
> > >>>> -Phillip
> > >>>>
> > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>>> Thank you for all the feedback and comments on the document. I'm on
> > >>>>> vacation this week, so I'm delayed responding to everything, but I
> > >>>>> will get to it as quickly as I can. I will be at VLDB in Copenhagen
> > >>>>> next week if anyone would like to chat in person about it, and we
> can
> > >>>>> relay the content of any discussions back to the document/PR/e-mail
> > >>>>> thread.
> > >>>>>
> > >>>>> I know that Phillip Cloud expressed interest in working on the PR
> and
> > >>>>> helping work through many of the details, so I'm glad to have the
> > >>>>> help. If there are others who would like to work on the PR or dig
> > into
> > >>>>> the details, please let me know. We might need to figure out how to
> > >>>>> accommodate "many cooks" by setting up the ComputeIR project
> > somewhere
> > >>>>> separate from the format/ directory to permit it to exist in a
> > >>>>> Work-In-Progress status for a period of time until we work through
> > the
> > >>>>> various details and design concerns.
> > >>>>>
> > >>>>> Re Julian's comment
> > >>>>>
> > >>>>>> The biggest surprise is that this language does full relational
> > >>>>> operations. I was expecting that it would do fragments of the
> > >>> operations.
> > >>>>>
> > >>>>> There's a related but different (yet still interesting and worthy
> of
> > >>>>> analysis) problem of creating an "engine language" that describes
> > more
> > >>>>> mechanically the constituent parts of implementing the relational
> > >>>>> operators. To create a functional computation language with
> concrete
> > >>>>> Arrow data structures as a top-level primitive sounds like an
> > >>>>> interesting research area where I could see something developing
> > >>>>> eventually.
> > >>>>>
> > >>>>> The main problem I'm interested in solving right now is enabling
> > front
> > >>>>> ends that have sufficient understanding of relational algebra and
> > data
> > >>>>> frame operations to talk to engines without having to go backwards
> > >>>>> from their logical query plans to SQL. So as mentioned in the
> > >>>>> document, being able to faithfully carry the relational operator
> node
> > >>>>> information generated by Calcite or Ibis or another system would be
> > >>>>> super useful. Defining the semantics of various kinds of
> user-defined
> > >>>>> functions would also be helpful to standardize the
> > >>>>> engine-to-user-language UDF/extension interface.
> > >>>>>
> > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <al...@gmail.com>
> > >>>> wrote:
> > >>>>>>
> > >>>>>> Hi Wes,
> > >>>>>>
> > >>>>>> cool initiative! Reminded me of "Building Advanced SQL Analytics
> > From
> > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> > >>>> proposes a
> > >>>>>> set of building block for advanced aggregation.
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>> Dimitri.
> > >>>>>>
> > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> jhyde.apache@gmail.com>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Wes,
> > >>>>>>>
> > >>>>>>> Thanks for this. I’ve added comments to the doc and to the PR.
> > >>>>>>>
> > >>>>>>> The biggest surprise is that this language does full relational
> > >>>>>>> operations. I was expecting that it would do fragments of the
> > >>>>> operations.
> > >>>>>>> Consider join. A distributed hybrid hash join needs to partition
> > >>> rows
> > >>>>> into
> > >>>>>>> output buffers based on a hash key, build hash tables, probe into
> > >>>> hash
> > >>>>>>> tables, scan hash tables for untouched “outer”rows, and so forth.
> > >>>>>>>
> > >>>>>>> I see Arrow’s compute as delivering each of those operations,
> > >>> working
> > >>>>> on
> > >>>>>>> perhaps a batch at a time, or a sequence of batches, with some
> > >>> other
> > >>>>> system
> > >>>>>>> coordinating those tasks. So I would expect to see Arrow’s
> compute
> > >>>>> language
> > >>>>>>> mainly operating on batches rather than a table abstraction.
> > >>>>>>>
> > >>>>>>> Julian
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> hi folks,
> > >>>>>>>>
> > >>>>>>>> This idea came up in passing in the past -- given that there are
> > >>>>>>>> multiple independent efforts to develop Arrow-native query
> > >>> engines
> > >>>>>>>> (and surely many more to come), it seems like it would be
> > >>> valuable
> > >>>> to
> > >>>>>>>> have a way to enable user languages (like Java, Python, R, or
> > >>> Rust,
> > >>>>>>>> for example) to communicate with backend computing engines (like
> > >>>>>>>> DataFusion, or new computing capabilities being built in the
> > >>> Arrow
> > >>>>> C++
> > >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> > >>>> specialized
> > >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
> > >>> parser /
> > >>>>>>>> analyzer framework to generate an expression tree of relational
> > >>>>>>>> operators and then translate that to an Arrow-native
> > >>>> compute-engine's
> > >>>>>>>> internal grammer, a user framework could provide the desired
> > >>>>>>>> Arrow-native expression tree / data manipulations directly and
> > >>> skip
> > >>>>>>>> the SQL altogether.
> > >>>>>>>>
> > >>>>>>>> The idea of creating a "serialized intermediate representation
> > >>>> (IR)"
> > >>>>>>>> for Arrow compute operations would be to serve use cases large
> > >>> and
> > >>>>>>>> small -- from the most complex TPC-* or time series database
> > >>> query
> > >>>> to
> > >>>>>>>> the most simple array predicate/filter sent with an RPC request
> > >>>> using
> > >>>>>>>> Arrow Flight. It is deliberately language- and engine-agnostic,
> > >>>> with
> > >>>>>>>> the only commonality across usages being the Arrow columnar
> > >>> format
> > >>>>>>>> (schemas and array types). This would be better than leaving it
> > >>> to
> > >>>>>>>> each application to develop its own bespoke expression
> > >>>>> representations
> > >>>>>>>> for its needs.
> > >>>>>>>>
> > >>>>>>>> I spent a while thinking about this and wrote up a brain dump
> RFC
> > >>>>>>>> document [1] and accompanying pull request [2] that makes the
> > >>>>> possibly
> > >>>>>>>> controversial choice of using Flatbuffers to represent the
> > >>>> serialized
> > >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in the
> > >>>> RFC
> > >>>>>>>> document. This PR is obviously deficient in many regards
> > >>>> (incomplete,
> > >>>>>>>> hacky, or unclear in places), and will need some help from
> others
> > >>>> to
> > >>>>>>>> flesh out. I suspect that actually implementing the IR will be
> > >>>>>>>> necessary to work out many of the low-level details.
> > >>>>>>>>
> > >>>>>>>> Note that this IR is intended to be more of a "superset" project
> > >>>> than
> > >>>>>>>> a "lowest common denominator". So there may be things that it
> > >>>>> includes
> > >>>>>>>> which are only available in some engines (e.g. engines that have
> > >>>>>>>> specialized handling of time series data).
> > >>>>>>>>
> > >>>>>>>> As some of my personal motivation for the project, concurrent
> > >>> with
> > >>>>> the
> > >>>>>>>> genesis of Apache Arrow, I started a Python project called Ibis
> > >>> [3]
> > >>>>>>>> (which is similar to R's dplyr project) which serves as a
> "Python
> > >>>>>>>> analytical query IR builder" that is capable of generating most
> > >>> of
> > >>>>> the
> > >>>>>>>> SQL standard, targeting many different SQL dialects and other
> > >>>>> backends
> > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > >>>>> library
> > >>>>>>>> as a "many-SQL" middleware. As such, I would like to be able to
> > >>>>>>>> translate between the in-memory "logical query" data structures
> > >>> in
> > >>>> a
> > >>>>>>>> library like Ibis to a serialized format that can be executed by
> > >>>> many
> > >>>>>>>> different Arrow-native query engines. The expression primitives
> > >>>>>>>> available in Ibis should serve as helpful test cases, too.
> > >>>>>>>>
> > >>>>>>>> I look forward to the community's comments on the RFC document
> > >>> [1]
> > >>>>> and
> > >>>>>>>> pull request [2] -- I realize that arriving at consensus on a
> > >>>> complex
> > >>>>>>>> and ambitious project like this can be challenging so I
> recommend
> > >>>>>>>> spending time on the "non-goals" section in the RFC and ask
> > >>>> questions
> > >>>>>>>> if you are unclear about the scope of what problems this is
> > >>> trying
> > >>>> to
> > >>>>>>>> solve. I would be happy to give Edit access on the RFC document
> > >>> to
> > >>>>>>>> others and would consider ideas about how to move forward with
> > >>>>>>>> something that is able to be implemented by different Arrow
> > >>>> libraries
> > >>>>>>>> in the reasonably near future.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Wes
> > >>>>>>>>
> > >>>>>>>> [1]:
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > >>>>>>>> [3]: https://ibis-project.org/
> > >>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > >>>>>>>> [5]:
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Benjamin Kietzman <be...@gmail.com>.

WRT out-of-band data: if encapsulation is the priority over reuse of
Buffer etc that's straightforward to accommodate by replacement
with an alternative to Buffer. I have made that change to my PR in
https://github.com/apache/arrow/pull/10934/commits/ebd4fc665579dd6bba29c5c4731c2350ea0fa70a

> as much built-in support for "canonical" operators/functions (...) as
possible

IMHO as much as possible is not very much; every consumer is going to
have to write extensive validation passes anyway to assert contracts
inexpressible in a flatbuffer schema, for example whether a FieldRef
points to a field which is not present in a Schema. Special casing
for joins can at most provide validation of their arity and
slightly more idiomatic generated accessors for that relation's inputs.

To cite MLIR yet again, it is easier to write the sophisticated pattern
matching required for optimization and lowering passes when the objects
being manipulated are as consistently structured as possible.

In light of these points I'd disagree that special-casing Relations and
Expressions is preferable.

On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <we...@gmail.com> wrote:

> Looking at Ben's alternate PR [1], having an IR that leans heavily on
> memory references to an out-of-band data sidecar seems like an
> approach that would substantially ratchet up the implementation
> complexity as producing the IR would then have the level of complexity
> of producing the Arrow IPC format — when producing the "root" Plan
> message, you must accumulate a list of the dependent serialized
> submessages, placing the appropriate Buffer memory offset in the Plan
> message, like we do when producing the RecordBatch.buffers field. This
> seems complicated to me as you must devise a custom binary protocol to
> concatenate the serialized Plan and the messages it depends on into a
> single binary payload
>
> <ROOT PLAN>
> <padding>
> <Buffer 0>
> <padding>
> <Buffer 1>
> <padding>
> ...
> <Buffer N - 1>
>
> (one purpose of FlatBufferBuilder is to spare you having to do this
> yourself — some reasons we do it for the Arrow IPC format is because
> appending Arrow memory buffers directly to a FlatBufferBuilder would
> be inefficient — internal realloc calls — and Flatbuffers are limited
> to 2GB. Neither of these things are problems here)
>
> In general, I believe the output created by an IR producer should be a
> single serialized object without any out-of-band data sidecar — this
> is much simpler for implementers and we can still provide an "escape
> hatch" for user-defined operators and functions where the required
> function/operator is passed opaquely as an embedded binary data.
>
> The serialization format (whether it is Flatbuffers or JSON, or
> something else) should allow for data memoization, so if there is a
> user-defined operator/function, or a relation that is used multiple
> times throughout the query (potentially with a large schema), then we
> should ensure that the data need not be duplicated in the
> serialization format unnecessarily — in Flatbuffers, you can achieve
> this by reusing offsets, but we could devise the data structures to
> make the memoization of "expensive" objects more explicit.
>
> I additionally think that it is important to provide as much built-in
> support for "canonical" operators/functions (such as the ones
> implemented commonly by SQL engines) as possible, and to liberally
> expand the catalogue of "built-in" capabilities. I would still prefer
> to have large unions/enums of built-in operators/functions and to
> expand those unions/enums to accommodate new things when it is
> demonstrated that there is a need to standardize things between
> producers/consumers.
>
> One of the beneficial properties of the Union/Enum approach for the
> operator/function catalogues, is that when there are additions to
> those enums, the generated Flatbuffers files will cause many language
> compilers to warn or error on unhandled enum cases. If all
> function/operator names are strings, then you are essentially
> reimplementing the functionality provided by enums by hand. I
> initially used strings for all function references in my original
> prototype, but I now think that using an enum for "built-ins" would be
> superior (because of the code-generated enum interface) and not a
> premature optimization.
>
> [1]: https://github.com/apache/arrow/pull/10934
>
> On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cp...@gmail.com> wrote:
> >
> > Hey all,
> >
> > Just wanted to give an update on the effort here.
> >
> > Ben Kietzman has created an alternative proposal to the initial design
> [1].
> > It largely overlaps with the original, but differs in a few important
> ways:
> >
> > * A big focus of the design is on flexibility, allowing producers,
> > consumers and ultimately end users of those systems the ability to define
> > custom operations in the graph.
> > * There are very few predefined relational operations (project, filter,
> > join and a handful of others).
> > * There are only 3 types of value expressions: literals, field
> references,
> > and function calls.
> > * The model of evaluation is one that requires a final sink node, to
> > indicate where the record batches from child relations end up (a file, a
> > socket, an in-memory buffer, etc).
> >
> > I've added notes [2] to the original Google doc (under the Alternative
> > Design Notes subheading), and a few pseudocode examples.
> > Unfortunately, these went out of date as soon as Ben pushed the PR [3],
> so
> > I need to update those to reflect his changes. Regardless,
> > the design is broadly the same, so it should still give a good indication
> > of the details of the design.
> >
> > There are a decent number of review comments on the original PR that I
> plan
> > to port over where they are still relevant.
> > I also plan on adding support for window functions either tonight or on
> > Monday.
> >
> > Please review this design at your earliest convenience. Since there's a
> > fairly concrete set of types in flatbuffers that
> > we can look at, ideally we can center discussion around the details in
> the
> > PR.
> >
> > Thanks!
> >
> > [1]: https://github.com/apache/arrow/pull/10856
> > [2]:
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > [3]: https://github.com/apache/arrow/pull/10934
> >
> > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <jh...@gmail.com>
> wrote:
> >
> > > > Wes wrote:
> > > >
> > > > Supporting this kind of intra-application engine
> > > > heterogeneity is one of the motivations for the project.
> > >
> > > +1
> > >
> > > The data format is the natural interface between tasks. (Defining
> “task”
> > > here as “something that is programmed using the IR”.) That is Arrow’s
> > > strength.
> > >
> > > So I think the IR should describe what each task should do, and tasks
> > > should be fairly small. Not whole relational operators, operating on
> whole
> > > tables, but pieces of relational operators, operating on batches or
> > > sequences of batches.
> > >
> > > Elsethread, someone mentioned the LoLePop concept and the
> Kohn/Leis/Neuman
> > > paper [1]. The LoLePop concept sounds good for our purposes.
> > >
> > > Julian
> > >
> > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > >
> > >
> > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <we...@gmail.com>
> wrote:
> > > >
> > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cp...@gmail.com>
> > > wrote:
> > > >>
> > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > >> jorgecarleitao@gmail.com> wrote:
> > > >>
> > > >>> Couple of questions
> > > >>>
> > > >>> 1. Is the goal that IRs have equal semantics, i.e. given
> (IR,data), the
> > > >>> operation "(IR,data) - engine -> result" MUST be the same for all
> > > "engine"?
> > > >>>
> > > >>
> > > >> I think that might be a non-starter for mundane reasons: there's
> > > probably
> > > >> at least two engines
> > > >> that disagree on the result of sum(x) where x is a floating point
> > > column.
> > > >>
> > > >>
> > > >>> 2. if yes, imo we may need to worry about:
> > > >>> * a definition of equality that implementations agree on.
> > > >>> * agreement over what the semantics look like. For example, do we
> use
> > > >>> kleene logic for AND and OR?
> > > >>>
> > > >>
> > > >> WRT Kleene logic my thoughts are that the IR should support both
> Kleene
> > > and
> > > >> non-Kleene,
> > > >> and producers can choose their desired semantics.
> > > >>
> > > >> Ibis for example, would override the `&` operator in `a & b` to
> produce
> > > >> `KleeneAnd(Column(a), Column(b))`.
> > > >>
> > > >>
> > > >>>
> > > >>> To try to understand the gist, let's pick an aggregated count over
> a
> > > >>> column: engines often do partial counts over partitions followed
> by a
> > > final
> > > >>> "sum" over the partial counts. Is the idea that the query engine
> would
> > > >>> communicate with the compute engine via 2 IRs where one is "count
> me
> > > these"
> > > >>> the other is "sum me these"?
> > > >>>
> > > >>> Best,
> > > >>> Jorge
> > > >>>
> > > >>
> > > >> Not in its current incarnation.
> > > >>
> > > >> The idea is that the IR producer communicates a desire to count(x)
> to a
> > > >> consumer, and  it's up to the consumer to figure out how to turn
> that
> > > count
> > > >> into something that makes sense for itself. In your example that's a
> > > series
> > > >> of partial counts followed by a sum.
> > > >>
> > > >
> > > > That said, I think there is a valid use case here where a system
> might
> > > > make use of different engines to execute different (composable)
> layers
> > > > of a complex query.
> > > >
> > > > For example:
> > > >
> > > > * suppose you want to scan and do predicate pushdown on an unusual
> > > > data source that is only accessible from one particular engine but
> > > > * you need to do some analytical operation with the scan results that
> > > > is only supported by another engine
> > > >
> > > > You could decompose the query into two stages with an IR relational
> > > > expression for each stage and use then the engines together to
> > > > accomplish what you need (of course, you would need an orchestration
> > > > layer to handle plumbing the query engine inputs and outputs together
> > > > as Arrow streams). Supporting this kind of intra-application engine
> > > > heterogeneity is one of the motivations for the project.
> > > >
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cp...@gmail.com>
> > > wrote:
> > > >>>
> > > >>>> Thanks Wes,
> > > >>>>
> > > >>>> Great to be back working on Arrow again and engaging with the
> > > community.
> > > >>> I
> > > >>>> am really excited about this effort.
> > > >>>>
> > > >>>> I think there are a number of concerns I see as important to
> address
> > > in
> > > >>> the
> > > >>>> compute IR proposal:
> > > >>>>
> > > >>>> 1. Requirement for output types.
> > > >>>>
> > > >>>> I think that so far there's been many reasons for requiring
> > > conforming IR
> > > >>>> producers and consumers to adhere to output types, but I haven't
> seen
> > > a
> > > >>>> strong rationale for keeping them optional (in the semantic
> sense, not
> > > >>> WRT
> > > >>>> any particular serialization format's representation of
> optionality).
> > > >>>>
> > > >>>> I think a design that includes unambiguous semantics for output
> types
> > > (a
> > > >>>> consumer must produce a value of the requested type or it's an
> > > >>>> error/non-conforming) is simpler to reason about for producers,
> and
> > > >>>> provides a strong guarantee for end users (humans or machines
> > > >>> constructing
> > > >>>> IR from an API and expecting the thing they ask for back from the
> IR
> > > >>>> consumer).
> > > >>>>
> > > >>>> 2. Flexibility
> > > >>>>
> > > >>>> The current PR is currently unable to support what I think are
> killer
> > > >>>> features of the IR: custom operators (relational or column) and
> UDFs.
> > > In
> > > >>> my
> > > >>>> mind, on top of the generalized compute description that the IR
> > > offers,
> > > >>> the
> > > >>>> ability for producers and consumers of IR to extend the IR without
> > > >>> needing
> > > >>>> to modify Arrow or depend on anything except the format is itself
> > > >>> something
> > > >>>> that is necessary to gain adoption.
> > > >>>>
> > > >>>> Developers will need to build custom relational operators (e.g.,
> > > scans of
> > > >>>> backends that don't exist anywhere for which a user has code to
> > > >>> implement)
> > > >>>> and custom functions (anything operating on a column that doesn't
> > > already
> > > >>>> exist, really). Furthermore, I think we can actually drive
> building an
> > > >>>> Arrow consumer using the same API that an end user would use to
> extend
> > > >>> the
> > > >>>> IR.
> > > >>>>
> > > >>>> 3. Window Functions
> > > >>>>
> > > >>>> Window functions are, I think, an important part of the IR value
> > > >>>> proposition, as they are one of the more complex operators in
> > > databases.
> > > >>> I
> > > >>>> think we need to have something in the initial IR proposal to
> support
> > > >>> these
> > > >>>> operations.
> > > >>>>
> > > >>>> 4. Non relational Joins
> > > >>>>
> > > >>>> Things like as-of join and window join operators aren't yet
> fleshed
> > > out
> > > >>> in
> > > >>>> the IR, and I'm not sure they should be in scope for the initial
> > > >>> prototype.
> > > >>>> I think once we settle on a design, we can work the design of
> these
> > > >>>> particular operators out during the initial prototype. I think the
> > > >>>> specification of these operators should basically be PR #2 after
> the
> > > >>>> initial design lands.
> > > >>>>
> > > >>>> # Order of Work
> > > >>>>
> > > >>>> 1. Nail down the design. Anything else is a non-starter.
> > > >>>>
> > > >>>> 2. Prototype an IR producer using Ibis
> > > >>>>
> > > >>>> Ibis is IMO a good candidate for a first IR producer as it has a
> > > number
> > > >>> of
> > > >>>> desirable properties that make prototyping faster and allow for
> us to
> > > >>>> refine the design of the IR as needed based on how the
> implementation
> > > >>> goes:
> > > >>>> * It's written in Python so it has native support for nearly all
> of
> > > >>>> flatbuffers' features without having to creating bindings to C++.
> > > >>>> * There's already a set of rules for type checking, as well as
> APIs
> > > for
> > > >>>> constructing expression trees, which means we don't need to worry
> > > about
> > > >>>> building a type checker for the prototype.
> > > >>>>
> > > >>>> 3. Prototype an IR consumer in C++
> > > >>>>
> > > >>>> I think in parallel to the producer prototype we can further
> inform
> > > the
> > > >>>> design from the consumer side by prototyping an IR consumer in
> C++ . I
> > > >>> know
> > > >>>> Ben Kietzman has expressed interest in working on this.
> > > >>>>
> > > >>>> Very interested to hear others' thoughts.
> > > >>>>
> > > >>>> -Phillip
> > > >>>>
> > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> wesmckinn@gmail.com>
> > > >>> wrote:
> > > >>>>
> > > >>>>> Thank you for all the feedback and comments on the document. I'm
> on
> > > >>>>> vacation this week, so I'm delayed responding to everything, but
> I
> > > >>>>> will get to it as quickly as I can. I will be at VLDB in
> Copenhagen
> > > >>>>> next week if anyone would like to chat in person about it, and
> we can
> > > >>>>> relay the content of any discussions back to the
> document/PR/e-mail
> > > >>>>> thread.
> > > >>>>>
> > > >>>>> I know that Phillip Cloud expressed interest in working on the
> PR and
> > > >>>>> helping work through many of the details, so I'm glad to have the
> > > >>>>> help. If there are others who would like to work on the PR or dig
> > > into
> > > >>>>> the details, please let me know. We might need to figure out how
> to
> > > >>>>> accommodate "many cooks" by setting up the ComputeIR project
> > > somewhere
> > > >>>>> separate from the format/ directory to permit it to exist in a
> > > >>>>> Work-In-Progress status for a period of time until we work
> through
> > > the
> > > >>>>> various details and design concerns.
> > > >>>>>
> > > >>>>> Re Julian's comment
> > > >>>>>
> > > >>>>>> The biggest surprise is that this language does full relational
> > > >>>>> operations. I was expecting that it would do fragments of the
> > > >>> operations.
> > > >>>>>
> > > >>>>> There's a related but different (yet still interesting and
> worthy of
> > > >>>>> analysis) problem of creating an "engine language" that describes
> > > more
> > > >>>>> mechanically the constituent parts of implementing the relational
> > > >>>>> operators. To create a functional computation language with
> concrete
> > > >>>>> Arrow data structures as a top-level primitive sounds like an
> > > >>>>> interesting research area where I could see something developing
> > > >>>>> eventually.
> > > >>>>>
> > > >>>>> The main problem I'm interested in solving right now is enabling
> > > front
> > > >>>>> ends that have sufficient understanding of relational algebra and
> > > data
> > > >>>>> frame operations to talk to engines without having to go
> backwards
> > > >>>>> from their logical query plans to SQL. So as mentioned in the
> > > >>>>> document, being able to faithfully carry the relational operator
> node
> > > >>>>> information generated by Calcite or Ibis or another system would
> be
> > > >>>>> super useful. Defining the semantics of various kinds of
> user-defined
> > > >>>>> functions would also be helpful to standardize the
> > > >>>>> engine-to-user-language UDF/extension interface.
> > > >>>>>
> > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> alendit@gmail.com>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>> Hi Wes,
> > > >>>>>>
> > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL Analytics
> > > From
> > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> > > >>>> proposes a
> > > >>>>>> set of building block for advanced aggregation.
> > > >>>>>>
> > > >>>>>> Cheers,
> > > >>>>>> Dimitri.
> > > >>>>>>
> > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> jhyde.apache@gmail.com>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Wes,
> > > >>>>>>>
> > > >>>>>>> Thanks for this. I’ve added comments to the doc and to the PR.
> > > >>>>>>>
> > > >>>>>>> The biggest surprise is that this language does full relational
> > > >>>>>>> operations. I was expecting that it would do fragments of the
> > > >>>>> operations.
> > > >>>>>>> Consider join. A distributed hybrid hash join needs to
> partition
> > > >>> rows
> > > >>>>> into
> > > >>>>>>> output buffers based on a hash key, build hash tables, probe
> into
> > > >>>> hash
> > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and so
> forth.
> > > >>>>>>>
> > > >>>>>>> I see Arrow’s compute as delivering each of those operations,
> > > >>> working
> > > >>>>> on
> > > >>>>>>> perhaps a batch at a time, or a sequence of batches, with some
> > > >>> other
> > > >>>>> system
> > > >>>>>>> coordinating those tasks. So I would expect to see Arrow’s
> compute
> > > >>>>> language
> > > >>>>>>> mainly operating on batches rather than a table abstraction.
> > > >>>>>>>
> > > >>>>>>> Julian
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmckinn@gmail.com
> >
> > > >>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>> hi folks,
> > > >>>>>>>>
> > > >>>>>>>> This idea came up in passing in the past -- given that there
> are
> > > >>>>>>>> multiple independent efforts to develop Arrow-native query
> > > >>> engines
> > > >>>>>>>> (and surely many more to come), it seems like it would be
> > > >>> valuable
> > > >>>> to
> > > >>>>>>>> have a way to enable user languages (like Java, Python, R, or
> > > >>> Rust,
> > > >>>>>>>> for example) to communicate with backend computing engines
> (like
> > > >>>>>>>> DataFusion, or new computing capabilities being built in the
> > > >>> Arrow
> > > >>>>> C++
> > > >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> > > >>>> specialized
> > > >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
> > > >>> parser /
> > > >>>>>>>> analyzer framework to generate an expression tree of
> relational
> > > >>>>>>>> operators and then translate that to an Arrow-native
> > > >>>> compute-engine's
> > > >>>>>>>> internal grammer, a user framework could provide the desired
> > > >>>>>>>> Arrow-native expression tree / data manipulations directly and
> > > >>> skip
> > > >>>>>>>> the SQL altogether.
> > > >>>>>>>>
> > > >>>>>>>> The idea of creating a "serialized intermediate representation
> > > >>>> (IR)"
> > > >>>>>>>> for Arrow compute operations would be to serve use cases large
> > > >>> and
> > > >>>>>>>> small -- from the most complex TPC-* or time series database
> > > >>> query
> > > >>>> to
> > > >>>>>>>> the most simple array predicate/filter sent with an RPC
> request
> > > >>>> using
> > > >>>>>>>> Arrow Flight. It is deliberately language- and
> engine-agnostic,
> > > >>>> with
> > > >>>>>>>> the only commonality across usages being the Arrow columnar
> > > >>> format
> > > >>>>>>>> (schemas and array types). This would be better than leaving
> it
> > > >>> to
> > > >>>>>>>> each application to develop its own bespoke expression
> > > >>>>> representations
> > > >>>>>>>> for its needs.
> > > >>>>>>>>
> > > >>>>>>>> I spent a while thinking about this and wrote up a brain dump
> RFC
> > > >>>>>>>> document [1] and accompanying pull request [2] that makes the
> > > >>>>> possibly
> > > >>>>>>>> controversial choice of using Flatbuffers to represent the
> > > >>>> serialized
> > > >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in
> the
> > > >>>> RFC
> > > >>>>>>>> document. This PR is obviously deficient in many regards
> > > >>>> (incomplete,
> > > >>>>>>>> hacky, or unclear in places), and will need some help from
> others
> > > >>>> to
> > > >>>>>>>> flesh out. I suspect that actually implementing the IR will be
> > > >>>>>>>> necessary to work out many of the low-level details.
> > > >>>>>>>>
> > > >>>>>>>> Note that this IR is intended to be more of a "superset"
> project
> > > >>>> than
> > > >>>>>>>> a "lowest common denominator". So there may be things that it
> > > >>>>> includes
> > > >>>>>>>> which are only available in some engines (e.g. engines that
> have
> > > >>>>>>>> specialized handling of time series data).
> > > >>>>>>>>
> > > >>>>>>>> As some of my personal motivation for the project, concurrent
> > > >>> with
> > > >>>>> the
> > > >>>>>>>> genesis of Apache Arrow, I started a Python project called
> Ibis
> > > >>> [3]
> > > >>>>>>>> (which is similar to R's dplyr project) which serves as a
> "Python
> > > >>>>>>>> analytical query IR builder" that is capable of generating
> most
> > > >>> of
> > > >>>>> the
> > > >>>>>>>> SQL standard, targeting many different SQL dialects and other
> > > >>>>> backends
> > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > > >>>>> library
> > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to be able
> to
> > > >>>>>>>> translate between the in-memory "logical query" data
> structures
> > > >>> in
> > > >>>> a
> > > >>>>>>>> library like Ibis to a serialized format that can be executed
> by
> > > >>>> many
> > > >>>>>>>> different Arrow-native query engines. The expression
> primitives
> > > >>>>>>>> available in Ibis should serve as helpful test cases, too.
> > > >>>>>>>>
> > > >>>>>>>> I look forward to the community's comments on the RFC document
> > > >>> [1]
> > > >>>>> and
> > > >>>>>>>> pull request [2] -- I realize that arriving at consensus on a
> > > >>>> complex
> > > >>>>>>>> and ambitious project like this can be challenging so I
> recommend
> > > >>>>>>>> spending time on the "non-goals" section in the RFC and ask
> > > >>>> questions
> > > >>>>>>>> if you are unclear about the scope of what problems this is
> > > >>> trying
> > > >>>> to
> > > >>>>>>>> solve. I would be happy to give Edit access on the RFC
> document
> > > >>> to
> > > >>>>>>>> others and would consider ideas about how to move forward with
> > > >>>>>>>> something that is able to be implemented by different Arrow
> > > >>>> libraries
> > > >>>>>>>> in the reasonably near future.
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Wes
> > > >>>>>>>>
> > > >>>>>>>> [1]:
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > >>>>>>>> [3]: https://ibis-project.org/
> > > >>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > >>>>>>>> [5]:
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > >
> > >
>

RE: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by "Yang, Binwei" <bi...@intel.com>.

This is excellent proposal. We worked on apache arrow based native Spark SQL engine offloading for 1.5 years. Currently we extended Gandiva's node and expression definition to pass query plan from Spark to native code. We will try start to try substrait. These features can definitely help:

(O) The plan tree should be human readable, so spark can dump and we can easily check if it's correct
(O) Easily restore the buffer in python. Our optimization flow is: run Spark get hot stage, easily reproduce the whole stage in python/cython, analysis and sample the single stage to optimize it. The inputs to python is 1) data source or shuffled data, 2) the query plan.


Thanks
Binwei

-----Original Message-----
From: Jacques Nadeau <ja...@apache.org> 
Sent: Wednesday, September 8, 2021 07:06
To: dev <de...@arrow.apache.org>
Subject: Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

As Phillip mentioned, I think there is something powerful in producing a
standard serialized representation of compute operations beyond just Arrow
and I'd really like to create a broader community around it. This has been
something I had been independently thinking about for the last several
months. The discussion here has inspired me to start making real progress
on this work. As such, I created a new repository and site where I've
started to put together work around a new specification for compute. I
would love for the people here to help define this and will be looking to a
number of other communities to also contribute. One of my goals has been to
break the specification into a number of much smaller pieces [1] so that we
can make progress on each subsection without being overwhelmed by the
amount of content that must be reviewed.

Would love to hear people's ideas on this initiative.

The site is here: https://substrait.io/
The repo is here: https://github.com/substrait-io/substrait

[1] https://substrait.io/spec/specification/#components



On Wed, Sep 1, 2021 at 3:26 PM Phillip Cloud <cp...@gmail.com> wrote:

> Hey everyone,
>
> As many of you know, the compute IR project has a lot of interested parties
> and has generated a lot of feedback. In light of some of the feedback we’ve
> received, we want to stress that the specification is intended to have
> input from many diverse points of view and that we welcome folks outside of
> the Arrow community. We think there’s immense potential for a compute IR
> that multiple projects--including those outside of the Arrow umbrella--can
> leverage.
>
> With that in mind, Jacques has been working on something outside of the
> Arrow repo that’ll be shared in a few days, that is designed to bring those
> viewpoints to bear on the problem of generic relational computation that
> lives outside of the Arrow project.
>
> Inside Arrow, we think that a version of the in-development IR
> specifications from the last several weeks will add a ton of value by
> informing this new effort and would like to continue to move forward with a
> work-in-progress IR inside of Arrow for the time being to enable some work
> on API development (independent of exactly how things are serialized) to
> take place. It is very likely that we will adopt this broader specification
> once the dust has settled, so the format inside of Arrow will be relatively
> unstable for a while and not have backwards compatibility guarantees for
> now.
>
> The primary focus of the Arrow IR will be on shoring up APIs (producers and
> consumers), and we will also be moving the compute IR flatbuffers files out
> the format directory into another top-level directory in the repo.
>
> Thanks,
> Phillip
>
> On Mon, Aug 30, 2021 at 7:30 PM Weston Pace <we...@gmail.com> wrote:
>
> > My (incredibly naive) interpretation is that there are three problems to
> > tackle.
> >
> > 1) How do you represent a graph and relational operators (join, union,
> > groupby, etc.)
> >  - The PR appears to be addressing this question fairly well
> > 2) How does a frontend query a backend to know what UDFs are supported.
> >  - I don't see anything in the spec for this (some comments touch on
> > it) but it seems like it would be necessary to build any kind of
> > system.
> > 3) Is there some well defined set of canonical UDFs that we can all
> > agree on the semantics for (e.g. addition, subtraction, etc.)
> >  - I thought, from earlier comments in this email thread, that the
> > goal was to avoid addressing this.  Although I think there is strong
> > value here as well.
> >
> > So what is the scope of this initiative?  If it is just #1 for example
> > then I don't see any need to put types in the IR (and I've commented
> > as such in the PR).  From a relational perspective isn't a UDF just a
> > black box Table -> UDF -> Table?
> >
> > On Mon, Aug 30, 2021 at 11:10 AM Phillip Cloud <cp...@gmail.com>
> wrote:
> > >
> > > Hey everyone,
> > >
> > > There's some interesting discussion around types and where their
> location
> > > is in the current PR [1] (and in fact whether to store them at all).
> > >
> > > It would be great to get some community feedback on this [2] part of
> the
> > PR
> > > in particular, because the choice of whether to store types at all has
> > > important design consequences.
> > >
> > > [1]: https://github.com/apache/arrow/pull/10934
> > > [2]: https://github.com/apache/arrow/pull/10934/files#r697025313
> > >
> > > On Fri, Aug 27, 2021 at 2:11 AM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > > > As an FYI, Iceberg is also considering an IR in relation to view
> > support
> > > > [1].  I chimed in and pointed them to this thread and Wes's doc.
> > Phillip
> > > > and Jacques chimed in there as well.
> > > >
> > > > [1]
> > > >
> > > >
> >
> https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E
> > > >
> > > > On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud <cp...@gmail.com>
> > wrote:
> > > >
> > > > > Thanks for the feedback Jacques, very helpful. In the latest
> version
> > of
> > > > the
> > > > > PR, I've tried to incorporate nearly all of these points.
> > > > >
> > > > > - I've incorporated most of what you had for dereferencing
> operations
> > > > into
> > > > > the PR, and gotten rid of schemas except on Read/Write relations.
> > > > > - With respect to properties, I've made a bunch more specific
> > operators,
> > > > > and kept user-defined things special case-y.
> > > > > - I haven't incorporated anything close to physical-plan things,
> but
> > I
> > > > > think that's a good follow up PR. Having separate representations
> for
> > > > > logical/physical plans seems like a waste of effort ultimately. I
> > think
> > > > we
> > > > > can find a good balance.
> > > > > - Agree on UDF support, I think that will have to evolve as the
> rest
> > of
> > > > the
> > > > > spec evolves. UDFs will need language-dedicated effort given the
> > large
> > > > > variety of languages that folks will want to use to define
> functions.
> > > > >
> > > > > On a separate note, in an effort to move this project forward I'd
> > like to
> > > > > do one final round of code review from anyone interested and then
> > merge
> > > > the
> > > > > PR after that.
> > > > > This spec will be unstable for a while, until we can work out all
> the
> > > > > design kinks and edge cases, and I think getting this initial spec
> > in is
> > > > > important to start that process.
> > > > >
> > > > >
> > > > > On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <jacques@apache.org
> >
> > > > wrote:
> > > > >
> > > > > > In a lucky turn of events, Phillip actually turned out to be in
> my
> > neck
> > > > > of
> > > > > > the woods on Friday so we had a chance to sit down and discuss
> > this. To
> > > > > > help, I actually shared something I had been working on a few
> > months
> > > > ago
> > > > > > independently (before this discussion started).
> > > > > >
> > > > > > For reference:
> > > > > > Wes PR: https://github.com/apache/arrow/pull/10856
> > > > > > Ben PR: https://github.com/apache/arrow/pull/10934
> > > > > > Jacques PR: https://github.com/apache/arrow/pull/10979
> > > > > >
> > > > > > The high level points of feedback I have are:
> > > > > >
> > > > > >    - Ben PR feels too deconstructed. While I like the elegance
> and
> > > > > >    symmetry, I believe this will lead to substantially more work
> in
> > > > > moving
> > > > > >    from serialization format to something closer to what a system
> > would
> > > > > > want
> > > > > >    to manipulate/consume. The reality is that there are a lot of
> > really
> > > > > > known
> > > > > >    things and specializing the representation for these things
> will
> > > > > > ultimately
> > > > > >    make things easier to program with without error and easier to
> > > > debug.
> > > > > > (For
> > > > > >    example, imagine trying to inspect a plan in a debugging
> session
> > > > with
> > > > > > the
> > > > > >    Ben representation.) We should embrace the known things in the
> > > > > >    specification.
> > > > > >    - I believe that it is a mistake for the inner workings of the
> > plan
> > > > to
> > > > > >    ever use field names. Only input (e.g. file read) and Output
> > (e.g.
> > > > > > return
> > > > > >    to user or write to file) need to have field names. For the
> > rest of
> > > > > the
> > > > > >    system, using field ordinals (determinant whether nested or
> > flat) is
> > > > > > much
> > > > > >    cleaner and is how most execution systems work. For example,
> in
> > > > > Impala I
> > > > > >    believe it is called a slot. As I noted in the PR, Calcite as
> an
> > > > > > example is
> > > > > >    entirely ordinal based at the algebra level. Rowtypes contain
> > field
> > > > > > names
> > > > > >    but they are actually basically pointless. Field references
> use
> > > > > > RexInputRef
> > > > > >    with ordinal based and rules around column order output (e.g.
> > what
> > > > is
> > > > > > the
> > > > > >    field order of a join output) are determinant and done
> entirely
> > at
> > > > an
> > > > > >    ordinal level. The only place where names should be used
> > (besides
> > > > > >    input/output) is in the case of map keys. In that case, the
> > names
> > > > are
> > > > > >    actually data, as opposed to scheme metadata. This is why I
> > propose
> > > > a
> > > > > >    strongly structured dereference operation [1].
> > > > > >    - Properties should only be included in the serialization if
> > they
> > > > are
> > > > > >    not easily re-derivable at plan consumption time. For example,
> > > > you'll
> > > > > > note
> > > > > >    that I don't store schema information for relational
> operation.
> > Each
> > > > > >    function and relational operation should already know how a
> > given
> > > > > input
> > > > > > is
> > > > > >    transformed to a given output. Capturing this information in
> the
> > > > > > plan/IR is
> > > > > >    excessive. In many ways, I compare it to the early use of
> > > > VectorLayout
> > > > > > [2]
> > > > > >    in Arrow schema. It may have provided some additional checksum
> > of
> > > > the
> > > > > >    message but ultimately it caused more pain than it was worth
> > (and
> > > > thus
> > > > > > we
> > > > > >    removed it before formalizing the specification). For
> > reference, in
> > > > > the
> > > > > >    context of Dremio, we used to actually do this, send schema
> > > > > information
> > > > > >    around for all operations. We removed it because in many cases
> > > > > becoming
> > > > > > the
> > > > > >    majority of our internal plan serialization (imagine simple
> > > > operations
> > > > > > that
> > > > > >    are carrying 1000s of fields).
> > > > > >    - I suggest focusing on support for both logical and physical
> > > > > >    representations. The moment you start talking about
> optimization
> > > > > passes,
> > > > > >    many of those would probably be better being done at the
> logical
> > > > > level.
> > > > > > The
> > > > > >    overlap is really high.
> > > > > >    - I think a lot more work must be done before introducing UDFs
> > and
> > > > > user
> > > > > >    defined relational operations. I see one goal being the
> > possibility
> > > > of
> > > > > >    there being three systems: A -> B -> C. A is a IR producer. C
> > is a
> > > > IR
> > > > > >    consumer and B is a IR filter or translator. In this
> situation,
> > B
> > > > > > should be
> > > > > >    able to operate and do optimizations on a plan even if if
> there
> > are
> > > > > > black
> > > > > >    box user defined operations. Being able to know the
> > > > > > properties-preservation
> > > > > >    or not of these operations is important to making decisions.
> For
> > > > > > example,
> > > > > >    does a user defined relational operation maintain sortedness?
> > > > > > Distribution?
> > > > > >    Is a defined UDF idempotent? As such, I think the definition
> of
> > > > those
> > > > > > black
> > > > > >    boxes should be much more structured. For example: it is a
> > python
> > > > > >    relational operation named X stored in Y that maintains
> > properties
> > > > 1,2
> > > > > > and
> > > > > >    disrupts property 3. Putting just a black box of bytes will
> > > > > > substantially
> > > > > >    reduce the compatibility and extensibility of the ecosystem of
> > tools
> > > > > >    working against IR. I'd note that I wouldn't expect this to
> be a
> > > > > burden
> > > > > > to
> > > > > >    actual end users. By using sensible defaults, I still would
> > expect
> > > > an
> > > > > > end
> > > > > >    user tool to support arbitrary user defined operations.
> > > > > >    - It might make sense to review the XML representation that
> Orca
> > > > uses
> > > > > >    [3]. I haven't looked at it recently but they had a strong
> goal
> > of
> > > > > >    decoupling for most of its life (to use in both Greenplum and
> > Hawq).
> > > > > It
> > > > > >    could be the most mature/formal serialization of query plans
> > > > > publically
> > > > > >    available.
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
> > > > > > [2]
> > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
> > > > > > [3] https://github.com/greenplum-db/gporca
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <
> cpcloud@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <
> > wesmckinn@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Looking at Ben's alternate PR [1], having an IR that leans
> > heavily
> > > > on
> > > > > > > > memory references to an out-of-band data sidecar seems like
> an
> > > > > > > > approach that would substantially ratchet up the
> implementation
> > > > > > > > complexity as producing the IR would then have the level of
> > > > > complexity
> > > > > > > > of producing the Arrow IPC format — when producing the "root"
> > Plan
> > > > > > > > message, you must accumulate a list of the dependent
> serialized
> > > > > > > > submessages, placing the appropriate Buffer memory offset in
> > the
> > > > Plan
> > > > > > > > message, like we do when producing the RecordBatch.buffers
> > field.
> > > > > This
> > > > > > > > seems complicated to me as you must devise a custom binary
> > protocol
> > > > > to
> > > > > > > > concatenate the serialized Plan and the messages it depends
> on
> > > > into a
> > > > > > > > single binary payload
> > > > > > > >
> > > > > > > > <ROOT PLAN>
> > > > > > > > <padding>
> > > > > > > > <Buffer 0>
> > > > > > > > <padding>
> > > > > > > > <Buffer 1>
> > > > > > > > <padding>
> > > > > > > > ...
> > > > > > > > <Buffer N - 1>
> > > > > > > >
> > > > > > > > (one purpose of FlatBufferBuilder is to spare you having to
> do
> > this
> > > > > > > > yourself — some reasons we do it for the Arrow IPC format is
> > > > because
> > > > > > > > appending Arrow memory buffers directly to a
> FlatBufferBuilder
> > > > would
> > > > > > > > be inefficient — internal realloc calls — and Flatbuffers are
> > > > limited
> > > > > > > > to 2GB. Neither of these things are problems here)
> > > > > > > >
> > > > > > > > In general, I believe the output created by an IR producer
> > should
> > > > be
> > > > > a
> > > > > > > > single serialized object without any out-of-band data
> sidecar —
> > > > this
> > > > > > > > is much simpler for implementers and we can still provide an
> > > > "escape
> > > > > > > > hatch" for user-defined operators and functions where the
> > required
> > > > > > > > function/operator is passed opaquely as an embedded binary
> > data.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The serialization format (whether it is Flatbuffers or JSON, or
> > > > > > > > something else) should allow for data memoization, so if
> there
> > is a
> > > > > > > > user-defined operator/function, or a relation that is used
> > multiple
> > > > > > > > times throughout the query (potentially with a large schema),
> > then
> > > > we
> > > > > > > > should ensure that the data need not be duplicated in the
> > > > > > > > serialization format unnecessarily — in Flatbuffers, you can
> > > > achieve
> > > > > > > > this by reusing offsets, but we could devise the data
> > structures to
> > > > > > > > make the memoization of "expensive" objects more explicit.
> > > > > > > >
> > > > > > >
> > > > > > > I think this is something that would need to be explicitly
> > encoded in
> > > > > > > the structures themselves if it's a hard requirement. I don't
> > think
> > > > > this
> > > > > > > should block
> > > > > > > a prototype producer/consumer.
> > > > > > >
> > > > > > > Is there something in the second PR/design that precludes the
> > reuse
> > > > of
> > > > > > > offsets?
> > > > > > > To my eye, the flatbuffers offset reuse mechanism works just as
> > well
> > > > > > there.
> > > > > > >
> > > > > > >
> > > > > > > > I additionally think that it is important to provide as much
> > > > built-in
> > > > > > > > support for "canonical" operators/functions (such as the ones
> > > > > > > > implemented commonly by SQL engines) as possible, and to
> > liberally
> > > > > > > > expand the catalogue of "built-in" capabilities. I would
> still
> > > > prefer
> > > > > > > > to have large unions/enums of built-in operators/functions
> and
> > to
> > > > > > > > expand those unions/enums to accommodate new things when it
> is
> > > > > > > > demonstrated that there is a need to standardize things
> between
> > > > > > > > producers/consumers.
> > > > > > > >
> > > > > > >
> > > > > > > I think there's a middle ground where we add a bit of structure
> > > > > > (something
> > > > > > > like
> > > > > > > a descriptor from the first PR) to indicate whether a thing is
> > > > built-in
> > > > > > vs
> > > > > > > user-defined.
> > > > > > > It looks like Ben has pushed something like this to his PR.
> > > > > > >
> > > > > > > With that scheme, we have both flexibility and a small set of
> > special
> > > > > > > builtins that make up
> > > > > > > a statically typed set for expressions and relational
> operators.
> > > > > > >
> > > > > > > I would really like to vet this PR with a prototype this week,
> > > > > > > to see whether we need to revisit any major choices. I don't
> > think
> > > > > we'll
> > > > > > be
> > > > > > > able to
> > > > > > > anticipate all the consequences until we write some code.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > One of the beneficial properties of the Union/Enum approach
> > for the
> > > > > > > > operator/function catalogues, is that when there are
> additions
> > to
> > > > > > > > those enums, the generated Flatbuffers files will cause many
> > > > language
> > > > > > > > compilers to warn or error on unhandled enum cases. If all
> > > > > > > > function/operator names are strings, then you are essentially
> > > > > > > > reimplementing the functionality provided by enums by hand. I
> > > > > > > > initially used strings for all function references in my
> > original
> > > > > > > > prototype, but I now think that using an enum for "built-ins"
> > would
> > > > > be
> > > > > > > > superior (because of the code-generated enum interface) and
> > not a
> > > > > > > > premature optimization.
> > > > > > > >
> > > > > > > > [1]: https://github.com/apache/arrow/pull/10934
> > > > > > > >
> > > > > > > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <
> > cpcloud@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hey all,
> > > > > > > > >
> > > > > > > > > Just wanted to give an update on the effort here.
> > > > > > > > >
> > > > > > > > > Ben Kietzman has created an alternative proposal to the
> > initial
> > > > > > design
> > > > > > > > [1].
> > > > > > > > > It largely overlaps with the original, but differs in a few
> > > > > important
> > > > > > > > ways:
> > > > > > > > >
> > > > > > > > > * A big focus of the design is on flexibility, allowing
> > > > producers,
> > > > > > > > > consumers and ultimately end users of those systems the
> > ability
> > > > to
> > > > > > > define
> > > > > > > > > custom operations in the graph.
> > > > > > > > > * There are very few predefined relational operations
> > (project,
> > > > > > filter,
> > > > > > > > > join and a handful of others).
> > > > > > > > > * There are only 3 types of value expressions: literals,
> > field
> > > > > > > > references,
> > > > > > > > > and function calls.
> > > > > > > > > * The model of evaluation is one that requires a final sink
> > node,
> > > > > to
> > > > > > > > > indicate where the record batches from child relations end
> > up (a
> > > > > > file,
> > > > > > > a
> > > > > > > > > socket, an in-memory buffer, etc).
> > > > > > > > >
> > > > > > > > > I've added notes [2] to the original Google doc (under the
> > > > > > Alternative
> > > > > > > > > Design Notes subheading), and a few pseudocode examples.
> > > > > > > > > Unfortunately, these went out of date as soon as Ben pushed
> > the
> > > > PR
> > > > > > [3],
> > > > > > > > so
> > > > > > > > > I need to update those to reflect his changes. Regardless,
> > > > > > > > > the design is broadly the same, so it should still give a
> > good
> > > > > > > indication
> > > > > > > > > of the details of the design.
> > > > > > > > >
> > > > > > > > > There are a decent number of review comments on the
> original
> > PR
> > > > > that
> > > > > > I
> > > > > > > > plan
> > > > > > > > > to port over where they are still relevant.
> > > > > > > > > I also plan on adding support for window functions either
> > tonight
> > > > > or
> > > > > > on
> > > > > > > > > Monday.
> > > > > > > > >
> > > > > > > > > Please review this design at your earliest convenience.
> Since
> > > > > > there's a
> > > > > > > > > fairly concrete set of types in flatbuffers that
> > > > > > > > > we can look at, ideally we can center discussion around the
> > > > details
> > > > > > in
> > > > > > > > the
> > > > > > > > > PR.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > [1]: https://github.com/apache/arrow/pull/10856
> > > > > > > > > [2]:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > > > > > > > [3]: https://github.com/apache/arrow/pull/10934
> > > > > > > > >
> > > > > > > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <
> > > > > jhyde.apache@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > > Wes wrote:
> > > > > > > > > > >
> > > > > > > > > > > Supporting this kind of intra-application engine
> > > > > > > > > > > heterogeneity is one of the motivations for the
> project.
> > > > > > > > > >
> > > > > > > > > > +1
> > > > > > > > > >
> > > > > > > > > > The data format is the natural interface between tasks.
> > > > (Defining
> > > > > > > > “task”
> > > > > > > > > > here as “something that is programmed using the IR”.)
> That
> > is
> > > > > > Arrow’s
> > > > > > > > > > strength.
> > > > > > > > > >
> > > > > > > > > > So I think the IR should describe what each task should
> > do, and
> > > > > > tasks
> > > > > > > > > > should be fairly small. Not whole relational operators,
> > > > operating
> > > > > > on
> > > > > > > > whole
> > > > > > > > > > tables, but pieces of relational operators, operating on
> > > > batches
> > > > > or
> > > > > > > > > > sequences of batches.
> > > > > > > > > >
> > > > > > > > > > Elsethread, someone mentioned the LoLePop concept and the
> > > > > > > > Kohn/Leis/Neuman
> > > > > > > > > > paper [1]. The LoLePop concept sounds good for our
> > purposes.
> > > > > > > > > >
> > > > > > > > > > Julian
> > > > > > > > > >
> > > > > > > > > > [1]
> > https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <
> > > > wesmckinn@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <
> > > > > > cpcloud@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > > > > > > > >> jorgecarleitao@gmail.com> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >>> Couple of questions
> > > > > > > > > > >>>
> > > > > > > > > > >>> 1. Is the goal that IRs have equal semantics, i.e.
> > given
> > > > > > > > (IR,data), the
> > > > > > > > > > >>> operation "(IR,data) - engine -> result" MUST be the
> > same
> > > > for
> > > > > > all
> > > > > > > > > > "engine"?
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > > >> I think that might be a non-starter for mundane
> reasons:
> > > > > there's
> > > > > > > > > > probably
> > > > > > > > > > >> at least two engines
> > > > > > > > > > >> that disagree on the result of sum(x) where x is a
> > floating
> > > > > > point
> > > > > > > > > > column.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>> 2. if yes, imo we may need to worry about:
> > > > > > > > > > >>> * a definition of equality that implementations agree
> > on.
> > > > > > > > > > >>> * agreement over what the semantics look like. For
> > example,
> > > > > do
> > > > > > we
> > > > > > > > use
> > > > > > > > > > >>> kleene logic for AND and OR?
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > > >> WRT Kleene logic my thoughts are that the IR should
> > support
> > > > > both
> > > > > > > > Kleene
> > > > > > > > > > and
> > > > > > > > > > >> non-Kleene,
> > > > > > > > > > >> and producers can choose their desired semantics.
> > > > > > > > > > >>
> > > > > > > > > > >> Ibis for example, would override the `&` operator in
> `a
> > & b`
> > > > > to
> > > > > > > > produce
> > > > > > > > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>>
> > > > > > > > > > >>> To try to understand the gist, let's pick an
> aggregated
> > > > count
> > > > > > > over
> > > > > > > > a
> > > > > > > > > > >>> column: engines often do partial counts over
> partitions
> > > > > > followed
> > > > > > > > by a
> > > > > > > > > > final
> > > > > > > > > > >>> "sum" over the partial counts. Is the idea that the
> > query
> > > > > > engine
> > > > > > > > would
> > > > > > > > > > >>> communicate with the compute engine via 2 IRs where
> > one is
> > > > > > "count
> > > > > > > > me
> > > > > > > > > > these"
> > > > > > > > > > >>> the other is "sum me these"?
> > > > > > > > > > >>>
> > > > > > > > > > >>> Best,
> > > > > > > > > > >>> Jorge
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > > >> Not in its current incarnation.
> > > > > > > > > > >>
> > > > > > > > > > >> The idea is that the IR producer communicates a desire
> > to
> > > > > > count(x)
> > > > > > > > to a
> > > > > > > > > > >> consumer, and  it's up to the consumer to figure out
> > how to
> > > > > turn
> > > > > > > > that
> > > > > > > > > > count
> > > > > > > > > > >> into something that makes sense for itself. In your
> > example
> > > > > > > that's a
> > > > > > > > > > series
> > > > > > > > > > >> of partial counts followed by a sum.
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > > That said, I think there is a valid use case here
> where a
> > > > > system
> > > > > > > > might
> > > > > > > > > > > make use of different engines to execute different
> > > > (composable)
> > > > > > > > layers
> > > > > > > > > > > of a complex query.
> > > > > > > > > > >
> > > > > > > > > > > For example:
> > > > > > > > > > >
> > > > > > > > > > > * suppose you want to scan and do predicate pushdown on
> > an
> > > > > > unusual
> > > > > > > > > > > data source that is only accessible from one particular
> > > > engine
> > > > > > but
> > > > > > > > > > > * you need to do some analytical operation with the
> scan
> > > > > results
> > > > > > > that
> > > > > > > > > > > is only supported by another engine
> > > > > > > > > > >
> > > > > > > > > > > You could decompose the query into two stages with an
> IR
> > > > > > relational
> > > > > > > > > > > expression for each stage and use then the engines
> > together
> > > > to
> > > > > > > > > > > accomplish what you need (of course, you would need an
> > > > > > > orchestration
> > > > > > > > > > > layer to handle plumbing the query engine inputs and
> > outputs
> > > > > > > together
> > > > > > > > > > > as Arrow streams). Supporting this kind of
> > intra-application
> > > > > > engine
> > > > > > > > > > > heterogeneity is one of the motivations for the
> project.
> > > > > > > > > > >
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <
> > > > > > cpcloud@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >>>
> > > > > > > > > > >>>> Thanks Wes,
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Great to be back working on Arrow again and engaging
> > with
> > > > > the
> > > > > > > > > > community.
> > > > > > > > > > >>> I
> > > > > > > > > > >>>> am really excited about this effort.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I think there are a number of concerns I see as
> > important
> > > > to
> > > > > > > > address
> > > > > > > > > > in
> > > > > > > > > > >>> the
> > > > > > > > > > >>>> compute IR proposal:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 1. Requirement for output types.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I think that so far there's been many reasons for
> > > > requiring
> > > > > > > > > > conforming IR
> > > > > > > > > > >>>> producers and consumers to adhere to output types,
> > but I
> > > > > > haven't
> > > > > > > > seen
> > > > > > > > > > a
> > > > > > > > > > >>>> strong rationale for keeping them optional (in the
> > > > semantic
> > > > > > > > sense, not
> > > > > > > > > > >>> WRT
> > > > > > > > > > >>>> any particular serialization format's representation
> > of
> > > > > > > > optionality).
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I think a design that includes unambiguous semantics
> > for
> > > > > > output
> > > > > > > > types
> > > > > > > > > > (a
> > > > > > > > > > >>>> consumer must produce a value of the requested type
> or
> > > > it's
> > > > > an
> > > > > > > > > > >>>> error/non-conforming) is simpler to reason about for
> > > > > > producers,
> > > > > > > > and
> > > > > > > > > > >>>> provides a strong guarantee for end users (humans or
> > > > > machines
> > > > > > > > > > >>> constructing
> > > > > > > > > > >>>> IR from an API and expecting the thing they ask for
> > back
> > > > > from
> > > > > > > the
> > > > > > > > IR
> > > > > > > > > > >>>> consumer).
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 2. Flexibility
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> The current PR is currently unable to support what I
> > think
> > > > > are
> > > > > > > > killer
> > > > > > > > > > >>>> features of the IR: custom operators (relational or
> > > > column)
> > > > > > and
> > > > > > > > UDFs.
> > > > > > > > > > In
> > > > > > > > > > >>> my
> > > > > > > > > > >>>> mind, on top of the generalized compute description
> > that
> > > > the
> > > > > > IR
> > > > > > > > > > offers,
> > > > > > > > > > >>> the
> > > > > > > > > > >>>> ability for producers and consumers of IR to extend
> > the IR
> > > > > > > without
> > > > > > > > > > >>> needing
> > > > > > > > > > >>>> to modify Arrow or depend on anything except the
> > format is
> > > > > > > itself
> > > > > > > > > > >>> something
> > > > > > > > > > >>>> that is necessary to gain adoption.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Developers will need to build custom relational
> > operators
> > > > > > (e.g.,
> > > > > > > > > > scans of
> > > > > > > > > > >>>> backends that don't exist anywhere for which a user
> > has
> > > > code
> > > > > > to
> > > > > > > > > > >>> implement)
> > > > > > > > > > >>>> and custom functions (anything operating on a column
> > that
> > > > > > > doesn't
> > > > > > > > > > already
> > > > > > > > > > >>>> exist, really). Furthermore, I think we can actually
> > drive
> > > > > > > > building an
> > > > > > > > > > >>>> Arrow consumer using the same API that an end user
> > would
> > > > use
> > > > > > to
> > > > > > > > extend
> > > > > > > > > > >>> the
> > > > > > > > > > >>>> IR.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 3. Window Functions
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Window functions are, I think, an important part of
> > the IR
> > > > > > value
> > > > > > > > > > >>>> proposition, as they are one of the more complex
> > operators
> > > > > in
> > > > > > > > > > databases.
> > > > > > > > > > >>> I
> > > > > > > > > > >>>> think we need to have something in the initial IR
> > proposal
> > > > > to
> > > > > > > > support
> > > > > > > > > > >>> these
> > > > > > > > > > >>>> operations.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 4. Non relational Joins
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Things like as-of join and window join operators
> > aren't
> > > > yet
> > > > > > > > fleshed
> > > > > > > > > > out
> > > > > > > > > > >>> in
> > > > > > > > > > >>>> the IR, and I'm not sure they should be in scope for
> > the
> > > > > > initial
> > > > > > > > > > >>> prototype.
> > > > > > > > > > >>>> I think once we settle on a design, we can work the
> > design
> > > > > of
> > > > > > > > these
> > > > > > > > > > >>>> particular operators out during the initial
> > prototype. I
> > > > > think
> > > > > > > the
> > > > > > > > > > >>>> specification of these operators should basically be
> > PR #2
> > > > > > after
> > > > > > > > the
> > > > > > > > > > >>>> initial design lands.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> # Order of Work
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 1. Nail down the design. Anything else is a
> > non-starter.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 2. Prototype an IR producer using Ibis
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Ibis is IMO a good candidate for a first IR producer
> > as it
> > > > > > has a
> > > > > > > > > > number
> > > > > > > > > > >>> of
> > > > > > > > > > >>>> desirable properties that make prototyping faster
> and
> > > > allow
> > > > > > for
> > > > > > > > us to
> > > > > > > > > > >>>> refine the design of the IR as needed based on how
> the
> > > > > > > > implementation
> > > > > > > > > > >>> goes:
> > > > > > > > > > >>>> * It's written in Python so it has native support
> for
> > > > nearly
> > > > > > all
> > > > > > > > of
> > > > > > > > > > >>>> flatbuffers' features without having to creating
> > bindings
> > > > to
> > > > > > > C++.
> > > > > > > > > > >>>> * There's already a set of rules for type checking,
> as
> > > > well
> > > > > as
> > > > > > > > APIs
> > > > > > > > > > for
> > > > > > > > > > >>>> constructing expression trees, which means we don't
> > need
> > > > to
> > > > > > > worry
> > > > > > > > > > about
> > > > > > > > > > >>>> building a type checker for the prototype.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 3. Prototype an IR consumer in C++
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I think in parallel to the producer prototype we can
> > > > further
> > > > > > > > inform
> > > > > > > > > > the
> > > > > > > > > > >>>> design from the consumer side by prototyping an IR
> > > > consumer
> > > > > in
> > > > > > > > C++ . I
> > > > > > > > > > >>> know
> > > > > > > > > > >>>> Ben Kietzman has expressed interest in working on
> > this.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Very interested to hear others' thoughts.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> -Phillip
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > > >>> wrote:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>> Thank you for all the feedback and comments on the
> > > > > document.
> > > > > > > I'm
> > > > > > > > on
> > > > > > > > > > >>>>> vacation this week, so I'm delayed responding to
> > > > > everything,
> > > > > > > but
> > > > > > > > I
> > > > > > > > > > >>>>> will get to it as quickly as I can. I will be at
> > VLDB in
> > > > > > > > Copenhagen
> > > > > > > > > > >>>>> next week if anyone would like to chat in person
> > about
> > > > it,
> > > > > > and
> > > > > > > > we can
> > > > > > > > > > >>>>> relay the content of any discussions back to the
> > > > > > > > document/PR/e-mail
> > > > > > > > > > >>>>> thread.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> I know that Phillip Cloud expressed interest in
> > working
> > > > on
> > > > > > the
> > > > > > > > PR and
> > > > > > > > > > >>>>> helping work through many of the details, so I'm
> > glad to
> > > > > have
> > > > > > > the
> > > > > > > > > > >>>>> help. If there are others who would like to work on
> > the
> > > > PR
> > > > > or
> > > > > > > dig
> > > > > > > > > > into
> > > > > > > > > > >>>>> the details, please let me know. We might need to
> > figure
> > > > > out
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > >>>>> accommodate "many cooks" by setting up the
> ComputeIR
> > > > > project
> > > > > > > > > > somewhere
> > > > > > > > > > >>>>> separate from the format/ directory to permit it to
> > exist
> > > > > in
> > > > > > a
> > > > > > > > > > >>>>> Work-In-Progress status for a period of time until
> we
> > > > work
> > > > > > > > through
> > > > > > > > > > the
> > > > > > > > > > >>>>> various details and design concerns.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Re Julian's comment
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>>> The biggest surprise is that this language does
> full
> > > > > > > relational
> > > > > > > > > > >>>>> operations. I was expecting that it would do
> > fragments of
> > > > > the
> > > > > > > > > > >>> operations.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> There's a related but different (yet still
> > interesting
> > > > and
> > > > > > > > worthy of
> > > > > > > > > > >>>>> analysis) problem of creating an "engine language"
> > that
> > > > > > > describes
> > > > > > > > > > more
> > > > > > > > > > >>>>> mechanically the constituent parts of implementing
> > the
> > > > > > > relational
> > > > > > > > > > >>>>> operators. To create a functional computation
> > language
> > > > with
> > > > > > > > concrete
> > > > > > > > > > >>>>> Arrow data structures as a top-level primitive
> sounds
> > > > like
> > > > > an
> > > > > > > > > > >>>>> interesting research area where I could see
> something
> > > > > > > developing
> > > > > > > > > > >>>>> eventually.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> The main problem I'm interested in solving right
> now
> > is
> > > > > > > enabling
> > > > > > > > > > front
> > > > > > > > > > >>>>> ends that have sufficient understanding of
> relational
> > > > > algebra
> > > > > > > and
> > > > > > > > > > data
> > > > > > > > > > >>>>> frame operations to talk to engines without having
> > to go
> > > > > > > > backwards
> > > > > > > > > > >>>>> from their logical query plans to SQL. So as
> > mentioned in
> > > > > the
> > > > > > > > > > >>>>> document, being able to faithfully carry the
> > relational
> > > > > > > operator
> > > > > > > > node
> > > > > > > > > > >>>>> information generated by Calcite or Ibis or another
> > > > system
> > > > > > > would
> > > > > > > > be
> > > > > > > > > > >>>>> super useful. Defining the semantics of various
> > kinds of
> > > > > > > > user-defined
> > > > > > > > > > >>>>> functions would also be helpful to standardize the
> > > > > > > > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > > > > > > > alendit@gmail.com>
> > > > > > > > > > >>>> wrote:
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Hi Wes,
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> cool initiative! Reminded me of "Building Advanced
> > SQL
> > > > > > > Analytics
> > > > > > > > > > From
> > > > > > > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > > > > > > >>>>>>
> > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf)
> > > > > > which
> > > > > > > > > > >>>> proposes a
> > > > > > > > > > >>>>>> set of building block for advanced aggregation.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Cheers,
> > > > > > > > > > >>>>>> Dimitri.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > > > > > > > jhyde.apache@gmail.com>
> > > > > > > > > > >>>>> wrote:
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>> Wes,
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> Thanks for this. I’ve added comments to the doc
> > and to
> > > > > the
> > > > > > > PR.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> The biggest surprise is that this language does
> > full
> > > > > > > relational
> > > > > > > > > > >>>>>>> operations. I was expecting that it would do
> > fragments
> > > > of
> > > > > > the
> > > > > > > > > > >>>>> operations.
> > > > > > > > > > >>>>>>> Consider join. A distributed hybrid hash join
> > needs to
> > > > > > > > partition
> > > > > > > > > > >>> rows
> > > > > > > > > > >>>>> into
> > > > > > > > > > >>>>>>> output buffers based on a hash key, build hash
> > tables,
> > > > > > probe
> > > > > > > > into
> > > > > > > > > > >>>> hash
> > > > > > > > > > >>>>>>> tables, scan hash tables for untouched
> > “outer”rows, and
> > > > > so
> > > > > > > > forth.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> I see Arrow’s compute as delivering each of those
> > > > > > operations,
> > > > > > > > > > >>> working
> > > > > > > > > > >>>>> on
> > > > > > > > > > >>>>>>> perhaps a batch at a time, or a sequence of
> > batches,
> > > > with
> > > > > > > some
> > > > > > > > > > >>> other
> > > > > > > > > > >>>>> system
> > > > > > > > > > >>>>>>> coordinating those tasks. So I would expect to
> see
> > > > > Arrow’s
> > > > > > > > compute
> > > > > > > > > > >>>>> language
> > > > > > > > > > >>>>>>> mainly operating on batches rather than a table
> > > > > > abstraction.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> Julian
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> > > > > > > wesmckinn@gmail.com
> > > > > > > > >
> > > > > > > > > > >>>>> wrote:
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> hi folks,
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> This idea came up in passing in the past --
> given
> > that
> > > > > > there
> > > > > > > > are
> > > > > > > > > > >>>>>>>> multiple independent efforts to develop
> > Arrow-native
> > > > > query
> > > > > > > > > > >>> engines
> > > > > > > > > > >>>>>>>> (and surely many more to come), it seems like it
> > would
> > > > > be
> > > > > > > > > > >>> valuable
> > > > > > > > > > >>>> to
> > > > > > > > > > >>>>>>>> have a way to enable user languages (like Java,
> > > > Python,
> > > > > R,
> > > > > > > or
> > > > > > > > > > >>> Rust,
> > > > > > > > > > >>>>>>>> for example) to communicate with backend
> computing
> > > > > engines
> > > > > > > > (like
> > > > > > > > > > >>>>>>>> DataFusion, or new computing capabilities being
> > built
> > > > in
> > > > > > the
> > > > > > > > > > >>> Arrow
> > > > > > > > > > >>>>> C++
> > > > > > > > > > >>>>>>>> library) in a fashion that is "lower-level" than
> > SQL
> > > > and
> > > > > > > > > > >>>> specialized
> > > > > > > > > > >>>>>>>> to Arrow's type system. So rather than leaving
> it
> > to a
> > > > > SQL
> > > > > > > > > > >>> parser /
> > > > > > > > > > >>>>>>>> analyzer framework to generate an expression
> tree
> > of
> > > > > > > > relational
> > > > > > > > > > >>>>>>>> operators and then translate that to an
> > Arrow-native
> > > > > > > > > > >>>> compute-engine's
> > > > > > > > > > >>>>>>>> internal grammer, a user framework could provide
> > the
> > > > > > desired
> > > > > > > > > > >>>>>>>> Arrow-native expression tree / data
> manipulations
> > > > > directly
> > > > > > > and
> > > > > > > > > > >>> skip
> > > > > > > > > > >>>>>>>> the SQL altogether.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> The idea of creating a "serialized intermediate
> > > > > > > representation
> > > > > > > > > > >>>> (IR)"
> > > > > > > > > > >>>>>>>> for Arrow compute operations would be to serve
> use
> > > > cases
> > > > > > > large
> > > > > > > > > > >>> and
> > > > > > > > > > >>>>>>>> small -- from the most complex TPC-* or time
> > series
> > > > > > database
> > > > > > > > > > >>> query
> > > > > > > > > > >>>> to
> > > > > > > > > > >>>>>>>> the most simple array predicate/filter sent with
> > an
> > > > RPC
> > > > > > > > request
> > > > > > > > > > >>>> using
> > > > > > > > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > > > > > > > engine-agnostic,
> > > > > > > > > > >>>> with
> > > > > > > > > > >>>>>>>> the only commonality across usages being the
> Arrow
> > > > > > columnar
> > > > > > > > > > >>> format
> > > > > > > > > > >>>>>>>> (schemas and array types). This would be better
> > than
> > > > > > leaving
> > > > > > > > it
> > > > > > > > > > >>> to
> > > > > > > > > > >>>>>>>> each application to develop its own bespoke
> > expression
> > > > > > > > > > >>>>> representations
> > > > > > > > > > >>>>>>>> for its needs.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> I spent a while thinking about this and wrote
> up a
> > > > brain
> > > > > > > dump
> > > > > > > > RFC
> > > > > > > > > > >>>>>>>> document [1] and accompanying pull request [2]
> > that
> > > > > makes
> > > > > > > the
> > > > > > > > > > >>>>> possibly
> > > > > > > > > > >>>>>>>> controversial choice of using Flatbuffers to
> > represent
> > > > > the
> > > > > > > > > > >>>> serialized
> > > > > > > > > > >>>>>>>> IR. I discuss the rationale for the choice of
> > > > > Flatbuffers
> > > > > > in
> > > > > > > > the
> > > > > > > > > > >>>> RFC
> > > > > > > > > > >>>>>>>> document. This PR is obviously deficient in many
> > > > regards
> > > > > > > > > > >>>> (incomplete,
> > > > > > > > > > >>>>>>>> hacky, or unclear in places), and will need some
> > help
> > > > > from
> > > > > > > > others
> > > > > > > > > > >>>> to
> > > > > > > > > > >>>>>>>> flesh out. I suspect that actually implementing
> > the IR
> > > > > > will
> > > > > > > be
> > > > > > > > > > >>>>>>>> necessary to work out many of the low-level
> > details.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> Note that this IR is intended to be more of a
> > > > "superset"
> > > > > > > > project
> > > > > > > > > > >>>> than
> > > > > > > > > > >>>>>>>> a "lowest common denominator". So there may be
> > things
> > > > > that
> > > > > > > it
> > > > > > > > > > >>>>> includes
> > > > > > > > > > >>>>>>>> which are only available in some engines (e.g.
> > engines
> > > > > > that
> > > > > > > > have
> > > > > > > > > > >>>>>>>> specialized handling of time series data).
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> As some of my personal motivation for the
> project,
> > > > > > > concurrent
> > > > > > > > > > >>> with
> > > > > > > > > > >>>>> the
> > > > > > > > > > >>>>>>>> genesis of Apache Arrow, I started a Python
> > project
> > > > > called
> > > > > > > > Ibis
> > > > > > > > > > >>> [3]
> > > > > > > > > > >>>>>>>> (which is similar to R's dplyr project) which
> > serves
> > > > as
> > > > > a
> > > > > > > > "Python
> > > > > > > > > > >>>>>>>> analytical query IR builder" that is capable of
> > > > > generating
> > > > > > > > most
> > > > > > > > > > >>> of
> > > > > > > > > > >>>>> the
> > > > > > > > > > >>>>>>>> SQL standard, targeting many different SQL
> > dialects
> > > > and
> > > > > > > other
> > > > > > > > > > >>>>> backends
> > > > > > > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5])
> > have
> > > > > used
> > > > > > > this
> > > > > > > > > > >>>>> library
> > > > > > > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would
> like
> > to
> > > > be
> > > > > > able
> > > > > > > > to
> > > > > > > > > > >>>>>>>> translate between the in-memory "logical query"
> > data
> > > > > > > > structures
> > > > > > > > > > >>> in
> > > > > > > > > > >>>> a
> > > > > > > > > > >>>>>>>> library like Ibis to a serialized format that
> can
> > be
> > > > > > > executed
> > > > > > > > by
> > > > > > > > > > >>>> many
> > > > > > > > > > >>>>>>>> different Arrow-native query engines. The
> > expression
> > > > > > > > primitives
> > > > > > > > > > >>>>>>>> available in Ibis should serve as helpful test
> > cases,
> > > > > too.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> I look forward to the community's comments on
> the
> > RFC
> > > > > > > document
> > > > > > > > > > >>> [1]
> > > > > > > > > > >>>>> and
> > > > > > > > > > >>>>>>>> pull request [2] -- I realize that arriving at
> > > > consensus
> > > > > > on
> > > > > > > a
> > > > > > > > > > >>>> complex
> > > > > > > > > > >>>>>>>> and ambitious project like this can be
> > challenging so
> > > > I
> > > > > > > > recommend
> > > > > > > > > > >>>>>>>> spending time on the "non-goals" section in the
> > RFC
> > > > and
> > > > > > ask
> > > > > > > > > > >>>> questions
> > > > > > > > > > >>>>>>>> if you are unclear about the scope of what
> > problems
> > > > this
> > > > > > is
> > > > > > > > > > >>> trying
> > > > > > > > > > >>>> to
> > > > > > > > > > >>>>>>>> solve. I would be happy to give Edit access on
> > the RFC
> > > > > > > > document
> > > > > > > > > > >>> to
> > > > > > > > > > >>>>>>>> others and would consider ideas about how to
> move
> > > > > forward
> > > > > > > with
> > > > > > > > > > >>>>>>>> something that is able to be implemented by
> > different
> > > > > > Arrow
> > > > > > > > > > >>>> libraries
> > > > > > > > > > >>>>>>>> in the reasonably near future.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> Thanks,
> > > > > > > > > > >>>>>>>> Wes
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> [1]:
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > > > > > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > > > > > > > >>>>>>>> [4]:
> > > > > > http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > > > > > >>>>>>>> [5]:
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Jacques Nadeau <ja...@apache.org>.

As Phillip mentioned, I think there is something powerful in producing a
standard serialized representation of compute operations beyond just Arrow
and I'd really like to create a broader community around it. This has been
something I had been independently thinking about for the last several
months. The discussion here has inspired me to start making real progress
on this work. As such, I created a new repository and site where I've
started to put together work around a new specification for compute. I
would love for the people here to help define this and will be looking to a
number of other communities to also contribute. One of my goals has been to
break the specification into a number of much smaller pieces [1] so that we
can make progress on each subsection without being overwhelmed by the
amount of content that must be reviewed.

Would love to hear people's ideas on this initiative.

The site is here: https://substrait.io/
The repo is here: https://github.com/substrait-io/substrait

[1] https://substrait.io/spec/specification/#components



On Wed, Sep 1, 2021 at 3:26 PM Phillip Cloud <cp...@gmail.com> wrote:

> Hey everyone,
>
> As many of you know, the compute IR project has a lot of interested parties
> and has generated a lot of feedback. In light of some of the feedback we’ve
> received, we want to stress that the specification is intended to have
> input from many diverse points of view and that we welcome folks outside of
> the Arrow community. We think there’s immense potential for a compute IR
> that multiple projects--including those outside of the Arrow umbrella--can
> leverage.
>
> With that in mind, Jacques has been working on something outside of the
> Arrow repo that’ll be shared in a few days, that is designed to bring those
> viewpoints to bear on the problem of generic relational computation that
> lives outside of the Arrow project.
>
> Inside Arrow, we think that a version of the in-development IR
> specifications from the last several weeks will add a ton of value by
> informing this new effort and would like to continue to move forward with a
> work-in-progress IR inside of Arrow for the time being to enable some work
> on API development (independent of exactly how things are serialized) to
> take place. It is very likely that we will adopt this broader specification
> once the dust has settled, so the format inside of Arrow will be relatively
> unstable for a while and not have backwards compatibility guarantees for
> now.
>
> The primary focus of the Arrow IR will be on shoring up APIs (producers and
> consumers), and we will also be moving the compute IR flatbuffers files out
> the format directory into another top-level directory in the repo.
>
> Thanks,
> Phillip
>
> On Mon, Aug 30, 2021 at 7:30 PM Weston Pace <we...@gmail.com> wrote:
>
> > My (incredibly naive) interpretation is that there are three problems to
> > tackle.
> >
> > 1) How do you represent a graph and relational operators (join, union,
> > groupby, etc.)
> >  - The PR appears to be addressing this question fairly well
> > 2) How does a frontend query a backend to know what UDFs are supported.
> >  - I don't see anything in the spec for this (some comments touch on
> > it) but it seems like it would be necessary to build any kind of
> > system.
> > 3) Is there some well defined set of canonical UDFs that we can all
> > agree on the semantics for (e.g. addition, subtraction, etc.)
> >  - I thought, from earlier comments in this email thread, that the
> > goal was to avoid addressing this.  Although I think there is strong
> > value here as well.
> >
> > So what is the scope of this initiative?  If it is just #1 for example
> > then I don't see any need to put types in the IR (and I've commented
> > as such in the PR).  From a relational perspective isn't a UDF just a
> > black box Table -> UDF -> Table?
> >
> > On Mon, Aug 30, 2021 at 11:10 AM Phillip Cloud <cp...@gmail.com>
> wrote:
> > >
> > > Hey everyone,
> > >
> > > There's some interesting discussion around types and where their
> location
> > > is in the current PR [1] (and in fact whether to store them at all).
> > >
> > > It would be great to get some community feedback on this [2] part of
> the
> > PR
> > > in particular, because the choice of whether to store types at all has
> > > important design consequences.
> > >
> > > [1]: https://github.com/apache/arrow/pull/10934
> > > [2]: https://github.com/apache/arrow/pull/10934/files#r697025313
> > >
> > > On Fri, Aug 27, 2021 at 2:11 AM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > > > As an FYI, Iceberg is also considering an IR in relation to view
> > support
> > > > [1].  I chimed in and pointed them to this thread and Wes's doc.
> > Phillip
> > > > and Jacques chimed in there as well.
> > > >
> > > > [1]
> > > >
> > > >
> >
> https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E
> > > >
> > > > On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud <cp...@gmail.com>
> > wrote:
> > > >
> > > > > Thanks for the feedback Jacques, very helpful. In the latest
> version
> > of
> > > > the
> > > > > PR, I've tried to incorporate nearly all of these points.
> > > > >
> > > > > - I've incorporated most of what you had for dereferencing
> operations
> > > > into
> > > > > the PR, and gotten rid of schemas except on Read/Write relations.
> > > > > - With respect to properties, I've made a bunch more specific
> > operators,
> > > > > and kept user-defined things special case-y.
> > > > > - I haven't incorporated anything close to physical-plan things,
> but
> > I
> > > > > think that's a good follow up PR. Having separate representations
> for
> > > > > logical/physical plans seems like a waste of effort ultimately. I
> > think
> > > > we
> > > > > can find a good balance.
> > > > > - Agree on UDF support, I think that will have to evolve as the
> rest
> > of
> > > > the
> > > > > spec evolves. UDFs will need language-dedicated effort given the
> > large
> > > > > variety of languages that folks will want to use to define
> functions.
> > > > >
> > > > > On a separate note, in an effort to move this project forward I'd
> > like to
> > > > > do one final round of code review from anyone interested and then
> > merge
> > > > the
> > > > > PR after that.
> > > > > This spec will be unstable for a while, until we can work out all
> the
> > > > > design kinks and edge cases, and I think getting this initial spec
> > in is
> > > > > important to start that process.
> > > > >
> > > > >
> > > > > On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <jacques@apache.org
> >
> > > > wrote:
> > > > >
> > > > > > In a lucky turn of events, Phillip actually turned out to be in
> my
> > neck
> > > > > of
> > > > > > the woods on Friday so we had a chance to sit down and discuss
> > this. To
> > > > > > help, I actually shared something I had been working on a few
> > months
> > > > ago
> > > > > > independently (before this discussion started).
> > > > > >
> > > > > > For reference:
> > > > > > Wes PR: https://github.com/apache/arrow/pull/10856
> > > > > > Ben PR: https://github.com/apache/arrow/pull/10934
> > > > > > Jacques PR: https://github.com/apache/arrow/pull/10979
> > > > > >
> > > > > > The high level points of feedback I have are:
> > > > > >
> > > > > >    - Ben PR feels too deconstructed. While I like the elegance
> and
> > > > > >    symmetry, I believe this will lead to substantially more work
> in
> > > > > moving
> > > > > >    from serialization format to something closer to what a system
> > would
> > > > > > want
> > > > > >    to manipulate/consume. The reality is that there are a lot of
> > really
> > > > > > known
> > > > > >    things and specializing the representation for these things
> will
> > > > > > ultimately
> > > > > >    make things easier to program with without error and easier to
> > > > debug.
> > > > > > (For
> > > > > >    example, imagine trying to inspect a plan in a debugging
> session
> > > > with
> > > > > > the
> > > > > >    Ben representation.) We should embrace the known things in the
> > > > > >    specification.
> > > > > >    - I believe that it is a mistake for the inner workings of the
> > plan
> > > > to
> > > > > >    ever use field names. Only input (e.g. file read) and Output
> > (e.g.
> > > > > > return
> > > > > >    to user or write to file) need to have field names. For the
> > rest of
> > > > > the
> > > > > >    system, using field ordinals (determinant whether nested or
> > flat) is
> > > > > > much
> > > > > >    cleaner and is how most execution systems work. For example,
> in
> > > > > Impala I
> > > > > >    believe it is called a slot. As I noted in the PR, Calcite as
> an
> > > > > > example is
> > > > > >    entirely ordinal based at the algebra level. Rowtypes contain
> > field
> > > > > > names
> > > > > >    but they are actually basically pointless. Field references
> use
> > > > > > RexInputRef
> > > > > >    with ordinal based and rules around column order output (e.g.
> > what
> > > > is
> > > > > > the
> > > > > >    field order of a join output) are determinant and done
> entirely
> > at
> > > > an
> > > > > >    ordinal level. The only place where names should be used
> > (besides
> > > > > >    input/output) is in the case of map keys. In that case, the
> > names
> > > > are
> > > > > >    actually data, as opposed to scheme metadata. This is why I
> > propose
> > > > a
> > > > > >    strongly structured dereference operation [1].
> > > > > >    - Properties should only be included in the serialization if
> > they
> > > > are
> > > > > >    not easily re-derivable at plan consumption time. For example,
> > > > you'll
> > > > > > note
> > > > > >    that I don't store schema information for relational
> operation.
> > Each
> > > > > >    function and relational operation should already know how a
> > given
> > > > > input
> > > > > > is
> > > > > >    transformed to a given output. Capturing this information in
> the
> > > > > > plan/IR is
> > > > > >    excessive. In many ways, I compare it to the early use of
> > > > VectorLayout
> > > > > > [2]
> > > > > >    in Arrow schema. It may have provided some additional checksum
> > of
> > > > the
> > > > > >    message but ultimately it caused more pain than it was worth
> > (and
> > > > thus
> > > > > > we
> > > > > >    removed it before formalizing the specification). For
> > reference, in
> > > > > the
> > > > > >    context of Dremio, we used to actually do this, send schema
> > > > > information
> > > > > >    around for all operations. We removed it because in many cases
> > > > > becoming
> > > > > > the
> > > > > >    majority of our internal plan serialization (imagine simple
> > > > operations
> > > > > > that
> > > > > >    are carrying 1000s of fields).
> > > > > >    - I suggest focusing on support for both logical and physical
> > > > > >    representations. The moment you start talking about
> optimization
> > > > > passes,
> > > > > >    many of those would probably be better being done at the
> logical
> > > > > level.
> > > > > > The
> > > > > >    overlap is really high.
> > > > > >    - I think a lot more work must be done before introducing UDFs
> > and
> > > > > user
> > > > > >    defined relational operations. I see one goal being the
> > possibility
> > > > of
> > > > > >    there being three systems: A -> B -> C. A is a IR producer. C
> > is a
> > > > IR
> > > > > >    consumer and B is a IR filter or translator. In this
> situation,
> > B
> > > > > > should be
> > > > > >    able to operate and do optimizations on a plan even if if
> there
> > are
> > > > > > black
> > > > > >    box user defined operations. Being able to know the
> > > > > > properties-preservation
> > > > > >    or not of these operations is important to making decisions.
> For
> > > > > > example,
> > > > > >    does a user defined relational operation maintain sortedness?
> > > > > > Distribution?
> > > > > >    Is a defined UDF idempotent? As such, I think the definition
> of
> > > > those
> > > > > > black
> > > > > >    boxes should be much more structured. For example: it is a
> > python
> > > > > >    relational operation named X stored in Y that maintains
> > properties
> > > > 1,2
> > > > > > and
> > > > > >    disrupts property 3. Putting just a black box of bytes will
> > > > > > substantially
> > > > > >    reduce the compatibility and extensibility of the ecosystem of
> > tools
> > > > > >    working against IR. I'd note that I wouldn't expect this to
> be a
> > > > > burden
> > > > > > to
> > > > > >    actual end users. By using sensible defaults, I still would
> > expect
> > > > an
> > > > > > end
> > > > > >    user tool to support arbitrary user defined operations.
> > > > > >    - It might make sense to review the XML representation that
> Orca
> > > > uses
> > > > > >    [3]. I haven't looked at it recently but they had a strong
> goal
> > of
> > > > > >    decoupling for most of its life (to use in both Greenplum and
> > Hawq).
> > > > > It
> > > > > >    could be the most mature/formal serialization of query plans
> > > > > publically
> > > > > >    available.
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
> > > > > > [2]
> > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
> > > > > > [3] https://github.com/greenplum-db/gporca
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <
> cpcloud@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <
> > wesmckinn@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Looking at Ben's alternate PR [1], having an IR that leans
> > heavily
> > > > on
> > > > > > > > memory references to an out-of-band data sidecar seems like
> an
> > > > > > > > approach that would substantially ratchet up the
> implementation
> > > > > > > > complexity as producing the IR would then have the level of
> > > > > complexity
> > > > > > > > of producing the Arrow IPC format — when producing the "root"
> > Plan
> > > > > > > > message, you must accumulate a list of the dependent
> serialized
> > > > > > > > submessages, placing the appropriate Buffer memory offset in
> > the
> > > > Plan
> > > > > > > > message, like we do when producing the RecordBatch.buffers
> > field.
> > > > > This
> > > > > > > > seems complicated to me as you must devise a custom binary
> > protocol
> > > > > to
> > > > > > > > concatenate the serialized Plan and the messages it depends
> on
> > > > into a
> > > > > > > > single binary payload
> > > > > > > >
> > > > > > > > <ROOT PLAN>
> > > > > > > > <padding>
> > > > > > > > <Buffer 0>
> > > > > > > > <padding>
> > > > > > > > <Buffer 1>
> > > > > > > > <padding>
> > > > > > > > ...
> > > > > > > > <Buffer N - 1>
> > > > > > > >
> > > > > > > > (one purpose of FlatBufferBuilder is to spare you having to
> do
> > this
> > > > > > > > yourself — some reasons we do it for the Arrow IPC format is
> > > > because
> > > > > > > > appending Arrow memory buffers directly to a
> FlatBufferBuilder
> > > > would
> > > > > > > > be inefficient — internal realloc calls — and Flatbuffers are
> > > > limited
> > > > > > > > to 2GB. Neither of these things are problems here)
> > > > > > > >
> > > > > > > > In general, I believe the output created by an IR producer
> > should
> > > > be
> > > > > a
> > > > > > > > single serialized object without any out-of-band data
> sidecar —
> > > > this
> > > > > > > > is much simpler for implementers and we can still provide an
> > > > "escape
> > > > > > > > hatch" for user-defined operators and functions where the
> > required
> > > > > > > > function/operator is passed opaquely as an embedded binary
> > data.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The serialization format (whether it is Flatbuffers or JSON, or
> > > > > > > > something else) should allow for data memoization, so if
> there
> > is a
> > > > > > > > user-defined operator/function, or a relation that is used
> > multiple
> > > > > > > > times throughout the query (potentially with a large schema),
> > then
> > > > we
> > > > > > > > should ensure that the data need not be duplicated in the
> > > > > > > > serialization format unnecessarily — in Flatbuffers, you can
> > > > achieve
> > > > > > > > this by reusing offsets, but we could devise the data
> > structures to
> > > > > > > > make the memoization of "expensive" objects more explicit.
> > > > > > > >
> > > > > > >
> > > > > > > I think this is something that would need to be explicitly
> > encoded in
> > > > > > > the structures themselves if it's a hard requirement. I don't
> > think
> > > > > this
> > > > > > > should block
> > > > > > > a prototype producer/consumer.
> > > > > > >
> > > > > > > Is there something in the second PR/design that precludes the
> > reuse
> > > > of
> > > > > > > offsets?
> > > > > > > To my eye, the flatbuffers offset reuse mechanism works just as
> > well
> > > > > > there.
> > > > > > >
> > > > > > >
> > > > > > > > I additionally think that it is important to provide as much
> > > > built-in
> > > > > > > > support for "canonical" operators/functions (such as the ones
> > > > > > > > implemented commonly by SQL engines) as possible, and to
> > liberally
> > > > > > > > expand the catalogue of "built-in" capabilities. I would
> still
> > > > prefer
> > > > > > > > to have large unions/enums of built-in operators/functions
> and
> > to
> > > > > > > > expand those unions/enums to accommodate new things when it
> is
> > > > > > > > demonstrated that there is a need to standardize things
> between
> > > > > > > > producers/consumers.
> > > > > > > >
> > > > > > >
> > > > > > > I think there's a middle ground where we add a bit of structure
> > > > > > (something
> > > > > > > like
> > > > > > > a descriptor from the first PR) to indicate whether a thing is
> > > > built-in
> > > > > > vs
> > > > > > > user-defined.
> > > > > > > It looks like Ben has pushed something like this to his PR.
> > > > > > >
> > > > > > > With that scheme, we have both flexibility and a small set of
> > special
> > > > > > > builtins that make up
> > > > > > > a statically typed set for expressions and relational
> operators.
> > > > > > >
> > > > > > > I would really like to vet this PR with a prototype this week,
> > > > > > > to see whether we need to revisit any major choices. I don't
> > think
> > > > > we'll
> > > > > > be
> > > > > > > able to
> > > > > > > anticipate all the consequences until we write some code.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > One of the beneficial properties of the Union/Enum approach
> > for the
> > > > > > > > operator/function catalogues, is that when there are
> additions
> > to
> > > > > > > > those enums, the generated Flatbuffers files will cause many
> > > > language
> > > > > > > > compilers to warn or error on unhandled enum cases. If all
> > > > > > > > function/operator names are strings, then you are essentially
> > > > > > > > reimplementing the functionality provided by enums by hand. I
> > > > > > > > initially used strings for all function references in my
> > original
> > > > > > > > prototype, but I now think that using an enum for "built-ins"
> > would
> > > > > be
> > > > > > > > superior (because of the code-generated enum interface) and
> > not a
> > > > > > > > premature optimization.
> > > > > > > >
> > > > > > > > [1]: https://github.com/apache/arrow/pull/10934
> > > > > > > >
> > > > > > > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <
> > cpcloud@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hey all,
> > > > > > > > >
> > > > > > > > > Just wanted to give an update on the effort here.
> > > > > > > > >
> > > > > > > > > Ben Kietzman has created an alternative proposal to the
> > initial
> > > > > > design
> > > > > > > > [1].
> > > > > > > > > It largely overlaps with the original, but differs in a few
> > > > > important
> > > > > > > > ways:
> > > > > > > > >
> > > > > > > > > * A big focus of the design is on flexibility, allowing
> > > > producers,
> > > > > > > > > consumers and ultimately end users of those systems the
> > ability
> > > > to
> > > > > > > define
> > > > > > > > > custom operations in the graph.
> > > > > > > > > * There are very few predefined relational operations
> > (project,
> > > > > > filter,
> > > > > > > > > join and a handful of others).
> > > > > > > > > * There are only 3 types of value expressions: literals,
> > field
> > > > > > > > references,
> > > > > > > > > and function calls.
> > > > > > > > > * The model of evaluation is one that requires a final sink
> > node,
> > > > > to
> > > > > > > > > indicate where the record batches from child relations end
> > up (a
> > > > > > file,
> > > > > > > a
> > > > > > > > > socket, an in-memory buffer, etc).
> > > > > > > > >
> > > > > > > > > I've added notes [2] to the original Google doc (under the
> > > > > > Alternative
> > > > > > > > > Design Notes subheading), and a few pseudocode examples.
> > > > > > > > > Unfortunately, these went out of date as soon as Ben pushed
> > the
> > > > PR
> > > > > > [3],
> > > > > > > > so
> > > > > > > > > I need to update those to reflect his changes. Regardless,
> > > > > > > > > the design is broadly the same, so it should still give a
> > good
> > > > > > > indication
> > > > > > > > > of the details of the design.
> > > > > > > > >
> > > > > > > > > There are a decent number of review comments on the
> original
> > PR
> > > > > that
> > > > > > I
> > > > > > > > plan
> > > > > > > > > to port over where they are still relevant.
> > > > > > > > > I also plan on adding support for window functions either
> > tonight
> > > > > or
> > > > > > on
> > > > > > > > > Monday.
> > > > > > > > >
> > > > > > > > > Please review this design at your earliest convenience.
> Since
> > > > > > there's a
> > > > > > > > > fairly concrete set of types in flatbuffers that
> > > > > > > > > we can look at, ideally we can center discussion around the
> > > > details
> > > > > > in
> > > > > > > > the
> > > > > > > > > PR.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > [1]: https://github.com/apache/arrow/pull/10856
> > > > > > > > > [2]:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > > > > > > > [3]: https://github.com/apache/arrow/pull/10934
> > > > > > > > >
> > > > > > > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <
> > > > > jhyde.apache@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > > Wes wrote:
> > > > > > > > > > >
> > > > > > > > > > > Supporting this kind of intra-application engine
> > > > > > > > > > > heterogeneity is one of the motivations for the
> project.
> > > > > > > > > >
> > > > > > > > > > +1
> > > > > > > > > >
> > > > > > > > > > The data format is the natural interface between tasks.
> > > > (Defining
> > > > > > > > “task”
> > > > > > > > > > here as “something that is programmed using the IR”.)
> That
> > is
> > > > > > Arrow’s
> > > > > > > > > > strength.
> > > > > > > > > >
> > > > > > > > > > So I think the IR should describe what each task should
> > do, and
> > > > > > tasks
> > > > > > > > > > should be fairly small. Not whole relational operators,
> > > > operating
> > > > > > on
> > > > > > > > whole
> > > > > > > > > > tables, but pieces of relational operators, operating on
> > > > batches
> > > > > or
> > > > > > > > > > sequences of batches.
> > > > > > > > > >
> > > > > > > > > > Elsethread, someone mentioned the LoLePop concept and the
> > > > > > > > Kohn/Leis/Neuman
> > > > > > > > > > paper [1]. The LoLePop concept sounds good for our
> > purposes.
> > > > > > > > > >
> > > > > > > > > > Julian
> > > > > > > > > >
> > > > > > > > > > [1]
> > https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <
> > > > wesmckinn@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <
> > > > > > cpcloud@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > > > > > > > >> jorgecarleitao@gmail.com> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >>> Couple of questions
> > > > > > > > > > >>>
> > > > > > > > > > >>> 1. Is the goal that IRs have equal semantics, i.e.
> > given
> > > > > > > > (IR,data), the
> > > > > > > > > > >>> operation "(IR,data) - engine -> result" MUST be the
> > same
> > > > for
> > > > > > all
> > > > > > > > > > "engine"?
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > > >> I think that might be a non-starter for mundane
> reasons:
> > > > > there's
> > > > > > > > > > probably
> > > > > > > > > > >> at least two engines
> > > > > > > > > > >> that disagree on the result of sum(x) where x is a
> > floating
> > > > > > point
> > > > > > > > > > column.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>> 2. if yes, imo we may need to worry about:
> > > > > > > > > > >>> * a definition of equality that implementations agree
> > on.
> > > > > > > > > > >>> * agreement over what the semantics look like. For
> > example,
> > > > > do
> > > > > > we
> > > > > > > > use
> > > > > > > > > > >>> kleene logic for AND and OR?
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > > >> WRT Kleene logic my thoughts are that the IR should
> > support
> > > > > both
> > > > > > > > Kleene
> > > > > > > > > > and
> > > > > > > > > > >> non-Kleene,
> > > > > > > > > > >> and producers can choose their desired semantics.
> > > > > > > > > > >>
> > > > > > > > > > >> Ibis for example, would override the `&` operator in
> `a
> > & b`
> > > > > to
> > > > > > > > produce
> > > > > > > > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>>
> > > > > > > > > > >>> To try to understand the gist, let's pick an
> aggregated
> > > > count
> > > > > > > over
> > > > > > > > a
> > > > > > > > > > >>> column: engines often do partial counts over
> partitions
> > > > > > followed
> > > > > > > > by a
> > > > > > > > > > final
> > > > > > > > > > >>> "sum" over the partial counts. Is the idea that the
> > query
> > > > > > engine
> > > > > > > > would
> > > > > > > > > > >>> communicate with the compute engine via 2 IRs where
> > one is
> > > > > > "count
> > > > > > > > me
> > > > > > > > > > these"
> > > > > > > > > > >>> the other is "sum me these"?
> > > > > > > > > > >>>
> > > > > > > > > > >>> Best,
> > > > > > > > > > >>> Jorge
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > > >> Not in its current incarnation.
> > > > > > > > > > >>
> > > > > > > > > > >> The idea is that the IR producer communicates a desire
> > to
> > > > > > count(x)
> > > > > > > > to a
> > > > > > > > > > >> consumer, and  it's up to the consumer to figure out
> > how to
> > > > > turn
> > > > > > > > that
> > > > > > > > > > count
> > > > > > > > > > >> into something that makes sense for itself. In your
> > example
> > > > > > > that's a
> > > > > > > > > > series
> > > > > > > > > > >> of partial counts followed by a sum.
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > > That said, I think there is a valid use case here
> where a
> > > > > system
> > > > > > > > might
> > > > > > > > > > > make use of different engines to execute different
> > > > (composable)
> > > > > > > > layers
> > > > > > > > > > > of a complex query.
> > > > > > > > > > >
> > > > > > > > > > > For example:
> > > > > > > > > > >
> > > > > > > > > > > * suppose you want to scan and do predicate pushdown on
> > an
> > > > > > unusual
> > > > > > > > > > > data source that is only accessible from one particular
> > > > engine
> > > > > > but
> > > > > > > > > > > * you need to do some analytical operation with the
> scan
> > > > > results
> > > > > > > that
> > > > > > > > > > > is only supported by another engine
> > > > > > > > > > >
> > > > > > > > > > > You could decompose the query into two stages with an
> IR
> > > > > > relational
> > > > > > > > > > > expression for each stage and use then the engines
> > together
> > > > to
> > > > > > > > > > > accomplish what you need (of course, you would need an
> > > > > > > orchestration
> > > > > > > > > > > layer to handle plumbing the query engine inputs and
> > outputs
> > > > > > > together
> > > > > > > > > > > as Arrow streams). Supporting this kind of
> > intra-application
> > > > > > engine
> > > > > > > > > > > heterogeneity is one of the motivations for the
> project.
> > > > > > > > > > >
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <
> > > > > > cpcloud@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >>>
> > > > > > > > > > >>>> Thanks Wes,
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Great to be back working on Arrow again and engaging
> > with
> > > > > the
> > > > > > > > > > community.
> > > > > > > > > > >>> I
> > > > > > > > > > >>>> am really excited about this effort.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I think there are a number of concerns I see as
> > important
> > > > to
> > > > > > > > address
> > > > > > > > > > in
> > > > > > > > > > >>> the
> > > > > > > > > > >>>> compute IR proposal:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 1. Requirement for output types.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I think that so far there's been many reasons for
> > > > requiring
> > > > > > > > > > conforming IR
> > > > > > > > > > >>>> producers and consumers to adhere to output types,
> > but I
> > > > > > haven't
> > > > > > > > seen
> > > > > > > > > > a
> > > > > > > > > > >>>> strong rationale for keeping them optional (in the
> > > > semantic
> > > > > > > > sense, not
> > > > > > > > > > >>> WRT
> > > > > > > > > > >>>> any particular serialization format's representation
> > of
> > > > > > > > optionality).
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I think a design that includes unambiguous semantics
> > for
> > > > > > output
> > > > > > > > types
> > > > > > > > > > (a
> > > > > > > > > > >>>> consumer must produce a value of the requested type
> or
> > > > it's
> > > > > an
> > > > > > > > > > >>>> error/non-conforming) is simpler to reason about for
> > > > > > producers,
> > > > > > > > and
> > > > > > > > > > >>>> provides a strong guarantee for end users (humans or
> > > > > machines
> > > > > > > > > > >>> constructing
> > > > > > > > > > >>>> IR from an API and expecting the thing they ask for
> > back
> > > > > from
> > > > > > > the
> > > > > > > > IR
> > > > > > > > > > >>>> consumer).
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 2. Flexibility
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> The current PR is currently unable to support what I
> > think
> > > > > are
> > > > > > > > killer
> > > > > > > > > > >>>> features of the IR: custom operators (relational or
> > > > column)
> > > > > > and
> > > > > > > > UDFs.
> > > > > > > > > > In
> > > > > > > > > > >>> my
> > > > > > > > > > >>>> mind, on top of the generalized compute description
> > that
> > > > the
> > > > > > IR
> > > > > > > > > > offers,
> > > > > > > > > > >>> the
> > > > > > > > > > >>>> ability for producers and consumers of IR to extend
> > the IR
> > > > > > > without
> > > > > > > > > > >>> needing
> > > > > > > > > > >>>> to modify Arrow or depend on anything except the
> > format is
> > > > > > > itself
> > > > > > > > > > >>> something
> > > > > > > > > > >>>> that is necessary to gain adoption.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Developers will need to build custom relational
> > operators
> > > > > > (e.g.,
> > > > > > > > > > scans of
> > > > > > > > > > >>>> backends that don't exist anywhere for which a user
> > has
> > > > code
> > > > > > to
> > > > > > > > > > >>> implement)
> > > > > > > > > > >>>> and custom functions (anything operating on a column
> > that
> > > > > > > doesn't
> > > > > > > > > > already
> > > > > > > > > > >>>> exist, really). Furthermore, I think we can actually
> > drive
> > > > > > > > building an
> > > > > > > > > > >>>> Arrow consumer using the same API that an end user
> > would
> > > > use
> > > > > > to
> > > > > > > > extend
> > > > > > > > > > >>> the
> > > > > > > > > > >>>> IR.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 3. Window Functions
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Window functions are, I think, an important part of
> > the IR
> > > > > > value
> > > > > > > > > > >>>> proposition, as they are one of the more complex
> > operators
> > > > > in
> > > > > > > > > > databases.
> > > > > > > > > > >>> I
> > > > > > > > > > >>>> think we need to have something in the initial IR
> > proposal
> > > > > to
> > > > > > > > support
> > > > > > > > > > >>> these
> > > > > > > > > > >>>> operations.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 4. Non relational Joins
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Things like as-of join and window join operators
> > aren't
> > > > yet
> > > > > > > > fleshed
> > > > > > > > > > out
> > > > > > > > > > >>> in
> > > > > > > > > > >>>> the IR, and I'm not sure they should be in scope for
> > the
> > > > > > initial
> > > > > > > > > > >>> prototype.
> > > > > > > > > > >>>> I think once we settle on a design, we can work the
> > design
> > > > > of
> > > > > > > > these
> > > > > > > > > > >>>> particular operators out during the initial
> > prototype. I
> > > > > think
> > > > > > > the
> > > > > > > > > > >>>> specification of these operators should basically be
> > PR #2
> > > > > > after
> > > > > > > > the
> > > > > > > > > > >>>> initial design lands.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> # Order of Work
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 1. Nail down the design. Anything else is a
> > non-starter.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 2. Prototype an IR producer using Ibis
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Ibis is IMO a good candidate for a first IR producer
> > as it
> > > > > > has a
> > > > > > > > > > number
> > > > > > > > > > >>> of
> > > > > > > > > > >>>> desirable properties that make prototyping faster
> and
> > > > allow
> > > > > > for
> > > > > > > > us to
> > > > > > > > > > >>>> refine the design of the IR as needed based on how
> the
> > > > > > > > implementation
> > > > > > > > > > >>> goes:
> > > > > > > > > > >>>> * It's written in Python so it has native support
> for
> > > > nearly
> > > > > > all
> > > > > > > > of
> > > > > > > > > > >>>> flatbuffers' features without having to creating
> > bindings
> > > > to
> > > > > > > C++.
> > > > > > > > > > >>>> * There's already a set of rules for type checking,
> as
> > > > well
> > > > > as
> > > > > > > > APIs
> > > > > > > > > > for
> > > > > > > > > > >>>> constructing expression trees, which means we don't
> > need
> > > > to
> > > > > > > worry
> > > > > > > > > > about
> > > > > > > > > > >>>> building a type checker for the prototype.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> 3. Prototype an IR consumer in C++
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I think in parallel to the producer prototype we can
> > > > further
> > > > > > > > inform
> > > > > > > > > > the
> > > > > > > > > > >>>> design from the consumer side by prototyping an IR
> > > > consumer
> > > > > in
> > > > > > > > C++ . I
> > > > > > > > > > >>> know
> > > > > > > > > > >>>> Ben Kietzman has expressed interest in working on
> > this.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Very interested to hear others' thoughts.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> -Phillip
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > > >>> wrote:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>> Thank you for all the feedback and comments on the
> > > > > document.
> > > > > > > I'm
> > > > > > > > on
> > > > > > > > > > >>>>> vacation this week, so I'm delayed responding to
> > > > > everything,
> > > > > > > but
> > > > > > > > I
> > > > > > > > > > >>>>> will get to it as quickly as I can. I will be at
> > VLDB in
> > > > > > > > Copenhagen
> > > > > > > > > > >>>>> next week if anyone would like to chat in person
> > about
> > > > it,
> > > > > > and
> > > > > > > > we can
> > > > > > > > > > >>>>> relay the content of any discussions back to the
> > > > > > > > document/PR/e-mail
> > > > > > > > > > >>>>> thread.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> I know that Phillip Cloud expressed interest in
> > working
> > > > on
> > > > > > the
> > > > > > > > PR and
> > > > > > > > > > >>>>> helping work through many of the details, so I'm
> > glad to
> > > > > have
> > > > > > > the
> > > > > > > > > > >>>>> help. If there are others who would like to work on
> > the
> > > > PR
> > > > > or
> > > > > > > dig
> > > > > > > > > > into
> > > > > > > > > > >>>>> the details, please let me know. We might need to
> > figure
> > > > > out
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > >>>>> accommodate "many cooks" by setting up the
> ComputeIR
> > > > > project
> > > > > > > > > > somewhere
> > > > > > > > > > >>>>> separate from the format/ directory to permit it to
> > exist
> > > > > in
> > > > > > a
> > > > > > > > > > >>>>> Work-In-Progress status for a period of time until
> we
> > > > work
> > > > > > > > through
> > > > > > > > > > the
> > > > > > > > > > >>>>> various details and design concerns.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Re Julian's comment
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>>> The biggest surprise is that this language does
> full
> > > > > > > relational
> > > > > > > > > > >>>>> operations. I was expecting that it would do
> > fragments of
> > > > > the
> > > > > > > > > > >>> operations.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> There's a related but different (yet still
> > interesting
> > > > and
> > > > > > > > worthy of
> > > > > > > > > > >>>>> analysis) problem of creating an "engine language"
> > that
> > > > > > > describes
> > > > > > > > > > more
> > > > > > > > > > >>>>> mechanically the constituent parts of implementing
> > the
> > > > > > > relational
> > > > > > > > > > >>>>> operators. To create a functional computation
> > language
> > > > with
> > > > > > > > concrete
> > > > > > > > > > >>>>> Arrow data structures as a top-level primitive
> sounds
> > > > like
> > > > > an
> > > > > > > > > > >>>>> interesting research area where I could see
> something
> > > > > > > developing
> > > > > > > > > > >>>>> eventually.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> The main problem I'm interested in solving right
> now
> > is
> > > > > > > enabling
> > > > > > > > > > front
> > > > > > > > > > >>>>> ends that have sufficient understanding of
> relational
> > > > > algebra
> > > > > > > and
> > > > > > > > > > data
> > > > > > > > > > >>>>> frame operations to talk to engines without having
> > to go
> > > > > > > > backwards
> > > > > > > > > > >>>>> from their logical query plans to SQL. So as
> > mentioned in
> > > > > the
> > > > > > > > > > >>>>> document, being able to faithfully carry the
> > relational
> > > > > > > operator
> > > > > > > > node
> > > > > > > > > > >>>>> information generated by Calcite or Ibis or another
> > > > system
> > > > > > > would
> > > > > > > > be
> > > > > > > > > > >>>>> super useful. Defining the semantics of various
> > kinds of
> > > > > > > > user-defined
> > > > > > > > > > >>>>> functions would also be helpful to standardize the
> > > > > > > > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > > > > > > > alendit@gmail.com>
> > > > > > > > > > >>>> wrote:
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Hi Wes,
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> cool initiative! Reminded me of "Building Advanced
> > SQL
> > > > > > > Analytics
> > > > > > > > > > From
> > > > > > > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > > > > > > >>>>>>
> > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf)
> > > > > > which
> > > > > > > > > > >>>> proposes a
> > > > > > > > > > >>>>>> set of building block for advanced aggregation.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Cheers,
> > > > > > > > > > >>>>>> Dimitri.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > > > > > > > jhyde.apache@gmail.com>
> > > > > > > > > > >>>>> wrote:
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>> Wes,
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> Thanks for this. I’ve added comments to the doc
> > and to
> > > > > the
> > > > > > > PR.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> The biggest surprise is that this language does
> > full
> > > > > > > relational
> > > > > > > > > > >>>>>>> operations. I was expecting that it would do
> > fragments
> > > > of
> > > > > > the
> > > > > > > > > > >>>>> operations.
> > > > > > > > > > >>>>>>> Consider join. A distributed hybrid hash join
> > needs to
> > > > > > > > partition
> > > > > > > > > > >>> rows
> > > > > > > > > > >>>>> into
> > > > > > > > > > >>>>>>> output buffers based on a hash key, build hash
> > tables,
> > > > > > probe
> > > > > > > > into
> > > > > > > > > > >>>> hash
> > > > > > > > > > >>>>>>> tables, scan hash tables for untouched
> > “outer”rows, and
> > > > > so
> > > > > > > > forth.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> I see Arrow’s compute as delivering each of those
> > > > > > operations,
> > > > > > > > > > >>> working
> > > > > > > > > > >>>>> on
> > > > > > > > > > >>>>>>> perhaps a batch at a time, or a sequence of
> > batches,
> > > > with
> > > > > > > some
> > > > > > > > > > >>> other
> > > > > > > > > > >>>>> system
> > > > > > > > > > >>>>>>> coordinating those tasks. So I would expect to
> see
> > > > > Arrow’s
> > > > > > > > compute
> > > > > > > > > > >>>>> language
> > > > > > > > > > >>>>>>> mainly operating on batches rather than a table
> > > > > > abstraction.
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> Julian
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> > > > > > > wesmckinn@gmail.com
> > > > > > > > >
> > > > > > > > > > >>>>> wrote:
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> hi folks,
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> This idea came up in passing in the past --
> given
> > that
> > > > > > there
> > > > > > > > are
> > > > > > > > > > >>>>>>>> multiple independent efforts to develop
> > Arrow-native
> > > > > query
> > > > > > > > > > >>> engines
> > > > > > > > > > >>>>>>>> (and surely many more to come), it seems like it
> > would
> > > > > be
> > > > > > > > > > >>> valuable
> > > > > > > > > > >>>> to
> > > > > > > > > > >>>>>>>> have a way to enable user languages (like Java,
> > > > Python,
> > > > > R,
> > > > > > > or
> > > > > > > > > > >>> Rust,
> > > > > > > > > > >>>>>>>> for example) to communicate with backend
> computing
> > > > > engines
> > > > > > > > (like
> > > > > > > > > > >>>>>>>> DataFusion, or new computing capabilities being
> > built
> > > > in
> > > > > > the
> > > > > > > > > > >>> Arrow
> > > > > > > > > > >>>>> C++
> > > > > > > > > > >>>>>>>> library) in a fashion that is "lower-level" than
> > SQL
> > > > and
> > > > > > > > > > >>>> specialized
> > > > > > > > > > >>>>>>>> to Arrow's type system. So rather than leaving
> it
> > to a
> > > > > SQL
> > > > > > > > > > >>> parser /
> > > > > > > > > > >>>>>>>> analyzer framework to generate an expression
> tree
> > of
> > > > > > > > relational
> > > > > > > > > > >>>>>>>> operators and then translate that to an
> > Arrow-native
> > > > > > > > > > >>>> compute-engine's
> > > > > > > > > > >>>>>>>> internal grammer, a user framework could provide
> > the
> > > > > > desired
> > > > > > > > > > >>>>>>>> Arrow-native expression tree / data
> manipulations
> > > > > directly
> > > > > > > and
> > > > > > > > > > >>> skip
> > > > > > > > > > >>>>>>>> the SQL altogether.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> The idea of creating a "serialized intermediate
> > > > > > > representation
> > > > > > > > > > >>>> (IR)"
> > > > > > > > > > >>>>>>>> for Arrow compute operations would be to serve
> use
> > > > cases
> > > > > > > large
> > > > > > > > > > >>> and
> > > > > > > > > > >>>>>>>> small -- from the most complex TPC-* or time
> > series
> > > > > > database
> > > > > > > > > > >>> query
> > > > > > > > > > >>>> to
> > > > > > > > > > >>>>>>>> the most simple array predicate/filter sent with
> > an
> > > > RPC
> > > > > > > > request
> > > > > > > > > > >>>> using
> > > > > > > > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > > > > > > > engine-agnostic,
> > > > > > > > > > >>>> with
> > > > > > > > > > >>>>>>>> the only commonality across usages being the
> Arrow
> > > > > > columnar
> > > > > > > > > > >>> format
> > > > > > > > > > >>>>>>>> (schemas and array types). This would be better
> > than
> > > > > > leaving
> > > > > > > > it
> > > > > > > > > > >>> to
> > > > > > > > > > >>>>>>>> each application to develop its own bespoke
> > expression
> > > > > > > > > > >>>>> representations
> > > > > > > > > > >>>>>>>> for its needs.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> I spent a while thinking about this and wrote
> up a
> > > > brain
> > > > > > > dump
> > > > > > > > RFC
> > > > > > > > > > >>>>>>>> document [1] and accompanying pull request [2]
> > that
> > > > > makes
> > > > > > > the
> > > > > > > > > > >>>>> possibly
> > > > > > > > > > >>>>>>>> controversial choice of using Flatbuffers to
> > represent
> > > > > the
> > > > > > > > > > >>>> serialized
> > > > > > > > > > >>>>>>>> IR. I discuss the rationale for the choice of
> > > > > Flatbuffers
> > > > > > in
> > > > > > > > the
> > > > > > > > > > >>>> RFC
> > > > > > > > > > >>>>>>>> document. This PR is obviously deficient in many
> > > > regards
> > > > > > > > > > >>>> (incomplete,
> > > > > > > > > > >>>>>>>> hacky, or unclear in places), and will need some
> > help
> > > > > from
> > > > > > > > others
> > > > > > > > > > >>>> to
> > > > > > > > > > >>>>>>>> flesh out. I suspect that actually implementing
> > the IR
> > > > > > will
> > > > > > > be
> > > > > > > > > > >>>>>>>> necessary to work out many of the low-level
> > details.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> Note that this IR is intended to be more of a
> > > > "superset"
> > > > > > > > project
> > > > > > > > > > >>>> than
> > > > > > > > > > >>>>>>>> a "lowest common denominator". So there may be
> > things
> > > > > that
> > > > > > > it
> > > > > > > > > > >>>>> includes
> > > > > > > > > > >>>>>>>> which are only available in some engines (e.g.
> > engines
> > > > > > that
> > > > > > > > have
> > > > > > > > > > >>>>>>>> specialized handling of time series data).
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> As some of my personal motivation for the
> project,
> > > > > > > concurrent
> > > > > > > > > > >>> with
> > > > > > > > > > >>>>> the
> > > > > > > > > > >>>>>>>> genesis of Apache Arrow, I started a Python
> > project
> > > > > called
> > > > > > > > Ibis
> > > > > > > > > > >>> [3]
> > > > > > > > > > >>>>>>>> (which is similar to R's dplyr project) which
> > serves
> > > > as
> > > > > a
> > > > > > > > "Python
> > > > > > > > > > >>>>>>>> analytical query IR builder" that is capable of
> > > > > generating
> > > > > > > > most
> > > > > > > > > > >>> of
> > > > > > > > > > >>>>> the
> > > > > > > > > > >>>>>>>> SQL standard, targeting many different SQL
> > dialects
> > > > and
> > > > > > > other
> > > > > > > > > > >>>>> backends
> > > > > > > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5])
> > have
> > > > > used
> > > > > > > this
> > > > > > > > > > >>>>> library
> > > > > > > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would
> like
> > to
> > > > be
> > > > > > able
> > > > > > > > to
> > > > > > > > > > >>>>>>>> translate between the in-memory "logical query"
> > data
> > > > > > > > structures
> > > > > > > > > > >>> in
> > > > > > > > > > >>>> a
> > > > > > > > > > >>>>>>>> library like Ibis to a serialized format that
> can
> > be
> > > > > > > executed
> > > > > > > > by
> > > > > > > > > > >>>> many
> > > > > > > > > > >>>>>>>> different Arrow-native query engines. The
> > expression
> > > > > > > > primitives
> > > > > > > > > > >>>>>>>> available in Ibis should serve as helpful test
> > cases,
> > > > > too.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> I look forward to the community's comments on
> the
> > RFC
> > > > > > > document
> > > > > > > > > > >>> [1]
> > > > > > > > > > >>>>> and
> > > > > > > > > > >>>>>>>> pull request [2] -- I realize that arriving at
> > > > consensus
> > > > > > on
> > > > > > > a
> > > > > > > > > > >>>> complex
> > > > > > > > > > >>>>>>>> and ambitious project like this can be
> > challenging so
> > > > I
> > > > > > > > recommend
> > > > > > > > > > >>>>>>>> spending time on the "non-goals" section in the
> > RFC
> > > > and
> > > > > > ask
> > > > > > > > > > >>>> questions
> > > > > > > > > > >>>>>>>> if you are unclear about the scope of what
> > problems
> > > > this
> > > > > > is
> > > > > > > > > > >>> trying
> > > > > > > > > > >>>> to
> > > > > > > > > > >>>>>>>> solve. I would be happy to give Edit access on
> > the RFC
> > > > > > > > document
> > > > > > > > > > >>> to
> > > > > > > > > > >>>>>>>> others and would consider ideas about how to
> move
> > > > > forward
> > > > > > > with
> > > > > > > > > > >>>>>>>> something that is able to be implemented by
> > different
> > > > > > Arrow
> > > > > > > > > > >>>> libraries
> > > > > > > > > > >>>>>>>> in the reasonably near future.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> Thanks,
> > > > > > > > > > >>>>>>>> Wes
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> [1]:
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > > > > > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > > > > > > > >>>>>>>> [4]:
> > > > > > http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > > > > > >>>>>>>> [5]:
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Phillip Cloud <cp...@gmail.com>.

Hey everyone,

As many of you know, the compute IR project has a lot of interested parties
and has generated a lot of feedback. In light of some of the feedback we’ve
received, we want to stress that the specification is intended to have
input from many diverse points of view and that we welcome folks outside of
the Arrow community. We think there’s immense potential for a compute IR
that multiple projects--including those outside of the Arrow umbrella--can
leverage.

With that in mind, Jacques has been working on something outside of the
Arrow repo that’ll be shared in a few days, that is designed to bring those
viewpoints to bear on the problem of generic relational computation that
lives outside of the Arrow project.

Inside Arrow, we think that a version of the in-development IR
specifications from the last several weeks will add a ton of value by
informing this new effort and would like to continue to move forward with a
work-in-progress IR inside of Arrow for the time being to enable some work
on API development (independent of exactly how things are serialized) to
take place. It is very likely that we will adopt this broader specification
once the dust has settled, so the format inside of Arrow will be relatively
unstable for a while and not have backwards compatibility guarantees for
now.

The primary focus of the Arrow IR will be on shoring up APIs (producers and
consumers), and we will also be moving the compute IR flatbuffers files out
the format directory into another top-level directory in the repo.

Thanks,
Phillip

On Mon, Aug 30, 2021 at 7:30 PM Weston Pace <we...@gmail.com> wrote:

> My (incredibly naive) interpretation is that there are three problems to
> tackle.
>
> 1) How do you represent a graph and relational operators (join, union,
> groupby, etc.)
>  - The PR appears to be addressing this question fairly well
> 2) How does a frontend query a backend to know what UDFs are supported.
>  - I don't see anything in the spec for this (some comments touch on
> it) but it seems like it would be necessary to build any kind of
> system.
> 3) Is there some well defined set of canonical UDFs that we can all
> agree on the semantics for (e.g. addition, subtraction, etc.)
>  - I thought, from earlier comments in this email thread, that the
> goal was to avoid addressing this.  Although I think there is strong
> value here as well.
>
> So what is the scope of this initiative?  If it is just #1 for example
> then I don't see any need to put types in the IR (and I've commented
> as such in the PR).  From a relational perspective isn't a UDF just a
> black box Table -> UDF -> Table?
>
> On Mon, Aug 30, 2021 at 11:10 AM Phillip Cloud <cp...@gmail.com> wrote:
> >
> > Hey everyone,
> >
> > There's some interesting discussion around types and where their location
> > is in the current PR [1] (and in fact whether to store them at all).
> >
> > It would be great to get some community feedback on this [2] part of the
> PR
> > in particular, because the choice of whether to store types at all has
> > important design consequences.
> >
> > [1]: https://github.com/apache/arrow/pull/10934
> > [2]: https://github.com/apache/arrow/pull/10934/files#r697025313
> >
> > On Fri, Aug 27, 2021 at 2:11 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > As an FYI, Iceberg is also considering an IR in relation to view
> support
> > > [1].  I chimed in and pointed them to this thread and Wes's doc.
> Phillip
> > > and Jacques chimed in there as well.
> > >
> > > [1]
> > >
> > >
> https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E
> > >
> > > On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud <cp...@gmail.com>
> wrote:
> > >
> > > > Thanks for the feedback Jacques, very helpful. In the latest version
> of
> > > the
> > > > PR, I've tried to incorporate nearly all of these points.
> > > >
> > > > - I've incorporated most of what you had for dereferencing operations
> > > into
> > > > the PR, and gotten rid of schemas except on Read/Write relations.
> > > > - With respect to properties, I've made a bunch more specific
> operators,
> > > > and kept user-defined things special case-y.
> > > > - I haven't incorporated anything close to physical-plan things, but
> I
> > > > think that's a good follow up PR. Having separate representations for
> > > > logical/physical plans seems like a waste of effort ultimately. I
> think
> > > we
> > > > can find a good balance.
> > > > - Agree on UDF support, I think that will have to evolve as the rest
> of
> > > the
> > > > spec evolves. UDFs will need language-dedicated effort given the
> large
> > > > variety of languages that folks will want to use to define functions.
> > > >
> > > > On a separate note, in an effort to move this project forward I'd
> like to
> > > > do one final round of code review from anyone interested and then
> merge
> > > the
> > > > PR after that.
> > > > This spec will be unstable for a while, until we can work out all the
> > > > design kinks and edge cases, and I think getting this initial spec
> in is
> > > > important to start that process.
> > > >
> > > >
> > > > On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > >
> > > > > In a lucky turn of events, Phillip actually turned out to be in my
> neck
> > > > of
> > > > > the woods on Friday so we had a chance to sit down and discuss
> this. To
> > > > > help, I actually shared something I had been working on a few
> months
> > > ago
> > > > > independently (before this discussion started).
> > > > >
> > > > > For reference:
> > > > > Wes PR: https://github.com/apache/arrow/pull/10856
> > > > > Ben PR: https://github.com/apache/arrow/pull/10934
> > > > > Jacques PR: https://github.com/apache/arrow/pull/10979
> > > > >
> > > > > The high level points of feedback I have are:
> > > > >
> > > > >    - Ben PR feels too deconstructed. While I like the elegance and
> > > > >    symmetry, I believe this will lead to substantially more work in
> > > > moving
> > > > >    from serialization format to something closer to what a system
> would
> > > > > want
> > > > >    to manipulate/consume. The reality is that there are a lot of
> really
> > > > > known
> > > > >    things and specializing the representation for these things will
> > > > > ultimately
> > > > >    make things easier to program with without error and easier to
> > > debug.
> > > > > (For
> > > > >    example, imagine trying to inspect a plan in a debugging session
> > > with
> > > > > the
> > > > >    Ben representation.) We should embrace the known things in the
> > > > >    specification.
> > > > >    - I believe that it is a mistake for the inner workings of the
> plan
> > > to
> > > > >    ever use field names. Only input (e.g. file read) and Output
> (e.g.
> > > > > return
> > > > >    to user or write to file) need to have field names. For the
> rest of
> > > > the
> > > > >    system, using field ordinals (determinant whether nested or
> flat) is
> > > > > much
> > > > >    cleaner and is how most execution systems work. For example, in
> > > > Impala I
> > > > >    believe it is called a slot. As I noted in the PR, Calcite as an
> > > > > example is
> > > > >    entirely ordinal based at the algebra level. Rowtypes contain
> field
> > > > > names
> > > > >    but they are actually basically pointless. Field references use
> > > > > RexInputRef
> > > > >    with ordinal based and rules around column order output (e.g.
> what
> > > is
> > > > > the
> > > > >    field order of a join output) are determinant and done entirely
> at
> > > an
> > > > >    ordinal level. The only place where names should be used
> (besides
> > > > >    input/output) is in the case of map keys. In that case, the
> names
> > > are
> > > > >    actually data, as opposed to scheme metadata. This is why I
> propose
> > > a
> > > > >    strongly structured dereference operation [1].
> > > > >    - Properties should only be included in the serialization if
> they
> > > are
> > > > >    not easily re-derivable at plan consumption time. For example,
> > > you'll
> > > > > note
> > > > >    that I don't store schema information for relational operation.
> Each
> > > > >    function and relational operation should already know how a
> given
> > > > input
> > > > > is
> > > > >    transformed to a given output. Capturing this information in the
> > > > > plan/IR is
> > > > >    excessive. In many ways, I compare it to the early use of
> > > VectorLayout
> > > > > [2]
> > > > >    in Arrow schema. It may have provided some additional checksum
> of
> > > the
> > > > >    message but ultimately it caused more pain than it was worth
> (and
> > > thus
> > > > > we
> > > > >    removed it before formalizing the specification). For
> reference, in
> > > > the
> > > > >    context of Dremio, we used to actually do this, send schema
> > > > information
> > > > >    around for all operations. We removed it because in many cases
> > > > becoming
> > > > > the
> > > > >    majority of our internal plan serialization (imagine simple
> > > operations
> > > > > that
> > > > >    are carrying 1000s of fields).
> > > > >    - I suggest focusing on support for both logical and physical
> > > > >    representations. The moment you start talking about optimization
> > > > passes,
> > > > >    many of those would probably be better being done at the logical
> > > > level.
> > > > > The
> > > > >    overlap is really high.
> > > > >    - I think a lot more work must be done before introducing UDFs
> and
> > > > user
> > > > >    defined relational operations. I see one goal being the
> possibility
> > > of
> > > > >    there being three systems: A -> B -> C. A is a IR producer. C
> is a
> > > IR
> > > > >    consumer and B is a IR filter or translator. In this situation,
> B
> > > > > should be
> > > > >    able to operate and do optimizations on a plan even if if there
> are
> > > > > black
> > > > >    box user defined operations. Being able to know the
> > > > > properties-preservation
> > > > >    or not of these operations is important to making decisions. For
> > > > > example,
> > > > >    does a user defined relational operation maintain sortedness?
> > > > > Distribution?
> > > > >    Is a defined UDF idempotent? As such, I think the definition of
> > > those
> > > > > black
> > > > >    boxes should be much more structured. For example: it is a
> python
> > > > >    relational operation named X stored in Y that maintains
> properties
> > > 1,2
> > > > > and
> > > > >    disrupts property 3. Putting just a black box of bytes will
> > > > > substantially
> > > > >    reduce the compatibility and extensibility of the ecosystem of
> tools
> > > > >    working against IR. I'd note that I wouldn't expect this to be a
> > > > burden
> > > > > to
> > > > >    actual end users. By using sensible defaults, I still would
> expect
> > > an
> > > > > end
> > > > >    user tool to support arbitrary user defined operations.
> > > > >    - It might make sense to review the XML representation that Orca
> > > uses
> > > > >    [3]. I haven't looked at it recently but they had a strong goal
> of
> > > > >    decoupling for most of its life (to use in both Greenplum and
> Hawq).
> > > > It
> > > > >    could be the most mature/formal serialization of query plans
> > > > publically
> > > > >    available.
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
> > > > > [3] https://github.com/greenplum-db/gporca
> > > > >
> > > > >
> > > > > On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <cp...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <
> wesmckinn@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Looking at Ben's alternate PR [1], having an IR that leans
> heavily
> > > on
> > > > > > > memory references to an out-of-band data sidecar seems like an
> > > > > > > approach that would substantially ratchet up the implementation
> > > > > > > complexity as producing the IR would then have the level of
> > > > complexity
> > > > > > > of producing the Arrow IPC format — when producing the "root"
> Plan
> > > > > > > message, you must accumulate a list of the dependent serialized
> > > > > > > submessages, placing the appropriate Buffer memory offset in
> the
> > > Plan
> > > > > > > message, like we do when producing the RecordBatch.buffers
> field.
> > > > This
> > > > > > > seems complicated to me as you must devise a custom binary
> protocol
> > > > to
> > > > > > > concatenate the serialized Plan and the messages it depends on
> > > into a
> > > > > > > single binary payload
> > > > > > >
> > > > > > > <ROOT PLAN>
> > > > > > > <padding>
> > > > > > > <Buffer 0>
> > > > > > > <padding>
> > > > > > > <Buffer 1>
> > > > > > > <padding>
> > > > > > > ...
> > > > > > > <Buffer N - 1>
> > > > > > >
> > > > > > > (one purpose of FlatBufferBuilder is to spare you having to do
> this
> > > > > > > yourself — some reasons we do it for the Arrow IPC format is
> > > because
> > > > > > > appending Arrow memory buffers directly to a FlatBufferBuilder
> > > would
> > > > > > > be inefficient — internal realloc calls — and Flatbuffers are
> > > limited
> > > > > > > to 2GB. Neither of these things are problems here)
> > > > > > >
> > > > > > > In general, I believe the output created by an IR producer
> should
> > > be
> > > > a
> > > > > > > single serialized object without any out-of-band data sidecar —
> > > this
> > > > > > > is much simpler for implementers and we can still provide an
> > > "escape
> > > > > > > hatch" for user-defined operators and functions where the
> required
> > > > > > > function/operator is passed opaquely as an embedded binary
> data.
> > > > > >
> > > > > >
> > > > > >
> > > > > > The serialization format (whether it is Flatbuffers or JSON, or
> > > > > > > something else) should allow for data memoization, so if there
> is a
> > > > > > > user-defined operator/function, or a relation that is used
> multiple
> > > > > > > times throughout the query (potentially with a large schema),
> then
> > > we
> > > > > > > should ensure that the data need not be duplicated in the
> > > > > > > serialization format unnecessarily — in Flatbuffers, you can
> > > achieve
> > > > > > > this by reusing offsets, but we could devise the data
> structures to
> > > > > > > make the memoization of "expensive" objects more explicit.
> > > > > > >
> > > > > >
> > > > > > I think this is something that would need to be explicitly
> encoded in
> > > > > > the structures themselves if it's a hard requirement. I don't
> think
> > > > this
> > > > > > should block
> > > > > > a prototype producer/consumer.
> > > > > >
> > > > > > Is there something in the second PR/design that precludes the
> reuse
> > > of
> > > > > > offsets?
> > > > > > To my eye, the flatbuffers offset reuse mechanism works just as
> well
> > > > > there.
> > > > > >
> > > > > >
> > > > > > > I additionally think that it is important to provide as much
> > > built-in
> > > > > > > support for "canonical" operators/functions (such as the ones
> > > > > > > implemented commonly by SQL engines) as possible, and to
> liberally
> > > > > > > expand the catalogue of "built-in" capabilities. I would still
> > > prefer
> > > > > > > to have large unions/enums of built-in operators/functions and
> to
> > > > > > > expand those unions/enums to accommodate new things when it is
> > > > > > > demonstrated that there is a need to standardize things between
> > > > > > > producers/consumers.
> > > > > > >
> > > > > >
> > > > > > I think there's a middle ground where we add a bit of structure
> > > > > (something
> > > > > > like
> > > > > > a descriptor from the first PR) to indicate whether a thing is
> > > built-in
> > > > > vs
> > > > > > user-defined.
> > > > > > It looks like Ben has pushed something like this to his PR.
> > > > > >
> > > > > > With that scheme, we have both flexibility and a small set of
> special
> > > > > > builtins that make up
> > > > > > a statically typed set for expressions and relational operators.
> > > > > >
> > > > > > I would really like to vet this PR with a prototype this week,
> > > > > > to see whether we need to revisit any major choices. I don't
> think
> > > > we'll
> > > > > be
> > > > > > able to
> > > > > > anticipate all the consequences until we write some code.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > One of the beneficial properties of the Union/Enum approach
> for the
> > > > > > > operator/function catalogues, is that when there are additions
> to
> > > > > > > those enums, the generated Flatbuffers files will cause many
> > > language
> > > > > > > compilers to warn or error on unhandled enum cases. If all
> > > > > > > function/operator names are strings, then you are essentially
> > > > > > > reimplementing the functionality provided by enums by hand. I
> > > > > > > initially used strings for all function references in my
> original
> > > > > > > prototype, but I now think that using an enum for "built-ins"
> would
> > > > be
> > > > > > > superior (because of the code-generated enum interface) and
> not a
> > > > > > > premature optimization.
> > > > > > >
> > > > > > > [1]: https://github.com/apache/arrow/pull/10934
> > > > > > >
> > > > > > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <
> cpcloud@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Hey all,
> > > > > > > >
> > > > > > > > Just wanted to give an update on the effort here.
> > > > > > > >
> > > > > > > > Ben Kietzman has created an alternative proposal to the
> initial
> > > > > design
> > > > > > > [1].
> > > > > > > > It largely overlaps with the original, but differs in a few
> > > > important
> > > > > > > ways:
> > > > > > > >
> > > > > > > > * A big focus of the design is on flexibility, allowing
> > > producers,
> > > > > > > > consumers and ultimately end users of those systems the
> ability
> > > to
> > > > > > define
> > > > > > > > custom operations in the graph.
> > > > > > > > * There are very few predefined relational operations
> (project,
> > > > > filter,
> > > > > > > > join and a handful of others).
> > > > > > > > * There are only 3 types of value expressions: literals,
> field
> > > > > > > references,
> > > > > > > > and function calls.
> > > > > > > > * The model of evaluation is one that requires a final sink
> node,
> > > > to
> > > > > > > > indicate where the record batches from child relations end
> up (a
> > > > > file,
> > > > > > a
> > > > > > > > socket, an in-memory buffer, etc).
> > > > > > > >
> > > > > > > > I've added notes [2] to the original Google doc (under the
> > > > > Alternative
> > > > > > > > Design Notes subheading), and a few pseudocode examples.
> > > > > > > > Unfortunately, these went out of date as soon as Ben pushed
> the
> > > PR
> > > > > [3],
> > > > > > > so
> > > > > > > > I need to update those to reflect his changes. Regardless,
> > > > > > > > the design is broadly the same, so it should still give a
> good
> > > > > > indication
> > > > > > > > of the details of the design.
> > > > > > > >
> > > > > > > > There are a decent number of review comments on the original
> PR
> > > > that
> > > > > I
> > > > > > > plan
> > > > > > > > to port over where they are still relevant.
> > > > > > > > I also plan on adding support for window functions either
> tonight
> > > > or
> > > > > on
> > > > > > > > Monday.
> > > > > > > >
> > > > > > > > Please review this design at your earliest convenience. Since
> > > > > there's a
> > > > > > > > fairly concrete set of types in flatbuffers that
> > > > > > > > we can look at, ideally we can center discussion around the
> > > details
> > > > > in
> > > > > > > the
> > > > > > > > PR.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > [1]: https://github.com/apache/arrow/pull/10856
> > > > > > > > [2]:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > > > > > > [3]: https://github.com/apache/arrow/pull/10934
> > > > > > > >
> > > > > > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <
> > > > jhyde.apache@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > > Wes wrote:
> > > > > > > > > >
> > > > > > > > > > Supporting this kind of intra-application engine
> > > > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > The data format is the natural interface between tasks.
> > > (Defining
> > > > > > > “task”
> > > > > > > > > here as “something that is programmed using the IR”.) That
> is
> > > > > Arrow’s
> > > > > > > > > strength.
> > > > > > > > >
> > > > > > > > > So I think the IR should describe what each task should
> do, and
> > > > > tasks
> > > > > > > > > should be fairly small. Not whole relational operators,
> > > operating
> > > > > on
> > > > > > > whole
> > > > > > > > > tables, but pieces of relational operators, operating on
> > > batches
> > > > or
> > > > > > > > > sequences of batches.
> > > > > > > > >
> > > > > > > > > Elsethread, someone mentioned the LoLePop concept and the
> > > > > > > Kohn/Leis/Neuman
> > > > > > > > > paper [1]. The LoLePop concept sounds good for our
> purposes.
> > > > > > > > >
> > > > > > > > > Julian
> > > > > > > > >
> > > > > > > > > [1]
> https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <
> > > wesmckinn@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <
> > > > > cpcloud@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > > > > > > >> jorgecarleitao@gmail.com> wrote:
> > > > > > > > > >>
> > > > > > > > > >>> Couple of questions
> > > > > > > > > >>>
> > > > > > > > > >>> 1. Is the goal that IRs have equal semantics, i.e.
> given
> > > > > > > (IR,data), the
> > > > > > > > > >>> operation "(IR,data) - engine -> result" MUST be the
> same
> > > for
> > > > > all
> > > > > > > > > "engine"?
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > > >> I think that might be a non-starter for mundane reasons:
> > > > there's
> > > > > > > > > probably
> > > > > > > > > >> at least two engines
> > > > > > > > > >> that disagree on the result of sum(x) where x is a
> floating
> > > > > point
> > > > > > > > > column.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>> 2. if yes, imo we may need to worry about:
> > > > > > > > > >>> * a definition of equality that implementations agree
> on.
> > > > > > > > > >>> * agreement over what the semantics look like. For
> example,
> > > > do
> > > > > we
> > > > > > > use
> > > > > > > > > >>> kleene logic for AND and OR?
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > > >> WRT Kleene logic my thoughts are that the IR should
> support
> > > > both
> > > > > > > Kleene
> > > > > > > > > and
> > > > > > > > > >> non-Kleene,
> > > > > > > > > >> and producers can choose their desired semantics.
> > > > > > > > > >>
> > > > > > > > > >> Ibis for example, would override the `&` operator in `a
> & b`
> > > > to
> > > > > > > produce
> > > > > > > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>>
> > > > > > > > > >>> To try to understand the gist, let's pick an aggregated
> > > count
> > > > > > over
> > > > > > > a
> > > > > > > > > >>> column: engines often do partial counts over partitions
> > > > > followed
> > > > > > > by a
> > > > > > > > > final
> > > > > > > > > >>> "sum" over the partial counts. Is the idea that the
> query
> > > > > engine
> > > > > > > would
> > > > > > > > > >>> communicate with the compute engine via 2 IRs where
> one is
> > > > > "count
> > > > > > > me
> > > > > > > > > these"
> > > > > > > > > >>> the other is "sum me these"?
> > > > > > > > > >>>
> > > > > > > > > >>> Best,
> > > > > > > > > >>> Jorge
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > > >> Not in its current incarnation.
> > > > > > > > > >>
> > > > > > > > > >> The idea is that the IR producer communicates a desire
> to
> > > > > count(x)
> > > > > > > to a
> > > > > > > > > >> consumer, and  it's up to the consumer to figure out
> how to
> > > > turn
> > > > > > > that
> > > > > > > > > count
> > > > > > > > > >> into something that makes sense for itself. In your
> example
> > > > > > that's a
> > > > > > > > > series
> > > > > > > > > >> of partial counts followed by a sum.
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > > That said, I think there is a valid use case here where a
> > > > system
> > > > > > > might
> > > > > > > > > > make use of different engines to execute different
> > > (composable)
> > > > > > > layers
> > > > > > > > > > of a complex query.
> > > > > > > > > >
> > > > > > > > > > For example:
> > > > > > > > > >
> > > > > > > > > > * suppose you want to scan and do predicate pushdown on
> an
> > > > > unusual
> > > > > > > > > > data source that is only accessible from one particular
> > > engine
> > > > > but
> > > > > > > > > > * you need to do some analytical operation with the scan
> > > > results
> > > > > > that
> > > > > > > > > > is only supported by another engine
> > > > > > > > > >
> > > > > > > > > > You could decompose the query into two stages with an IR
> > > > > relational
> > > > > > > > > > expression for each stage and use then the engines
> together
> > > to
> > > > > > > > > > accomplish what you need (of course, you would need an
> > > > > > orchestration
> > > > > > > > > > layer to handle plumbing the query engine inputs and
> outputs
> > > > > > together
> > > > > > > > > > as Arrow streams). Supporting this kind of
> intra-application
> > > > > engine
> > > > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > > > > >
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <
> > > > > cpcloud@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >>>
> > > > > > > > > >>>> Thanks Wes,
> > > > > > > > > >>>>
> > > > > > > > > >>>> Great to be back working on Arrow again and engaging
> with
> > > > the
> > > > > > > > > community.
> > > > > > > > > >>> I
> > > > > > > > > >>>> am really excited about this effort.
> > > > > > > > > >>>>
> > > > > > > > > >>>> I think there are a number of concerns I see as
> important
> > > to
> > > > > > > address
> > > > > > > > > in
> > > > > > > > > >>> the
> > > > > > > > > >>>> compute IR proposal:
> > > > > > > > > >>>>
> > > > > > > > > >>>> 1. Requirement for output types.
> > > > > > > > > >>>>
> > > > > > > > > >>>> I think that so far there's been many reasons for
> > > requiring
> > > > > > > > > conforming IR
> > > > > > > > > >>>> producers and consumers to adhere to output types,
> but I
> > > > > haven't
> > > > > > > seen
> > > > > > > > > a
> > > > > > > > > >>>> strong rationale for keeping them optional (in the
> > > semantic
> > > > > > > sense, not
> > > > > > > > > >>> WRT
> > > > > > > > > >>>> any particular serialization format's representation
> of
> > > > > > > optionality).
> > > > > > > > > >>>>
> > > > > > > > > >>>> I think a design that includes unambiguous semantics
> for
> > > > > output
> > > > > > > types
> > > > > > > > > (a
> > > > > > > > > >>>> consumer must produce a value of the requested type or
> > > it's
> > > > an
> > > > > > > > > >>>> error/non-conforming) is simpler to reason about for
> > > > > producers,
> > > > > > > and
> > > > > > > > > >>>> provides a strong guarantee for end users (humans or
> > > > machines
> > > > > > > > > >>> constructing
> > > > > > > > > >>>> IR from an API and expecting the thing they ask for
> back
> > > > from
> > > > > > the
> > > > > > > IR
> > > > > > > > > >>>> consumer).
> > > > > > > > > >>>>
> > > > > > > > > >>>> 2. Flexibility
> > > > > > > > > >>>>
> > > > > > > > > >>>> The current PR is currently unable to support what I
> think
> > > > are
> > > > > > > killer
> > > > > > > > > >>>> features of the IR: custom operators (relational or
> > > column)
> > > > > and
> > > > > > > UDFs.
> > > > > > > > > In
> > > > > > > > > >>> my
> > > > > > > > > >>>> mind, on top of the generalized compute description
> that
> > > the
> > > > > IR
> > > > > > > > > offers,
> > > > > > > > > >>> the
> > > > > > > > > >>>> ability for producers and consumers of IR to extend
> the IR
> > > > > > without
> > > > > > > > > >>> needing
> > > > > > > > > >>>> to modify Arrow or depend on anything except the
> format is
> > > > > > itself
> > > > > > > > > >>> something
> > > > > > > > > >>>> that is necessary to gain adoption.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Developers will need to build custom relational
> operators
> > > > > (e.g.,
> > > > > > > > > scans of
> > > > > > > > > >>>> backends that don't exist anywhere for which a user
> has
> > > code
> > > > > to
> > > > > > > > > >>> implement)
> > > > > > > > > >>>> and custom functions (anything operating on a column
> that
> > > > > > doesn't
> > > > > > > > > already
> > > > > > > > > >>>> exist, really). Furthermore, I think we can actually
> drive
> > > > > > > building an
> > > > > > > > > >>>> Arrow consumer using the same API that an end user
> would
> > > use
> > > > > to
> > > > > > > extend
> > > > > > > > > >>> the
> > > > > > > > > >>>> IR.
> > > > > > > > > >>>>
> > > > > > > > > >>>> 3. Window Functions
> > > > > > > > > >>>>
> > > > > > > > > >>>> Window functions are, I think, an important part of
> the IR
> > > > > value
> > > > > > > > > >>>> proposition, as they are one of the more complex
> operators
> > > > in
> > > > > > > > > databases.
> > > > > > > > > >>> I
> > > > > > > > > >>>> think we need to have something in the initial IR
> proposal
> > > > to
> > > > > > > support
> > > > > > > > > >>> these
> > > > > > > > > >>>> operations.
> > > > > > > > > >>>>
> > > > > > > > > >>>> 4. Non relational Joins
> > > > > > > > > >>>>
> > > > > > > > > >>>> Things like as-of join and window join operators
> aren't
> > > yet
> > > > > > > fleshed
> > > > > > > > > out
> > > > > > > > > >>> in
> > > > > > > > > >>>> the IR, and I'm not sure they should be in scope for
> the
> > > > > initial
> > > > > > > > > >>> prototype.
> > > > > > > > > >>>> I think once we settle on a design, we can work the
> design
> > > > of
> > > > > > > these
> > > > > > > > > >>>> particular operators out during the initial
> prototype. I
> > > > think
> > > > > > the
> > > > > > > > > >>>> specification of these operators should basically be
> PR #2
> > > > > after
> > > > > > > the
> > > > > > > > > >>>> initial design lands.
> > > > > > > > > >>>>
> > > > > > > > > >>>> # Order of Work
> > > > > > > > > >>>>
> > > > > > > > > >>>> 1. Nail down the design. Anything else is a
> non-starter.
> > > > > > > > > >>>>
> > > > > > > > > >>>> 2. Prototype an IR producer using Ibis
> > > > > > > > > >>>>
> > > > > > > > > >>>> Ibis is IMO a good candidate for a first IR producer
> as it
> > > > > has a
> > > > > > > > > number
> > > > > > > > > >>> of
> > > > > > > > > >>>> desirable properties that make prototyping faster and
> > > allow
> > > > > for
> > > > > > > us to
> > > > > > > > > >>>> refine the design of the IR as needed based on how the
> > > > > > > implementation
> > > > > > > > > >>> goes:
> > > > > > > > > >>>> * It's written in Python so it has native support for
> > > nearly
> > > > > all
> > > > > > > of
> > > > > > > > > >>>> flatbuffers' features without having to creating
> bindings
> > > to
> > > > > > C++.
> > > > > > > > > >>>> * There's already a set of rules for type checking, as
> > > well
> > > > as
> > > > > > > APIs
> > > > > > > > > for
> > > > > > > > > >>>> constructing expression trees, which means we don't
> need
> > > to
> > > > > > worry
> > > > > > > > > about
> > > > > > > > > >>>> building a type checker for the prototype.
> > > > > > > > > >>>>
> > > > > > > > > >>>> 3. Prototype an IR consumer in C++
> > > > > > > > > >>>>
> > > > > > > > > >>>> I think in parallel to the producer prototype we can
> > > further
> > > > > > > inform
> > > > > > > > > the
> > > > > > > > > >>>> design from the consumer side by prototyping an IR
> > > consumer
> > > > in
> > > > > > > C++ . I
> > > > > > > > > >>> know
> > > > > > > > > >>>> Ben Kietzman has expressed interest in working on
> this.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Very interested to hear others' thoughts.
> > > > > > > > > >>>>
> > > > > > > > > >>>> -Phillip
> > > > > > > > > >>>>
> > > > > > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > > > > > > wesmckinn@gmail.com>
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>>> Thank you for all the feedback and comments on the
> > > > document.
> > > > > > I'm
> > > > > > > on
> > > > > > > > > >>>>> vacation this week, so I'm delayed responding to
> > > > everything,
> > > > > > but
> > > > > > > I
> > > > > > > > > >>>>> will get to it as quickly as I can. I will be at
> VLDB in
> > > > > > > Copenhagen
> > > > > > > > > >>>>> next week if anyone would like to chat in person
> about
> > > it,
> > > > > and
> > > > > > > we can
> > > > > > > > > >>>>> relay the content of any discussions back to the
> > > > > > > document/PR/e-mail
> > > > > > > > > >>>>> thread.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> I know that Phillip Cloud expressed interest in
> working
> > > on
> > > > > the
> > > > > > > PR and
> > > > > > > > > >>>>> helping work through many of the details, so I'm
> glad to
> > > > have
> > > > > > the
> > > > > > > > > >>>>> help. If there are others who would like to work on
> the
> > > PR
> > > > or
> > > > > > dig
> > > > > > > > > into
> > > > > > > > > >>>>> the details, please let me know. We might need to
> figure
> > > > out
> > > > > > how
> > > > > > > to
> > > > > > > > > >>>>> accommodate "many cooks" by setting up the ComputeIR
> > > > project
> > > > > > > > > somewhere
> > > > > > > > > >>>>> separate from the format/ directory to permit it to
> exist
> > > > in
> > > > > a
> > > > > > > > > >>>>> Work-In-Progress status for a period of time until we
> > > work
> > > > > > > through
> > > > > > > > > the
> > > > > > > > > >>>>> various details and design concerns.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Re Julian's comment
> > > > > > > > > >>>>>
> > > > > > > > > >>>>>> The biggest surprise is that this language does full
> > > > > > relational
> > > > > > > > > >>>>> operations. I was expecting that it would do
> fragments of
> > > > the
> > > > > > > > > >>> operations.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> There's a related but different (yet still
> interesting
> > > and
> > > > > > > worthy of
> > > > > > > > > >>>>> analysis) problem of creating an "engine language"
> that
> > > > > > describes
> > > > > > > > > more
> > > > > > > > > >>>>> mechanically the constituent parts of implementing
> the
> > > > > > relational
> > > > > > > > > >>>>> operators. To create a functional computation
> language
> > > with
> > > > > > > concrete
> > > > > > > > > >>>>> Arrow data structures as a top-level primitive sounds
> > > like
> > > > an
> > > > > > > > > >>>>> interesting research area where I could see something
> > > > > > developing
> > > > > > > > > >>>>> eventually.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> The main problem I'm interested in solving right now
> is
> > > > > > enabling
> > > > > > > > > front
> > > > > > > > > >>>>> ends that have sufficient understanding of relational
> > > > algebra
> > > > > > and
> > > > > > > > > data
> > > > > > > > > >>>>> frame operations to talk to engines without having
> to go
> > > > > > > backwards
> > > > > > > > > >>>>> from their logical query plans to SQL. So as
> mentioned in
> > > > the
> > > > > > > > > >>>>> document, being able to faithfully carry the
> relational
> > > > > > operator
> > > > > > > node
> > > > > > > > > >>>>> information generated by Calcite or Ibis or another
> > > system
> > > > > > would
> > > > > > > be
> > > > > > > > > >>>>> super useful. Defining the semantics of various
> kinds of
> > > > > > > user-defined
> > > > > > > > > >>>>> functions would also be helpful to standardize the
> > > > > > > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > > > > > > alendit@gmail.com>
> > > > > > > > > >>>> wrote:
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Hi Wes,
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> cool initiative! Reminded me of "Building Advanced
> SQL
> > > > > > Analytics
> > > > > > > > > From
> > > > > > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > > > > > >>>>>>
> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf)
> > > > > which
> > > > > > > > > >>>> proposes a
> > > > > > > > > >>>>>> set of building block for advanced aggregation.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Cheers,
> > > > > > > > > >>>>>> Dimitri.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > > > > > > jhyde.apache@gmail.com>
> > > > > > > > > >>>>> wrote:
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>> Wes,
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> Thanks for this. I’ve added comments to the doc
> and to
> > > > the
> > > > > > PR.
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> The biggest surprise is that this language does
> full
> > > > > > relational
> > > > > > > > > >>>>>>> operations. I was expecting that it would do
> fragments
> > > of
> > > > > the
> > > > > > > > > >>>>> operations.
> > > > > > > > > >>>>>>> Consider join. A distributed hybrid hash join
> needs to
> > > > > > > partition
> > > > > > > > > >>> rows
> > > > > > > > > >>>>> into
> > > > > > > > > >>>>>>> output buffers based on a hash key, build hash
> tables,
> > > > > probe
> > > > > > > into
> > > > > > > > > >>>> hash
> > > > > > > > > >>>>>>> tables, scan hash tables for untouched
> “outer”rows, and
> > > > so
> > > > > > > forth.
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> I see Arrow’s compute as delivering each of those
> > > > > operations,
> > > > > > > > > >>> working
> > > > > > > > > >>>>> on
> > > > > > > > > >>>>>>> perhaps a batch at a time, or a sequence of
> batches,
> > > with
> > > > > > some
> > > > > > > > > >>> other
> > > > > > > > > >>>>> system
> > > > > > > > > >>>>>>> coordinating those tasks. So I would expect to see
> > > > Arrow’s
> > > > > > > compute
> > > > > > > > > >>>>> language
> > > > > > > > > >>>>>>> mainly operating on batches rather than a table
> > > > > abstraction.
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> Julian
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> > > > > > wesmckinn@gmail.com
> > > > > > > >
> > > > > > > > > >>>>> wrote:
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> hi folks,
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> This idea came up in passing in the past -- given
> that
> > > > > there
> > > > > > > are
> > > > > > > > > >>>>>>>> multiple independent efforts to develop
> Arrow-native
> > > > query
> > > > > > > > > >>> engines
> > > > > > > > > >>>>>>>> (and surely many more to come), it seems like it
> would
> > > > be
> > > > > > > > > >>> valuable
> > > > > > > > > >>>> to
> > > > > > > > > >>>>>>>> have a way to enable user languages (like Java,
> > > Python,
> > > > R,
> > > > > > or
> > > > > > > > > >>> Rust,
> > > > > > > > > >>>>>>>> for example) to communicate with backend computing
> > > > engines
> > > > > > > (like
> > > > > > > > > >>>>>>>> DataFusion, or new computing capabilities being
> built
> > > in
> > > > > the
> > > > > > > > > >>> Arrow
> > > > > > > > > >>>>> C++
> > > > > > > > > >>>>>>>> library) in a fashion that is "lower-level" than
> SQL
> > > and
> > > > > > > > > >>>> specialized
> > > > > > > > > >>>>>>>> to Arrow's type system. So rather than leaving it
> to a
> > > > SQL
> > > > > > > > > >>> parser /
> > > > > > > > > >>>>>>>> analyzer framework to generate an expression tree
> of
> > > > > > > relational
> > > > > > > > > >>>>>>>> operators and then translate that to an
> Arrow-native
> > > > > > > > > >>>> compute-engine's
> > > > > > > > > >>>>>>>> internal grammer, a user framework could provide
> the
> > > > > desired
> > > > > > > > > >>>>>>>> Arrow-native expression tree / data manipulations
> > > > directly
> > > > > > and
> > > > > > > > > >>> skip
> > > > > > > > > >>>>>>>> the SQL altogether.
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> The idea of creating a "serialized intermediate
> > > > > > representation
> > > > > > > > > >>>> (IR)"
> > > > > > > > > >>>>>>>> for Arrow compute operations would be to serve use
> > > cases
> > > > > > large
> > > > > > > > > >>> and
> > > > > > > > > >>>>>>>> small -- from the most complex TPC-* or time
> series
> > > > > database
> > > > > > > > > >>> query
> > > > > > > > > >>>> to
> > > > > > > > > >>>>>>>> the most simple array predicate/filter sent with
> an
> > > RPC
> > > > > > > request
> > > > > > > > > >>>> using
> > > > > > > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > > > > > > engine-agnostic,
> > > > > > > > > >>>> with
> > > > > > > > > >>>>>>>> the only commonality across usages being the Arrow
> > > > > columnar
> > > > > > > > > >>> format
> > > > > > > > > >>>>>>>> (schemas and array types). This would be better
> than
> > > > > leaving
> > > > > > > it
> > > > > > > > > >>> to
> > > > > > > > > >>>>>>>> each application to develop its own bespoke
> expression
> > > > > > > > > >>>>> representations
> > > > > > > > > >>>>>>>> for its needs.
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> I spent a while thinking about this and wrote up a
> > > brain
> > > > > > dump
> > > > > > > RFC
> > > > > > > > > >>>>>>>> document [1] and accompanying pull request [2]
> that
> > > > makes
> > > > > > the
> > > > > > > > > >>>>> possibly
> > > > > > > > > >>>>>>>> controversial choice of using Flatbuffers to
> represent
> > > > the
> > > > > > > > > >>>> serialized
> > > > > > > > > >>>>>>>> IR. I discuss the rationale for the choice of
> > > > Flatbuffers
> > > > > in
> > > > > > > the
> > > > > > > > > >>>> RFC
> > > > > > > > > >>>>>>>> document. This PR is obviously deficient in many
> > > regards
> > > > > > > > > >>>> (incomplete,
> > > > > > > > > >>>>>>>> hacky, or unclear in places), and will need some
> help
> > > > from
> > > > > > > others
> > > > > > > > > >>>> to
> > > > > > > > > >>>>>>>> flesh out. I suspect that actually implementing
> the IR
> > > > > will
> > > > > > be
> > > > > > > > > >>>>>>>> necessary to work out many of the low-level
> details.
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> Note that this IR is intended to be more of a
> > > "superset"
> > > > > > > project
> > > > > > > > > >>>> than
> > > > > > > > > >>>>>>>> a "lowest common denominator". So there may be
> things
> > > > that
> > > > > > it
> > > > > > > > > >>>>> includes
> > > > > > > > > >>>>>>>> which are only available in some engines (e.g.
> engines
> > > > > that
> > > > > > > have
> > > > > > > > > >>>>>>>> specialized handling of time series data).
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> As some of my personal motivation for the project,
> > > > > > concurrent
> > > > > > > > > >>> with
> > > > > > > > > >>>>> the
> > > > > > > > > >>>>>>>> genesis of Apache Arrow, I started a Python
> project
> > > > called
> > > > > > > Ibis
> > > > > > > > > >>> [3]
> > > > > > > > > >>>>>>>> (which is similar to R's dplyr project) which
> serves
> > > as
> > > > a
> > > > > > > "Python
> > > > > > > > > >>>>>>>> analytical query IR builder" that is capable of
> > > > generating
> > > > > > > most
> > > > > > > > > >>> of
> > > > > > > > > >>>>> the
> > > > > > > > > >>>>>>>> SQL standard, targeting many different SQL
> dialects
> > > and
> > > > > > other
> > > > > > > > > >>>>> backends
> > > > > > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5])
> have
> > > > used
> > > > > > this
> > > > > > > > > >>>>> library
> > > > > > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would like
> to
> > > be
> > > > > able
> > > > > > > to
> > > > > > > > > >>>>>>>> translate between the in-memory "logical query"
> data
> > > > > > > structures
> > > > > > > > > >>> in
> > > > > > > > > >>>> a
> > > > > > > > > >>>>>>>> library like Ibis to a serialized format that can
> be
> > > > > > executed
> > > > > > > by
> > > > > > > > > >>>> many
> > > > > > > > > >>>>>>>> different Arrow-native query engines. The
> expression
> > > > > > > primitives
> > > > > > > > > >>>>>>>> available in Ibis should serve as helpful test
> cases,
> > > > too.
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> I look forward to the community's comments on the
> RFC
> > > > > > document
> > > > > > > > > >>> [1]
> > > > > > > > > >>>>> and
> > > > > > > > > >>>>>>>> pull request [2] -- I realize that arriving at
> > > consensus
> > > > > on
> > > > > > a
> > > > > > > > > >>>> complex
> > > > > > > > > >>>>>>>> and ambitious project like this can be
> challenging so
> > > I
> > > > > > > recommend
> > > > > > > > > >>>>>>>> spending time on the "non-goals" section in the
> RFC
> > > and
> > > > > ask
> > > > > > > > > >>>> questions
> > > > > > > > > >>>>>>>> if you are unclear about the scope of what
> problems
> > > this
> > > > > is
> > > > > > > > > >>> trying
> > > > > > > > > >>>> to
> > > > > > > > > >>>>>>>> solve. I would be happy to give Edit access on
> the RFC
> > > > > > > document
> > > > > > > > > >>> to
> > > > > > > > > >>>>>>>> others and would consider ideas about how to move
> > > > forward
> > > > > > with
> > > > > > > > > >>>>>>>> something that is able to be implemented by
> different
> > > > > Arrow
> > > > > > > > > >>>> libraries
> > > > > > > > > >>>>>>>> in the reasonably near future.
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> Thanks,
> > > > > > > > > >>>>>>>> Wes
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> [1]:
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > > > > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > > > > > > >>>>>>>> [4]:
> > > > > http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > > > > >>>>>>>> [5]:
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Weston Pace <we...@gmail.com>.

My (incredibly naive) interpretation is that there are three problems to tackle.

1) How do you represent a graph and relational operators (join, union,
groupby, etc.)
 - The PR appears to be addressing this question fairly well
2) How does a frontend query a backend to know what UDFs are supported.
 - I don't see anything in the spec for this (some comments touch on
it) but it seems like it would be necessary to build any kind of
system.
3) Is there some well defined set of canonical UDFs that we can all
agree on the semantics for (e.g. addition, subtraction, etc.)
 - I thought, from earlier comments in this email thread, that the
goal was to avoid addressing this.  Although I think there is strong
value here as well.

So what is the scope of this initiative?  If it is just #1 for example
then I don't see any need to put types in the IR (and I've commented
as such in the PR).  From a relational perspective isn't a UDF just a
black box Table -> UDF -> Table?

On Mon, Aug 30, 2021 at 11:10 AM Phillip Cloud <cp...@gmail.com> wrote:
>
> Hey everyone,
>
> There's some interesting discussion around types and where their location
> is in the current PR [1] (and in fact whether to store them at all).
>
> It would be great to get some community feedback on this [2] part of the PR
> in particular, because the choice of whether to store types at all has
> important design consequences.
>
> [1]: https://github.com/apache/arrow/pull/10934
> [2]: https://github.com/apache/arrow/pull/10934/files#r697025313
>
> On Fri, Aug 27, 2021 at 2:11 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > As an FYI, Iceberg is also considering an IR in relation to view support
> > [1].  I chimed in and pointed them to this thread and Wes's doc.  Phillip
> > and Jacques chimed in there as well.
> >
> > [1]
> >
> > https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E
> >
> > On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud <cp...@gmail.com> wrote:
> >
> > > Thanks for the feedback Jacques, very helpful. In the latest version of
> > the
> > > PR, I've tried to incorporate nearly all of these points.
> > >
> > > - I've incorporated most of what you had for dereferencing operations
> > into
> > > the PR, and gotten rid of schemas except on Read/Write relations.
> > > - With respect to properties, I've made a bunch more specific operators,
> > > and kept user-defined things special case-y.
> > > - I haven't incorporated anything close to physical-plan things, but I
> > > think that's a good follow up PR. Having separate representations for
> > > logical/physical plans seems like a waste of effort ultimately. I think
> > we
> > > can find a good balance.
> > > - Agree on UDF support, I think that will have to evolve as the rest of
> > the
> > > spec evolves. UDFs will need language-dedicated effort given the large
> > > variety of languages that folks will want to use to define functions.
> > >
> > > On a separate note, in an effort to move this project forward I'd like to
> > > do one final round of code review from anyone interested and then merge
> > the
> > > PR after that.
> > > This spec will be unstable for a while, until we can work out all the
> > > design kinks and edge cases, and I think getting this initial spec in is
> > > important to start that process.
> > >
> > >
> > > On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >
> > > > In a lucky turn of events, Phillip actually turned out to be in my neck
> > > of
> > > > the woods on Friday so we had a chance to sit down and discuss this. To
> > > > help, I actually shared something I had been working on a few months
> > ago
> > > > independently (before this discussion started).
> > > >
> > > > For reference:
> > > > Wes PR: https://github.com/apache/arrow/pull/10856
> > > > Ben PR: https://github.com/apache/arrow/pull/10934
> > > > Jacques PR: https://github.com/apache/arrow/pull/10979
> > > >
> > > > The high level points of feedback I have are:
> > > >
> > > >    - Ben PR feels too deconstructed. While I like the elegance and
> > > >    symmetry, I believe this will lead to substantially more work in
> > > moving
> > > >    from serialization format to something closer to what a system would
> > > > want
> > > >    to manipulate/consume. The reality is that there are a lot of really
> > > > known
> > > >    things and specializing the representation for these things will
> > > > ultimately
> > > >    make things easier to program with without error and easier to
> > debug.
> > > > (For
> > > >    example, imagine trying to inspect a plan in a debugging session
> > with
> > > > the
> > > >    Ben representation.) We should embrace the known things in the
> > > >    specification.
> > > >    - I believe that it is a mistake for the inner workings of the plan
> > to
> > > >    ever use field names. Only input (e.g. file read) and Output (e.g.
> > > > return
> > > >    to user or write to file) need to have field names. For the rest of
> > > the
> > > >    system, using field ordinals (determinant whether nested or flat) is
> > > > much
> > > >    cleaner and is how most execution systems work. For example, in
> > > Impala I
> > > >    believe it is called a slot. As I noted in the PR, Calcite as an
> > > > example is
> > > >    entirely ordinal based at the algebra level. Rowtypes contain field
> > > > names
> > > >    but they are actually basically pointless. Field references use
> > > > RexInputRef
> > > >    with ordinal based and rules around column order output (e.g. what
> > is
> > > > the
> > > >    field order of a join output) are determinant and done entirely at
> > an
> > > >    ordinal level. The only place where names should be used (besides
> > > >    input/output) is in the case of map keys. In that case, the names
> > are
> > > >    actually data, as opposed to scheme metadata. This is why I propose
> > a
> > > >    strongly structured dereference operation [1].
> > > >    - Properties should only be included in the serialization if they
> > are
> > > >    not easily re-derivable at plan consumption time. For example,
> > you'll
> > > > note
> > > >    that I don't store schema information for relational operation. Each
> > > >    function and relational operation should already know how a given
> > > input
> > > > is
> > > >    transformed to a given output. Capturing this information in the
> > > > plan/IR is
> > > >    excessive. In many ways, I compare it to the early use of
> > VectorLayout
> > > > [2]
> > > >    in Arrow schema. It may have provided some additional checksum of
> > the
> > > >    message but ultimately it caused more pain than it was worth (and
> > thus
> > > > we
> > > >    removed it before formalizing the specification). For reference, in
> > > the
> > > >    context of Dremio, we used to actually do this, send schema
> > > information
> > > >    around for all operations. We removed it because in many cases
> > > becoming
> > > > the
> > > >    majority of our internal plan serialization (imagine simple
> > operations
> > > > that
> > > >    are carrying 1000s of fields).
> > > >    - I suggest focusing on support for both logical and physical
> > > >    representations. The moment you start talking about optimization
> > > passes,
> > > >    many of those would probably be better being done at the logical
> > > level.
> > > > The
> > > >    overlap is really high.
> > > >    - I think a lot more work must be done before introducing UDFs and
> > > user
> > > >    defined relational operations. I see one goal being the possibility
> > of
> > > >    there being three systems: A -> B -> C. A is a IR producer. C is a
> > IR
> > > >    consumer and B is a IR filter or translator. In this situation, B
> > > > should be
> > > >    able to operate and do optimizations on a plan even if if there are
> > > > black
> > > >    box user defined operations. Being able to know the
> > > > properties-preservation
> > > >    or not of these operations is important to making decisions. For
> > > > example,
> > > >    does a user defined relational operation maintain sortedness?
> > > > Distribution?
> > > >    Is a defined UDF idempotent? As such, I think the definition of
> > those
> > > > black
> > > >    boxes should be much more structured. For example: it is a python
> > > >    relational operation named X stored in Y that maintains properties
> > 1,2
> > > > and
> > > >    disrupts property 3. Putting just a black box of bytes will
> > > > substantially
> > > >    reduce the compatibility and extensibility of the ecosystem of tools
> > > >    working against IR. I'd note that I wouldn't expect this to be a
> > > burden
> > > > to
> > > >    actual end users. By using sensible defaults, I still would expect
> > an
> > > > end
> > > >    user tool to support arbitrary user defined operations.
> > > >    - It might make sense to review the XML representation that Orca
> > uses
> > > >    [3]. I haven't looked at it recently but they had a strong goal of
> > > >    decoupling for most of its life (to use in both Greenplum and Hawq).
> > > It
> > > >    could be the most mature/formal serialization of query plans
> > > publically
> > > >    available.
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
> > > > [2]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
> > > > [3] https://github.com/greenplum-db/gporca
> > > >
> > > >
> > > > On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <cp...@gmail.com>
> > > wrote:
> > > >
> > > > > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Looking at Ben's alternate PR [1], having an IR that leans heavily
> > on
> > > > > > memory references to an out-of-band data sidecar seems like an
> > > > > > approach that would substantially ratchet up the implementation
> > > > > > complexity as producing the IR would then have the level of
> > > complexity
> > > > > > of producing the Arrow IPC format — when producing the "root" Plan
> > > > > > message, you must accumulate a list of the dependent serialized
> > > > > > submessages, placing the appropriate Buffer memory offset in the
> > Plan
> > > > > > message, like we do when producing the RecordBatch.buffers field.
> > > This
> > > > > > seems complicated to me as you must devise a custom binary protocol
> > > to
> > > > > > concatenate the serialized Plan and the messages it depends on
> > into a
> > > > > > single binary payload
> > > > > >
> > > > > > <ROOT PLAN>
> > > > > > <padding>
> > > > > > <Buffer 0>
> > > > > > <padding>
> > > > > > <Buffer 1>
> > > > > > <padding>
> > > > > > ...
> > > > > > <Buffer N - 1>
> > > > > >
> > > > > > (one purpose of FlatBufferBuilder is to spare you having to do this
> > > > > > yourself — some reasons we do it for the Arrow IPC format is
> > because
> > > > > > appending Arrow memory buffers directly to a FlatBufferBuilder
> > would
> > > > > > be inefficient — internal realloc calls — and Flatbuffers are
> > limited
> > > > > > to 2GB. Neither of these things are problems here)
> > > > > >
> > > > > > In general, I believe the output created by an IR producer should
> > be
> > > a
> > > > > > single serialized object without any out-of-band data sidecar —
> > this
> > > > > > is much simpler for implementers and we can still provide an
> > "escape
> > > > > > hatch" for user-defined operators and functions where the required
> > > > > > function/operator is passed opaquely as an embedded binary data.
> > > > >
> > > > >
> > > > >
> > > > > The serialization format (whether it is Flatbuffers or JSON, or
> > > > > > something else) should allow for data memoization, so if there is a
> > > > > > user-defined operator/function, or a relation that is used multiple
> > > > > > times throughout the query (potentially with a large schema), then
> > we
> > > > > > should ensure that the data need not be duplicated in the
> > > > > > serialization format unnecessarily — in Flatbuffers, you can
> > achieve
> > > > > > this by reusing offsets, but we could devise the data structures to
> > > > > > make the memoization of "expensive" objects more explicit.
> > > > > >
> > > > >
> > > > > I think this is something that would need to be explicitly encoded in
> > > > > the structures themselves if it's a hard requirement. I don't think
> > > this
> > > > > should block
> > > > > a prototype producer/consumer.
> > > > >
> > > > > Is there something in the second PR/design that precludes the reuse
> > of
> > > > > offsets?
> > > > > To my eye, the flatbuffers offset reuse mechanism works just as well
> > > > there.
> > > > >
> > > > >
> > > > > > I additionally think that it is important to provide as much
> > built-in
> > > > > > support for "canonical" operators/functions (such as the ones
> > > > > > implemented commonly by SQL engines) as possible, and to liberally
> > > > > > expand the catalogue of "built-in" capabilities. I would still
> > prefer
> > > > > > to have large unions/enums of built-in operators/functions and to
> > > > > > expand those unions/enums to accommodate new things when it is
> > > > > > demonstrated that there is a need to standardize things between
> > > > > > producers/consumers.
> > > > > >
> > > > >
> > > > > I think there's a middle ground where we add a bit of structure
> > > > (something
> > > > > like
> > > > > a descriptor from the first PR) to indicate whether a thing is
> > built-in
> > > > vs
> > > > > user-defined.
> > > > > It looks like Ben has pushed something like this to his PR.
> > > > >
> > > > > With that scheme, we have both flexibility and a small set of special
> > > > > builtins that make up
> > > > > a statically typed set for expressions and relational operators.
> > > > >
> > > > > I would really like to vet this PR with a prototype this week,
> > > > > to see whether we need to revisit any major choices. I don't think
> > > we'll
> > > > be
> > > > > able to
> > > > > anticipate all the consequences until we write some code.
> > > > >
> > > > >
> > > > > >
> > > > > > One of the beneficial properties of the Union/Enum approach for the
> > > > > > operator/function catalogues, is that when there are additions to
> > > > > > those enums, the generated Flatbuffers files will cause many
> > language
> > > > > > compilers to warn or error on unhandled enum cases. If all
> > > > > > function/operator names are strings, then you are essentially
> > > > > > reimplementing the functionality provided by enums by hand. I
> > > > > > initially used strings for all function references in my original
> > > > > > prototype, but I now think that using an enum for "built-ins" would
> > > be
> > > > > > superior (because of the code-generated enum interface) and not a
> > > > > > premature optimization.
> > > > > >
> > > > > > [1]: https://github.com/apache/arrow/pull/10934
> > > > > >
> > > > > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cp...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > Hey all,
> > > > > > >
> > > > > > > Just wanted to give an update on the effort here.
> > > > > > >
> > > > > > > Ben Kietzman has created an alternative proposal to the initial
> > > > design
> > > > > > [1].
> > > > > > > It largely overlaps with the original, but differs in a few
> > > important
> > > > > > ways:
> > > > > > >
> > > > > > > * A big focus of the design is on flexibility, allowing
> > producers,
> > > > > > > consumers and ultimately end users of those systems the ability
> > to
> > > > > define
> > > > > > > custom operations in the graph.
> > > > > > > * There are very few predefined relational operations (project,
> > > > filter,
> > > > > > > join and a handful of others).
> > > > > > > * There are only 3 types of value expressions: literals, field
> > > > > > references,
> > > > > > > and function calls.
> > > > > > > * The model of evaluation is one that requires a final sink node,
> > > to
> > > > > > > indicate where the record batches from child relations end up (a
> > > > file,
> > > > > a
> > > > > > > socket, an in-memory buffer, etc).
> > > > > > >
> > > > > > > I've added notes [2] to the original Google doc (under the
> > > > Alternative
> > > > > > > Design Notes subheading), and a few pseudocode examples.
> > > > > > > Unfortunately, these went out of date as soon as Ben pushed the
> > PR
> > > > [3],
> > > > > > so
> > > > > > > I need to update those to reflect his changes. Regardless,
> > > > > > > the design is broadly the same, so it should still give a good
> > > > > indication
> > > > > > > of the details of the design.
> > > > > > >
> > > > > > > There are a decent number of review comments on the original PR
> > > that
> > > > I
> > > > > > plan
> > > > > > > to port over where they are still relevant.
> > > > > > > I also plan on adding support for window functions either tonight
> > > or
> > > > on
> > > > > > > Monday.
> > > > > > >
> > > > > > > Please review this design at your earliest convenience. Since
> > > > there's a
> > > > > > > fairly concrete set of types in flatbuffers that
> > > > > > > we can look at, ideally we can center discussion around the
> > details
> > > > in
> > > > > > the
> > > > > > > PR.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > [1]: https://github.com/apache/arrow/pull/10856
> > > > > > > [2]:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > > > > > [3]: https://github.com/apache/arrow/pull/10934
> > > > > > >
> > > > > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <
> > > jhyde.apache@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > > Wes wrote:
> > > > > > > > >
> > > > > > > > > Supporting this kind of intra-application engine
> > > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > The data format is the natural interface between tasks.
> > (Defining
> > > > > > “task”
> > > > > > > > here as “something that is programmed using the IR”.) That is
> > > > Arrow’s
> > > > > > > > strength.
> > > > > > > >
> > > > > > > > So I think the IR should describe what each task should do, and
> > > > tasks
> > > > > > > > should be fairly small. Not whole relational operators,
> > operating
> > > > on
> > > > > > whole
> > > > > > > > tables, but pieces of relational operators, operating on
> > batches
> > > or
> > > > > > > > sequences of batches.
> > > > > > > >
> > > > > > > > Elsethread, someone mentioned the LoLePop concept and the
> > > > > > Kohn/Leis/Neuman
> > > > > > > > paper [1]. The LoLePop concept sounds good for our purposes.
> > > > > > > >
> > > > > > > > Julian
> > > > > > > >
> > > > > > > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <
> > wesmckinn@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <
> > > > cpcloud@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > > > > > >> jorgecarleitao@gmail.com> wrote:
> > > > > > > > >>
> > > > > > > > >>> Couple of questions
> > > > > > > > >>>
> > > > > > > > >>> 1. Is the goal that IRs have equal semantics, i.e. given
> > > > > > (IR,data), the
> > > > > > > > >>> operation "(IR,data) - engine -> result" MUST be the same
> > for
> > > > all
> > > > > > > > "engine"?
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> I think that might be a non-starter for mundane reasons:
> > > there's
> > > > > > > > probably
> > > > > > > > >> at least two engines
> > > > > > > > >> that disagree on the result of sum(x) where x is a floating
> > > > point
> > > > > > > > column.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>> 2. if yes, imo we may need to worry about:
> > > > > > > > >>> * a definition of equality that implementations agree on.
> > > > > > > > >>> * agreement over what the semantics look like. For example,
> > > do
> > > > we
> > > > > > use
> > > > > > > > >>> kleene logic for AND and OR?
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> WRT Kleene logic my thoughts are that the IR should support
> > > both
> > > > > > Kleene
> > > > > > > > and
> > > > > > > > >> non-Kleene,
> > > > > > > > >> and producers can choose their desired semantics.
> > > > > > > > >>
> > > > > > > > >> Ibis for example, would override the `&` operator in `a & b`
> > > to
> > > > > > produce
> > > > > > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>>
> > > > > > > > >>> To try to understand the gist, let's pick an aggregated
> > count
> > > > > over
> > > > > > a
> > > > > > > > >>> column: engines often do partial counts over partitions
> > > > followed
> > > > > > by a
> > > > > > > > final
> > > > > > > > >>> "sum" over the partial counts. Is the idea that the query
> > > > engine
> > > > > > would
> > > > > > > > >>> communicate with the compute engine via 2 IRs where one is
> > > > "count
> > > > > > me
> > > > > > > > these"
> > > > > > > > >>> the other is "sum me these"?
> > > > > > > > >>>
> > > > > > > > >>> Best,
> > > > > > > > >>> Jorge
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> Not in its current incarnation.
> > > > > > > > >>
> > > > > > > > >> The idea is that the IR producer communicates a desire to
> > > > count(x)
> > > > > > to a
> > > > > > > > >> consumer, and  it's up to the consumer to figure out how to
> > > turn
> > > > > > that
> > > > > > > > count
> > > > > > > > >> into something that makes sense for itself. In your example
> > > > > that's a
> > > > > > > > series
> > > > > > > > >> of partial counts followed by a sum.
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > > That said, I think there is a valid use case here where a
> > > system
> > > > > > might
> > > > > > > > > make use of different engines to execute different
> > (composable)
> > > > > > layers
> > > > > > > > > of a complex query.
> > > > > > > > >
> > > > > > > > > For example:
> > > > > > > > >
> > > > > > > > > * suppose you want to scan and do predicate pushdown on an
> > > > unusual
> > > > > > > > > data source that is only accessible from one particular
> > engine
> > > > but
> > > > > > > > > * you need to do some analytical operation with the scan
> > > results
> > > > > that
> > > > > > > > > is only supported by another engine
> > > > > > > > >
> > > > > > > > > You could decompose the query into two stages with an IR
> > > > relational
> > > > > > > > > expression for each stage and use then the engines together
> > to
> > > > > > > > > accomplish what you need (of course, you would need an
> > > > > orchestration
> > > > > > > > > layer to handle plumbing the query engine inputs and outputs
> > > > > together
> > > > > > > > > as Arrow streams). Supporting this kind of intra-application
> > > > engine
> > > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > > > >
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <
> > > > cpcloud@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >>>
> > > > > > > > >>>> Thanks Wes,
> > > > > > > > >>>>
> > > > > > > > >>>> Great to be back working on Arrow again and engaging with
> > > the
> > > > > > > > community.
> > > > > > > > >>> I
> > > > > > > > >>>> am really excited about this effort.
> > > > > > > > >>>>
> > > > > > > > >>>> I think there are a number of concerns I see as important
> > to
> > > > > > address
> > > > > > > > in
> > > > > > > > >>> the
> > > > > > > > >>>> compute IR proposal:
> > > > > > > > >>>>
> > > > > > > > >>>> 1. Requirement for output types.
> > > > > > > > >>>>
> > > > > > > > >>>> I think that so far there's been many reasons for
> > requiring
> > > > > > > > conforming IR
> > > > > > > > >>>> producers and consumers to adhere to output types, but I
> > > > haven't
> > > > > > seen
> > > > > > > > a
> > > > > > > > >>>> strong rationale for keeping them optional (in the
> > semantic
> > > > > > sense, not
> > > > > > > > >>> WRT
> > > > > > > > >>>> any particular serialization format's representation of
> > > > > > optionality).
> > > > > > > > >>>>
> > > > > > > > >>>> I think a design that includes unambiguous semantics for
> > > > output
> > > > > > types
> > > > > > > > (a
> > > > > > > > >>>> consumer must produce a value of the requested type or
> > it's
> > > an
> > > > > > > > >>>> error/non-conforming) is simpler to reason about for
> > > > producers,
> > > > > > and
> > > > > > > > >>>> provides a strong guarantee for end users (humans or
> > > machines
> > > > > > > > >>> constructing
> > > > > > > > >>>> IR from an API and expecting the thing they ask for back
> > > from
> > > > > the
> > > > > > IR
> > > > > > > > >>>> consumer).
> > > > > > > > >>>>
> > > > > > > > >>>> 2. Flexibility
> > > > > > > > >>>>
> > > > > > > > >>>> The current PR is currently unable to support what I think
> > > are
> > > > > > killer
> > > > > > > > >>>> features of the IR: custom operators (relational or
> > column)
> > > > and
> > > > > > UDFs.
> > > > > > > > In
> > > > > > > > >>> my
> > > > > > > > >>>> mind, on top of the generalized compute description that
> > the
> > > > IR
> > > > > > > > offers,
> > > > > > > > >>> the
> > > > > > > > >>>> ability for producers and consumers of IR to extend the IR
> > > > > without
> > > > > > > > >>> needing
> > > > > > > > >>>> to modify Arrow or depend on anything except the format is
> > > > > itself
> > > > > > > > >>> something
> > > > > > > > >>>> that is necessary to gain adoption.
> > > > > > > > >>>>
> > > > > > > > >>>> Developers will need to build custom relational operators
> > > > (e.g.,
> > > > > > > > scans of
> > > > > > > > >>>> backends that don't exist anywhere for which a user has
> > code
> > > > to
> > > > > > > > >>> implement)
> > > > > > > > >>>> and custom functions (anything operating on a column that
> > > > > doesn't
> > > > > > > > already
> > > > > > > > >>>> exist, really). Furthermore, I think we can actually drive
> > > > > > building an
> > > > > > > > >>>> Arrow consumer using the same API that an end user would
> > use
> > > > to
> > > > > > extend
> > > > > > > > >>> the
> > > > > > > > >>>> IR.
> > > > > > > > >>>>
> > > > > > > > >>>> 3. Window Functions
> > > > > > > > >>>>
> > > > > > > > >>>> Window functions are, I think, an important part of the IR
> > > > value
> > > > > > > > >>>> proposition, as they are one of the more complex operators
> > > in
> > > > > > > > databases.
> > > > > > > > >>> I
> > > > > > > > >>>> think we need to have something in the initial IR proposal
> > > to
> > > > > > support
> > > > > > > > >>> these
> > > > > > > > >>>> operations.
> > > > > > > > >>>>
> > > > > > > > >>>> 4. Non relational Joins
> > > > > > > > >>>>
> > > > > > > > >>>> Things like as-of join and window join operators aren't
> > yet
> > > > > > fleshed
> > > > > > > > out
> > > > > > > > >>> in
> > > > > > > > >>>> the IR, and I'm not sure they should be in scope for the
> > > > initial
> > > > > > > > >>> prototype.
> > > > > > > > >>>> I think once we settle on a design, we can work the design
> > > of
> > > > > > these
> > > > > > > > >>>> particular operators out during the initial prototype. I
> > > think
> > > > > the
> > > > > > > > >>>> specification of these operators should basically be PR #2
> > > > after
> > > > > > the
> > > > > > > > >>>> initial design lands.
> > > > > > > > >>>>
> > > > > > > > >>>> # Order of Work
> > > > > > > > >>>>
> > > > > > > > >>>> 1. Nail down the design. Anything else is a non-starter.
> > > > > > > > >>>>
> > > > > > > > >>>> 2. Prototype an IR producer using Ibis
> > > > > > > > >>>>
> > > > > > > > >>>> Ibis is IMO a good candidate for a first IR producer as it
> > > > has a
> > > > > > > > number
> > > > > > > > >>> of
> > > > > > > > >>>> desirable properties that make prototyping faster and
> > allow
> > > > for
> > > > > > us to
> > > > > > > > >>>> refine the design of the IR as needed based on how the
> > > > > > implementation
> > > > > > > > >>> goes:
> > > > > > > > >>>> * It's written in Python so it has native support for
> > nearly
> > > > all
> > > > > > of
> > > > > > > > >>>> flatbuffers' features without having to creating bindings
> > to
> > > > > C++.
> > > > > > > > >>>> * There's already a set of rules for type checking, as
> > well
> > > as
> > > > > > APIs
> > > > > > > > for
> > > > > > > > >>>> constructing expression trees, which means we don't need
> > to
> > > > > worry
> > > > > > > > about
> > > > > > > > >>>> building a type checker for the prototype.
> > > > > > > > >>>>
> > > > > > > > >>>> 3. Prototype an IR consumer in C++
> > > > > > > > >>>>
> > > > > > > > >>>> I think in parallel to the producer prototype we can
> > further
> > > > > > inform
> > > > > > > > the
> > > > > > > > >>>> design from the consumer side by prototyping an IR
> > consumer
> > > in
> > > > > > C++ . I
> > > > > > > > >>> know
> > > > > > > > >>>> Ben Kietzman has expressed interest in working on this.
> > > > > > > > >>>>
> > > > > > > > >>>> Very interested to hear others' thoughts.
> > > > > > > > >>>>
> > > > > > > > >>>> -Phillip
> > > > > > > > >>>>
> > > > > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > > > > > wesmckinn@gmail.com>
> > > > > > > > >>> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>>> Thank you for all the feedback and comments on the
> > > document.
> > > > > I'm
> > > > > > on
> > > > > > > > >>>>> vacation this week, so I'm delayed responding to
> > > everything,
> > > > > but
> > > > > > I
> > > > > > > > >>>>> will get to it as quickly as I can. I will be at VLDB in
> > > > > > Copenhagen
> > > > > > > > >>>>> next week if anyone would like to chat in person about
> > it,
> > > > and
> > > > > > we can
> > > > > > > > >>>>> relay the content of any discussions back to the
> > > > > > document/PR/e-mail
> > > > > > > > >>>>> thread.
> > > > > > > > >>>>>
> > > > > > > > >>>>> I know that Phillip Cloud expressed interest in working
> > on
> > > > the
> > > > > > PR and
> > > > > > > > >>>>> helping work through many of the details, so I'm glad to
> > > have
> > > > > the
> > > > > > > > >>>>> help. If there are others who would like to work on the
> > PR
> > > or
> > > > > dig
> > > > > > > > into
> > > > > > > > >>>>> the details, please let me know. We might need to figure
> > > out
> > > > > how
> > > > > > to
> > > > > > > > >>>>> accommodate "many cooks" by setting up the ComputeIR
> > > project
> > > > > > > > somewhere
> > > > > > > > >>>>> separate from the format/ directory to permit it to exist
> > > in
> > > > a
> > > > > > > > >>>>> Work-In-Progress status for a period of time until we
> > work
> > > > > > through
> > > > > > > > the
> > > > > > > > >>>>> various details and design concerns.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Re Julian's comment
> > > > > > > > >>>>>
> > > > > > > > >>>>>> The biggest surprise is that this language does full
> > > > > relational
> > > > > > > > >>>>> operations. I was expecting that it would do fragments of
> > > the
> > > > > > > > >>> operations.
> > > > > > > > >>>>>
> > > > > > > > >>>>> There's a related but different (yet still interesting
> > and
> > > > > > worthy of
> > > > > > > > >>>>> analysis) problem of creating an "engine language" that
> > > > > describes
> > > > > > > > more
> > > > > > > > >>>>> mechanically the constituent parts of implementing the
> > > > > relational
> > > > > > > > >>>>> operators. To create a functional computation language
> > with
> > > > > > concrete
> > > > > > > > >>>>> Arrow data structures as a top-level primitive sounds
> > like
> > > an
> > > > > > > > >>>>> interesting research area where I could see something
> > > > > developing
> > > > > > > > >>>>> eventually.
> > > > > > > > >>>>>
> > > > > > > > >>>>> The main problem I'm interested in solving right now is
> > > > > enabling
> > > > > > > > front
> > > > > > > > >>>>> ends that have sufficient understanding of relational
> > > algebra
> > > > > and
> > > > > > > > data
> > > > > > > > >>>>> frame operations to talk to engines without having to go
> > > > > > backwards
> > > > > > > > >>>>> from their logical query plans to SQL. So as mentioned in
> > > the
> > > > > > > > >>>>> document, being able to faithfully carry the relational
> > > > > operator
> > > > > > node
> > > > > > > > >>>>> information generated by Calcite or Ibis or another
> > system
> > > > > would
> > > > > > be
> > > > > > > > >>>>> super useful. Defining the semantics of various kinds of
> > > > > > user-defined
> > > > > > > > >>>>> functions would also be helpful to standardize the
> > > > > > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > > > > > >>>>>
> > > > > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > > > > > alendit@gmail.com>
> > > > > > > > >>>> wrote:
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Hi Wes,
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL
> > > > > Analytics
> > > > > > > > From
> > > > > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > > > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf)
> > > > which
> > > > > > > > >>>> proposes a
> > > > > > > > >>>>>> set of building block for advanced aggregation.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Cheers,
> > > > > > > > >>>>>> Dimitri.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > > > > > jhyde.apache@gmail.com>
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>> Wes,
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Thanks for this. I’ve added comments to the doc and to
> > > the
> > > > > PR.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> The biggest surprise is that this language does full
> > > > > relational
> > > > > > > > >>>>>>> operations. I was expecting that it would do fragments
> > of
> > > > the
> > > > > > > > >>>>> operations.
> > > > > > > > >>>>>>> Consider join. A distributed hybrid hash join needs to
> > > > > > partition
> > > > > > > > >>> rows
> > > > > > > > >>>>> into
> > > > > > > > >>>>>>> output buffers based on a hash key, build hash tables,
> > > > probe
> > > > > > into
> > > > > > > > >>>> hash
> > > > > > > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and
> > > so
> > > > > > forth.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> I see Arrow’s compute as delivering each of those
> > > > operations,
> > > > > > > > >>> working
> > > > > > > > >>>>> on
> > > > > > > > >>>>>>> perhaps a batch at a time, or a sequence of batches,
> > with
> > > > > some
> > > > > > > > >>> other
> > > > > > > > >>>>> system
> > > > > > > > >>>>>>> coordinating those tasks. So I would expect to see
> > > Arrow’s
> > > > > > compute
> > > > > > > > >>>>> language
> > > > > > > > >>>>>>> mainly operating on batches rather than a table
> > > > abstraction.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Julian
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> > > > > wesmckinn@gmail.com
> > > > > > >
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> hi folks,
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> This idea came up in passing in the past -- given that
> > > > there
> > > > > > are
> > > > > > > > >>>>>>>> multiple independent efforts to develop Arrow-native
> > > query
> > > > > > > > >>> engines
> > > > > > > > >>>>>>>> (and surely many more to come), it seems like it would
> > > be
> > > > > > > > >>> valuable
> > > > > > > > >>>> to
> > > > > > > > >>>>>>>> have a way to enable user languages (like Java,
> > Python,
> > > R,
> > > > > or
> > > > > > > > >>> Rust,
> > > > > > > > >>>>>>>> for example) to communicate with backend computing
> > > engines
> > > > > > (like
> > > > > > > > >>>>>>>> DataFusion, or new computing capabilities being built
> > in
> > > > the
> > > > > > > > >>> Arrow
> > > > > > > > >>>>> C++
> > > > > > > > >>>>>>>> library) in a fashion that is "lower-level" than SQL
> > and
> > > > > > > > >>>> specialized
> > > > > > > > >>>>>>>> to Arrow's type system. So rather than leaving it to a
> > > SQL
> > > > > > > > >>> parser /
> > > > > > > > >>>>>>>> analyzer framework to generate an expression tree of
> > > > > > relational
> > > > > > > > >>>>>>>> operators and then translate that to an Arrow-native
> > > > > > > > >>>> compute-engine's
> > > > > > > > >>>>>>>> internal grammer, a user framework could provide the
> > > > desired
> > > > > > > > >>>>>>>> Arrow-native expression tree / data manipulations
> > > directly
> > > > > and
> > > > > > > > >>> skip
> > > > > > > > >>>>>>>> the SQL altogether.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> The idea of creating a "serialized intermediate
> > > > > representation
> > > > > > > > >>>> (IR)"
> > > > > > > > >>>>>>>> for Arrow compute operations would be to serve use
> > cases
> > > > > large
> > > > > > > > >>> and
> > > > > > > > >>>>>>>> small -- from the most complex TPC-* or time series
> > > > database
> > > > > > > > >>> query
> > > > > > > > >>>> to
> > > > > > > > >>>>>>>> the most simple array predicate/filter sent with an
> > RPC
> > > > > > request
> > > > > > > > >>>> using
> > > > > > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > > > > > engine-agnostic,
> > > > > > > > >>>> with
> > > > > > > > >>>>>>>> the only commonality across usages being the Arrow
> > > > columnar
> > > > > > > > >>> format
> > > > > > > > >>>>>>>> (schemas and array types). This would be better than
> > > > leaving
> > > > > > it
> > > > > > > > >>> to
> > > > > > > > >>>>>>>> each application to develop its own bespoke expression
> > > > > > > > >>>>> representations
> > > > > > > > >>>>>>>> for its needs.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> I spent a while thinking about this and wrote up a
> > brain
> > > > > dump
> > > > > > RFC
> > > > > > > > >>>>>>>> document [1] and accompanying pull request [2] that
> > > makes
> > > > > the
> > > > > > > > >>>>> possibly
> > > > > > > > >>>>>>>> controversial choice of using Flatbuffers to represent
> > > the
> > > > > > > > >>>> serialized
> > > > > > > > >>>>>>>> IR. I discuss the rationale for the choice of
> > > Flatbuffers
> > > > in
> > > > > > the
> > > > > > > > >>>> RFC
> > > > > > > > >>>>>>>> document. This PR is obviously deficient in many
> > regards
> > > > > > > > >>>> (incomplete,
> > > > > > > > >>>>>>>> hacky, or unclear in places), and will need some help
> > > from
> > > > > > others
> > > > > > > > >>>> to
> > > > > > > > >>>>>>>> flesh out. I suspect that actually implementing the IR
> > > > will
> > > > > be
> > > > > > > > >>>>>>>> necessary to work out many of the low-level details.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> Note that this IR is intended to be more of a
> > "superset"
> > > > > > project
> > > > > > > > >>>> than
> > > > > > > > >>>>>>>> a "lowest common denominator". So there may be things
> > > that
> > > > > it
> > > > > > > > >>>>> includes
> > > > > > > > >>>>>>>> which are only available in some engines (e.g. engines
> > > > that
> > > > > > have
> > > > > > > > >>>>>>>> specialized handling of time series data).
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> As some of my personal motivation for the project,
> > > > > concurrent
> > > > > > > > >>> with
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>>> genesis of Apache Arrow, I started a Python project
> > > called
> > > > > > Ibis
> > > > > > > > >>> [3]
> > > > > > > > >>>>>>>> (which is similar to R's dplyr project) which serves
> > as
> > > a
> > > > > > "Python
> > > > > > > > >>>>>>>> analytical query IR builder" that is capable of
> > > generating
> > > > > > most
> > > > > > > > >>> of
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>>> SQL standard, targeting many different SQL dialects
> > and
> > > > > other
> > > > > > > > >>>>> backends
> > > > > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have
> > > used
> > > > > this
> > > > > > > > >>>>> library
> > > > > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to
> > be
> > > > able
> > > > > > to
> > > > > > > > >>>>>>>> translate between the in-memory "logical query" data
> > > > > > structures
> > > > > > > > >>> in
> > > > > > > > >>>> a
> > > > > > > > >>>>>>>> library like Ibis to a serialized format that can be
> > > > > executed
> > > > > > by
> > > > > > > > >>>> many
> > > > > > > > >>>>>>>> different Arrow-native query engines. The expression
> > > > > > primitives
> > > > > > > > >>>>>>>> available in Ibis should serve as helpful test cases,
> > > too.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> I look forward to the community's comments on the RFC
> > > > > document
> > > > > > > > >>> [1]
> > > > > > > > >>>>> and
> > > > > > > > >>>>>>>> pull request [2] -- I realize that arriving at
> > consensus
> > > > on
> > > > > a
> > > > > > > > >>>> complex
> > > > > > > > >>>>>>>> and ambitious project like this can be challenging so
> > I
> > > > > > recommend
> > > > > > > > >>>>>>>> spending time on the "non-goals" section in the RFC
> > and
> > > > ask
> > > > > > > > >>>> questions
> > > > > > > > >>>>>>>> if you are unclear about the scope of what problems
> > this
> > > > is
> > > > > > > > >>> trying
> > > > > > > > >>>> to
> > > > > > > > >>>>>>>> solve. I would be happy to give Edit access on the RFC
> > > > > > document
> > > > > > > > >>> to
> > > > > > > > >>>>>>>> others and would consider ideas about how to move
> > > forward
> > > > > with
> > > > > > > > >>>>>>>> something that is able to be implemented by different
> > > > Arrow
> > > > > > > > >>>> libraries
> > > > > > > > >>>>>>>> in the reasonably near future.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> Thanks,
> > > > > > > > >>>>>>>> Wes
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> [1]:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > > > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > > > > > >>>>>>>> [4]:
> > > > http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > > > >>>>>>>> [5]:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Phillip Cloud <cp...@gmail.com>.

Hey everyone,

There's some interesting discussion around types and where their location
is in the current PR [1] (and in fact whether to store them at all).

It would be great to get some community feedback on this [2] part of the PR
in particular, because the choice of whether to store types at all has
important design consequences.

[1]: https://github.com/apache/arrow/pull/10934
[2]: https://github.com/apache/arrow/pull/10934/files#r697025313

On Fri, Aug 27, 2021 at 2:11 AM Micah Kornfield <em...@gmail.com>
wrote:

> As an FYI, Iceberg is also considering an IR in relation to view support
> [1].  I chimed in and pointed them to this thread and Wes's doc.  Phillip
> and Jacques chimed in there as well.
>
> [1]
>
> https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E
>
> On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud <cp...@gmail.com> wrote:
>
> > Thanks for the feedback Jacques, very helpful. In the latest version of
> the
> > PR, I've tried to incorporate nearly all of these points.
> >
> > - I've incorporated most of what you had for dereferencing operations
> into
> > the PR, and gotten rid of schemas except on Read/Write relations.
> > - With respect to properties, I've made a bunch more specific operators,
> > and kept user-defined things special case-y.
> > - I haven't incorporated anything close to physical-plan things, but I
> > think that's a good follow up PR. Having separate representations for
> > logical/physical plans seems like a waste of effort ultimately. I think
> we
> > can find a good balance.
> > - Agree on UDF support, I think that will have to evolve as the rest of
> the
> > spec evolves. UDFs will need language-dedicated effort given the large
> > variety of languages that folks will want to use to define functions.
> >
> > On a separate note, in an effort to move this project forward I'd like to
> > do one final round of code review from anyone interested and then merge
> the
> > PR after that.
> > This spec will be unstable for a while, until we can work out all the
> > design kinks and edge cases, and I think getting this initial spec in is
> > important to start that process.
> >
> >
> > On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > > In a lucky turn of events, Phillip actually turned out to be in my neck
> > of
> > > the woods on Friday so we had a chance to sit down and discuss this. To
> > > help, I actually shared something I had been working on a few months
> ago
> > > independently (before this discussion started).
> > >
> > > For reference:
> > > Wes PR: https://github.com/apache/arrow/pull/10856
> > > Ben PR: https://github.com/apache/arrow/pull/10934
> > > Jacques PR: https://github.com/apache/arrow/pull/10979
> > >
> > > The high level points of feedback I have are:
> > >
> > >    - Ben PR feels too deconstructed. While I like the elegance and
> > >    symmetry, I believe this will lead to substantially more work in
> > moving
> > >    from serialization format to something closer to what a system would
> > > want
> > >    to manipulate/consume. The reality is that there are a lot of really
> > > known
> > >    things and specializing the representation for these things will
> > > ultimately
> > >    make things easier to program with without error and easier to
> debug.
> > > (For
> > >    example, imagine trying to inspect a plan in a debugging session
> with
> > > the
> > >    Ben representation.) We should embrace the known things in the
> > >    specification.
> > >    - I believe that it is a mistake for the inner workings of the plan
> to
> > >    ever use field names. Only input (e.g. file read) and Output (e.g.
> > > return
> > >    to user or write to file) need to have field names. For the rest of
> > the
> > >    system, using field ordinals (determinant whether nested or flat) is
> > > much
> > >    cleaner and is how most execution systems work. For example, in
> > Impala I
> > >    believe it is called a slot. As I noted in the PR, Calcite as an
> > > example is
> > >    entirely ordinal based at the algebra level. Rowtypes contain field
> > > names
> > >    but they are actually basically pointless. Field references use
> > > RexInputRef
> > >    with ordinal based and rules around column order output (e.g. what
> is
> > > the
> > >    field order of a join output) are determinant and done entirely at
> an
> > >    ordinal level. The only place where names should be used (besides
> > >    input/output) is in the case of map keys. In that case, the names
> are
> > >    actually data, as opposed to scheme metadata. This is why I propose
> a
> > >    strongly structured dereference operation [1].
> > >    - Properties should only be included in the serialization if they
> are
> > >    not easily re-derivable at plan consumption time. For example,
> you'll
> > > note
> > >    that I don't store schema information for relational operation. Each
> > >    function and relational operation should already know how a given
> > input
> > > is
> > >    transformed to a given output. Capturing this information in the
> > > plan/IR is
> > >    excessive. In many ways, I compare it to the early use of
> VectorLayout
> > > [2]
> > >    in Arrow schema. It may have provided some additional checksum of
> the
> > >    message but ultimately it caused more pain than it was worth (and
> thus
> > > we
> > >    removed it before formalizing the specification). For reference, in
> > the
> > >    context of Dremio, we used to actually do this, send schema
> > information
> > >    around for all operations. We removed it because in many cases
> > becoming
> > > the
> > >    majority of our internal plan serialization (imagine simple
> operations
> > > that
> > >    are carrying 1000s of fields).
> > >    - I suggest focusing on support for both logical and physical
> > >    representations. The moment you start talking about optimization
> > passes,
> > >    many of those would probably be better being done at the logical
> > level.
> > > The
> > >    overlap is really high.
> > >    - I think a lot more work must be done before introducing UDFs and
> > user
> > >    defined relational operations. I see one goal being the possibility
> of
> > >    there being three systems: A -> B -> C. A is a IR producer. C is a
> IR
> > >    consumer and B is a IR filter or translator. In this situation, B
> > > should be
> > >    able to operate and do optimizations on a plan even if if there are
> > > black
> > >    box user defined operations. Being able to know the
> > > properties-preservation
> > >    or not of these operations is important to making decisions. For
> > > example,
> > >    does a user defined relational operation maintain sortedness?
> > > Distribution?
> > >    Is a defined UDF idempotent? As such, I think the definition of
> those
> > > black
> > >    boxes should be much more structured. For example: it is a python
> > >    relational operation named X stored in Y that maintains properties
> 1,2
> > > and
> > >    disrupts property 3. Putting just a black box of bytes will
> > > substantially
> > >    reduce the compatibility and extensibility of the ecosystem of tools
> > >    working against IR. I'd note that I wouldn't expect this to be a
> > burden
> > > to
> > >    actual end users. By using sensible defaults, I still would expect
> an
> > > end
> > >    user tool to support arbitrary user defined operations.
> > >    - It might make sense to review the XML representation that Orca
> uses
> > >    [3]. I haven't looked at it recently but they had a strong goal of
> > >    decoupling for most of its life (to use in both Greenplum and Hawq).
> > It
> > >    could be the most mature/formal serialization of query plans
> > publically
> > >    available.
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
> > > [2]
> > >
> > >
> >
> https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
> > > [3] https://github.com/greenplum-db/gporca
> > >
> > >
> > > On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <cp...@gmail.com>
> > wrote:
> > >
> > > > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > Looking at Ben's alternate PR [1], having an IR that leans heavily
> on
> > > > > memory references to an out-of-band data sidecar seems like an
> > > > > approach that would substantially ratchet up the implementation
> > > > > complexity as producing the IR would then have the level of
> > complexity
> > > > > of producing the Arrow IPC format — when producing the "root" Plan
> > > > > message, you must accumulate a list of the dependent serialized
> > > > > submessages, placing the appropriate Buffer memory offset in the
> Plan
> > > > > message, like we do when producing the RecordBatch.buffers field.
> > This
> > > > > seems complicated to me as you must devise a custom binary protocol
> > to
> > > > > concatenate the serialized Plan and the messages it depends on
> into a
> > > > > single binary payload
> > > > >
> > > > > <ROOT PLAN>
> > > > > <padding>
> > > > > <Buffer 0>
> > > > > <padding>
> > > > > <Buffer 1>
> > > > > <padding>
> > > > > ...
> > > > > <Buffer N - 1>
> > > > >
> > > > > (one purpose of FlatBufferBuilder is to spare you having to do this
> > > > > yourself — some reasons we do it for the Arrow IPC format is
> because
> > > > > appending Arrow memory buffers directly to a FlatBufferBuilder
> would
> > > > > be inefficient — internal realloc calls — and Flatbuffers are
> limited
> > > > > to 2GB. Neither of these things are problems here)
> > > > >
> > > > > In general, I believe the output created by an IR producer should
> be
> > a
> > > > > single serialized object without any out-of-band data sidecar —
> this
> > > > > is much simpler for implementers and we can still provide an
> "escape
> > > > > hatch" for user-defined operators and functions where the required
> > > > > function/operator is passed opaquely as an embedded binary data.
> > > >
> > > >
> > > >
> > > > The serialization format (whether it is Flatbuffers or JSON, or
> > > > > something else) should allow for data memoization, so if there is a
> > > > > user-defined operator/function, or a relation that is used multiple
> > > > > times throughout the query (potentially with a large schema), then
> we
> > > > > should ensure that the data need not be duplicated in the
> > > > > serialization format unnecessarily — in Flatbuffers, you can
> achieve
> > > > > this by reusing offsets, but we could devise the data structures to
> > > > > make the memoization of "expensive" objects more explicit.
> > > > >
> > > >
> > > > I think this is something that would need to be explicitly encoded in
> > > > the structures themselves if it's a hard requirement. I don't think
> > this
> > > > should block
> > > > a prototype producer/consumer.
> > > >
> > > > Is there something in the second PR/design that precludes the reuse
> of
> > > > offsets?
> > > > To my eye, the flatbuffers offset reuse mechanism works just as well
> > > there.
> > > >
> > > >
> > > > > I additionally think that it is important to provide as much
> built-in
> > > > > support for "canonical" operators/functions (such as the ones
> > > > > implemented commonly by SQL engines) as possible, and to liberally
> > > > > expand the catalogue of "built-in" capabilities. I would still
> prefer
> > > > > to have large unions/enums of built-in operators/functions and to
> > > > > expand those unions/enums to accommodate new things when it is
> > > > > demonstrated that there is a need to standardize things between
> > > > > producers/consumers.
> > > > >
> > > >
> > > > I think there's a middle ground where we add a bit of structure
> > > (something
> > > > like
> > > > a descriptor from the first PR) to indicate whether a thing is
> built-in
> > > vs
> > > > user-defined.
> > > > It looks like Ben has pushed something like this to his PR.
> > > >
> > > > With that scheme, we have both flexibility and a small set of special
> > > > builtins that make up
> > > > a statically typed set for expressions and relational operators.
> > > >
> > > > I would really like to vet this PR with a prototype this week,
> > > > to see whether we need to revisit any major choices. I don't think
> > we'll
> > > be
> > > > able to
> > > > anticipate all the consequences until we write some code.
> > > >
> > > >
> > > > >
> > > > > One of the beneficial properties of the Union/Enum approach for the
> > > > > operator/function catalogues, is that when there are additions to
> > > > > those enums, the generated Flatbuffers files will cause many
> language
> > > > > compilers to warn or error on unhandled enum cases. If all
> > > > > function/operator names are strings, then you are essentially
> > > > > reimplementing the functionality provided by enums by hand. I
> > > > > initially used strings for all function references in my original
> > > > > prototype, but I now think that using an enum for "built-ins" would
> > be
> > > > > superior (because of the code-generated enum interface) and not a
> > > > > premature optimization.
> > > > >
> > > > > [1]: https://github.com/apache/arrow/pull/10934
> > > > >
> > > > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cp...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > Hey all,
> > > > > >
> > > > > > Just wanted to give an update on the effort here.
> > > > > >
> > > > > > Ben Kietzman has created an alternative proposal to the initial
> > > design
> > > > > [1].
> > > > > > It largely overlaps with the original, but differs in a few
> > important
> > > > > ways:
> > > > > >
> > > > > > * A big focus of the design is on flexibility, allowing
> producers,
> > > > > > consumers and ultimately end users of those systems the ability
> to
> > > > define
> > > > > > custom operations in the graph.
> > > > > > * There are very few predefined relational operations (project,
> > > filter,
> > > > > > join and a handful of others).
> > > > > > * There are only 3 types of value expressions: literals, field
> > > > > references,
> > > > > > and function calls.
> > > > > > * The model of evaluation is one that requires a final sink node,
> > to
> > > > > > indicate where the record batches from child relations end up (a
> > > file,
> > > > a
> > > > > > socket, an in-memory buffer, etc).
> > > > > >
> > > > > > I've added notes [2] to the original Google doc (under the
> > > Alternative
> > > > > > Design Notes subheading), and a few pseudocode examples.
> > > > > > Unfortunately, these went out of date as soon as Ben pushed the
> PR
> > > [3],
> > > > > so
> > > > > > I need to update those to reflect his changes. Regardless,
> > > > > > the design is broadly the same, so it should still give a good
> > > > indication
> > > > > > of the details of the design.
> > > > > >
> > > > > > There are a decent number of review comments on the original PR
> > that
> > > I
> > > > > plan
> > > > > > to port over where they are still relevant.
> > > > > > I also plan on adding support for window functions either tonight
> > or
> > > on
> > > > > > Monday.
> > > > > >
> > > > > > Please review this design at your earliest convenience. Since
> > > there's a
> > > > > > fairly concrete set of types in flatbuffers that
> > > > > > we can look at, ideally we can center discussion around the
> details
> > > in
> > > > > the
> > > > > > PR.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > [1]: https://github.com/apache/arrow/pull/10856
> > > > > > [2]:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > > > > [3]: https://github.com/apache/arrow/pull/10934
> > > > > >
> > > > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <
> > jhyde.apache@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > > Wes wrote:
> > > > > > > >
> > > > > > > > Supporting this kind of intra-application engine
> > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > The data format is the natural interface between tasks.
> (Defining
> > > > > “task”
> > > > > > > here as “something that is programmed using the IR”.) That is
> > > Arrow’s
> > > > > > > strength.
> > > > > > >
> > > > > > > So I think the IR should describe what each task should do, and
> > > tasks
> > > > > > > should be fairly small. Not whole relational operators,
> operating
> > > on
> > > > > whole
> > > > > > > tables, but pieces of relational operators, operating on
> batches
> > or
> > > > > > > sequences of batches.
> > > > > > >
> > > > > > > Elsethread, someone mentioned the LoLePop concept and the
> > > > > Kohn/Leis/Neuman
> > > > > > > paper [1]. The LoLePop concept sounds good for our purposes.
> > > > > > >
> > > > > > > Julian
> > > > > > >
> > > > > > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > > > > >
> > > > > > >
> > > > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <
> wesmckinn@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <
> > > cpcloud@gmail.com>
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > > > > >> jorgecarleitao@gmail.com> wrote:
> > > > > > > >>
> > > > > > > >>> Couple of questions
> > > > > > > >>>
> > > > > > > >>> 1. Is the goal that IRs have equal semantics, i.e. given
> > > > > (IR,data), the
> > > > > > > >>> operation "(IR,data) - engine -> result" MUST be the same
> for
> > > all
> > > > > > > "engine"?
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >> I think that might be a non-starter for mundane reasons:
> > there's
> > > > > > > probably
> > > > > > > >> at least two engines
> > > > > > > >> that disagree on the result of sum(x) where x is a floating
> > > point
> > > > > > > column.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>> 2. if yes, imo we may need to worry about:
> > > > > > > >>> * a definition of equality that implementations agree on.
> > > > > > > >>> * agreement over what the semantics look like. For example,
> > do
> > > we
> > > > > use
> > > > > > > >>> kleene logic for AND and OR?
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >> WRT Kleene logic my thoughts are that the IR should support
> > both
> > > > > Kleene
> > > > > > > and
> > > > > > > >> non-Kleene,
> > > > > > > >> and producers can choose their desired semantics.
> > > > > > > >>
> > > > > > > >> Ibis for example, would override the `&` operator in `a & b`
> > to
> > > > > produce
> > > > > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>>
> > > > > > > >>> To try to understand the gist, let's pick an aggregated
> count
> > > > over
> > > > > a
> > > > > > > >>> column: engines often do partial counts over partitions
> > > followed
> > > > > by a
> > > > > > > final
> > > > > > > >>> "sum" over the partial counts. Is the idea that the query
> > > engine
> > > > > would
> > > > > > > >>> communicate with the compute engine via 2 IRs where one is
> > > "count
> > > > > me
> > > > > > > these"
> > > > > > > >>> the other is "sum me these"?
> > > > > > > >>>
> > > > > > > >>> Best,
> > > > > > > >>> Jorge
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >> Not in its current incarnation.
> > > > > > > >>
> > > > > > > >> The idea is that the IR producer communicates a desire to
> > > count(x)
> > > > > to a
> > > > > > > >> consumer, and  it's up to the consumer to figure out how to
> > turn
> > > > > that
> > > > > > > count
> > > > > > > >> into something that makes sense for itself. In your example
> > > > that's a
> > > > > > > series
> > > > > > > >> of partial counts followed by a sum.
> > > > > > > >>
> > > > > > > >
> > > > > > > > That said, I think there is a valid use case here where a
> > system
> > > > > might
> > > > > > > > make use of different engines to execute different
> (composable)
> > > > > layers
> > > > > > > > of a complex query.
> > > > > > > >
> > > > > > > > For example:
> > > > > > > >
> > > > > > > > * suppose you want to scan and do predicate pushdown on an
> > > unusual
> > > > > > > > data source that is only accessible from one particular
> engine
> > > but
> > > > > > > > * you need to do some analytical operation with the scan
> > results
> > > > that
> > > > > > > > is only supported by another engine
> > > > > > > >
> > > > > > > > You could decompose the query into two stages with an IR
> > > relational
> > > > > > > > expression for each stage and use then the engines together
> to
> > > > > > > > accomplish what you need (of course, you would need an
> > > > orchestration
> > > > > > > > layer to handle plumbing the query engine inputs and outputs
> > > > together
> > > > > > > > as Arrow streams). Supporting this kind of intra-application
> > > engine
> > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > > >
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <
> > > cpcloud@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> Thanks Wes,
> > > > > > > >>>>
> > > > > > > >>>> Great to be back working on Arrow again and engaging with
> > the
> > > > > > > community.
> > > > > > > >>> I
> > > > > > > >>>> am really excited about this effort.
> > > > > > > >>>>
> > > > > > > >>>> I think there are a number of concerns I see as important
> to
> > > > > address
> > > > > > > in
> > > > > > > >>> the
> > > > > > > >>>> compute IR proposal:
> > > > > > > >>>>
> > > > > > > >>>> 1. Requirement for output types.
> > > > > > > >>>>
> > > > > > > >>>> I think that so far there's been many reasons for
> requiring
> > > > > > > conforming IR
> > > > > > > >>>> producers and consumers to adhere to output types, but I
> > > haven't
> > > > > seen
> > > > > > > a
> > > > > > > >>>> strong rationale for keeping them optional (in the
> semantic
> > > > > sense, not
> > > > > > > >>> WRT
> > > > > > > >>>> any particular serialization format's representation of
> > > > > optionality).
> > > > > > > >>>>
> > > > > > > >>>> I think a design that includes unambiguous semantics for
> > > output
> > > > > types
> > > > > > > (a
> > > > > > > >>>> consumer must produce a value of the requested type or
> it's
> > an
> > > > > > > >>>> error/non-conforming) is simpler to reason about for
> > > producers,
> > > > > and
> > > > > > > >>>> provides a strong guarantee for end users (humans or
> > machines
> > > > > > > >>> constructing
> > > > > > > >>>> IR from an API and expecting the thing they ask for back
> > from
> > > > the
> > > > > IR
> > > > > > > >>>> consumer).
> > > > > > > >>>>
> > > > > > > >>>> 2. Flexibility
> > > > > > > >>>>
> > > > > > > >>>> The current PR is currently unable to support what I think
> > are
> > > > > killer
> > > > > > > >>>> features of the IR: custom operators (relational or
> column)
> > > and
> > > > > UDFs.
> > > > > > > In
> > > > > > > >>> my
> > > > > > > >>>> mind, on top of the generalized compute description that
> the
> > > IR
> > > > > > > offers,
> > > > > > > >>> the
> > > > > > > >>>> ability for producers and consumers of IR to extend the IR
> > > > without
> > > > > > > >>> needing
> > > > > > > >>>> to modify Arrow or depend on anything except the format is
> > > > itself
> > > > > > > >>> something
> > > > > > > >>>> that is necessary to gain adoption.
> > > > > > > >>>>
> > > > > > > >>>> Developers will need to build custom relational operators
> > > (e.g.,
> > > > > > > scans of
> > > > > > > >>>> backends that don't exist anywhere for which a user has
> code
> > > to
> > > > > > > >>> implement)
> > > > > > > >>>> and custom functions (anything operating on a column that
> > > > doesn't
> > > > > > > already
> > > > > > > >>>> exist, really). Furthermore, I think we can actually drive
> > > > > building an
> > > > > > > >>>> Arrow consumer using the same API that an end user would
> use
> > > to
> > > > > extend
> > > > > > > >>> the
> > > > > > > >>>> IR.
> > > > > > > >>>>
> > > > > > > >>>> 3. Window Functions
> > > > > > > >>>>
> > > > > > > >>>> Window functions are, I think, an important part of the IR
> > > value
> > > > > > > >>>> proposition, as they are one of the more complex operators
> > in
> > > > > > > databases.
> > > > > > > >>> I
> > > > > > > >>>> think we need to have something in the initial IR proposal
> > to
> > > > > support
> > > > > > > >>> these
> > > > > > > >>>> operations.
> > > > > > > >>>>
> > > > > > > >>>> 4. Non relational Joins
> > > > > > > >>>>
> > > > > > > >>>> Things like as-of join and window join operators aren't
> yet
> > > > > fleshed
> > > > > > > out
> > > > > > > >>> in
> > > > > > > >>>> the IR, and I'm not sure they should be in scope for the
> > > initial
> > > > > > > >>> prototype.
> > > > > > > >>>> I think once we settle on a design, we can work the design
> > of
> > > > > these
> > > > > > > >>>> particular operators out during the initial prototype. I
> > think
> > > > the
> > > > > > > >>>> specification of these operators should basically be PR #2
> > > after
> > > > > the
> > > > > > > >>>> initial design lands.
> > > > > > > >>>>
> > > > > > > >>>> # Order of Work
> > > > > > > >>>>
> > > > > > > >>>> 1. Nail down the design. Anything else is a non-starter.
> > > > > > > >>>>
> > > > > > > >>>> 2. Prototype an IR producer using Ibis
> > > > > > > >>>>
> > > > > > > >>>> Ibis is IMO a good candidate for a first IR producer as it
> > > has a
> > > > > > > number
> > > > > > > >>> of
> > > > > > > >>>> desirable properties that make prototyping faster and
> allow
> > > for
> > > > > us to
> > > > > > > >>>> refine the design of the IR as needed based on how the
> > > > > implementation
> > > > > > > >>> goes:
> > > > > > > >>>> * It's written in Python so it has native support for
> nearly
> > > all
> > > > > of
> > > > > > > >>>> flatbuffers' features without having to creating bindings
> to
> > > > C++.
> > > > > > > >>>> * There's already a set of rules for type checking, as
> well
> > as
> > > > > APIs
> > > > > > > for
> > > > > > > >>>> constructing expression trees, which means we don't need
> to
> > > > worry
> > > > > > > about
> > > > > > > >>>> building a type checker for the prototype.
> > > > > > > >>>>
> > > > > > > >>>> 3. Prototype an IR consumer in C++
> > > > > > > >>>>
> > > > > > > >>>> I think in parallel to the producer prototype we can
> further
> > > > > inform
> > > > > > > the
> > > > > > > >>>> design from the consumer side by prototyping an IR
> consumer
> > in
> > > > > C++ . I
> > > > > > > >>> know
> > > > > > > >>>> Ben Kietzman has expressed interest in working on this.
> > > > > > > >>>>
> > > > > > > >>>> Very interested to hear others' thoughts.
> > > > > > > >>>>
> > > > > > > >>>> -Phillip
> > > > > > > >>>>
> > > > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > > > > wesmckinn@gmail.com>
> > > > > > > >>> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> Thank you for all the feedback and comments on the
> > document.
> > > > I'm
> > > > > on
> > > > > > > >>>>> vacation this week, so I'm delayed responding to
> > everything,
> > > > but
> > > > > I
> > > > > > > >>>>> will get to it as quickly as I can. I will be at VLDB in
> > > > > Copenhagen
> > > > > > > >>>>> next week if anyone would like to chat in person about
> it,
> > > and
> > > > > we can
> > > > > > > >>>>> relay the content of any discussions back to the
> > > > > document/PR/e-mail
> > > > > > > >>>>> thread.
> > > > > > > >>>>>
> > > > > > > >>>>> I know that Phillip Cloud expressed interest in working
> on
> > > the
> > > > > PR and
> > > > > > > >>>>> helping work through many of the details, so I'm glad to
> > have
> > > > the
> > > > > > > >>>>> help. If there are others who would like to work on the
> PR
> > or
> > > > dig
> > > > > > > into
> > > > > > > >>>>> the details, please let me know. We might need to figure
> > out
> > > > how
> > > > > to
> > > > > > > >>>>> accommodate "many cooks" by setting up the ComputeIR
> > project
> > > > > > > somewhere
> > > > > > > >>>>> separate from the format/ directory to permit it to exist
> > in
> > > a
> > > > > > > >>>>> Work-In-Progress status for a period of time until we
> work
> > > > > through
> > > > > > > the
> > > > > > > >>>>> various details and design concerns.
> > > > > > > >>>>>
> > > > > > > >>>>> Re Julian's comment
> > > > > > > >>>>>
> > > > > > > >>>>>> The biggest surprise is that this language does full
> > > > relational
> > > > > > > >>>>> operations. I was expecting that it would do fragments of
> > the
> > > > > > > >>> operations.
> > > > > > > >>>>>
> > > > > > > >>>>> There's a related but different (yet still interesting
> and
> > > > > worthy of
> > > > > > > >>>>> analysis) problem of creating an "engine language" that
> > > > describes
> > > > > > > more
> > > > > > > >>>>> mechanically the constituent parts of implementing the
> > > > relational
> > > > > > > >>>>> operators. To create a functional computation language
> with
> > > > > concrete
> > > > > > > >>>>> Arrow data structures as a top-level primitive sounds
> like
> > an
> > > > > > > >>>>> interesting research area where I could see something
> > > > developing
> > > > > > > >>>>> eventually.
> > > > > > > >>>>>
> > > > > > > >>>>> The main problem I'm interested in solving right now is
> > > > enabling
> > > > > > > front
> > > > > > > >>>>> ends that have sufficient understanding of relational
> > algebra
> > > > and
> > > > > > > data
> > > > > > > >>>>> frame operations to talk to engines without having to go
> > > > > backwards
> > > > > > > >>>>> from their logical query plans to SQL. So as mentioned in
> > the
> > > > > > > >>>>> document, being able to faithfully carry the relational
> > > > operator
> > > > > node
> > > > > > > >>>>> information generated by Calcite or Ibis or another
> system
> > > > would
> > > > > be
> > > > > > > >>>>> super useful. Defining the semantics of various kinds of
> > > > > user-defined
> > > > > > > >>>>> functions would also be helpful to standardize the
> > > > > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > > > > >>>>>
> > > > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > > > > alendit@gmail.com>
> > > > > > > >>>> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>> Hi Wes,
> > > > > > > >>>>>>
> > > > > > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL
> > > > Analytics
> > > > > > > From
> > > > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf)
> > > which
> > > > > > > >>>> proposes a
> > > > > > > >>>>>> set of building block for advanced aggregation.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Cheers,
> > > > > > > >>>>>> Dimitri.
> > > > > > > >>>>>>
> > > > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > > > > jhyde.apache@gmail.com>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>>> Wes,
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Thanks for this. I’ve added comments to the doc and to
> > the
> > > > PR.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> The biggest surprise is that this language does full
> > > > relational
> > > > > > > >>>>>>> operations. I was expecting that it would do fragments
> of
> > > the
> > > > > > > >>>>> operations.
> > > > > > > >>>>>>> Consider join. A distributed hybrid hash join needs to
> > > > > partition
> > > > > > > >>> rows
> > > > > > > >>>>> into
> > > > > > > >>>>>>> output buffers based on a hash key, build hash tables,
> > > probe
> > > > > into
> > > > > > > >>>> hash
> > > > > > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and
> > so
> > > > > forth.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> I see Arrow’s compute as delivering each of those
> > > operations,
> > > > > > > >>> working
> > > > > > > >>>>> on
> > > > > > > >>>>>>> perhaps a batch at a time, or a sequence of batches,
> with
> > > > some
> > > > > > > >>> other
> > > > > > > >>>>> system
> > > > > > > >>>>>>> coordinating those tasks. So I would expect to see
> > Arrow’s
> > > > > compute
> > > > > > > >>>>> language
> > > > > > > >>>>>>> mainly operating on batches rather than a table
> > > abstraction.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Julian
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> > > > wesmckinn@gmail.com
> > > > > >
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> hi folks,
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> This idea came up in passing in the past -- given that
> > > there
> > > > > are
> > > > > > > >>>>>>>> multiple independent efforts to develop Arrow-native
> > query
> > > > > > > >>> engines
> > > > > > > >>>>>>>> (and surely many more to come), it seems like it would
> > be
> > > > > > > >>> valuable
> > > > > > > >>>> to
> > > > > > > >>>>>>>> have a way to enable user languages (like Java,
> Python,
> > R,
> > > > or
> > > > > > > >>> Rust,
> > > > > > > >>>>>>>> for example) to communicate with backend computing
> > engines
> > > > > (like
> > > > > > > >>>>>>>> DataFusion, or new computing capabilities being built
> in
> > > the
> > > > > > > >>> Arrow
> > > > > > > >>>>> C++
> > > > > > > >>>>>>>> library) in a fashion that is "lower-level" than SQL
> and
> > > > > > > >>>> specialized
> > > > > > > >>>>>>>> to Arrow's type system. So rather than leaving it to a
> > SQL
> > > > > > > >>> parser /
> > > > > > > >>>>>>>> analyzer framework to generate an expression tree of
> > > > > relational
> > > > > > > >>>>>>>> operators and then translate that to an Arrow-native
> > > > > > > >>>> compute-engine's
> > > > > > > >>>>>>>> internal grammer, a user framework could provide the
> > > desired
> > > > > > > >>>>>>>> Arrow-native expression tree / data manipulations
> > directly
> > > > and
> > > > > > > >>> skip
> > > > > > > >>>>>>>> the SQL altogether.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> The idea of creating a "serialized intermediate
> > > > representation
> > > > > > > >>>> (IR)"
> > > > > > > >>>>>>>> for Arrow compute operations would be to serve use
> cases
> > > > large
> > > > > > > >>> and
> > > > > > > >>>>>>>> small -- from the most complex TPC-* or time series
> > > database
> > > > > > > >>> query
> > > > > > > >>>> to
> > > > > > > >>>>>>>> the most simple array predicate/filter sent with an
> RPC
> > > > > request
> > > > > > > >>>> using
> > > > > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > > > > engine-agnostic,
> > > > > > > >>>> with
> > > > > > > >>>>>>>> the only commonality across usages being the Arrow
> > > columnar
> > > > > > > >>> format
> > > > > > > >>>>>>>> (schemas and array types). This would be better than
> > > leaving
> > > > > it
> > > > > > > >>> to
> > > > > > > >>>>>>>> each application to develop its own bespoke expression
> > > > > > > >>>>> representations
> > > > > > > >>>>>>>> for its needs.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> I spent a while thinking about this and wrote up a
> brain
> > > > dump
> > > > > RFC
> > > > > > > >>>>>>>> document [1] and accompanying pull request [2] that
> > makes
> > > > the
> > > > > > > >>>>> possibly
> > > > > > > >>>>>>>> controversial choice of using Flatbuffers to represent
> > the
> > > > > > > >>>> serialized
> > > > > > > >>>>>>>> IR. I discuss the rationale for the choice of
> > Flatbuffers
> > > in
> > > > > the
> > > > > > > >>>> RFC
> > > > > > > >>>>>>>> document. This PR is obviously deficient in many
> regards
> > > > > > > >>>> (incomplete,
> > > > > > > >>>>>>>> hacky, or unclear in places), and will need some help
> > from
> > > > > others
> > > > > > > >>>> to
> > > > > > > >>>>>>>> flesh out. I suspect that actually implementing the IR
> > > will
> > > > be
> > > > > > > >>>>>>>> necessary to work out many of the low-level details.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Note that this IR is intended to be more of a
> "superset"
> > > > > project
> > > > > > > >>>> than
> > > > > > > >>>>>>>> a "lowest common denominator". So there may be things
> > that
> > > > it
> > > > > > > >>>>> includes
> > > > > > > >>>>>>>> which are only available in some engines (e.g. engines
> > > that
> > > > > have
> > > > > > > >>>>>>>> specialized handling of time series data).
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> As some of my personal motivation for the project,
> > > > concurrent
> > > > > > > >>> with
> > > > > > > >>>>> the
> > > > > > > >>>>>>>> genesis of Apache Arrow, I started a Python project
> > called
> > > > > Ibis
> > > > > > > >>> [3]
> > > > > > > >>>>>>>> (which is similar to R's dplyr project) which serves
> as
> > a
> > > > > "Python
> > > > > > > >>>>>>>> analytical query IR builder" that is capable of
> > generating
> > > > > most
> > > > > > > >>> of
> > > > > > > >>>>> the
> > > > > > > >>>>>>>> SQL standard, targeting many different SQL dialects
> and
> > > > other
> > > > > > > >>>>> backends
> > > > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have
> > used
> > > > this
> > > > > > > >>>>> library
> > > > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to
> be
> > > able
> > > > > to
> > > > > > > >>>>>>>> translate between the in-memory "logical query" data
> > > > > structures
> > > > > > > >>> in
> > > > > > > >>>> a
> > > > > > > >>>>>>>> library like Ibis to a serialized format that can be
> > > > executed
> > > > > by
> > > > > > > >>>> many
> > > > > > > >>>>>>>> different Arrow-native query engines. The expression
> > > > > primitives
> > > > > > > >>>>>>>> available in Ibis should serve as helpful test cases,
> > too.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> I look forward to the community's comments on the RFC
> > > > document
> > > > > > > >>> [1]
> > > > > > > >>>>> and
> > > > > > > >>>>>>>> pull request [2] -- I realize that arriving at
> consensus
> > > on
> > > > a
> > > > > > > >>>> complex
> > > > > > > >>>>>>>> and ambitious project like this can be challenging so
> I
> > > > > recommend
> > > > > > > >>>>>>>> spending time on the "non-goals" section in the RFC
> and
> > > ask
> > > > > > > >>>> questions
> > > > > > > >>>>>>>> if you are unclear about the scope of what problems
> this
> > > is
> > > > > > > >>> trying
> > > > > > > >>>> to
> > > > > > > >>>>>>>> solve. I would be happy to give Edit access on the RFC
> > > > > document
> > > > > > > >>> to
> > > > > > > >>>>>>>> others and would consider ideas about how to move
> > forward
> > > > with
> > > > > > > >>>>>>>> something that is able to be implemented by different
> > > Arrow
> > > > > > > >>>> libraries
> > > > > > > >>>>>>>> in the reasonably near future.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Thanks,
> > > > > > > >>>>>>>> Wes
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> [1]:
> > > > > > > >>>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > > > > >>>>>>>> [4]:
> > > http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > > >>>>>>>> [5]:
> > > > > > > >>>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Micah Kornfield <em...@gmail.com>.

As an FYI, Iceberg is also considering an IR in relation to view support
[1].  I chimed in and pointed them to this thread and Wes's doc.  Phillip
and Jacques chimed in there as well.

[1]
https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E

On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud <cp...@gmail.com> wrote:

> Thanks for the feedback Jacques, very helpful. In the latest version of the
> PR, I've tried to incorporate nearly all of these points.
>
> - I've incorporated most of what you had for dereferencing operations into
> the PR, and gotten rid of schemas except on Read/Write relations.
> - With respect to properties, I've made a bunch more specific operators,
> and kept user-defined things special case-y.
> - I haven't incorporated anything close to physical-plan things, but I
> think that's a good follow up PR. Having separate representations for
> logical/physical plans seems like a waste of effort ultimately. I think we
> can find a good balance.
> - Agree on UDF support, I think that will have to evolve as the rest of the
> spec evolves. UDFs will need language-dedicated effort given the large
> variety of languages that folks will want to use to define functions.
>
> On a separate note, in an effort to move this project forward I'd like to
> do one final round of code review from anyone interested and then merge the
> PR after that.
> This spec will be unstable for a while, until we can work out all the
> design kinks and edge cases, and I think getting this initial spec in is
> important to start that process.
>
>
> On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> > In a lucky turn of events, Phillip actually turned out to be in my neck
> of
> > the woods on Friday so we had a chance to sit down and discuss this. To
> > help, I actually shared something I had been working on a few months ago
> > independently (before this discussion started).
> >
> > For reference:
> > Wes PR: https://github.com/apache/arrow/pull/10856
> > Ben PR: https://github.com/apache/arrow/pull/10934
> > Jacques PR: https://github.com/apache/arrow/pull/10979
> >
> > The high level points of feedback I have are:
> >
> >    - Ben PR feels too deconstructed. While I like the elegance and
> >    symmetry, I believe this will lead to substantially more work in
> moving
> >    from serialization format to something closer to what a system would
> > want
> >    to manipulate/consume. The reality is that there are a lot of really
> > known
> >    things and specializing the representation for these things will
> > ultimately
> >    make things easier to program with without error and easier to debug.
> > (For
> >    example, imagine trying to inspect a plan in a debugging session with
> > the
> >    Ben representation.) We should embrace the known things in the
> >    specification.
> >    - I believe that it is a mistake for the inner workings of the plan to
> >    ever use field names. Only input (e.g. file read) and Output (e.g.
> > return
> >    to user or write to file) need to have field names. For the rest of
> the
> >    system, using field ordinals (determinant whether nested or flat) is
> > much
> >    cleaner and is how most execution systems work. For example, in
> Impala I
> >    believe it is called a slot. As I noted in the PR, Calcite as an
> > example is
> >    entirely ordinal based at the algebra level. Rowtypes contain field
> > names
> >    but they are actually basically pointless. Field references use
> > RexInputRef
> >    with ordinal based and rules around column order output (e.g. what is
> > the
> >    field order of a join output) are determinant and done entirely at an
> >    ordinal level. The only place where names should be used (besides
> >    input/output) is in the case of map keys. In that case, the names are
> >    actually data, as opposed to scheme metadata. This is why I propose a
> >    strongly structured dereference operation [1].
> >    - Properties should only be included in the serialization if they are
> >    not easily re-derivable at plan consumption time. For example, you'll
> > note
> >    that I don't store schema information for relational operation. Each
> >    function and relational operation should already know how a given
> input
> > is
> >    transformed to a given output. Capturing this information in the
> > plan/IR is
> >    excessive. In many ways, I compare it to the early use of VectorLayout
> > [2]
> >    in Arrow schema. It may have provided some additional checksum of the
> >    message but ultimately it caused more pain than it was worth (and thus
> > we
> >    removed it before formalizing the specification). For reference, in
> the
> >    context of Dremio, we used to actually do this, send schema
> information
> >    around for all operations. We removed it because in many cases
> becoming
> > the
> >    majority of our internal plan serialization (imagine simple operations
> > that
> >    are carrying 1000s of fields).
> >    - I suggest focusing on support for both logical and physical
> >    representations. The moment you start talking about optimization
> passes,
> >    many of those would probably be better being done at the logical
> level.
> > The
> >    overlap is really high.
> >    - I think a lot more work must be done before introducing UDFs and
> user
> >    defined relational operations. I see one goal being the possibility of
> >    there being three systems: A -> B -> C. A is a IR producer. C is a IR
> >    consumer and B is a IR filter or translator. In this situation, B
> > should be
> >    able to operate and do optimizations on a plan even if if there are
> > black
> >    box user defined operations. Being able to know the
> > properties-preservation
> >    or not of these operations is important to making decisions. For
> > example,
> >    does a user defined relational operation maintain sortedness?
> > Distribution?
> >    Is a defined UDF idempotent? As such, I think the definition of those
> > black
> >    boxes should be much more structured. For example: it is a python
> >    relational operation named X stored in Y that maintains properties 1,2
> > and
> >    disrupts property 3. Putting just a black box of bytes will
> > substantially
> >    reduce the compatibility and extensibility of the ecosystem of tools
> >    working against IR. I'd note that I wouldn't expect this to be a
> burden
> > to
> >    actual end users. By using sensible defaults, I still would expect an
> > end
> >    user tool to support arbitrary user defined operations.
> >    - It might make sense to review the XML representation that Orca uses
> >    [3]. I haven't looked at it recently but they had a strong goal of
> >    decoupling for most of its life (to use in both Greenplum and Hawq).
> It
> >    could be the most mature/formal serialization of query plans
> publically
> >    available.
> >
> >
> > [1]
> >
> >
> https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
> > [2]
> >
> >
> https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
> > [3] https://github.com/greenplum-db/gporca
> >
> >
> > On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <cp...@gmail.com>
> wrote:
> >
> > > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > Looking at Ben's alternate PR [1], having an IR that leans heavily on
> > > > memory references to an out-of-band data sidecar seems like an
> > > > approach that would substantially ratchet up the implementation
> > > > complexity as producing the IR would then have the level of
> complexity
> > > > of producing the Arrow IPC format — when producing the "root" Plan
> > > > message, you must accumulate a list of the dependent serialized
> > > > submessages, placing the appropriate Buffer memory offset in the Plan
> > > > message, like we do when producing the RecordBatch.buffers field.
> This
> > > > seems complicated to me as you must devise a custom binary protocol
> to
> > > > concatenate the serialized Plan and the messages it depends on into a
> > > > single binary payload
> > > >
> > > > <ROOT PLAN>
> > > > <padding>
> > > > <Buffer 0>
> > > > <padding>
> > > > <Buffer 1>
> > > > <padding>
> > > > ...
> > > > <Buffer N - 1>
> > > >
> > > > (one purpose of FlatBufferBuilder is to spare you having to do this
> > > > yourself — some reasons we do it for the Arrow IPC format is because
> > > > appending Arrow memory buffers directly to a FlatBufferBuilder would
> > > > be inefficient — internal realloc calls — and Flatbuffers are limited
> > > > to 2GB. Neither of these things are problems here)
> > > >
> > > > In general, I believe the output created by an IR producer should be
> a
> > > > single serialized object without any out-of-band data sidecar — this
> > > > is much simpler for implementers and we can still provide an "escape
> > > > hatch" for user-defined operators and functions where the required
> > > > function/operator is passed opaquely as an embedded binary data.
> > >
> > >
> > >
> > > The serialization format (whether it is Flatbuffers or JSON, or
> > > > something else) should allow for data memoization, so if there is a
> > > > user-defined operator/function, or a relation that is used multiple
> > > > times throughout the query (potentially with a large schema), then we
> > > > should ensure that the data need not be duplicated in the
> > > > serialization format unnecessarily — in Flatbuffers, you can achieve
> > > > this by reusing offsets, but we could devise the data structures to
> > > > make the memoization of "expensive" objects more explicit.
> > > >
> > >
> > > I think this is something that would need to be explicitly encoded in
> > > the structures themselves if it's a hard requirement. I don't think
> this
> > > should block
> > > a prototype producer/consumer.
> > >
> > > Is there something in the second PR/design that precludes the reuse of
> > > offsets?
> > > To my eye, the flatbuffers offset reuse mechanism works just as well
> > there.
> > >
> > >
> > > > I additionally think that it is important to provide as much built-in
> > > > support for "canonical" operators/functions (such as the ones
> > > > implemented commonly by SQL engines) as possible, and to liberally
> > > > expand the catalogue of "built-in" capabilities. I would still prefer
> > > > to have large unions/enums of built-in operators/functions and to
> > > > expand those unions/enums to accommodate new things when it is
> > > > demonstrated that there is a need to standardize things between
> > > > producers/consumers.
> > > >
> > >
> > > I think there's a middle ground where we add a bit of structure
> > (something
> > > like
> > > a descriptor from the first PR) to indicate whether a thing is built-in
> > vs
> > > user-defined.
> > > It looks like Ben has pushed something like this to his PR.
> > >
> > > With that scheme, we have both flexibility and a small set of special
> > > builtins that make up
> > > a statically typed set for expressions and relational operators.
> > >
> > > I would really like to vet this PR with a prototype this week,
> > > to see whether we need to revisit any major choices. I don't think
> we'll
> > be
> > > able to
> > > anticipate all the consequences until we write some code.
> > >
> > >
> > > >
> > > > One of the beneficial properties of the Union/Enum approach for the
> > > > operator/function catalogues, is that when there are additions to
> > > > those enums, the generated Flatbuffers files will cause many language
> > > > compilers to warn or error on unhandled enum cases. If all
> > > > function/operator names are strings, then you are essentially
> > > > reimplementing the functionality provided by enums by hand. I
> > > > initially used strings for all function references in my original
> > > > prototype, but I now think that using an enum for "built-ins" would
> be
> > > > superior (because of the code-generated enum interface) and not a
> > > > premature optimization.
> > > >
> > > > [1]: https://github.com/apache/arrow/pull/10934
> > > >
> > > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cp...@gmail.com>
> > > wrote:
> > > > >
> > > > > Hey all,
> > > > >
> > > > > Just wanted to give an update on the effort here.
> > > > >
> > > > > Ben Kietzman has created an alternative proposal to the initial
> > design
> > > > [1].
> > > > > It largely overlaps with the original, but differs in a few
> important
> > > > ways:
> > > > >
> > > > > * A big focus of the design is on flexibility, allowing producers,
> > > > > consumers and ultimately end users of those systems the ability to
> > > define
> > > > > custom operations in the graph.
> > > > > * There are very few predefined relational operations (project,
> > filter,
> > > > > join and a handful of others).
> > > > > * There are only 3 types of value expressions: literals, field
> > > > references,
> > > > > and function calls.
> > > > > * The model of evaluation is one that requires a final sink node,
> to
> > > > > indicate where the record batches from child relations end up (a
> > file,
> > > a
> > > > > socket, an in-memory buffer, etc).
> > > > >
> > > > > I've added notes [2] to the original Google doc (under the
> > Alternative
> > > > > Design Notes subheading), and a few pseudocode examples.
> > > > > Unfortunately, these went out of date as soon as Ben pushed the PR
> > [3],
> > > > so
> > > > > I need to update those to reflect his changes. Regardless,
> > > > > the design is broadly the same, so it should still give a good
> > > indication
> > > > > of the details of the design.
> > > > >
> > > > > There are a decent number of review comments on the original PR
> that
> > I
> > > > plan
> > > > > to port over where they are still relevant.
> > > > > I also plan on adding support for window functions either tonight
> or
> > on
> > > > > Monday.
> > > > >
> > > > > Please review this design at your earliest convenience. Since
> > there's a
> > > > > fairly concrete set of types in flatbuffers that
> > > > > we can look at, ideally we can center discussion around the details
> > in
> > > > the
> > > > > PR.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > [1]: https://github.com/apache/arrow/pull/10856
> > > > > [2]:
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > > > [3]: https://github.com/apache/arrow/pull/10934
> > > > >
> > > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <
> jhyde.apache@gmail.com>
> > > > wrote:
> > > > >
> > > > > > > Wes wrote:
> > > > > > >
> > > > > > > Supporting this kind of intra-application engine
> > > > > > > heterogeneity is one of the motivations for the project.
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > The data format is the natural interface between tasks. (Defining
> > > > “task”
> > > > > > here as “something that is programmed using the IR”.) That is
> > Arrow’s
> > > > > > strength.
> > > > > >
> > > > > > So I think the IR should describe what each task should do, and
> > tasks
> > > > > > should be fairly small. Not whole relational operators, operating
> > on
> > > > whole
> > > > > > tables, but pieces of relational operators, operating on batches
> or
> > > > > > sequences of batches.
> > > > > >
> > > > > > Elsethread, someone mentioned the LoLePop concept and the
> > > > Kohn/Leis/Neuman
> > > > > > paper [1]. The LoLePop concept sounds good for our purposes.
> > > > > >
> > > > > > Julian
> > > > > >
> > > > > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > > > >
> > > > > >
> > > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <wesmckinn@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <
> > cpcloud@gmail.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > > > >> jorgecarleitao@gmail.com> wrote:
> > > > > > >>
> > > > > > >>> Couple of questions
> > > > > > >>>
> > > > > > >>> 1. Is the goal that IRs have equal semantics, i.e. given
> > > > (IR,data), the
> > > > > > >>> operation "(IR,data) - engine -> result" MUST be the same for
> > all
> > > > > > "engine"?
> > > > > > >>>
> > > > > > >>
> > > > > > >> I think that might be a non-starter for mundane reasons:
> there's
> > > > > > probably
> > > > > > >> at least two engines
> > > > > > >> that disagree on the result of sum(x) where x is a floating
> > point
> > > > > > column.
> > > > > > >>
> > > > > > >>
> > > > > > >>> 2. if yes, imo we may need to worry about:
> > > > > > >>> * a definition of equality that implementations agree on.
> > > > > > >>> * agreement over what the semantics look like. For example,
> do
> > we
> > > > use
> > > > > > >>> kleene logic for AND and OR?
> > > > > > >>>
> > > > > > >>
> > > > > > >> WRT Kleene logic my thoughts are that the IR should support
> both
> > > > Kleene
> > > > > > and
> > > > > > >> non-Kleene,
> > > > > > >> and producers can choose their desired semantics.
> > > > > > >>
> > > > > > >> Ibis for example, would override the `&` operator in `a & b`
> to
> > > > produce
> > > > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > > > >>
> > > > > > >>
> > > > > > >>>
> > > > > > >>> To try to understand the gist, let's pick an aggregated count
> > > over
> > > > a
> > > > > > >>> column: engines often do partial counts over partitions
> > followed
> > > > by a
> > > > > > final
> > > > > > >>> "sum" over the partial counts. Is the idea that the query
> > engine
> > > > would
> > > > > > >>> communicate with the compute engine via 2 IRs where one is
> > "count
> > > > me
> > > > > > these"
> > > > > > >>> the other is "sum me these"?
> > > > > > >>>
> > > > > > >>> Best,
> > > > > > >>> Jorge
> > > > > > >>>
> > > > > > >>
> > > > > > >> Not in its current incarnation.
> > > > > > >>
> > > > > > >> The idea is that the IR producer communicates a desire to
> > count(x)
> > > > to a
> > > > > > >> consumer, and  it's up to the consumer to figure out how to
> turn
> > > > that
> > > > > > count
> > > > > > >> into something that makes sense for itself. In your example
> > > that's a
> > > > > > series
> > > > > > >> of partial counts followed by a sum.
> > > > > > >>
> > > > > > >
> > > > > > > That said, I think there is a valid use case here where a
> system
> > > > might
> > > > > > > make use of different engines to execute different (composable)
> > > > layers
> > > > > > > of a complex query.
> > > > > > >
> > > > > > > For example:
> > > > > > >
> > > > > > > * suppose you want to scan and do predicate pushdown on an
> > unusual
> > > > > > > data source that is only accessible from one particular engine
> > but
> > > > > > > * you need to do some analytical operation with the scan
> results
> > > that
> > > > > > > is only supported by another engine
> > > > > > >
> > > > > > > You could decompose the query into two stages with an IR
> > relational
> > > > > > > expression for each stage and use then the engines together to
> > > > > > > accomplish what you need (of course, you would need an
> > > orchestration
> > > > > > > layer to handle plumbing the query engine inputs and outputs
> > > together
> > > > > > > as Arrow streams). Supporting this kind of intra-application
> > engine
> > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > >
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <
> > cpcloud@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>>> Thanks Wes,
> > > > > > >>>>
> > > > > > >>>> Great to be back working on Arrow again and engaging with
> the
> > > > > > community.
> > > > > > >>> I
> > > > > > >>>> am really excited about this effort.
> > > > > > >>>>
> > > > > > >>>> I think there are a number of concerns I see as important to
> > > > address
> > > > > > in
> > > > > > >>> the
> > > > > > >>>> compute IR proposal:
> > > > > > >>>>
> > > > > > >>>> 1. Requirement for output types.
> > > > > > >>>>
> > > > > > >>>> I think that so far there's been many reasons for requiring
> > > > > > conforming IR
> > > > > > >>>> producers and consumers to adhere to output types, but I
> > haven't
> > > > seen
> > > > > > a
> > > > > > >>>> strong rationale for keeping them optional (in the semantic
> > > > sense, not
> > > > > > >>> WRT
> > > > > > >>>> any particular serialization format's representation of
> > > > optionality).
> > > > > > >>>>
> > > > > > >>>> I think a design that includes unambiguous semantics for
> > output
> > > > types
> > > > > > (a
> > > > > > >>>> consumer must produce a value of the requested type or it's
> an
> > > > > > >>>> error/non-conforming) is simpler to reason about for
> > producers,
> > > > and
> > > > > > >>>> provides a strong guarantee for end users (humans or
> machines
> > > > > > >>> constructing
> > > > > > >>>> IR from an API and expecting the thing they ask for back
> from
> > > the
> > > > IR
> > > > > > >>>> consumer).
> > > > > > >>>>
> > > > > > >>>> 2. Flexibility
> > > > > > >>>>
> > > > > > >>>> The current PR is currently unable to support what I think
> are
> > > > killer
> > > > > > >>>> features of the IR: custom operators (relational or column)
> > and
> > > > UDFs.
> > > > > > In
> > > > > > >>> my
> > > > > > >>>> mind, on top of the generalized compute description that the
> > IR
> > > > > > offers,
> > > > > > >>> the
> > > > > > >>>> ability for producers and consumers of IR to extend the IR
> > > without
> > > > > > >>> needing
> > > > > > >>>> to modify Arrow or depend on anything except the format is
> > > itself
> > > > > > >>> something
> > > > > > >>>> that is necessary to gain adoption.
> > > > > > >>>>
> > > > > > >>>> Developers will need to build custom relational operators
> > (e.g.,
> > > > > > scans of
> > > > > > >>>> backends that don't exist anywhere for which a user has code
> > to
> > > > > > >>> implement)
> > > > > > >>>> and custom functions (anything operating on a column that
> > > doesn't
> > > > > > already
> > > > > > >>>> exist, really). Furthermore, I think we can actually drive
> > > > building an
> > > > > > >>>> Arrow consumer using the same API that an end user would use
> > to
> > > > extend
> > > > > > >>> the
> > > > > > >>>> IR.
> > > > > > >>>>
> > > > > > >>>> 3. Window Functions
> > > > > > >>>>
> > > > > > >>>> Window functions are, I think, an important part of the IR
> > value
> > > > > > >>>> proposition, as they are one of the more complex operators
> in
> > > > > > databases.
> > > > > > >>> I
> > > > > > >>>> think we need to have something in the initial IR proposal
> to
> > > > support
> > > > > > >>> these
> > > > > > >>>> operations.
> > > > > > >>>>
> > > > > > >>>> 4. Non relational Joins
> > > > > > >>>>
> > > > > > >>>> Things like as-of join and window join operators aren't yet
> > > > fleshed
> > > > > > out
> > > > > > >>> in
> > > > > > >>>> the IR, and I'm not sure they should be in scope for the
> > initial
> > > > > > >>> prototype.
> > > > > > >>>> I think once we settle on a design, we can work the design
> of
> > > > these
> > > > > > >>>> particular operators out during the initial prototype. I
> think
> > > the
> > > > > > >>>> specification of these operators should basically be PR #2
> > after
> > > > the
> > > > > > >>>> initial design lands.
> > > > > > >>>>
> > > > > > >>>> # Order of Work
> > > > > > >>>>
> > > > > > >>>> 1. Nail down the design. Anything else is a non-starter.
> > > > > > >>>>
> > > > > > >>>> 2. Prototype an IR producer using Ibis
> > > > > > >>>>
> > > > > > >>>> Ibis is IMO a good candidate for a first IR producer as it
> > has a
> > > > > > number
> > > > > > >>> of
> > > > > > >>>> desirable properties that make prototyping faster and allow
> > for
> > > > us to
> > > > > > >>>> refine the design of the IR as needed based on how the
> > > > implementation
> > > > > > >>> goes:
> > > > > > >>>> * It's written in Python so it has native support for nearly
> > all
> > > > of
> > > > > > >>>> flatbuffers' features without having to creating bindings to
> > > C++.
> > > > > > >>>> * There's already a set of rules for type checking, as well
> as
> > > > APIs
> > > > > > for
> > > > > > >>>> constructing expression trees, which means we don't need to
> > > worry
> > > > > > about
> > > > > > >>>> building a type checker for the prototype.
> > > > > > >>>>
> > > > > > >>>> 3. Prototype an IR consumer in C++
> > > > > > >>>>
> > > > > > >>>> I think in parallel to the producer prototype we can further
> > > > inform
> > > > > > the
> > > > > > >>>> design from the consumer side by prototyping an IR consumer
> in
> > > > C++ . I
> > > > > > >>> know
> > > > > > >>>> Ben Kietzman has expressed interest in working on this.
> > > > > > >>>>
> > > > > > >>>> Very interested to hear others' thoughts.
> > > > > > >>>>
> > > > > > >>>> -Phillip
> > > > > > >>>>
> > > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > > > wesmckinn@gmail.com>
> > > > > > >>> wrote:
> > > > > > >>>>
> > > > > > >>>>> Thank you for all the feedback and comments on the
> document.
> > > I'm
> > > > on
> > > > > > >>>>> vacation this week, so I'm delayed responding to
> everything,
> > > but
> > > > I
> > > > > > >>>>> will get to it as quickly as I can. I will be at VLDB in
> > > > Copenhagen
> > > > > > >>>>> next week if anyone would like to chat in person about it,
> > and
> > > > we can
> > > > > > >>>>> relay the content of any discussions back to the
> > > > document/PR/e-mail
> > > > > > >>>>> thread.
> > > > > > >>>>>
> > > > > > >>>>> I know that Phillip Cloud expressed interest in working on
> > the
> > > > PR and
> > > > > > >>>>> helping work through many of the details, so I'm glad to
> have
> > > the
> > > > > > >>>>> help. If there are others who would like to work on the PR
> or
> > > dig
> > > > > > into
> > > > > > >>>>> the details, please let me know. We might need to figure
> out
> > > how
> > > > to
> > > > > > >>>>> accommodate "many cooks" by setting up the ComputeIR
> project
> > > > > > somewhere
> > > > > > >>>>> separate from the format/ directory to permit it to exist
> in
> > a
> > > > > > >>>>> Work-In-Progress status for a period of time until we work
> > > > through
> > > > > > the
> > > > > > >>>>> various details and design concerns.
> > > > > > >>>>>
> > > > > > >>>>> Re Julian's comment
> > > > > > >>>>>
> > > > > > >>>>>> The biggest surprise is that this language does full
> > > relational
> > > > > > >>>>> operations. I was expecting that it would do fragments of
> the
> > > > > > >>> operations.
> > > > > > >>>>>
> > > > > > >>>>> There's a related but different (yet still interesting and
> > > > worthy of
> > > > > > >>>>> analysis) problem of creating an "engine language" that
> > > describes
> > > > > > more
> > > > > > >>>>> mechanically the constituent parts of implementing the
> > > relational
> > > > > > >>>>> operators. To create a functional computation language with
> > > > concrete
> > > > > > >>>>> Arrow data structures as a top-level primitive sounds like
> an
> > > > > > >>>>> interesting research area where I could see something
> > > developing
> > > > > > >>>>> eventually.
> > > > > > >>>>>
> > > > > > >>>>> The main problem I'm interested in solving right now is
> > > enabling
> > > > > > front
> > > > > > >>>>> ends that have sufficient understanding of relational
> algebra
> > > and
> > > > > > data
> > > > > > >>>>> frame operations to talk to engines without having to go
> > > > backwards
> > > > > > >>>>> from their logical query plans to SQL. So as mentioned in
> the
> > > > > > >>>>> document, being able to faithfully carry the relational
> > > operator
> > > > node
> > > > > > >>>>> information generated by Calcite or Ibis or another system
> > > would
> > > > be
> > > > > > >>>>> super useful. Defining the semantics of various kinds of
> > > > user-defined
> > > > > > >>>>> functions would also be helpful to standardize the
> > > > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > > > >>>>>
> > > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > > > alendit@gmail.com>
> > > > > > >>>> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>> Hi Wes,
> > > > > > >>>>>>
> > > > > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL
> > > Analytics
> > > > > > From
> > > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf)
> > which
> > > > > > >>>> proposes a
> > > > > > >>>>>> set of building block for advanced aggregation.
> > > > > > >>>>>>
> > > > > > >>>>>> Cheers,
> > > > > > >>>>>> Dimitri.
> > > > > > >>>>>>
> > > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > > > jhyde.apache@gmail.com>
> > > > > > >>>>> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>>> Wes,
> > > > > > >>>>>>>
> > > > > > >>>>>>> Thanks for this. I’ve added comments to the doc and to
> the
> > > PR.
> > > > > > >>>>>>>
> > > > > > >>>>>>> The biggest surprise is that this language does full
> > > relational
> > > > > > >>>>>>> operations. I was expecting that it would do fragments of
> > the
> > > > > > >>>>> operations.
> > > > > > >>>>>>> Consider join. A distributed hybrid hash join needs to
> > > > partition
> > > > > > >>> rows
> > > > > > >>>>> into
> > > > > > >>>>>>> output buffers based on a hash key, build hash tables,
> > probe
> > > > into
> > > > > > >>>> hash
> > > > > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and
> so
> > > > forth.
> > > > > > >>>>>>>
> > > > > > >>>>>>> I see Arrow’s compute as delivering each of those
> > operations,
> > > > > > >>> working
> > > > > > >>>>> on
> > > > > > >>>>>>> perhaps a batch at a time, or a sequence of batches, with
> > > some
> > > > > > >>> other
> > > > > > >>>>> system
> > > > > > >>>>>>> coordinating those tasks. So I would expect to see
> Arrow’s
> > > > compute
> > > > > > >>>>> language
> > > > > > >>>>>>> mainly operating on batches rather than a table
> > abstraction.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Julian
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> > > wesmckinn@gmail.com
> > > > >
> > > > > > >>>>> wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> hi folks,
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> This idea came up in passing in the past -- given that
> > there
> > > > are
> > > > > > >>>>>>>> multiple independent efforts to develop Arrow-native
> query
> > > > > > >>> engines
> > > > > > >>>>>>>> (and surely many more to come), it seems like it would
> be
> > > > > > >>> valuable
> > > > > > >>>> to
> > > > > > >>>>>>>> have a way to enable user languages (like Java, Python,
> R,
> > > or
> > > > > > >>> Rust,
> > > > > > >>>>>>>> for example) to communicate with backend computing
> engines
> > > > (like
> > > > > > >>>>>>>> DataFusion, or new computing capabilities being built in
> > the
> > > > > > >>> Arrow
> > > > > > >>>>> C++
> > > > > > >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> > > > > > >>>> specialized
> > > > > > >>>>>>>> to Arrow's type system. So rather than leaving it to a
> SQL
> > > > > > >>> parser /
> > > > > > >>>>>>>> analyzer framework to generate an expression tree of
> > > > relational
> > > > > > >>>>>>>> operators and then translate that to an Arrow-native
> > > > > > >>>> compute-engine's
> > > > > > >>>>>>>> internal grammer, a user framework could provide the
> > desired
> > > > > > >>>>>>>> Arrow-native expression tree / data manipulations
> directly
> > > and
> > > > > > >>> skip
> > > > > > >>>>>>>> the SQL altogether.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> The idea of creating a "serialized intermediate
> > > representation
> > > > > > >>>> (IR)"
> > > > > > >>>>>>>> for Arrow compute operations would be to serve use cases
> > > large
> > > > > > >>> and
> > > > > > >>>>>>>> small -- from the most complex TPC-* or time series
> > database
> > > > > > >>> query
> > > > > > >>>> to
> > > > > > >>>>>>>> the most simple array predicate/filter sent with an RPC
> > > > request
> > > > > > >>>> using
> > > > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > > > engine-agnostic,
> > > > > > >>>> with
> > > > > > >>>>>>>> the only commonality across usages being the Arrow
> > columnar
> > > > > > >>> format
> > > > > > >>>>>>>> (schemas and array types). This would be better than
> > leaving
> > > > it
> > > > > > >>> to
> > > > > > >>>>>>>> each application to develop its own bespoke expression
> > > > > > >>>>> representations
> > > > > > >>>>>>>> for its needs.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I spent a while thinking about this and wrote up a brain
> > > dump
> > > > RFC
> > > > > > >>>>>>>> document [1] and accompanying pull request [2] that
> makes
> > > the
> > > > > > >>>>> possibly
> > > > > > >>>>>>>> controversial choice of using Flatbuffers to represent
> the
> > > > > > >>>> serialized
> > > > > > >>>>>>>> IR. I discuss the rationale for the choice of
> Flatbuffers
> > in
> > > > the
> > > > > > >>>> RFC
> > > > > > >>>>>>>> document. This PR is obviously deficient in many regards
> > > > > > >>>> (incomplete,
> > > > > > >>>>>>>> hacky, or unclear in places), and will need some help
> from
> > > > others
> > > > > > >>>> to
> > > > > > >>>>>>>> flesh out. I suspect that actually implementing the IR
> > will
> > > be
> > > > > > >>>>>>>> necessary to work out many of the low-level details.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Note that this IR is intended to be more of a "superset"
> > > > project
> > > > > > >>>> than
> > > > > > >>>>>>>> a "lowest common denominator". So there may be things
> that
> > > it
> > > > > > >>>>> includes
> > > > > > >>>>>>>> which are only available in some engines (e.g. engines
> > that
> > > > have
> > > > > > >>>>>>>> specialized handling of time series data).
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> As some of my personal motivation for the project,
> > > concurrent
> > > > > > >>> with
> > > > > > >>>>> the
> > > > > > >>>>>>>> genesis of Apache Arrow, I started a Python project
> called
> > > > Ibis
> > > > > > >>> [3]
> > > > > > >>>>>>>> (which is similar to R's dplyr project) which serves as
> a
> > > > "Python
> > > > > > >>>>>>>> analytical query IR builder" that is capable of
> generating
> > > > most
> > > > > > >>> of
> > > > > > >>>>> the
> > > > > > >>>>>>>> SQL standard, targeting many different SQL dialects and
> > > other
> > > > > > >>>>> backends
> > > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have
> used
> > > this
> > > > > > >>>>> library
> > > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to be
> > able
> > > > to
> > > > > > >>>>>>>> translate between the in-memory "logical query" data
> > > > structures
> > > > > > >>> in
> > > > > > >>>> a
> > > > > > >>>>>>>> library like Ibis to a serialized format that can be
> > > executed
> > > > by
> > > > > > >>>> many
> > > > > > >>>>>>>> different Arrow-native query engines. The expression
> > > > primitives
> > > > > > >>>>>>>> available in Ibis should serve as helpful test cases,
> too.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I look forward to the community's comments on the RFC
> > > document
> > > > > > >>> [1]
> > > > > > >>>>> and
> > > > > > >>>>>>>> pull request [2] -- I realize that arriving at consensus
> > on
> > > a
> > > > > > >>>> complex
> > > > > > >>>>>>>> and ambitious project like this can be challenging so I
> > > > recommend
> > > > > > >>>>>>>> spending time on the "non-goals" section in the RFC and
> > ask
> > > > > > >>>> questions
> > > > > > >>>>>>>> if you are unclear about the scope of what problems this
> > is
> > > > > > >>> trying
> > > > > > >>>> to
> > > > > > >>>>>>>> solve. I would be happy to give Edit access on the RFC
> > > > document
> > > > > > >>> to
> > > > > > >>>>>>>> others and would consider ideas about how to move
> forward
> > > with
> > > > > > >>>>>>>> something that is able to be implemented by different
> > Arrow
> > > > > > >>>> libraries
> > > > > > >>>>>>>> in the reasonably near future.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Thanks,
> > > > > > >>>>>>>> Wes
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> [1]:
> > > > > > >>>>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > > > >>>>>>>> [4]:
> > http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > >>>>>>>> [5]:
> > > > > > >>>>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > >
> > > >
> > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > >
> > > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Phillip Cloud <cp...@gmail.com>.

Thanks for the feedback Jacques, very helpful. In the latest version of the
PR, I've tried to incorporate nearly all of these points.

- I've incorporated most of what you had for dereferencing operations into
the PR, and gotten rid of schemas except on Read/Write relations.
- With respect to properties, I've made a bunch more specific operators,
and kept user-defined things special case-y.
- I haven't incorporated anything close to physical-plan things, but I
think that's a good follow up PR. Having separate representations for
logical/physical plans seems like a waste of effort ultimately. I think we
can find a good balance.
- Agree on UDF support, I think that will have to evolve as the rest of the
spec evolves. UDFs will need language-dedicated effort given the large
variety of languages that folks will want to use to define functions.

On a separate note, in an effort to move this project forward I'd like to
do one final round of code review from anyone interested and then merge the
PR after that.
This spec will be unstable for a while, until we can work out all the
design kinks and edge cases, and I think getting this initial spec in is
important to start that process.


On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <ja...@apache.org> wrote:

> In a lucky turn of events, Phillip actually turned out to be in my neck of
> the woods on Friday so we had a chance to sit down and discuss this. To
> help, I actually shared something I had been working on a few months ago
> independently (before this discussion started).
>
> For reference:
> Wes PR: https://github.com/apache/arrow/pull/10856
> Ben PR: https://github.com/apache/arrow/pull/10934
> Jacques PR: https://github.com/apache/arrow/pull/10979
>
> The high level points of feedback I have are:
>
>    - Ben PR feels too deconstructed. While I like the elegance and
>    symmetry, I believe this will lead to substantially more work in moving
>    from serialization format to something closer to what a system would
> want
>    to manipulate/consume. The reality is that there are a lot of really
> known
>    things and specializing the representation for these things will
> ultimately
>    make things easier to program with without error and easier to debug.
> (For
>    example, imagine trying to inspect a plan in a debugging session with
> the
>    Ben representation.) We should embrace the known things in the
>    specification.
>    - I believe that it is a mistake for the inner workings of the plan to
>    ever use field names. Only input (e.g. file read) and Output (e.g.
> return
>    to user or write to file) need to have field names. For the rest of the
>    system, using field ordinals (determinant whether nested or flat) is
> much
>    cleaner and is how most execution systems work. For example, in Impala I
>    believe it is called a slot. As I noted in the PR, Calcite as an
> example is
>    entirely ordinal based at the algebra level. Rowtypes contain field
> names
>    but they are actually basically pointless. Field references use
> RexInputRef
>    with ordinal based and rules around column order output (e.g. what is
> the
>    field order of a join output) are determinant and done entirely at an
>    ordinal level. The only place where names should be used (besides
>    input/output) is in the case of map keys. In that case, the names are
>    actually data, as opposed to scheme metadata. This is why I propose a
>    strongly structured dereference operation [1].
>    - Properties should only be included in the serialization if they are
>    not easily re-derivable at plan consumption time. For example, you'll
> note
>    that I don't store schema information for relational operation. Each
>    function and relational operation should already know how a given input
> is
>    transformed to a given output. Capturing this information in the
> plan/IR is
>    excessive. In many ways, I compare it to the early use of VectorLayout
> [2]
>    in Arrow schema. It may have provided some additional checksum of the
>    message but ultimately it caused more pain than it was worth (and thus
> we
>    removed it before formalizing the specification). For reference, in the
>    context of Dremio, we used to actually do this, send schema information
>    around for all operations. We removed it because in many cases becoming
> the
>    majority of our internal plan serialization (imagine simple operations
> that
>    are carrying 1000s of fields).
>    - I suggest focusing on support for both logical and physical
>    representations. The moment you start talking about optimization passes,
>    many of those would probably be better being done at the logical level.
> The
>    overlap is really high.
>    - I think a lot more work must be done before introducing UDFs and user
>    defined relational operations. I see one goal being the possibility of
>    there being three systems: A -> B -> C. A is a IR producer. C is a IR
>    consumer and B is a IR filter or translator. In this situation, B
> should be
>    able to operate and do optimizations on a plan even if if there are
> black
>    box user defined operations. Being able to know the
> properties-preservation
>    or not of these operations is important to making decisions. For
> example,
>    does a user defined relational operation maintain sortedness?
> Distribution?
>    Is a defined UDF idempotent? As such, I think the definition of those
> black
>    boxes should be much more structured. For example: it is a python
>    relational operation named X stored in Y that maintains properties 1,2
> and
>    disrupts property 3. Putting just a black box of bytes will
> substantially
>    reduce the compatibility and extensibility of the ecosystem of tools
>    working against IR. I'd note that I wouldn't expect this to be a burden
> to
>    actual end users. By using sensible defaults, I still would expect an
> end
>    user tool to support arbitrary user defined operations.
>    - It might make sense to review the XML representation that Orca uses
>    [3]. I haven't looked at it recently but they had a strong goal of
>    decoupling for most of its life (to use in both Greenplum and Hawq). It
>    could be the most mature/formal serialization of query plans publically
>    available.
>
>
> [1]
>
> https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
> [2]
>
> https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
> [3] https://github.com/greenplum-db/gporca
>
>
> On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <cp...@gmail.com> wrote:
>
> > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Looking at Ben's alternate PR [1], having an IR that leans heavily on
> > > memory references to an out-of-band data sidecar seems like an
> > > approach that would substantially ratchet up the implementation
> > > complexity as producing the IR would then have the level of complexity
> > > of producing the Arrow IPC format — when producing the "root" Plan
> > > message, you must accumulate a list of the dependent serialized
> > > submessages, placing the appropriate Buffer memory offset in the Plan
> > > message, like we do when producing the RecordBatch.buffers field. This
> > > seems complicated to me as you must devise a custom binary protocol to
> > > concatenate the serialized Plan and the messages it depends on into a
> > > single binary payload
> > >
> > > <ROOT PLAN>
> > > <padding>
> > > <Buffer 0>
> > > <padding>
> > > <Buffer 1>
> > > <padding>
> > > ...
> > > <Buffer N - 1>
> > >
> > > (one purpose of FlatBufferBuilder is to spare you having to do this
> > > yourself — some reasons we do it for the Arrow IPC format is because
> > > appending Arrow memory buffers directly to a FlatBufferBuilder would
> > > be inefficient — internal realloc calls — and Flatbuffers are limited
> > > to 2GB. Neither of these things are problems here)
> > >
> > > In general, I believe the output created by an IR producer should be a
> > > single serialized object without any out-of-band data sidecar — this
> > > is much simpler for implementers and we can still provide an "escape
> > > hatch" for user-defined operators and functions where the required
> > > function/operator is passed opaquely as an embedded binary data.
> >
> >
> >
> > The serialization format (whether it is Flatbuffers or JSON, or
> > > something else) should allow for data memoization, so if there is a
> > > user-defined operator/function, or a relation that is used multiple
> > > times throughout the query (potentially with a large schema), then we
> > > should ensure that the data need not be duplicated in the
> > > serialization format unnecessarily — in Flatbuffers, you can achieve
> > > this by reusing offsets, but we could devise the data structures to
> > > make the memoization of "expensive" objects more explicit.
> > >
> >
> > I think this is something that would need to be explicitly encoded in
> > the structures themselves if it's a hard requirement. I don't think this
> > should block
> > a prototype producer/consumer.
> >
> > Is there something in the second PR/design that precludes the reuse of
> > offsets?
> > To my eye, the flatbuffers offset reuse mechanism works just as well
> there.
> >
> >
> > > I additionally think that it is important to provide as much built-in
> > > support for "canonical" operators/functions (such as the ones
> > > implemented commonly by SQL engines) as possible, and to liberally
> > > expand the catalogue of "built-in" capabilities. I would still prefer
> > > to have large unions/enums of built-in operators/functions and to
> > > expand those unions/enums to accommodate new things when it is
> > > demonstrated that there is a need to standardize things between
> > > producers/consumers.
> > >
> >
> > I think there's a middle ground where we add a bit of structure
> (something
> > like
> > a descriptor from the first PR) to indicate whether a thing is built-in
> vs
> > user-defined.
> > It looks like Ben has pushed something like this to his PR.
> >
> > With that scheme, we have both flexibility and a small set of special
> > builtins that make up
> > a statically typed set for expressions and relational operators.
> >
> > I would really like to vet this PR with a prototype this week,
> > to see whether we need to revisit any major choices. I don't think we'll
> be
> > able to
> > anticipate all the consequences until we write some code.
> >
> >
> > >
> > > One of the beneficial properties of the Union/Enum approach for the
> > > operator/function catalogues, is that when there are additions to
> > > those enums, the generated Flatbuffers files will cause many language
> > > compilers to warn or error on unhandled enum cases. If all
> > > function/operator names are strings, then you are essentially
> > > reimplementing the functionality provided by enums by hand. I
> > > initially used strings for all function references in my original
> > > prototype, but I now think that using an enum for "built-ins" would be
> > > superior (because of the code-generated enum interface) and not a
> > > premature optimization.
> > >
> > > [1]: https://github.com/apache/arrow/pull/10934
> > >
> > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cp...@gmail.com>
> > wrote:
> > > >
> > > > Hey all,
> > > >
> > > > Just wanted to give an update on the effort here.
> > > >
> > > > Ben Kietzman has created an alternative proposal to the initial
> design
> > > [1].
> > > > It largely overlaps with the original, but differs in a few important
> > > ways:
> > > >
> > > > * A big focus of the design is on flexibility, allowing producers,
> > > > consumers and ultimately end users of those systems the ability to
> > define
> > > > custom operations in the graph.
> > > > * There are very few predefined relational operations (project,
> filter,
> > > > join and a handful of others).
> > > > * There are only 3 types of value expressions: literals, field
> > > references,
> > > > and function calls.
> > > > * The model of evaluation is one that requires a final sink node, to
> > > > indicate where the record batches from child relations end up (a
> file,
> > a
> > > > socket, an in-memory buffer, etc).
> > > >
> > > > I've added notes [2] to the original Google doc (under the
> Alternative
> > > > Design Notes subheading), and a few pseudocode examples.
> > > > Unfortunately, these went out of date as soon as Ben pushed the PR
> [3],
> > > so
> > > > I need to update those to reflect his changes. Regardless,
> > > > the design is broadly the same, so it should still give a good
> > indication
> > > > of the details of the design.
> > > >
> > > > There are a decent number of review comments on the original PR that
> I
> > > plan
> > > > to port over where they are still relevant.
> > > > I also plan on adding support for window functions either tonight or
> on
> > > > Monday.
> > > >
> > > > Please review this design at your earliest convenience. Since
> there's a
> > > > fairly concrete set of types in flatbuffers that
> > > > we can look at, ideally we can center discussion around the details
> in
> > > the
> > > > PR.
> > > >
> > > > Thanks!
> > > >
> > > > [1]: https://github.com/apache/arrow/pull/10856
> > > > [2]:
> > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > > [3]: https://github.com/apache/arrow/pull/10934
> > > >
> > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <jh...@gmail.com>
> > > wrote:
> > > >
> > > > > > Wes wrote:
> > > > > >
> > > > > > Supporting this kind of intra-application engine
> > > > > > heterogeneity is one of the motivations for the project.
> > > > >
> > > > > +1
> > > > >
> > > > > The data format is the natural interface between tasks. (Defining
> > > “task”
> > > > > here as “something that is programmed using the IR”.) That is
> Arrow’s
> > > > > strength.
> > > > >
> > > > > So I think the IR should describe what each task should do, and
> tasks
> > > > > should be fairly small. Not whole relational operators, operating
> on
> > > whole
> > > > > tables, but pieces of relational operators, operating on batches or
> > > > > sequences of batches.
> > > > >
> > > > > Elsethread, someone mentioned the LoLePop concept and the
> > > Kohn/Leis/Neuman
> > > > > paper [1]. The LoLePop concept sounds good for our purposes.
> > > > >
> > > > > Julian
> > > > >
> > > > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > > >
> > > > >
> > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <
> cpcloud@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > > >> jorgecarleitao@gmail.com> wrote:
> > > > > >>
> > > > > >>> Couple of questions
> > > > > >>>
> > > > > >>> 1. Is the goal that IRs have equal semantics, i.e. given
> > > (IR,data), the
> > > > > >>> operation "(IR,data) - engine -> result" MUST be the same for
> all
> > > > > "engine"?
> > > > > >>>
> > > > > >>
> > > > > >> I think that might be a non-starter for mundane reasons: there's
> > > > > probably
> > > > > >> at least two engines
> > > > > >> that disagree on the result of sum(x) where x is a floating
> point
> > > > > column.
> > > > > >>
> > > > > >>
> > > > > >>> 2. if yes, imo we may need to worry about:
> > > > > >>> * a definition of equality that implementations agree on.
> > > > > >>> * agreement over what the semantics look like. For example, do
> we
> > > use
> > > > > >>> kleene logic for AND and OR?
> > > > > >>>
> > > > > >>
> > > > > >> WRT Kleene logic my thoughts are that the IR should support both
> > > Kleene
> > > > > and
> > > > > >> non-Kleene,
> > > > > >> and producers can choose their desired semantics.
> > > > > >>
> > > > > >> Ibis for example, would override the `&` operator in `a & b` to
> > > produce
> > > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > > >>
> > > > > >>
> > > > > >>>
> > > > > >>> To try to understand the gist, let's pick an aggregated count
> > over
> > > a
> > > > > >>> column: engines often do partial counts over partitions
> followed
> > > by a
> > > > > final
> > > > > >>> "sum" over the partial counts. Is the idea that the query
> engine
> > > would
> > > > > >>> communicate with the compute engine via 2 IRs where one is
> "count
> > > me
> > > > > these"
> > > > > >>> the other is "sum me these"?
> > > > > >>>
> > > > > >>> Best,
> > > > > >>> Jorge
> > > > > >>>
> > > > > >>
> > > > > >> Not in its current incarnation.
> > > > > >>
> > > > > >> The idea is that the IR producer communicates a desire to
> count(x)
> > > to a
> > > > > >> consumer, and  it's up to the consumer to figure out how to turn
> > > that
> > > > > count
> > > > > >> into something that makes sense for itself. In your example
> > that's a
> > > > > series
> > > > > >> of partial counts followed by a sum.
> > > > > >>
> > > > > >
> > > > > > That said, I think there is a valid use case here where a system
> > > might
> > > > > > make use of different engines to execute different (composable)
> > > layers
> > > > > > of a complex query.
> > > > > >
> > > > > > For example:
> > > > > >
> > > > > > * suppose you want to scan and do predicate pushdown on an
> unusual
> > > > > > data source that is only accessible from one particular engine
> but
> > > > > > * you need to do some analytical operation with the scan results
> > that
> > > > > > is only supported by another engine
> > > > > >
> > > > > > You could decompose the query into two stages with an IR
> relational
> > > > > > expression for each stage and use then the engines together to
> > > > > > accomplish what you need (of course, you would need an
> > orchestration
> > > > > > layer to handle plumbing the query engine inputs and outputs
> > together
> > > > > > as Arrow streams). Supporting this kind of intra-application
> engine
> > > > > > heterogeneity is one of the motivations for the project.
> > > > > >
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <
> cpcloud@gmail.com
> > >
> > > > > wrote:
> > > > > >>>
> > > > > >>>> Thanks Wes,
> > > > > >>>>
> > > > > >>>> Great to be back working on Arrow again and engaging with the
> > > > > community.
> > > > > >>> I
> > > > > >>>> am really excited about this effort.
> > > > > >>>>
> > > > > >>>> I think there are a number of concerns I see as important to
> > > address
> > > > > in
> > > > > >>> the
> > > > > >>>> compute IR proposal:
> > > > > >>>>
> > > > > >>>> 1. Requirement for output types.
> > > > > >>>>
> > > > > >>>> I think that so far there's been many reasons for requiring
> > > > > conforming IR
> > > > > >>>> producers and consumers to adhere to output types, but I
> haven't
> > > seen
> > > > > a
> > > > > >>>> strong rationale for keeping them optional (in the semantic
> > > sense, not
> > > > > >>> WRT
> > > > > >>>> any particular serialization format's representation of
> > > optionality).
> > > > > >>>>
> > > > > >>>> I think a design that includes unambiguous semantics for
> output
> > > types
> > > > > (a
> > > > > >>>> consumer must produce a value of the requested type or it's an
> > > > > >>>> error/non-conforming) is simpler to reason about for
> producers,
> > > and
> > > > > >>>> provides a strong guarantee for end users (humans or machines
> > > > > >>> constructing
> > > > > >>>> IR from an API and expecting the thing they ask for back from
> > the
> > > IR
> > > > > >>>> consumer).
> > > > > >>>>
> > > > > >>>> 2. Flexibility
> > > > > >>>>
> > > > > >>>> The current PR is currently unable to support what I think are
> > > killer
> > > > > >>>> features of the IR: custom operators (relational or column)
> and
> > > UDFs.
> > > > > In
> > > > > >>> my
> > > > > >>>> mind, on top of the generalized compute description that the
> IR
> > > > > offers,
> > > > > >>> the
> > > > > >>>> ability for producers and consumers of IR to extend the IR
> > without
> > > > > >>> needing
> > > > > >>>> to modify Arrow or depend on anything except the format is
> > itself
> > > > > >>> something
> > > > > >>>> that is necessary to gain adoption.
> > > > > >>>>
> > > > > >>>> Developers will need to build custom relational operators
> (e.g.,
> > > > > scans of
> > > > > >>>> backends that don't exist anywhere for which a user has code
> to
> > > > > >>> implement)
> > > > > >>>> and custom functions (anything operating on a column that
> > doesn't
> > > > > already
> > > > > >>>> exist, really). Furthermore, I think we can actually drive
> > > building an
> > > > > >>>> Arrow consumer using the same API that an end user would use
> to
> > > extend
> > > > > >>> the
> > > > > >>>> IR.
> > > > > >>>>
> > > > > >>>> 3. Window Functions
> > > > > >>>>
> > > > > >>>> Window functions are, I think, an important part of the IR
> value
> > > > > >>>> proposition, as they are one of the more complex operators in
> > > > > databases.
> > > > > >>> I
> > > > > >>>> think we need to have something in the initial IR proposal to
> > > support
> > > > > >>> these
> > > > > >>>> operations.
> > > > > >>>>
> > > > > >>>> 4. Non relational Joins
> > > > > >>>>
> > > > > >>>> Things like as-of join and window join operators aren't yet
> > > fleshed
> > > > > out
> > > > > >>> in
> > > > > >>>> the IR, and I'm not sure they should be in scope for the
> initial
> > > > > >>> prototype.
> > > > > >>>> I think once we settle on a design, we can work the design of
> > > these
> > > > > >>>> particular operators out during the initial prototype. I think
> > the
> > > > > >>>> specification of these operators should basically be PR #2
> after
> > > the
> > > > > >>>> initial design lands.
> > > > > >>>>
> > > > > >>>> # Order of Work
> > > > > >>>>
> > > > > >>>> 1. Nail down the design. Anything else is a non-starter.
> > > > > >>>>
> > > > > >>>> 2. Prototype an IR producer using Ibis
> > > > > >>>>
> > > > > >>>> Ibis is IMO a good candidate for a first IR producer as it
> has a
> > > > > number
> > > > > >>> of
> > > > > >>>> desirable properties that make prototyping faster and allow
> for
> > > us to
> > > > > >>>> refine the design of the IR as needed based on how the
> > > implementation
> > > > > >>> goes:
> > > > > >>>> * It's written in Python so it has native support for nearly
> all
> > > of
> > > > > >>>> flatbuffers' features without having to creating bindings to
> > C++.
> > > > > >>>> * There's already a set of rules for type checking, as well as
> > > APIs
> > > > > for
> > > > > >>>> constructing expression trees, which means we don't need to
> > worry
> > > > > about
> > > > > >>>> building a type checker for the prototype.
> > > > > >>>>
> > > > > >>>> 3. Prototype an IR consumer in C++
> > > > > >>>>
> > > > > >>>> I think in parallel to the producer prototype we can further
> > > inform
> > > > > the
> > > > > >>>> design from the consumer side by prototyping an IR consumer in
> > > C++ . I
> > > > > >>> know
> > > > > >>>> Ben Kietzman has expressed interest in working on this.
> > > > > >>>>
> > > > > >>>> Very interested to hear others' thoughts.
> > > > > >>>>
> > > > > >>>> -Phillip
> > > > > >>>>
> > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > > wesmckinn@gmail.com>
> > > > > >>> wrote:
> > > > > >>>>
> > > > > >>>>> Thank you for all the feedback and comments on the document.
> > I'm
> > > on
> > > > > >>>>> vacation this week, so I'm delayed responding to everything,
> > but
> > > I
> > > > > >>>>> will get to it as quickly as I can. I will be at VLDB in
> > > Copenhagen
> > > > > >>>>> next week if anyone would like to chat in person about it,
> and
> > > we can
> > > > > >>>>> relay the content of any discussions back to the
> > > document/PR/e-mail
> > > > > >>>>> thread.
> > > > > >>>>>
> > > > > >>>>> I know that Phillip Cloud expressed interest in working on
> the
> > > PR and
> > > > > >>>>> helping work through many of the details, so I'm glad to have
> > the
> > > > > >>>>> help. If there are others who would like to work on the PR or
> > dig
> > > > > into
> > > > > >>>>> the details, please let me know. We might need to figure out
> > how
> > > to
> > > > > >>>>> accommodate "many cooks" by setting up the ComputeIR project
> > > > > somewhere
> > > > > >>>>> separate from the format/ directory to permit it to exist in
> a
> > > > > >>>>> Work-In-Progress status for a period of time until we work
> > > through
> > > > > the
> > > > > >>>>> various details and design concerns.
> > > > > >>>>>
> > > > > >>>>> Re Julian's comment
> > > > > >>>>>
> > > > > >>>>>> The biggest surprise is that this language does full
> > relational
> > > > > >>>>> operations. I was expecting that it would do fragments of the
> > > > > >>> operations.
> > > > > >>>>>
> > > > > >>>>> There's a related but different (yet still interesting and
> > > worthy of
> > > > > >>>>> analysis) problem of creating an "engine language" that
> > describes
> > > > > more
> > > > > >>>>> mechanically the constituent parts of implementing the
> > relational
> > > > > >>>>> operators. To create a functional computation language with
> > > concrete
> > > > > >>>>> Arrow data structures as a top-level primitive sounds like an
> > > > > >>>>> interesting research area where I could see something
> > developing
> > > > > >>>>> eventually.
> > > > > >>>>>
> > > > > >>>>> The main problem I'm interested in solving right now is
> > enabling
> > > > > front
> > > > > >>>>> ends that have sufficient understanding of relational algebra
> > and
> > > > > data
> > > > > >>>>> frame operations to talk to engines without having to go
> > > backwards
> > > > > >>>>> from their logical query plans to SQL. So as mentioned in the
> > > > > >>>>> document, being able to faithfully carry the relational
> > operator
> > > node
> > > > > >>>>> information generated by Calcite or Ibis or another system
> > would
> > > be
> > > > > >>>>> super useful. Defining the semantics of various kinds of
> > > user-defined
> > > > > >>>>> functions would also be helpful to standardize the
> > > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > > >>>>>
> > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > > alendit@gmail.com>
> > > > > >>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>> Hi Wes,
> > > > > >>>>>>
> > > > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL
> > Analytics
> > > > > From
> > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf)
> which
> > > > > >>>> proposes a
> > > > > >>>>>> set of building block for advanced aggregation.
> > > > > >>>>>>
> > > > > >>>>>> Cheers,
> > > > > >>>>>> Dimitri.
> > > > > >>>>>>
> > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > > jhyde.apache@gmail.com>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Wes,
> > > > > >>>>>>>
> > > > > >>>>>>> Thanks for this. I’ve added comments to the doc and to the
> > PR.
> > > > > >>>>>>>
> > > > > >>>>>>> The biggest surprise is that this language does full
> > relational
> > > > > >>>>>>> operations. I was expecting that it would do fragments of
> the
> > > > > >>>>> operations.
> > > > > >>>>>>> Consider join. A distributed hybrid hash join needs to
> > > partition
> > > > > >>> rows
> > > > > >>>>> into
> > > > > >>>>>>> output buffers based on a hash key, build hash tables,
> probe
> > > into
> > > > > >>>> hash
> > > > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and so
> > > forth.
> > > > > >>>>>>>
> > > > > >>>>>>> I see Arrow’s compute as delivering each of those
> operations,
> > > > > >>> working
> > > > > >>>>> on
> > > > > >>>>>>> perhaps a batch at a time, or a sequence of batches, with
> > some
> > > > > >>> other
> > > > > >>>>> system
> > > > > >>>>>>> coordinating those tasks. So I would expect to see Arrow’s
> > > compute
> > > > > >>>>> language
> > > > > >>>>>>> mainly operating on batches rather than a table
> abstraction.
> > > > > >>>>>>>
> > > > > >>>>>>> Julian
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> > wesmckinn@gmail.com
> > > >
> > > > > >>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>> hi folks,
> > > > > >>>>>>>>
> > > > > >>>>>>>> This idea came up in passing in the past -- given that
> there
> > > are
> > > > > >>>>>>>> multiple independent efforts to develop Arrow-native query
> > > > > >>> engines
> > > > > >>>>>>>> (and surely many more to come), it seems like it would be
> > > > > >>> valuable
> > > > > >>>> to
> > > > > >>>>>>>> have a way to enable user languages (like Java, Python, R,
> > or
> > > > > >>> Rust,
> > > > > >>>>>>>> for example) to communicate with backend computing engines
> > > (like
> > > > > >>>>>>>> DataFusion, or new computing capabilities being built in
> the
> > > > > >>> Arrow
> > > > > >>>>> C++
> > > > > >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> > > > > >>>> specialized
> > > > > >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
> > > > > >>> parser /
> > > > > >>>>>>>> analyzer framework to generate an expression tree of
> > > relational
> > > > > >>>>>>>> operators and then translate that to an Arrow-native
> > > > > >>>> compute-engine's
> > > > > >>>>>>>> internal grammer, a user framework could provide the
> desired
> > > > > >>>>>>>> Arrow-native expression tree / data manipulations directly
> > and
> > > > > >>> skip
> > > > > >>>>>>>> the SQL altogether.
> > > > > >>>>>>>>
> > > > > >>>>>>>> The idea of creating a "serialized intermediate
> > representation
> > > > > >>>> (IR)"
> > > > > >>>>>>>> for Arrow compute operations would be to serve use cases
> > large
> > > > > >>> and
> > > > > >>>>>>>> small -- from the most complex TPC-* or time series
> database
> > > > > >>> query
> > > > > >>>> to
> > > > > >>>>>>>> the most simple array predicate/filter sent with an RPC
> > > request
> > > > > >>>> using
> > > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > > engine-agnostic,
> > > > > >>>> with
> > > > > >>>>>>>> the only commonality across usages being the Arrow
> columnar
> > > > > >>> format
> > > > > >>>>>>>> (schemas and array types). This would be better than
> leaving
> > > it
> > > > > >>> to
> > > > > >>>>>>>> each application to develop its own bespoke expression
> > > > > >>>>> representations
> > > > > >>>>>>>> for its needs.
> > > > > >>>>>>>>
> > > > > >>>>>>>> I spent a while thinking about this and wrote up a brain
> > dump
> > > RFC
> > > > > >>>>>>>> document [1] and accompanying pull request [2] that makes
> > the
> > > > > >>>>> possibly
> > > > > >>>>>>>> controversial choice of using Flatbuffers to represent the
> > > > > >>>> serialized
> > > > > >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers
> in
> > > the
> > > > > >>>> RFC
> > > > > >>>>>>>> document. This PR is obviously deficient in many regards
> > > > > >>>> (incomplete,
> > > > > >>>>>>>> hacky, or unclear in places), and will need some help from
> > > others
> > > > > >>>> to
> > > > > >>>>>>>> flesh out. I suspect that actually implementing the IR
> will
> > be
> > > > > >>>>>>>> necessary to work out many of the low-level details.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Note that this IR is intended to be more of a "superset"
> > > project
> > > > > >>>> than
> > > > > >>>>>>>> a "lowest common denominator". So there may be things that
> > it
> > > > > >>>>> includes
> > > > > >>>>>>>> which are only available in some engines (e.g. engines
> that
> > > have
> > > > > >>>>>>>> specialized handling of time series data).
> > > > > >>>>>>>>
> > > > > >>>>>>>> As some of my personal motivation for the project,
> > concurrent
> > > > > >>> with
> > > > > >>>>> the
> > > > > >>>>>>>> genesis of Apache Arrow, I started a Python project called
> > > Ibis
> > > > > >>> [3]
> > > > > >>>>>>>> (which is similar to R's dplyr project) which serves as a
> > > "Python
> > > > > >>>>>>>> analytical query IR builder" that is capable of generating
> > > most
> > > > > >>> of
> > > > > >>>>> the
> > > > > >>>>>>>> SQL standard, targeting many different SQL dialects and
> > other
> > > > > >>>>> backends
> > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used
> > this
> > > > > >>>>> library
> > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to be
> able
> > > to
> > > > > >>>>>>>> translate between the in-memory "logical query" data
> > > structures
> > > > > >>> in
> > > > > >>>> a
> > > > > >>>>>>>> library like Ibis to a serialized format that can be
> > executed
> > > by
> > > > > >>>> many
> > > > > >>>>>>>> different Arrow-native query engines. The expression
> > > primitives
> > > > > >>>>>>>> available in Ibis should serve as helpful test cases, too.
> > > > > >>>>>>>>
> > > > > >>>>>>>> I look forward to the community's comments on the RFC
> > document
> > > > > >>> [1]
> > > > > >>>>> and
> > > > > >>>>>>>> pull request [2] -- I realize that arriving at consensus
> on
> > a
> > > > > >>>> complex
> > > > > >>>>>>>> and ambitious project like this can be challenging so I
> > > recommend
> > > > > >>>>>>>> spending time on the "non-goals" section in the RFC and
> ask
> > > > > >>>> questions
> > > > > >>>>>>>> if you are unclear about the scope of what problems this
> is
> > > > > >>> trying
> > > > > >>>> to
> > > > > >>>>>>>> solve. I would be happy to give Edit access on the RFC
> > > document
> > > > > >>> to
> > > > > >>>>>>>> others and would consider ideas about how to move forward
> > with
> > > > > >>>>>>>> something that is able to be implemented by different
> Arrow
> > > > > >>>> libraries
> > > > > >>>>>>>> in the reasonably near future.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Thanks,
> > > > > >>>>>>>> Wes
> > > > > >>>>>>>>
> > > > > >>>>>>>> [1]:
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > > >>>>>>>> [4]:
> http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > >>>>>>>> [5]:
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > >
> > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Jacques Nadeau <ja...@apache.org>.

In a lucky turn of events, Phillip actually turned out to be in my neck of
the woods on Friday so we had a chance to sit down and discuss this. To
help, I actually shared something I had been working on a few months ago
independently (before this discussion started).

For reference:
Wes PR: https://github.com/apache/arrow/pull/10856
Ben PR: https://github.com/apache/arrow/pull/10934
Jacques PR: https://github.com/apache/arrow/pull/10979

The high level points of feedback I have are:

   - Ben PR feels too deconstructed. While I like the elegance and
   symmetry, I believe this will lead to substantially more work in moving
   from serialization format to something closer to what a system would want
   to manipulate/consume. The reality is that there are a lot of really known
   things and specializing the representation for these things will ultimately
   make things easier to program with without error and easier to debug. (For
   example, imagine trying to inspect a plan in a debugging session with the
   Ben representation.) We should embrace the known things in the
   specification.
   - I believe that it is a mistake for the inner workings of the plan to
   ever use field names. Only input (e.g. file read) and Output (e.g. return
   to user or write to file) need to have field names. For the rest of the
   system, using field ordinals (determinant whether nested or flat) is much
   cleaner and is how most execution systems work. For example, in Impala I
   believe it is called a slot. As I noted in the PR, Calcite as an example is
   entirely ordinal based at the algebra level. Rowtypes contain field names
   but they are actually basically pointless. Field references use RexInputRef
   with ordinal based and rules around column order output (e.g. what is the
   field order of a join output) are determinant and done entirely at an
   ordinal level. The only place where names should be used (besides
   input/output) is in the case of map keys. In that case, the names are
   actually data, as opposed to scheme metadata. This is why I propose a
   strongly structured dereference operation [1].
   - Properties should only be included in the serialization if they are
   not easily re-derivable at plan consumption time. For example, you'll note
   that I don't store schema information for relational operation. Each
   function and relational operation should already know how a given input is
   transformed to a given output. Capturing this information in the plan/IR is
   excessive. In many ways, I compare it to the early use of VectorLayout [2]
   in Arrow schema. It may have provided some additional checksum of the
   message but ultimately it caused more pain than it was worth (and thus we
   removed it before formalizing the specification). For reference, in the
   context of Dremio, we used to actually do this, send schema information
   around for all operations. We removed it because in many cases becoming the
   majority of our internal plan serialization (imagine simple operations that
   are carrying 1000s of fields).
   - I suggest focusing on support for both logical and physical
   representations. The moment you start talking about optimization passes,
   many of those would probably be better being done at the logical level. The
   overlap is really high.
   - I think a lot more work must be done before introducing UDFs and user
   defined relational operations. I see one goal being the possibility of
   there being three systems: A -> B -> C. A is a IR producer. C is a IR
   consumer and B is a IR filter or translator. In this situation, B should be
   able to operate and do optimizations on a plan even if if there are black
   box user defined operations. Being able to know the properties-preservation
   or not of these operations is important to making decisions. For example,
   does a user defined relational operation maintain sortedness? Distribution?
   Is a defined UDF idempotent? As such, I think the definition of those black
   boxes should be much more structured. For example: it is a python
   relational operation named X stored in Y that maintains properties 1,2 and
   disrupts property 3. Putting just a black box of bytes will substantially
   reduce the compatibility and extensibility of the ecosystem of tools
   working against IR. I'd note that I wouldn't expect this to be a burden to
   actual end users. By using sensible defaults, I still would expect an end
   user tool to support arbitrary user defined operations.
   - It might make sense to review the XML representation that Orca uses
   [3]. I haven't looked at it recently but they had a strong goal of
   decoupling for most of its life (to use in both Greenplum and Hawq). It
   could be the most mature/formal serialization of query plans publically
   available.


[1]
https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
[2]
https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
[3] https://github.com/greenplum-db/gporca


On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <cp...@gmail.com> wrote:

> On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Looking at Ben's alternate PR [1], having an IR that leans heavily on
> > memory references to an out-of-band data sidecar seems like an
> > approach that would substantially ratchet up the implementation
> > complexity as producing the IR would then have the level of complexity
> > of producing the Arrow IPC format — when producing the "root" Plan
> > message, you must accumulate a list of the dependent serialized
> > submessages, placing the appropriate Buffer memory offset in the Plan
> > message, like we do when producing the RecordBatch.buffers field. This
> > seems complicated to me as you must devise a custom binary protocol to
> > concatenate the serialized Plan and the messages it depends on into a
> > single binary payload
> >
> > <ROOT PLAN>
> > <padding>
> > <Buffer 0>
> > <padding>
> > <Buffer 1>
> > <padding>
> > ...
> > <Buffer N - 1>
> >
> > (one purpose of FlatBufferBuilder is to spare you having to do this
> > yourself — some reasons we do it for the Arrow IPC format is because
> > appending Arrow memory buffers directly to a FlatBufferBuilder would
> > be inefficient — internal realloc calls — and Flatbuffers are limited
> > to 2GB. Neither of these things are problems here)
> >
> > In general, I believe the output created by an IR producer should be a
> > single serialized object without any out-of-band data sidecar — this
> > is much simpler for implementers and we can still provide an "escape
> > hatch" for user-defined operators and functions where the required
> > function/operator is passed opaquely as an embedded binary data.
>
>
>
> The serialization format (whether it is Flatbuffers or JSON, or
> > something else) should allow for data memoization, so if there is a
> > user-defined operator/function, or a relation that is used multiple
> > times throughout the query (potentially with a large schema), then we
> > should ensure that the data need not be duplicated in the
> > serialization format unnecessarily — in Flatbuffers, you can achieve
> > this by reusing offsets, but we could devise the data structures to
> > make the memoization of "expensive" objects more explicit.
> >
>
> I think this is something that would need to be explicitly encoded in
> the structures themselves if it's a hard requirement. I don't think this
> should block
> a prototype producer/consumer.
>
> Is there something in the second PR/design that precludes the reuse of
> offsets?
> To my eye, the flatbuffers offset reuse mechanism works just as well there.
>
>
> > I additionally think that it is important to provide as much built-in
> > support for "canonical" operators/functions (such as the ones
> > implemented commonly by SQL engines) as possible, and to liberally
> > expand the catalogue of "built-in" capabilities. I would still prefer
> > to have large unions/enums of built-in operators/functions and to
> > expand those unions/enums to accommodate new things when it is
> > demonstrated that there is a need to standardize things between
> > producers/consumers.
> >
>
> I think there's a middle ground where we add a bit of structure (something
> like
> a descriptor from the first PR) to indicate whether a thing is built-in vs
> user-defined.
> It looks like Ben has pushed something like this to his PR.
>
> With that scheme, we have both flexibility and a small set of special
> builtins that make up
> a statically typed set for expressions and relational operators.
>
> I would really like to vet this PR with a prototype this week,
> to see whether we need to revisit any major choices. I don't think we'll be
> able to
> anticipate all the consequences until we write some code.
>
>
> >
> > One of the beneficial properties of the Union/Enum approach for the
> > operator/function catalogues, is that when there are additions to
> > those enums, the generated Flatbuffers files will cause many language
> > compilers to warn or error on unhandled enum cases. If all
> > function/operator names are strings, then you are essentially
> > reimplementing the functionality provided by enums by hand. I
> > initially used strings for all function references in my original
> > prototype, but I now think that using an enum for "built-ins" would be
> > superior (because of the code-generated enum interface) and not a
> > premature optimization.
> >
> > [1]: https://github.com/apache/arrow/pull/10934
> >
> > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cp...@gmail.com>
> wrote:
> > >
> > > Hey all,
> > >
> > > Just wanted to give an update on the effort here.
> > >
> > > Ben Kietzman has created an alternative proposal to the initial design
> > [1].
> > > It largely overlaps with the original, but differs in a few important
> > ways:
> > >
> > > * A big focus of the design is on flexibility, allowing producers,
> > > consumers and ultimately end users of those systems the ability to
> define
> > > custom operations in the graph.
> > > * There are very few predefined relational operations (project, filter,
> > > join and a handful of others).
> > > * There are only 3 types of value expressions: literals, field
> > references,
> > > and function calls.
> > > * The model of evaluation is one that requires a final sink node, to
> > > indicate where the record batches from child relations end up (a file,
> a
> > > socket, an in-memory buffer, etc).
> > >
> > > I've added notes [2] to the original Google doc (under the Alternative
> > > Design Notes subheading), and a few pseudocode examples.
> > > Unfortunately, these went out of date as soon as Ben pushed the PR [3],
> > so
> > > I need to update those to reflect his changes. Regardless,
> > > the design is broadly the same, so it should still give a good
> indication
> > > of the details of the design.
> > >
> > > There are a decent number of review comments on the original PR that I
> > plan
> > > to port over where they are still relevant.
> > > I also plan on adding support for window functions either tonight or on
> > > Monday.
> > >
> > > Please review this design at your earliest convenience. Since there's a
> > > fairly concrete set of types in flatbuffers that
> > > we can look at, ideally we can center discussion around the details in
> > the
> > > PR.
> > >
> > > Thanks!
> > >
> > > [1]: https://github.com/apache/arrow/pull/10856
> > > [2]:
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > [3]: https://github.com/apache/arrow/pull/10934
> > >
> > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <jh...@gmail.com>
> > wrote:
> > >
> > > > > Wes wrote:
> > > > >
> > > > > Supporting this kind of intra-application engine
> > > > > heterogeneity is one of the motivations for the project.
> > > >
> > > > +1
> > > >
> > > > The data format is the natural interface between tasks. (Defining
> > “task”
> > > > here as “something that is programmed using the IR”.) That is Arrow’s
> > > > strength.
> > > >
> > > > So I think the IR should describe what each task should do, and tasks
> > > > should be fairly small. Not whole relational operators, operating on
> > whole
> > > > tables, but pieces of relational operators, operating on batches or
> > > > sequences of batches.
> > > >
> > > > Elsethread, someone mentioned the LoLePop concept and the
> > Kohn/Leis/Neuman
> > > > paper [1]. The LoLePop concept sounds good for our purposes.
> > > >
> > > > Julian
> > > >
> > > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > >
> > > >
> > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <we...@gmail.com>
> > wrote:
> > > > >
> > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cp...@gmail.com>
> > > > wrote:
> > > > >>
> > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > >> jorgecarleitao@gmail.com> wrote:
> > > > >>
> > > > >>> Couple of questions
> > > > >>>
> > > > >>> 1. Is the goal that IRs have equal semantics, i.e. given
> > (IR,data), the
> > > > >>> operation "(IR,data) - engine -> result" MUST be the same for all
> > > > "engine"?
> > > > >>>
> > > > >>
> > > > >> I think that might be a non-starter for mundane reasons: there's
> > > > probably
> > > > >> at least two engines
> > > > >> that disagree on the result of sum(x) where x is a floating point
> > > > column.
> > > > >>
> > > > >>
> > > > >>> 2. if yes, imo we may need to worry about:
> > > > >>> * a definition of equality that implementations agree on.
> > > > >>> * agreement over what the semantics look like. For example, do we
> > use
> > > > >>> kleene logic for AND and OR?
> > > > >>>
> > > > >>
> > > > >> WRT Kleene logic my thoughts are that the IR should support both
> > Kleene
> > > > and
> > > > >> non-Kleene,
> > > > >> and producers can choose their desired semantics.
> > > > >>
> > > > >> Ibis for example, would override the `&` operator in `a & b` to
> > produce
> > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > >>
> > > > >>
> > > > >>>
> > > > >>> To try to understand the gist, let's pick an aggregated count
> over
> > a
> > > > >>> column: engines often do partial counts over partitions followed
> > by a
> > > > final
> > > > >>> "sum" over the partial counts. Is the idea that the query engine
> > would
> > > > >>> communicate with the compute engine via 2 IRs where one is "count
> > me
> > > > these"
> > > > >>> the other is "sum me these"?
> > > > >>>
> > > > >>> Best,
> > > > >>> Jorge
> > > > >>>
> > > > >>
> > > > >> Not in its current incarnation.
> > > > >>
> > > > >> The idea is that the IR producer communicates a desire to count(x)
> > to a
> > > > >> consumer, and  it's up to the consumer to figure out how to turn
> > that
> > > > count
> > > > >> into something that makes sense for itself. In your example
> that's a
> > > > series
> > > > >> of partial counts followed by a sum.
> > > > >>
> > > > >
> > > > > That said, I think there is a valid use case here where a system
> > might
> > > > > make use of different engines to execute different (composable)
> > layers
> > > > > of a complex query.
> > > > >
> > > > > For example:
> > > > >
> > > > > * suppose you want to scan and do predicate pushdown on an unusual
> > > > > data source that is only accessible from one particular engine but
> > > > > * you need to do some analytical operation with the scan results
> that
> > > > > is only supported by another engine
> > > > >
> > > > > You could decompose the query into two stages with an IR relational
> > > > > expression for each stage and use then the engines together to
> > > > > accomplish what you need (of course, you would need an
> orchestration
> > > > > layer to handle plumbing the query engine inputs and outputs
> together
> > > > > as Arrow streams). Supporting this kind of intra-application engine
> > > > > heterogeneity is one of the motivations for the project.
> > > > >
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cpcloud@gmail.com
> >
> > > > wrote:
> > > > >>>
> > > > >>>> Thanks Wes,
> > > > >>>>
> > > > >>>> Great to be back working on Arrow again and engaging with the
> > > > community.
> > > > >>> I
> > > > >>>> am really excited about this effort.
> > > > >>>>
> > > > >>>> I think there are a number of concerns I see as important to
> > address
> > > > in
> > > > >>> the
> > > > >>>> compute IR proposal:
> > > > >>>>
> > > > >>>> 1. Requirement for output types.
> > > > >>>>
> > > > >>>> I think that so far there's been many reasons for requiring
> > > > conforming IR
> > > > >>>> producers and consumers to adhere to output types, but I haven't
> > seen
> > > > a
> > > > >>>> strong rationale for keeping them optional (in the semantic
> > sense, not
> > > > >>> WRT
> > > > >>>> any particular serialization format's representation of
> > optionality).
> > > > >>>>
> > > > >>>> I think a design that includes unambiguous semantics for output
> > types
> > > > (a
> > > > >>>> consumer must produce a value of the requested type or it's an
> > > > >>>> error/non-conforming) is simpler to reason about for producers,
> > and
> > > > >>>> provides a strong guarantee for end users (humans or machines
> > > > >>> constructing
> > > > >>>> IR from an API and expecting the thing they ask for back from
> the
> > IR
> > > > >>>> consumer).
> > > > >>>>
> > > > >>>> 2. Flexibility
> > > > >>>>
> > > > >>>> The current PR is currently unable to support what I think are
> > killer
> > > > >>>> features of the IR: custom operators (relational or column) and
> > UDFs.
> > > > In
> > > > >>> my
> > > > >>>> mind, on top of the generalized compute description that the IR
> > > > offers,
> > > > >>> the
> > > > >>>> ability for producers and consumers of IR to extend the IR
> without
> > > > >>> needing
> > > > >>>> to modify Arrow or depend on anything except the format is
> itself
> > > > >>> something
> > > > >>>> that is necessary to gain adoption.
> > > > >>>>
> > > > >>>> Developers will need to build custom relational operators (e.g.,
> > > > scans of
> > > > >>>> backends that don't exist anywhere for which a user has code to
> > > > >>> implement)
> > > > >>>> and custom functions (anything operating on a column that
> doesn't
> > > > already
> > > > >>>> exist, really). Furthermore, I think we can actually drive
> > building an
> > > > >>>> Arrow consumer using the same API that an end user would use to
> > extend
> > > > >>> the
> > > > >>>> IR.
> > > > >>>>
> > > > >>>> 3. Window Functions
> > > > >>>>
> > > > >>>> Window functions are, I think, an important part of the IR value
> > > > >>>> proposition, as they are one of the more complex operators in
> > > > databases.
> > > > >>> I
> > > > >>>> think we need to have something in the initial IR proposal to
> > support
> > > > >>> these
> > > > >>>> operations.
> > > > >>>>
> > > > >>>> 4. Non relational Joins
> > > > >>>>
> > > > >>>> Things like as-of join and window join operators aren't yet
> > fleshed
> > > > out
> > > > >>> in
> > > > >>>> the IR, and I'm not sure they should be in scope for the initial
> > > > >>> prototype.
> > > > >>>> I think once we settle on a design, we can work the design of
> > these
> > > > >>>> particular operators out during the initial prototype. I think
> the
> > > > >>>> specification of these operators should basically be PR #2 after
> > the
> > > > >>>> initial design lands.
> > > > >>>>
> > > > >>>> # Order of Work
> > > > >>>>
> > > > >>>> 1. Nail down the design. Anything else is a non-starter.
> > > > >>>>
> > > > >>>> 2. Prototype an IR producer using Ibis
> > > > >>>>
> > > > >>>> Ibis is IMO a good candidate for a first IR producer as it has a
> > > > number
> > > > >>> of
> > > > >>>> desirable properties that make prototyping faster and allow for
> > us to
> > > > >>>> refine the design of the IR as needed based on how the
> > implementation
> > > > >>> goes:
> > > > >>>> * It's written in Python so it has native support for nearly all
> > of
> > > > >>>> flatbuffers' features without having to creating bindings to
> C++.
> > > > >>>> * There's already a set of rules for type checking, as well as
> > APIs
> > > > for
> > > > >>>> constructing expression trees, which means we don't need to
> worry
> > > > about
> > > > >>>> building a type checker for the prototype.
> > > > >>>>
> > > > >>>> 3. Prototype an IR consumer in C++
> > > > >>>>
> > > > >>>> I think in parallel to the producer prototype we can further
> > inform
> > > > the
> > > > >>>> design from the consumer side by prototyping an IR consumer in
> > C++ . I
> > > > >>> know
> > > > >>>> Ben Kietzman has expressed interest in working on this.
> > > > >>>>
> > > > >>>> Very interested to hear others' thoughts.
> > > > >>>>
> > > > >>>> -Phillip
> > > > >>>>
> > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > wesmckinn@gmail.com>
> > > > >>> wrote:
> > > > >>>>
> > > > >>>>> Thank you for all the feedback and comments on the document.
> I'm
> > on
> > > > >>>>> vacation this week, so I'm delayed responding to everything,
> but
> > I
> > > > >>>>> will get to it as quickly as I can. I will be at VLDB in
> > Copenhagen
> > > > >>>>> next week if anyone would like to chat in person about it, and
> > we can
> > > > >>>>> relay the content of any discussions back to the
> > document/PR/e-mail
> > > > >>>>> thread.
> > > > >>>>>
> > > > >>>>> I know that Phillip Cloud expressed interest in working on the
> > PR and
> > > > >>>>> helping work through many of the details, so I'm glad to have
> the
> > > > >>>>> help. If there are others who would like to work on the PR or
> dig
> > > > into
> > > > >>>>> the details, please let me know. We might need to figure out
> how
> > to
> > > > >>>>> accommodate "many cooks" by setting up the ComputeIR project
> > > > somewhere
> > > > >>>>> separate from the format/ directory to permit it to exist in a
> > > > >>>>> Work-In-Progress status for a period of time until we work
> > through
> > > > the
> > > > >>>>> various details and design concerns.
> > > > >>>>>
> > > > >>>>> Re Julian's comment
> > > > >>>>>
> > > > >>>>>> The biggest surprise is that this language does full
> relational
> > > > >>>>> operations. I was expecting that it would do fragments of the
> > > > >>> operations.
> > > > >>>>>
> > > > >>>>> There's a related but different (yet still interesting and
> > worthy of
> > > > >>>>> analysis) problem of creating an "engine language" that
> describes
> > > > more
> > > > >>>>> mechanically the constituent parts of implementing the
> relational
> > > > >>>>> operators. To create a functional computation language with
> > concrete
> > > > >>>>> Arrow data structures as a top-level primitive sounds like an
> > > > >>>>> interesting research area where I could see something
> developing
> > > > >>>>> eventually.
> > > > >>>>>
> > > > >>>>> The main problem I'm interested in solving right now is
> enabling
> > > > front
> > > > >>>>> ends that have sufficient understanding of relational algebra
> and
> > > > data
> > > > >>>>> frame operations to talk to engines without having to go
> > backwards
> > > > >>>>> from their logical query plans to SQL. So as mentioned in the
> > > > >>>>> document, being able to faithfully carry the relational
> operator
> > node
> > > > >>>>> information generated by Calcite or Ibis or another system
> would
> > be
> > > > >>>>> super useful. Defining the semantics of various kinds of
> > user-defined
> > > > >>>>> functions would also be helpful to standardize the
> > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > >>>>>
> > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > alendit@gmail.com>
> > > > >>>> wrote:
> > > > >>>>>>
> > > > >>>>>> Hi Wes,
> > > > >>>>>>
> > > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL
> Analytics
> > > > From
> > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> > > > >>>> proposes a
> > > > >>>>>> set of building block for advanced aggregation.
> > > > >>>>>>
> > > > >>>>>> Cheers,
> > > > >>>>>> Dimitri.
> > > > >>>>>>
> > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > jhyde.apache@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Wes,
> > > > >>>>>>>
> > > > >>>>>>> Thanks for this. I’ve added comments to the doc and to the
> PR.
> > > > >>>>>>>
> > > > >>>>>>> The biggest surprise is that this language does full
> relational
> > > > >>>>>>> operations. I was expecting that it would do fragments of the
> > > > >>>>> operations.
> > > > >>>>>>> Consider join. A distributed hybrid hash join needs to
> > partition
> > > > >>> rows
> > > > >>>>> into
> > > > >>>>>>> output buffers based on a hash key, build hash tables, probe
> > into
> > > > >>>> hash
> > > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and so
> > forth.
> > > > >>>>>>>
> > > > >>>>>>> I see Arrow’s compute as delivering each of those operations,
> > > > >>> working
> > > > >>>>> on
> > > > >>>>>>> perhaps a batch at a time, or a sequence of batches, with
> some
> > > > >>> other
> > > > >>>>> system
> > > > >>>>>>> coordinating those tasks. So I would expect to see Arrow’s
> > compute
> > > > >>>>> language
> > > > >>>>>>> mainly operating on batches rather than a table abstraction.
> > > > >>>>>>>
> > > > >>>>>>> Julian
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> wesmckinn@gmail.com
> > >
> > > > >>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>> hi folks,
> > > > >>>>>>>>
> > > > >>>>>>>> This idea came up in passing in the past -- given that there
> > are
> > > > >>>>>>>> multiple independent efforts to develop Arrow-native query
> > > > >>> engines
> > > > >>>>>>>> (and surely many more to come), it seems like it would be
> > > > >>> valuable
> > > > >>>> to
> > > > >>>>>>>> have a way to enable user languages (like Java, Python, R,
> or
> > > > >>> Rust,
> > > > >>>>>>>> for example) to communicate with backend computing engines
> > (like
> > > > >>>>>>>> DataFusion, or new computing capabilities being built in the
> > > > >>> Arrow
> > > > >>>>> C++
> > > > >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> > > > >>>> specialized
> > > > >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
> > > > >>> parser /
> > > > >>>>>>>> analyzer framework to generate an expression tree of
> > relational
> > > > >>>>>>>> operators and then translate that to an Arrow-native
> > > > >>>> compute-engine's
> > > > >>>>>>>> internal grammer, a user framework could provide the desired
> > > > >>>>>>>> Arrow-native expression tree / data manipulations directly
> and
> > > > >>> skip
> > > > >>>>>>>> the SQL altogether.
> > > > >>>>>>>>
> > > > >>>>>>>> The idea of creating a "serialized intermediate
> representation
> > > > >>>> (IR)"
> > > > >>>>>>>> for Arrow compute operations would be to serve use cases
> large
> > > > >>> and
> > > > >>>>>>>> small -- from the most complex TPC-* or time series database
> > > > >>> query
> > > > >>>> to
> > > > >>>>>>>> the most simple array predicate/filter sent with an RPC
> > request
> > > > >>>> using
> > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > engine-agnostic,
> > > > >>>> with
> > > > >>>>>>>> the only commonality across usages being the Arrow columnar
> > > > >>> format
> > > > >>>>>>>> (schemas and array types). This would be better than leaving
> > it
> > > > >>> to
> > > > >>>>>>>> each application to develop its own bespoke expression
> > > > >>>>> representations
> > > > >>>>>>>> for its needs.
> > > > >>>>>>>>
> > > > >>>>>>>> I spent a while thinking about this and wrote up a brain
> dump
> > RFC
> > > > >>>>>>>> document [1] and accompanying pull request [2] that makes
> the
> > > > >>>>> possibly
> > > > >>>>>>>> controversial choice of using Flatbuffers to represent the
> > > > >>>> serialized
> > > > >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in
> > the
> > > > >>>> RFC
> > > > >>>>>>>> document. This PR is obviously deficient in many regards
> > > > >>>> (incomplete,
> > > > >>>>>>>> hacky, or unclear in places), and will need some help from
> > others
> > > > >>>> to
> > > > >>>>>>>> flesh out. I suspect that actually implementing the IR will
> be
> > > > >>>>>>>> necessary to work out many of the low-level details.
> > > > >>>>>>>>
> > > > >>>>>>>> Note that this IR is intended to be more of a "superset"
> > project
> > > > >>>> than
> > > > >>>>>>>> a "lowest common denominator". So there may be things that
> it
> > > > >>>>> includes
> > > > >>>>>>>> which are only available in some engines (e.g. engines that
> > have
> > > > >>>>>>>> specialized handling of time series data).
> > > > >>>>>>>>
> > > > >>>>>>>> As some of my personal motivation for the project,
> concurrent
> > > > >>> with
> > > > >>>>> the
> > > > >>>>>>>> genesis of Apache Arrow, I started a Python project called
> > Ibis
> > > > >>> [3]
> > > > >>>>>>>> (which is similar to R's dplyr project) which serves as a
> > "Python
> > > > >>>>>>>> analytical query IR builder" that is capable of generating
> > most
> > > > >>> of
> > > > >>>>> the
> > > > >>>>>>>> SQL standard, targeting many different SQL dialects and
> other
> > > > >>>>> backends
> > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used
> this
> > > > >>>>> library
> > > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to be able
> > to
> > > > >>>>>>>> translate between the in-memory "logical query" data
> > structures
> > > > >>> in
> > > > >>>> a
> > > > >>>>>>>> library like Ibis to a serialized format that can be
> executed
> > by
> > > > >>>> many
> > > > >>>>>>>> different Arrow-native query engines. The expression
> > primitives
> > > > >>>>>>>> available in Ibis should serve as helpful test cases, too.
> > > > >>>>>>>>
> > > > >>>>>>>> I look forward to the community's comments on the RFC
> document
> > > > >>> [1]
> > > > >>>>> and
> > > > >>>>>>>> pull request [2] -- I realize that arriving at consensus on
> a
> > > > >>>> complex
> > > > >>>>>>>> and ambitious project like this can be challenging so I
> > recommend
> > > > >>>>>>>> spending time on the "non-goals" section in the RFC and ask
> > > > >>>> questions
> > > > >>>>>>>> if you are unclear about the scope of what problems this is
> > > > >>> trying
> > > > >>>> to
> > > > >>>>>>>> solve. I would be happy to give Edit access on the RFC
> > document
> > > > >>> to
> > > > >>>>>>>> others and would consider ideas about how to move forward
> with
> > > > >>>>>>>> something that is able to be implemented by different Arrow
> > > > >>>> libraries
> > > > >>>>>>>> in the reasonably near future.
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks,
> > > > >>>>>>>> Wes
> > > > >>>>>>>>
> > > > >>>>>>>> [1]:
> > > > >>>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > >>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > >>>>>>>> [5]:
> > > > >>>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > >
> > > >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Phillip Cloud <cp...@gmail.com>.

On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <we...@gmail.com> wrote:

> Looking at Ben's alternate PR [1], having an IR that leans heavily on
> memory references to an out-of-band data sidecar seems like an
> approach that would substantially ratchet up the implementation
> complexity as producing the IR would then have the level of complexity
> of producing the Arrow IPC format — when producing the "root" Plan
> message, you must accumulate a list of the dependent serialized
> submessages, placing the appropriate Buffer memory offset in the Plan
> message, like we do when producing the RecordBatch.buffers field. This
> seems complicated to me as you must devise a custom binary protocol to
> concatenate the serialized Plan and the messages it depends on into a
> single binary payload
>
> <ROOT PLAN>
> <padding>
> <Buffer 0>
> <padding>
> <Buffer 1>
> <padding>
> ...
> <Buffer N - 1>
>
> (one purpose of FlatBufferBuilder is to spare you having to do this
> yourself — some reasons we do it for the Arrow IPC format is because
> appending Arrow memory buffers directly to a FlatBufferBuilder would
> be inefficient — internal realloc calls — and Flatbuffers are limited
> to 2GB. Neither of these things are problems here)
>
> In general, I believe the output created by an IR producer should be a
> single serialized object without any out-of-band data sidecar — this
> is much simpler for implementers and we can still provide an "escape
> hatch" for user-defined operators and functions where the required
> function/operator is passed opaquely as an embedded binary data.



The serialization format (whether it is Flatbuffers or JSON, or
> something else) should allow for data memoization, so if there is a
> user-defined operator/function, or a relation that is used multiple
> times throughout the query (potentially with a large schema), then we
> should ensure that the data need not be duplicated in the
> serialization format unnecessarily — in Flatbuffers, you can achieve
> this by reusing offsets, but we could devise the data structures to
> make the memoization of "expensive" objects more explicit.
>

I think this is something that would need to be explicitly encoded in
the structures themselves if it's a hard requirement. I don't think this
should block
a prototype producer/consumer.

Is there something in the second PR/design that precludes the reuse of
offsets?
To my eye, the flatbuffers offset reuse mechanism works just as well there.


> I additionally think that it is important to provide as much built-in
> support for "canonical" operators/functions (such as the ones
> implemented commonly by SQL engines) as possible, and to liberally
> expand the catalogue of "built-in" capabilities. I would still prefer
> to have large unions/enums of built-in operators/functions and to
> expand those unions/enums to accommodate new things when it is
> demonstrated that there is a need to standardize things between
> producers/consumers.
>

I think there's a middle ground where we add a bit of structure (something
like
a descriptor from the first PR) to indicate whether a thing is built-in vs
user-defined.
It looks like Ben has pushed something like this to his PR.

With that scheme, we have both flexibility and a small set of special
builtins that make up
a statically typed set for expressions and relational operators.

I would really like to vet this PR with a prototype this week,
to see whether we need to revisit any major choices. I don't think we'll be
able to
anticipate all the consequences until we write some code.


>
> One of the beneficial properties of the Union/Enum approach for the
> operator/function catalogues, is that when there are additions to
> those enums, the generated Flatbuffers files will cause many language
> compilers to warn or error on unhandled enum cases. If all
> function/operator names are strings, then you are essentially
> reimplementing the functionality provided by enums by hand. I
> initially used strings for all function references in my original
> prototype, but I now think that using an enum for "built-ins" would be
> superior (because of the code-generated enum interface) and not a
> premature optimization.
>
> [1]: https://github.com/apache/arrow/pull/10934
>
> On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cp...@gmail.com> wrote:
> >
> > Hey all,
> >
> > Just wanted to give an update on the effort here.
> >
> > Ben Kietzman has created an alternative proposal to the initial design
> [1].
> > It largely overlaps with the original, but differs in a few important
> ways:
> >
> > * A big focus of the design is on flexibility, allowing producers,
> > consumers and ultimately end users of those systems the ability to define
> > custom operations in the graph.
> > * There are very few predefined relational operations (project, filter,
> > join and a handful of others).
> > * There are only 3 types of value expressions: literals, field
> references,
> > and function calls.
> > * The model of evaluation is one that requires a final sink node, to
> > indicate where the record batches from child relations end up (a file, a
> > socket, an in-memory buffer, etc).
> >
> > I've added notes [2] to the original Google doc (under the Alternative
> > Design Notes subheading), and a few pseudocode examples.
> > Unfortunately, these went out of date as soon as Ben pushed the PR [3],
> so
> > I need to update those to reflect his changes. Regardless,
> > the design is broadly the same, so it should still give a good indication
> > of the details of the design.
> >
> > There are a decent number of review comments on the original PR that I
> plan
> > to port over where they are still relevant.
> > I also plan on adding support for window functions either tonight or on
> > Monday.
> >
> > Please review this design at your earliest convenience. Since there's a
> > fairly concrete set of types in flatbuffers that
> > we can look at, ideally we can center discussion around the details in
> the
> > PR.
> >
> > Thanks!
> >
> > [1]: https://github.com/apache/arrow/pull/10856
> > [2]:
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > [3]: https://github.com/apache/arrow/pull/10934
> >
> > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <jh...@gmail.com>
> wrote:
> >
> > > > Wes wrote:
> > > >
> > > > Supporting this kind of intra-application engine
> > > > heterogeneity is one of the motivations for the project.
> > >
> > > +1
> > >
> > > The data format is the natural interface between tasks. (Defining
> “task”
> > > here as “something that is programmed using the IR”.) That is Arrow’s
> > > strength.
> > >
> > > So I think the IR should describe what each task should do, and tasks
> > > should be fairly small. Not whole relational operators, operating on
> whole
> > > tables, but pieces of relational operators, operating on batches or
> > > sequences of batches.
> > >
> > > Elsethread, someone mentioned the LoLePop concept and the
> Kohn/Leis/Neuman
> > > paper [1]. The LoLePop concept sounds good for our purposes.
> > >
> > > Julian
> > >
> > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > >
> > >
> > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <we...@gmail.com>
> wrote:
> > > >
> > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cp...@gmail.com>
> > > wrote:
> > > >>
> > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > >> jorgecarleitao@gmail.com> wrote:
> > > >>
> > > >>> Couple of questions
> > > >>>
> > > >>> 1. Is the goal that IRs have equal semantics, i.e. given
> (IR,data), the
> > > >>> operation "(IR,data) - engine -> result" MUST be the same for all
> > > "engine"?
> > > >>>
> > > >>
> > > >> I think that might be a non-starter for mundane reasons: there's
> > > probably
> > > >> at least two engines
> > > >> that disagree on the result of sum(x) where x is a floating point
> > > column.
> > > >>
> > > >>
> > > >>> 2. if yes, imo we may need to worry about:
> > > >>> * a definition of equality that implementations agree on.
> > > >>> * agreement over what the semantics look like. For example, do we
> use
> > > >>> kleene logic for AND and OR?
> > > >>>
> > > >>
> > > >> WRT Kleene logic my thoughts are that the IR should support both
> Kleene
> > > and
> > > >> non-Kleene,
> > > >> and producers can choose their desired semantics.
> > > >>
> > > >> Ibis for example, would override the `&` operator in `a & b` to
> produce
> > > >> `KleeneAnd(Column(a), Column(b))`.
> > > >>
> > > >>
> > > >>>
> > > >>> To try to understand the gist, let's pick an aggregated count over
> a
> > > >>> column: engines often do partial counts over partitions followed
> by a
> > > final
> > > >>> "sum" over the partial counts. Is the idea that the query engine
> would
> > > >>> communicate with the compute engine via 2 IRs where one is "count
> me
> > > these"
> > > >>> the other is "sum me these"?
> > > >>>
> > > >>> Best,
> > > >>> Jorge
> > > >>>
> > > >>
> > > >> Not in its current incarnation.
> > > >>
> > > >> The idea is that the IR producer communicates a desire to count(x)
> to a
> > > >> consumer, and  it's up to the consumer to figure out how to turn
> that
> > > count
> > > >> into something that makes sense for itself. In your example that's a
> > > series
> > > >> of partial counts followed by a sum.
> > > >>
> > > >
> > > > That said, I think there is a valid use case here where a system
> might
> > > > make use of different engines to execute different (composable)
> layers
> > > > of a complex query.
> > > >
> > > > For example:
> > > >
> > > > * suppose you want to scan and do predicate pushdown on an unusual
> > > > data source that is only accessible from one particular engine but
> > > > * you need to do some analytical operation with the scan results that
> > > > is only supported by another engine
> > > >
> > > > You could decompose the query into two stages with an IR relational
> > > > expression for each stage and use then the engines together to
> > > > accomplish what you need (of course, you would need an orchestration
> > > > layer to handle plumbing the query engine inputs and outputs together
> > > > as Arrow streams). Supporting this kind of intra-application engine
> > > > heterogeneity is one of the motivations for the project.
> > > >
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cp...@gmail.com>
> > > wrote:
> > > >>>
> > > >>>> Thanks Wes,
> > > >>>>
> > > >>>> Great to be back working on Arrow again and engaging with the
> > > community.
> > > >>> I
> > > >>>> am really excited about this effort.
> > > >>>>
> > > >>>> I think there are a number of concerns I see as important to
> address
> > > in
> > > >>> the
> > > >>>> compute IR proposal:
> > > >>>>
> > > >>>> 1. Requirement for output types.
> > > >>>>
> > > >>>> I think that so far there's been many reasons for requiring
> > > conforming IR
> > > >>>> producers and consumers to adhere to output types, but I haven't
> seen
> > > a
> > > >>>> strong rationale for keeping them optional (in the semantic
> sense, not
> > > >>> WRT
> > > >>>> any particular serialization format's representation of
> optionality).
> > > >>>>
> > > >>>> I think a design that includes unambiguous semantics for output
> types
> > > (a
> > > >>>> consumer must produce a value of the requested type or it's an
> > > >>>> error/non-conforming) is simpler to reason about for producers,
> and
> > > >>>> provides a strong guarantee for end users (humans or machines
> > > >>> constructing
> > > >>>> IR from an API and expecting the thing they ask for back from the
> IR
> > > >>>> consumer).
> > > >>>>
> > > >>>> 2. Flexibility
> > > >>>>
> > > >>>> The current PR is currently unable to support what I think are
> killer
> > > >>>> features of the IR: custom operators (relational or column) and
> UDFs.
> > > In
> > > >>> my
> > > >>>> mind, on top of the generalized compute description that the IR
> > > offers,
> > > >>> the
> > > >>>> ability for producers and consumers of IR to extend the IR without
> > > >>> needing
> > > >>>> to modify Arrow or depend on anything except the format is itself
> > > >>> something
> > > >>>> that is necessary to gain adoption.
> > > >>>>
> > > >>>> Developers will need to build custom relational operators (e.g.,
> > > scans of
> > > >>>> backends that don't exist anywhere for which a user has code to
> > > >>> implement)
> > > >>>> and custom functions (anything operating on a column that doesn't
> > > already
> > > >>>> exist, really). Furthermore, I think we can actually drive
> building an
> > > >>>> Arrow consumer using the same API that an end user would use to
> extend
> > > >>> the
> > > >>>> IR.
> > > >>>>
> > > >>>> 3. Window Functions
> > > >>>>
> > > >>>> Window functions are, I think, an important part of the IR value
> > > >>>> proposition, as they are one of the more complex operators in
> > > databases.
> > > >>> I
> > > >>>> think we need to have something in the initial IR proposal to
> support
> > > >>> these
> > > >>>> operations.
> > > >>>>
> > > >>>> 4. Non relational Joins
> > > >>>>
> > > >>>> Things like as-of join and window join operators aren't yet
> fleshed
> > > out
> > > >>> in
> > > >>>> the IR, and I'm not sure they should be in scope for the initial
> > > >>> prototype.
> > > >>>> I think once we settle on a design, we can work the design of
> these
> > > >>>> particular operators out during the initial prototype. I think the
> > > >>>> specification of these operators should basically be PR #2 after
> the
> > > >>>> initial design lands.
> > > >>>>
> > > >>>> # Order of Work
> > > >>>>
> > > >>>> 1. Nail down the design. Anything else is a non-starter.
> > > >>>>
> > > >>>> 2. Prototype an IR producer using Ibis
> > > >>>>
> > > >>>> Ibis is IMO a good candidate for a first IR producer as it has a
> > > number
> > > >>> of
> > > >>>> desirable properties that make prototyping faster and allow for
> us to
> > > >>>> refine the design of the IR as needed based on how the
> implementation
> > > >>> goes:
> > > >>>> * It's written in Python so it has native support for nearly all
> of
> > > >>>> flatbuffers' features without having to creating bindings to C++.
> > > >>>> * There's already a set of rules for type checking, as well as
> APIs
> > > for
> > > >>>> constructing expression trees, which means we don't need to worry
> > > about
> > > >>>> building a type checker for the prototype.
> > > >>>>
> > > >>>> 3. Prototype an IR consumer in C++
> > > >>>>
> > > >>>> I think in parallel to the producer prototype we can further
> inform
> > > the
> > > >>>> design from the consumer side by prototyping an IR consumer in
> C++ . I
> > > >>> know
> > > >>>> Ben Kietzman has expressed interest in working on this.
> > > >>>>
> > > >>>> Very interested to hear others' thoughts.
> > > >>>>
> > > >>>> -Phillip
> > > >>>>
> > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> wesmckinn@gmail.com>
> > > >>> wrote:
> > > >>>>
> > > >>>>> Thank you for all the feedback and comments on the document. I'm
> on
> > > >>>>> vacation this week, so I'm delayed responding to everything, but
> I
> > > >>>>> will get to it as quickly as I can. I will be at VLDB in
> Copenhagen
> > > >>>>> next week if anyone would like to chat in person about it, and
> we can
> > > >>>>> relay the content of any discussions back to the
> document/PR/e-mail
> > > >>>>> thread.
> > > >>>>>
> > > >>>>> I know that Phillip Cloud expressed interest in working on the
> PR and
> > > >>>>> helping work through many of the details, so I'm glad to have the
> > > >>>>> help. If there are others who would like to work on the PR or dig
> > > into
> > > >>>>> the details, please let me know. We might need to figure out how
> to
> > > >>>>> accommodate "many cooks" by setting up the ComputeIR project
> > > somewhere
> > > >>>>> separate from the format/ directory to permit it to exist in a
> > > >>>>> Work-In-Progress status for a period of time until we work
> through
> > > the
> > > >>>>> various details and design concerns.
> > > >>>>>
> > > >>>>> Re Julian's comment
> > > >>>>>
> > > >>>>>> The biggest surprise is that this language does full relational
> > > >>>>> operations. I was expecting that it would do fragments of the
> > > >>> operations.
> > > >>>>>
> > > >>>>> There's a related but different (yet still interesting and
> worthy of
> > > >>>>> analysis) problem of creating an "engine language" that describes
> > > more
> > > >>>>> mechanically the constituent parts of implementing the relational
> > > >>>>> operators. To create a functional computation language with
> concrete
> > > >>>>> Arrow data structures as a top-level primitive sounds like an
> > > >>>>> interesting research area where I could see something developing
> > > >>>>> eventually.
> > > >>>>>
> > > >>>>> The main problem I'm interested in solving right now is enabling
> > > front
> > > >>>>> ends that have sufficient understanding of relational algebra and
> > > data
> > > >>>>> frame operations to talk to engines without having to go
> backwards
> > > >>>>> from their logical query plans to SQL. So as mentioned in the
> > > >>>>> document, being able to faithfully carry the relational operator
> node
> > > >>>>> information generated by Calcite or Ibis or another system would
> be
> > > >>>>> super useful. Defining the semantics of various kinds of
> user-defined
> > > >>>>> functions would also be helpful to standardize the
> > > >>>>> engine-to-user-language UDF/extension interface.
> > > >>>>>
> > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> alendit@gmail.com>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>> Hi Wes,
> > > >>>>>>
> > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL Analytics
> > > From
> > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> > > >>>> proposes a
> > > >>>>>> set of building block for advanced aggregation.
> > > >>>>>>
> > > >>>>>> Cheers,
> > > >>>>>> Dimitri.
> > > >>>>>>
> > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> jhyde.apache@gmail.com>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Wes,
> > > >>>>>>>
> > > >>>>>>> Thanks for this. I’ve added comments to the doc and to the PR.
> > > >>>>>>>
> > > >>>>>>> The biggest surprise is that this language does full relational
> > > >>>>>>> operations. I was expecting that it would do fragments of the
> > > >>>>> operations.
> > > >>>>>>> Consider join. A distributed hybrid hash join needs to
> partition
> > > >>> rows
> > > >>>>> into
> > > >>>>>>> output buffers based on a hash key, build hash tables, probe
> into
> > > >>>> hash
> > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and so
> forth.
> > > >>>>>>>
> > > >>>>>>> I see Arrow’s compute as delivering each of those operations,
> > > >>> working
> > > >>>>> on
> > > >>>>>>> perhaps a batch at a time, or a sequence of batches, with some
> > > >>> other
> > > >>>>> system
> > > >>>>>>> coordinating those tasks. So I would expect to see Arrow’s
> compute
> > > >>>>> language
> > > >>>>>>> mainly operating on batches rather than a table abstraction.
> > > >>>>>>>
> > > >>>>>>> Julian
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmckinn@gmail.com
> >
> > > >>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>> hi folks,
> > > >>>>>>>>
> > > >>>>>>>> This idea came up in passing in the past -- given that there
> are
> > > >>>>>>>> multiple independent efforts to develop Arrow-native query
> > > >>> engines
> > > >>>>>>>> (and surely many more to come), it seems like it would be
> > > >>> valuable
> > > >>>> to
> > > >>>>>>>> have a way to enable user languages (like Java, Python, R, or
> > > >>> Rust,
> > > >>>>>>>> for example) to communicate with backend computing engines
> (like
> > > >>>>>>>> DataFusion, or new computing capabilities being built in the
> > > >>> Arrow
> > > >>>>> C++
> > > >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> > > >>>> specialized
> > > >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
> > > >>> parser /
> > > >>>>>>>> analyzer framework to generate an expression tree of
> relational
> > > >>>>>>>> operators and then translate that to an Arrow-native
> > > >>>> compute-engine's
> > > >>>>>>>> internal grammer, a user framework could provide the desired
> > > >>>>>>>> Arrow-native expression tree / data manipulations directly and
> > > >>> skip
> > > >>>>>>>> the SQL altogether.
> > > >>>>>>>>
> > > >>>>>>>> The idea of creating a "serialized intermediate representation
> > > >>>> (IR)"
> > > >>>>>>>> for Arrow compute operations would be to serve use cases large
> > > >>> and
> > > >>>>>>>> small -- from the most complex TPC-* or time series database
> > > >>> query
> > > >>>> to
> > > >>>>>>>> the most simple array predicate/filter sent with an RPC
> request
> > > >>>> using
> > > >>>>>>>> Arrow Flight. It is deliberately language- and
> engine-agnostic,
> > > >>>> with
> > > >>>>>>>> the only commonality across usages being the Arrow columnar
> > > >>> format
> > > >>>>>>>> (schemas and array types). This would be better than leaving
> it
> > > >>> to
> > > >>>>>>>> each application to develop its own bespoke expression
> > > >>>>> representations
> > > >>>>>>>> for its needs.
> > > >>>>>>>>
> > > >>>>>>>> I spent a while thinking about this and wrote up a brain dump
> RFC
> > > >>>>>>>> document [1] and accompanying pull request [2] that makes the
> > > >>>>> possibly
> > > >>>>>>>> controversial choice of using Flatbuffers to represent the
> > > >>>> serialized
> > > >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in
> the
> > > >>>> RFC
> > > >>>>>>>> document. This PR is obviously deficient in many regards
> > > >>>> (incomplete,
> > > >>>>>>>> hacky, or unclear in places), and will need some help from
> others
> > > >>>> to
> > > >>>>>>>> flesh out. I suspect that actually implementing the IR will be
> > > >>>>>>>> necessary to work out many of the low-level details.
> > > >>>>>>>>
> > > >>>>>>>> Note that this IR is intended to be more of a "superset"
> project
> > > >>>> than
> > > >>>>>>>> a "lowest common denominator". So there may be things that it
> > > >>>>> includes
> > > >>>>>>>> which are only available in some engines (e.g. engines that
> have
> > > >>>>>>>> specialized handling of time series data).
> > > >>>>>>>>
> > > >>>>>>>> As some of my personal motivation for the project, concurrent
> > > >>> with
> > > >>>>> the
> > > >>>>>>>> genesis of Apache Arrow, I started a Python project called
> Ibis
> > > >>> [3]
> > > >>>>>>>> (which is similar to R's dplyr project) which serves as a
> "Python
> > > >>>>>>>> analytical query IR builder" that is capable of generating
> most
> > > >>> of
> > > >>>>> the
> > > >>>>>>>> SQL standard, targeting many different SQL dialects and other
> > > >>>>> backends
> > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > > >>>>> library
> > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to be able
> to
> > > >>>>>>>> translate between the in-memory "logical query" data
> structures
> > > >>> in
> > > >>>> a
> > > >>>>>>>> library like Ibis to a serialized format that can be executed
> by
> > > >>>> many
> > > >>>>>>>> different Arrow-native query engines. The expression
> primitives
> > > >>>>>>>> available in Ibis should serve as helpful test cases, too.
> > > >>>>>>>>
> > > >>>>>>>> I look forward to the community's comments on the RFC document
> > > >>> [1]
> > > >>>>> and
> > > >>>>>>>> pull request [2] -- I realize that arriving at consensus on a
> > > >>>> complex
> > > >>>>>>>> and ambitious project like this can be challenging so I
> recommend
> > > >>>>>>>> spending time on the "non-goals" section in the RFC and ask
> > > >>>> questions
> > > >>>>>>>> if you are unclear about the scope of what problems this is
> > > >>> trying
> > > >>>> to
> > > >>>>>>>> solve. I would be happy to give Edit access on the RFC
> document
> > > >>> to
> > > >>>>>>>> others and would consider ideas about how to move forward with
> > > >>>>>>>> something that is able to be implemented by different Arrow
> > > >>>> libraries
> > > >>>>>>>> in the reasonably near future.
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Wes
> > > >>>>>>>>
> > > >>>>>>>> [1]:
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > >>>>>>>> [3]: https://ibis-project.org/
> > > >>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > >>>>>>>> [5]:
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > >
> > >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Wes McKinney <we...@gmail.com>.

Looking at Ben's alternate PR [1], having an IR that leans heavily on
memory references to an out-of-band data sidecar seems like an
approach that would substantially ratchet up the implementation
complexity as producing the IR would then have the level of complexity
of producing the Arrow IPC format — when producing the "root" Plan
message, you must accumulate a list of the dependent serialized
submessages, placing the appropriate Buffer memory offset in the Plan
message, like we do when producing the RecordBatch.buffers field. This
seems complicated to me as you must devise a custom binary protocol to
concatenate the serialized Plan and the messages it depends on into a
single binary payload

<ROOT PLAN>
<padding>
<Buffer 0>
<padding>
<Buffer 1>
<padding>
...
<Buffer N - 1>

(one purpose of FlatBufferBuilder is to spare you having to do this
yourself — some reasons we do it for the Arrow IPC format is because
appending Arrow memory buffers directly to a FlatBufferBuilder would
be inefficient — internal realloc calls — and Flatbuffers are limited
to 2GB. Neither of these things are problems here)

In general, I believe the output created by an IR producer should be a
single serialized object without any out-of-band data sidecar — this
is much simpler for implementers and we can still provide an "escape
hatch" for user-defined operators and functions where the required
function/operator is passed opaquely as an embedded binary data.

The serialization format (whether it is Flatbuffers or JSON, or
something else) should allow for data memoization, so if there is a
user-defined operator/function, or a relation that is used multiple
times throughout the query (potentially with a large schema), then we
should ensure that the data need not be duplicated in the
serialization format unnecessarily — in Flatbuffers, you can achieve
this by reusing offsets, but we could devise the data structures to
make the memoization of "expensive" objects more explicit.

I additionally think that it is important to provide as much built-in
support for "canonical" operators/functions (such as the ones
implemented commonly by SQL engines) as possible, and to liberally
expand the catalogue of "built-in" capabilities. I would still prefer
to have large unions/enums of built-in operators/functions and to
expand those unions/enums to accommodate new things when it is
demonstrated that there is a need to standardize things between
producers/consumers.

One of the beneficial properties of the Union/Enum approach for the
operator/function catalogues, is that when there are additions to
those enums, the generated Flatbuffers files will cause many language
compilers to warn or error on unhandled enum cases. If all
function/operator names are strings, then you are essentially
reimplementing the functionality provided by enums by hand. I
initially used strings for all function references in my original
prototype, but I now think that using an enum for "built-ins" would be
superior (because of the code-generated enum interface) and not a
premature optimization.

[1]: https://github.com/apache/arrow/pull/10934

On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cp...@gmail.com> wrote:
>
> Hey all,
>
> Just wanted to give an update on the effort here.
>
> Ben Kietzman has created an alternative proposal to the initial design [1].
> It largely overlaps with the original, but differs in a few important ways:
>
> * A big focus of the design is on flexibility, allowing producers,
> consumers and ultimately end users of those systems the ability to define
> custom operations in the graph.
> * There are very few predefined relational operations (project, filter,
> join and a handful of others).
> * There are only 3 types of value expressions: literals, field references,
> and function calls.
> * The model of evaluation is one that requires a final sink node, to
> indicate where the record batches from child relations end up (a file, a
> socket, an in-memory buffer, etc).
>
> I've added notes [2] to the original Google doc (under the Alternative
> Design Notes subheading), and a few pseudocode examples.
> Unfortunately, these went out of date as soon as Ben pushed the PR [3], so
> I need to update those to reflect his changes. Regardless,
> the design is broadly the same, so it should still give a good indication
> of the details of the design.
>
> There are a decent number of review comments on the original PR that I plan
> to port over where they are still relevant.
> I also plan on adding support for window functions either tonight or on
> Monday.
>
> Please review this design at your earliest convenience. Since there's a
> fairly concrete set of types in flatbuffers that
> we can look at, ideally we can center discussion around the details in the
> PR.
>
> Thanks!
>
> [1]: https://github.com/apache/arrow/pull/10856
> [2]:
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> [3]: https://github.com/apache/arrow/pull/10934
>
> On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <jh...@gmail.com> wrote:
>
> > > Wes wrote:
> > >
> > > Supporting this kind of intra-application engine
> > > heterogeneity is one of the motivations for the project.
> >
> > +1
> >
> > The data format is the natural interface between tasks. (Defining “task”
> > here as “something that is programmed using the IR”.) That is Arrow’s
> > strength.
> >
> > So I think the IR should describe what each task should do, and tasks
> > should be fairly small. Not whole relational operators, operating on whole
> > tables, but pieces of relational operators, operating on batches or
> > sequences of batches.
> >
> > Elsethread, someone mentioned the LoLePop concept and the Kohn/Leis/Neuman
> > paper [1]. The LoLePop concept sounds good for our purposes.
> >
> > Julian
> >
> > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> >
> >
> > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <we...@gmail.com> wrote:
> > >
> > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cp...@gmail.com>
> > wrote:
> > >>
> > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > >> jorgecarleitao@gmail.com> wrote:
> > >>
> > >>> Couple of questions
> > >>>
> > >>> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
> > >>> operation "(IR,data) - engine -> result" MUST be the same for all
> > "engine"?
> > >>>
> > >>
> > >> I think that might be a non-starter for mundane reasons: there's
> > probably
> > >> at least two engines
> > >> that disagree on the result of sum(x) where x is a floating point
> > column.
> > >>
> > >>
> > >>> 2. if yes, imo we may need to worry about:
> > >>> * a definition of equality that implementations agree on.
> > >>> * agreement over what the semantics look like. For example, do we use
> > >>> kleene logic for AND and OR?
> > >>>
> > >>
> > >> WRT Kleene logic my thoughts are that the IR should support both Kleene
> > and
> > >> non-Kleene,
> > >> and producers can choose their desired semantics.
> > >>
> > >> Ibis for example, would override the `&` operator in `a & b` to produce
> > >> `KleeneAnd(Column(a), Column(b))`.
> > >>
> > >>
> > >>>
> > >>> To try to understand the gist, let's pick an aggregated count over a
> > >>> column: engines often do partial counts over partitions followed by a
> > final
> > >>> "sum" over the partial counts. Is the idea that the query engine would
> > >>> communicate with the compute engine via 2 IRs where one is "count me
> > these"
> > >>> the other is "sum me these"?
> > >>>
> > >>> Best,
> > >>> Jorge
> > >>>
> > >>
> > >> Not in its current incarnation.
> > >>
> > >> The idea is that the IR producer communicates a desire to count(x) to a
> > >> consumer, and  it's up to the consumer to figure out how to turn that
> > count
> > >> into something that makes sense for itself. In your example that's a
> > series
> > >> of partial counts followed by a sum.
> > >>
> > >
> > > That said, I think there is a valid use case here where a system might
> > > make use of different engines to execute different (composable) layers
> > > of a complex query.
> > >
> > > For example:
> > >
> > > * suppose you want to scan and do predicate pushdown on an unusual
> > > data source that is only accessible from one particular engine but
> > > * you need to do some analytical operation with the scan results that
> > > is only supported by another engine
> > >
> > > You could decompose the query into two stages with an IR relational
> > > expression for each stage and use then the engines together to
> > > accomplish what you need (of course, you would need an orchestration
> > > layer to handle plumbing the query engine inputs and outputs together
> > > as Arrow streams). Supporting this kind of intra-application engine
> > > heterogeneity is one of the motivations for the project.
> > >
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cp...@gmail.com>
> > wrote:
> > >>>
> > >>>> Thanks Wes,
> > >>>>
> > >>>> Great to be back working on Arrow again and engaging with the
> > community.
> > >>> I
> > >>>> am really excited about this effort.
> > >>>>
> > >>>> I think there are a number of concerns I see as important to address
> > in
> > >>> the
> > >>>> compute IR proposal:
> > >>>>
> > >>>> 1. Requirement for output types.
> > >>>>
> > >>>> I think that so far there's been many reasons for requiring
> > conforming IR
> > >>>> producers and consumers to adhere to output types, but I haven't seen
> > a
> > >>>> strong rationale for keeping them optional (in the semantic sense, not
> > >>> WRT
> > >>>> any particular serialization format's representation of optionality).
> > >>>>
> > >>>> I think a design that includes unambiguous semantics for output types
> > (a
> > >>>> consumer must produce a value of the requested type or it's an
> > >>>> error/non-conforming) is simpler to reason about for producers, and
> > >>>> provides a strong guarantee for end users (humans or machines
> > >>> constructing
> > >>>> IR from an API and expecting the thing they ask for back from the IR
> > >>>> consumer).
> > >>>>
> > >>>> 2. Flexibility
> > >>>>
> > >>>> The current PR is currently unable to support what I think are killer
> > >>>> features of the IR: custom operators (relational or column) and UDFs.
> > In
> > >>> my
> > >>>> mind, on top of the generalized compute description that the IR
> > offers,
> > >>> the
> > >>>> ability for producers and consumers of IR to extend the IR without
> > >>> needing
> > >>>> to modify Arrow or depend on anything except the format is itself
> > >>> something
> > >>>> that is necessary to gain adoption.
> > >>>>
> > >>>> Developers will need to build custom relational operators (e.g.,
> > scans of
> > >>>> backends that don't exist anywhere for which a user has code to
> > >>> implement)
> > >>>> and custom functions (anything operating on a column that doesn't
> > already
> > >>>> exist, really). Furthermore, I think we can actually drive building an
> > >>>> Arrow consumer using the same API that an end user would use to extend
> > >>> the
> > >>>> IR.
> > >>>>
> > >>>> 3. Window Functions
> > >>>>
> > >>>> Window functions are, I think, an important part of the IR value
> > >>>> proposition, as they are one of the more complex operators in
> > databases.
> > >>> I
> > >>>> think we need to have something in the initial IR proposal to support
> > >>> these
> > >>>> operations.
> > >>>>
> > >>>> 4. Non relational Joins
> > >>>>
> > >>>> Things like as-of join and window join operators aren't yet fleshed
> > out
> > >>> in
> > >>>> the IR, and I'm not sure they should be in scope for the initial
> > >>> prototype.
> > >>>> I think once we settle on a design, we can work the design of these
> > >>>> particular operators out during the initial prototype. I think the
> > >>>> specification of these operators should basically be PR #2 after the
> > >>>> initial design lands.
> > >>>>
> > >>>> # Order of Work
> > >>>>
> > >>>> 1. Nail down the design. Anything else is a non-starter.
> > >>>>
> > >>>> 2. Prototype an IR producer using Ibis
> > >>>>
> > >>>> Ibis is IMO a good candidate for a first IR producer as it has a
> > number
> > >>> of
> > >>>> desirable properties that make prototyping faster and allow for us to
> > >>>> refine the design of the IR as needed based on how the implementation
> > >>> goes:
> > >>>> * It's written in Python so it has native support for nearly all of
> > >>>> flatbuffers' features without having to creating bindings to C++.
> > >>>> * There's already a set of rules for type checking, as well as APIs
> > for
> > >>>> constructing expression trees, which means we don't need to worry
> > about
> > >>>> building a type checker for the prototype.
> > >>>>
> > >>>> 3. Prototype an IR consumer in C++
> > >>>>
> > >>>> I think in parallel to the producer prototype we can further inform
> > the
> > >>>> design from the consumer side by prototyping an IR consumer in C++ . I
> > >>> know
> > >>>> Ben Kietzman has expressed interest in working on this.
> > >>>>
> > >>>> Very interested to hear others' thoughts.
> > >>>>
> > >>>> -Phillip
> > >>>>
> > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>>> Thank you for all the feedback and comments on the document. I'm on
> > >>>>> vacation this week, so I'm delayed responding to everything, but I
> > >>>>> will get to it as quickly as I can. I will be at VLDB in Copenhagen
> > >>>>> next week if anyone would like to chat in person about it, and we can
> > >>>>> relay the content of any discussions back to the document/PR/e-mail
> > >>>>> thread.
> > >>>>>
> > >>>>> I know that Phillip Cloud expressed interest in working on the PR and
> > >>>>> helping work through many of the details, so I'm glad to have the
> > >>>>> help. If there are others who would like to work on the PR or dig
> > into
> > >>>>> the details, please let me know. We might need to figure out how to
> > >>>>> accommodate "many cooks" by setting up the ComputeIR project
> > somewhere
> > >>>>> separate from the format/ directory to permit it to exist in a
> > >>>>> Work-In-Progress status for a period of time until we work through
> > the
> > >>>>> various details and design concerns.
> > >>>>>
> > >>>>> Re Julian's comment
> > >>>>>
> > >>>>>> The biggest surprise is that this language does full relational
> > >>>>> operations. I was expecting that it would do fragments of the
> > >>> operations.
> > >>>>>
> > >>>>> There's a related but different (yet still interesting and worthy of
> > >>>>> analysis) problem of creating an "engine language" that describes
> > more
> > >>>>> mechanically the constituent parts of implementing the relational
> > >>>>> operators. To create a functional computation language with concrete
> > >>>>> Arrow data structures as a top-level primitive sounds like an
> > >>>>> interesting research area where I could see something developing
> > >>>>> eventually.
> > >>>>>
> > >>>>> The main problem I'm interested in solving right now is enabling
> > front
> > >>>>> ends that have sufficient understanding of relational algebra and
> > data
> > >>>>> frame operations to talk to engines without having to go backwards
> > >>>>> from their logical query plans to SQL. So as mentioned in the
> > >>>>> document, being able to faithfully carry the relational operator node
> > >>>>> information generated by Calcite or Ibis or another system would be
> > >>>>> super useful. Defining the semantics of various kinds of user-defined
> > >>>>> functions would also be helpful to standardize the
> > >>>>> engine-to-user-language UDF/extension interface.
> > >>>>>
> > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <al...@gmail.com>
> > >>>> wrote:
> > >>>>>>
> > >>>>>> Hi Wes,
> > >>>>>>
> > >>>>>> cool initiative! Reminded me of "Building Advanced SQL Analytics
> > From
> > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> > >>>> proposes a
> > >>>>>> set of building block for advanced aggregation.
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>> Dimitri.
> > >>>>>>
> > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jh...@gmail.com>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Wes,
> > >>>>>>>
> > >>>>>>> Thanks for this. I’ve added comments to the doc and to the PR.
> > >>>>>>>
> > >>>>>>> The biggest surprise is that this language does full relational
> > >>>>>>> operations. I was expecting that it would do fragments of the
> > >>>>> operations.
> > >>>>>>> Consider join. A distributed hybrid hash join needs to partition
> > >>> rows
> > >>>>> into
> > >>>>>>> output buffers based on a hash key, build hash tables, probe into
> > >>>> hash
> > >>>>>>> tables, scan hash tables for untouched “outer”rows, and so forth.
> > >>>>>>>
> > >>>>>>> I see Arrow’s compute as delivering each of those operations,
> > >>> working
> > >>>>> on
> > >>>>>>> perhaps a batch at a time, or a sequence of batches, with some
> > >>> other
> > >>>>> system
> > >>>>>>> coordinating those tasks. So I would expect to see Arrow’s compute
> > >>>>> language
> > >>>>>>> mainly operating on batches rather than a table abstraction.
> > >>>>>>>
> > >>>>>>> Julian
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> hi folks,
> > >>>>>>>>
> > >>>>>>>> This idea came up in passing in the past -- given that there are
> > >>>>>>>> multiple independent efforts to develop Arrow-native query
> > >>> engines
> > >>>>>>>> (and surely many more to come), it seems like it would be
> > >>> valuable
> > >>>> to
> > >>>>>>>> have a way to enable user languages (like Java, Python, R, or
> > >>> Rust,
> > >>>>>>>> for example) to communicate with backend computing engines (like
> > >>>>>>>> DataFusion, or new computing capabilities being built in the
> > >>> Arrow
> > >>>>> C++
> > >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> > >>>> specialized
> > >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
> > >>> parser /
> > >>>>>>>> analyzer framework to generate an expression tree of relational
> > >>>>>>>> operators and then translate that to an Arrow-native
> > >>>> compute-engine's
> > >>>>>>>> internal grammer, a user framework could provide the desired
> > >>>>>>>> Arrow-native expression tree / data manipulations directly and
> > >>> skip
> > >>>>>>>> the SQL altogether.
> > >>>>>>>>
> > >>>>>>>> The idea of creating a "serialized intermediate representation
> > >>>> (IR)"
> > >>>>>>>> for Arrow compute operations would be to serve use cases large
> > >>> and
> > >>>>>>>> small -- from the most complex TPC-* or time series database
> > >>> query
> > >>>> to
> > >>>>>>>> the most simple array predicate/filter sent with an RPC request
> > >>>> using
> > >>>>>>>> Arrow Flight. It is deliberately language- and engine-agnostic,
> > >>>> with
> > >>>>>>>> the only commonality across usages being the Arrow columnar
> > >>> format
> > >>>>>>>> (schemas and array types). This would be better than leaving it
> > >>> to
> > >>>>>>>> each application to develop its own bespoke expression
> > >>>>> representations
> > >>>>>>>> for its needs.
> > >>>>>>>>
> > >>>>>>>> I spent a while thinking about this and wrote up a brain dump RFC
> > >>>>>>>> document [1] and accompanying pull request [2] that makes the
> > >>>>> possibly
> > >>>>>>>> controversial choice of using Flatbuffers to represent the
> > >>>> serialized
> > >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in the
> > >>>> RFC
> > >>>>>>>> document. This PR is obviously deficient in many regards
> > >>>> (incomplete,
> > >>>>>>>> hacky, or unclear in places), and will need some help from others
> > >>>> to
> > >>>>>>>> flesh out. I suspect that actually implementing the IR will be
> > >>>>>>>> necessary to work out many of the low-level details.
> > >>>>>>>>
> > >>>>>>>> Note that this IR is intended to be more of a "superset" project
> > >>>> than
> > >>>>>>>> a "lowest common denominator". So there may be things that it
> > >>>>> includes
> > >>>>>>>> which are only available in some engines (e.g. engines that have
> > >>>>>>>> specialized handling of time series data).
> > >>>>>>>>
> > >>>>>>>> As some of my personal motivation for the project, concurrent
> > >>> with
> > >>>>> the
> > >>>>>>>> genesis of Apache Arrow, I started a Python project called Ibis
> > >>> [3]
> > >>>>>>>> (which is similar to R's dplyr project) which serves as a "Python
> > >>>>>>>> analytical query IR builder" that is capable of generating most
> > >>> of
> > >>>>> the
> > >>>>>>>> SQL standard, targeting many different SQL dialects and other
> > >>>>> backends
> > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > >>>>> library
> > >>>>>>>> as a "many-SQL" middleware. As such, I would like to be able to
> > >>>>>>>> translate between the in-memory "logical query" data structures
> > >>> in
> > >>>> a
> > >>>>>>>> library like Ibis to a serialized format that can be executed by
> > >>>> many
> > >>>>>>>> different Arrow-native query engines. The expression primitives
> > >>>>>>>> available in Ibis should serve as helpful test cases, too.
> > >>>>>>>>
> > >>>>>>>> I look forward to the community's comments on the RFC document
> > >>> [1]
> > >>>>> and
> > >>>>>>>> pull request [2] -- I realize that arriving at consensus on a
> > >>>> complex
> > >>>>>>>> and ambitious project like this can be challenging so I recommend
> > >>>>>>>> spending time on the "non-goals" section in the RFC and ask
> > >>>> questions
> > >>>>>>>> if you are unclear about the scope of what problems this is
> > >>> trying
> > >>>> to
> > >>>>>>>> solve. I would be happy to give Edit access on the RFC document
> > >>> to
> > >>>>>>>> others and would consider ideas about how to move forward with
> > >>>>>>>> something that is able to be implemented by different Arrow
> > >>>> libraries
> > >>>>>>>> in the reasonably near future.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Wes
> > >>>>>>>>
> > >>>>>>>> [1]:
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > >>>>>>>> [3]: https://ibis-project.org/
> > >>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > >>>>>>>> [5]:
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> >

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Phillip Cloud <cp...@gmail.com>.

Hey all,

Just wanted to give an update on the effort here.

Ben Kietzman has created an alternative proposal to the initial design [1].
It largely overlaps with the original, but differs in a few important ways:

* A big focus of the design is on flexibility, allowing producers,
consumers and ultimately end users of those systems the ability to define
custom operations in the graph.
* There are very few predefined relational operations (project, filter,
join and a handful of others).
* There are only 3 types of value expressions: literals, field references,
and function calls.
* The model of evaluation is one that requires a final sink node, to
indicate where the record batches from child relations end up (a file, a
socket, an in-memory buffer, etc).

I've added notes [2] to the original Google doc (under the Alternative
Design Notes subheading), and a few pseudocode examples.
Unfortunately, these went out of date as soon as Ben pushed the PR [3], so
I need to update those to reflect his changes. Regardless,
the design is broadly the same, so it should still give a good indication
of the details of the design.

There are a decent number of review comments on the original PR that I plan
to port over where they are still relevant.
I also plan on adding support for window functions either tonight or on
Monday.

Please review this design at your earliest convenience. Since there's a
fairly concrete set of types in flatbuffers that
we can look at, ideally we can center discussion around the details in the
PR.

Thanks!

[1]: https://github.com/apache/arrow/pull/10856
[2]:
https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
[3]: https://github.com/apache/arrow/pull/10934

On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <jh...@gmail.com> wrote:

> > Wes wrote:
> >
> > Supporting this kind of intra-application engine
> > heterogeneity is one of the motivations for the project.
>
> +1
>
> The data format is the natural interface between tasks. (Defining “task”
> here as “something that is programmed using the IR”.) That is Arrow’s
> strength.
>
> So I think the IR should describe what each task should do, and tasks
> should be fairly small. Not whole relational operators, operating on whole
> tables, but pieces of relational operators, operating on batches or
> sequences of batches.
>
> Elsethread, someone mentioned the LoLePop concept and the Kohn/Leis/Neuman
> paper [1]. The LoLePop concept sounds good for our purposes.
>
> Julian
>
> [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
>
>
> > On Aug 12, 2021, at 5:19 AM, Wes McKinney <we...@gmail.com> wrote:
> >
> > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cp...@gmail.com>
> wrote:
> >>
> >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> >> jorgecarleitao@gmail.com> wrote:
> >>
> >>> Couple of questions
> >>>
> >>> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
> >>> operation "(IR,data) - engine -> result" MUST be the same for all
> "engine"?
> >>>
> >>
> >> I think that might be a non-starter for mundane reasons: there's
> probably
> >> at least two engines
> >> that disagree on the result of sum(x) where x is a floating point
> column.
> >>
> >>
> >>> 2. if yes, imo we may need to worry about:
> >>> * a definition of equality that implementations agree on.
> >>> * agreement over what the semantics look like. For example, do we use
> >>> kleene logic for AND and OR?
> >>>
> >>
> >> WRT Kleene logic my thoughts are that the IR should support both Kleene
> and
> >> non-Kleene,
> >> and producers can choose their desired semantics.
> >>
> >> Ibis for example, would override the `&` operator in `a & b` to produce
> >> `KleeneAnd(Column(a), Column(b))`.
> >>
> >>
> >>>
> >>> To try to understand the gist, let's pick an aggregated count over a
> >>> column: engines often do partial counts over partitions followed by a
> final
> >>> "sum" over the partial counts. Is the idea that the query engine would
> >>> communicate with the compute engine via 2 IRs where one is "count me
> these"
> >>> the other is "sum me these"?
> >>>
> >>> Best,
> >>> Jorge
> >>>
> >>
> >> Not in its current incarnation.
> >>
> >> The idea is that the IR producer communicates a desire to count(x) to a
> >> consumer, and  it's up to the consumer to figure out how to turn that
> count
> >> into something that makes sense for itself. In your example that's a
> series
> >> of partial counts followed by a sum.
> >>
> >
> > That said, I think there is a valid use case here where a system might
> > make use of different engines to execute different (composable) layers
> > of a complex query.
> >
> > For example:
> >
> > * suppose you want to scan and do predicate pushdown on an unusual
> > data source that is only accessible from one particular engine but
> > * you need to do some analytical operation with the scan results that
> > is only supported by another engine
> >
> > You could decompose the query into two stages with an IR relational
> > expression for each stage and use then the engines together to
> > accomplish what you need (of course, you would need an orchestration
> > layer to handle plumbing the query engine inputs and outputs together
> > as Arrow streams). Supporting this kind of intra-application engine
> > heterogeneity is one of the motivations for the project.
> >
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cp...@gmail.com>
> wrote:
> >>>
> >>>> Thanks Wes,
> >>>>
> >>>> Great to be back working on Arrow again and engaging with the
> community.
> >>> I
> >>>> am really excited about this effort.
> >>>>
> >>>> I think there are a number of concerns I see as important to address
> in
> >>> the
> >>>> compute IR proposal:
> >>>>
> >>>> 1. Requirement for output types.
> >>>>
> >>>> I think that so far there's been many reasons for requiring
> conforming IR
> >>>> producers and consumers to adhere to output types, but I haven't seen
> a
> >>>> strong rationale for keeping them optional (in the semantic sense, not
> >>> WRT
> >>>> any particular serialization format's representation of optionality).
> >>>>
> >>>> I think a design that includes unambiguous semantics for output types
> (a
> >>>> consumer must produce a value of the requested type or it's an
> >>>> error/non-conforming) is simpler to reason about for producers, and
> >>>> provides a strong guarantee for end users (humans or machines
> >>> constructing
> >>>> IR from an API and expecting the thing they ask for back from the IR
> >>>> consumer).
> >>>>
> >>>> 2. Flexibility
> >>>>
> >>>> The current PR is currently unable to support what I think are killer
> >>>> features of the IR: custom operators (relational or column) and UDFs.
> In
> >>> my
> >>>> mind, on top of the generalized compute description that the IR
> offers,
> >>> the
> >>>> ability for producers and consumers of IR to extend the IR without
> >>> needing
> >>>> to modify Arrow or depend on anything except the format is itself
> >>> something
> >>>> that is necessary to gain adoption.
> >>>>
> >>>> Developers will need to build custom relational operators (e.g.,
> scans of
> >>>> backends that don't exist anywhere for which a user has code to
> >>> implement)
> >>>> and custom functions (anything operating on a column that doesn't
> already
> >>>> exist, really). Furthermore, I think we can actually drive building an
> >>>> Arrow consumer using the same API that an end user would use to extend
> >>> the
> >>>> IR.
> >>>>
> >>>> 3. Window Functions
> >>>>
> >>>> Window functions are, I think, an important part of the IR value
> >>>> proposition, as they are one of the more complex operators in
> databases.
> >>> I
> >>>> think we need to have something in the initial IR proposal to support
> >>> these
> >>>> operations.
> >>>>
> >>>> 4. Non relational Joins
> >>>>
> >>>> Things like as-of join and window join operators aren't yet fleshed
> out
> >>> in
> >>>> the IR, and I'm not sure they should be in scope for the initial
> >>> prototype.
> >>>> I think once we settle on a design, we can work the design of these
> >>>> particular operators out during the initial prototype. I think the
> >>>> specification of these operators should basically be PR #2 after the
> >>>> initial design lands.
> >>>>
> >>>> # Order of Work
> >>>>
> >>>> 1. Nail down the design. Anything else is a non-starter.
> >>>>
> >>>> 2. Prototype an IR producer using Ibis
> >>>>
> >>>> Ibis is IMO a good candidate for a first IR producer as it has a
> number
> >>> of
> >>>> desirable properties that make prototyping faster and allow for us to
> >>>> refine the design of the IR as needed based on how the implementation
> >>> goes:
> >>>> * It's written in Python so it has native support for nearly all of
> >>>> flatbuffers' features without having to creating bindings to C++.
> >>>> * There's already a set of rules for type checking, as well as APIs
> for
> >>>> constructing expression trees, which means we don't need to worry
> about
> >>>> building a type checker for the prototype.
> >>>>
> >>>> 3. Prototype an IR consumer in C++
> >>>>
> >>>> I think in parallel to the producer prototype we can further inform
> the
> >>>> design from the consumer side by prototyping an IR consumer in C++ . I
> >>> know
> >>>> Ben Kietzman has expressed interest in working on this.
> >>>>
> >>>> Very interested to hear others' thoughts.
> >>>>
> >>>> -Phillip
> >>>>
> >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> Thank you for all the feedback and comments on the document. I'm on
> >>>>> vacation this week, so I'm delayed responding to everything, but I
> >>>>> will get to it as quickly as I can. I will be at VLDB in Copenhagen
> >>>>> next week if anyone would like to chat in person about it, and we can
> >>>>> relay the content of any discussions back to the document/PR/e-mail
> >>>>> thread.
> >>>>>
> >>>>> I know that Phillip Cloud expressed interest in working on the PR and
> >>>>> helping work through many of the details, so I'm glad to have the
> >>>>> help. If there are others who would like to work on the PR or dig
> into
> >>>>> the details, please let me know. We might need to figure out how to
> >>>>> accommodate "many cooks" by setting up the ComputeIR project
> somewhere
> >>>>> separate from the format/ directory to permit it to exist in a
> >>>>> Work-In-Progress status for a period of time until we work through
> the
> >>>>> various details and design concerns.
> >>>>>
> >>>>> Re Julian's comment
> >>>>>
> >>>>>> The biggest surprise is that this language does full relational
> >>>>> operations. I was expecting that it would do fragments of the
> >>> operations.
> >>>>>
> >>>>> There's a related but different (yet still interesting and worthy of
> >>>>> analysis) problem of creating an "engine language" that describes
> more
> >>>>> mechanically the constituent parts of implementing the relational
> >>>>> operators. To create a functional computation language with concrete
> >>>>> Arrow data structures as a top-level primitive sounds like an
> >>>>> interesting research area where I could see something developing
> >>>>> eventually.
> >>>>>
> >>>>> The main problem I'm interested in solving right now is enabling
> front
> >>>>> ends that have sufficient understanding of relational algebra and
> data
> >>>>> frame operations to talk to engines without having to go backwards
> >>>>> from their logical query plans to SQL. So as mentioned in the
> >>>>> document, being able to faithfully carry the relational operator node
> >>>>> information generated by Calcite or Ibis or another system would be
> >>>>> super useful. Defining the semantics of various kinds of user-defined
> >>>>> functions would also be helpful to standardize the
> >>>>> engine-to-user-language UDF/extension interface.
> >>>>>
> >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <al...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>> Hi Wes,
> >>>>>>
> >>>>>> cool initiative! Reminded me of "Building Advanced SQL Analytics
> From
> >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> >>>> proposes a
> >>>>>> set of building block for advanced aggregation.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Dimitri.
> >>>>>>
> >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jh...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Wes,
> >>>>>>>
> >>>>>>> Thanks for this. I’ve added comments to the doc and to the PR.
> >>>>>>>
> >>>>>>> The biggest surprise is that this language does full relational
> >>>>>>> operations. I was expecting that it would do fragments of the
> >>>>> operations.
> >>>>>>> Consider join. A distributed hybrid hash join needs to partition
> >>> rows
> >>>>> into
> >>>>>>> output buffers based on a hash key, build hash tables, probe into
> >>>> hash
> >>>>>>> tables, scan hash tables for untouched “outer”rows, and so forth.
> >>>>>>>
> >>>>>>> I see Arrow’s compute as delivering each of those operations,
> >>> working
> >>>>> on
> >>>>>>> perhaps a batch at a time, or a sequence of batches, with some
> >>> other
> >>>>> system
> >>>>>>> coordinating those tasks. So I would expect to see Arrow’s compute
> >>>>> language
> >>>>>>> mainly operating on batches rather than a table abstraction.
> >>>>>>>
> >>>>>>> Julian
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> hi folks,
> >>>>>>>>
> >>>>>>>> This idea came up in passing in the past -- given that there are
> >>>>>>>> multiple independent efforts to develop Arrow-native query
> >>> engines
> >>>>>>>> (and surely many more to come), it seems like it would be
> >>> valuable
> >>>> to
> >>>>>>>> have a way to enable user languages (like Java, Python, R, or
> >>> Rust,
> >>>>>>>> for example) to communicate with backend computing engines (like
> >>>>>>>> DataFusion, or new computing capabilities being built in the
> >>> Arrow
> >>>>> C++
> >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> >>>> specialized
> >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
> >>> parser /
> >>>>>>>> analyzer framework to generate an expression tree of relational
> >>>>>>>> operators and then translate that to an Arrow-native
> >>>> compute-engine's
> >>>>>>>> internal grammer, a user framework could provide the desired
> >>>>>>>> Arrow-native expression tree / data manipulations directly and
> >>> skip
> >>>>>>>> the SQL altogether.
> >>>>>>>>
> >>>>>>>> The idea of creating a "serialized intermediate representation
> >>>> (IR)"
> >>>>>>>> for Arrow compute operations would be to serve use cases large
> >>> and
> >>>>>>>> small -- from the most complex TPC-* or time series database
> >>> query
> >>>> to
> >>>>>>>> the most simple array predicate/filter sent with an RPC request
> >>>> using
> >>>>>>>> Arrow Flight. It is deliberately language- and engine-agnostic,
> >>>> with
> >>>>>>>> the only commonality across usages being the Arrow columnar
> >>> format
> >>>>>>>> (schemas and array types). This would be better than leaving it
> >>> to
> >>>>>>>> each application to develop its own bespoke expression
> >>>>> representations
> >>>>>>>> for its needs.
> >>>>>>>>
> >>>>>>>> I spent a while thinking about this and wrote up a brain dump RFC
> >>>>>>>> document [1] and accompanying pull request [2] that makes the
> >>>>> possibly
> >>>>>>>> controversial choice of using Flatbuffers to represent the
> >>>> serialized
> >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in the
> >>>> RFC
> >>>>>>>> document. This PR is obviously deficient in many regards
> >>>> (incomplete,
> >>>>>>>> hacky, or unclear in places), and will need some help from others
> >>>> to
> >>>>>>>> flesh out. I suspect that actually implementing the IR will be
> >>>>>>>> necessary to work out many of the low-level details.
> >>>>>>>>
> >>>>>>>> Note that this IR is intended to be more of a "superset" project
> >>>> than
> >>>>>>>> a "lowest common denominator". So there may be things that it
> >>>>> includes
> >>>>>>>> which are only available in some engines (e.g. engines that have
> >>>>>>>> specialized handling of time series data).
> >>>>>>>>
> >>>>>>>> As some of my personal motivation for the project, concurrent
> >>> with
> >>>>> the
> >>>>>>>> genesis of Apache Arrow, I started a Python project called Ibis
> >>> [3]
> >>>>>>>> (which is similar to R's dplyr project) which serves as a "Python
> >>>>>>>> analytical query IR builder" that is capable of generating most
> >>> of
> >>>>> the
> >>>>>>>> SQL standard, targeting many different SQL dialects and other
> >>>>> backends
> >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used this
> >>>>> library
> >>>>>>>> as a "many-SQL" middleware. As such, I would like to be able to
> >>>>>>>> translate between the in-memory "logical query" data structures
> >>> in
> >>>> a
> >>>>>>>> library like Ibis to a serialized format that can be executed by
> >>>> many
> >>>>>>>> different Arrow-native query engines. The expression primitives
> >>>>>>>> available in Ibis should serve as helpful test cases, too.
> >>>>>>>>
> >>>>>>>> I look forward to the community's comments on the RFC document
> >>> [1]
> >>>>> and
> >>>>>>>> pull request [2] -- I realize that arriving at consensus on a
> >>>> complex
> >>>>>>>> and ambitious project like this can be challenging so I recommend
> >>>>>>>> spending time on the "non-goals" section in the RFC and ask
> >>>> questions
> >>>>>>>> if you are unclear about the scope of what problems this is
> >>> trying
> >>>> to
> >>>>>>>> solve. I would be happy to give Edit access on the RFC document
> >>> to
> >>>>>>>> others and would consider ideas about how to move forward with
> >>>>>>>> something that is able to be implemented by different Arrow
> >>>> libraries
> >>>>>>>> in the reasonably near future.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Wes
> >>>>>>>>
> >>>>>>>> [1]:
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> >>>>>>>> [3]: https://ibis-project.org/
> >>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> >>>>>>>> [5]:
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
>
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Julian Hyde <jh...@gmail.com>.

> Wes wrote:
>
> Supporting this kind of intra-application engine
> heterogeneity is one of the motivations for the project.

+1

The data format is the natural interface between tasks. (Defining “task” here as “something that is programmed using the IR”.) That is Arrow’s strength.

So I think the IR should describe what each task should do, and tasks should be fairly small. Not whole relational operators, operating on whole tables, but pieces of relational operators, operating on batches or sequences of batches. 

Elsethread, someone mentioned the LoLePop concept and the Kohn/Leis/Neuman paper [1]. The LoLePop concept sounds good for our purposes.

Julian

[1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf


> On Aug 12, 2021, at 5:19 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cp...@gmail.com> wrote:
>> 
>> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
>> jorgecarleitao@gmail.com> wrote:
>> 
>>> Couple of questions
>>> 
>>> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
>>> operation "(IR,data) - engine -> result" MUST be the same for all "engine"?
>>> 
>> 
>> I think that might be a non-starter for mundane reasons: there's probably
>> at least two engines
>> that disagree on the result of sum(x) where x is a floating point column.
>> 
>> 
>>> 2. if yes, imo we may need to worry about:
>>> * a definition of equality that implementations agree on.
>>> * agreement over what the semantics look like. For example, do we use
>>> kleene logic for AND and OR?
>>> 
>> 
>> WRT Kleene logic my thoughts are that the IR should support both Kleene and
>> non-Kleene,
>> and producers can choose their desired semantics.
>> 
>> Ibis for example, would override the `&` operator in `a & b` to produce
>> `KleeneAnd(Column(a), Column(b))`.
>> 
>> 
>>> 
>>> To try to understand the gist, let's pick an aggregated count over a
>>> column: engines often do partial counts over partitions followed by a final
>>> "sum" over the partial counts. Is the idea that the query engine would
>>> communicate with the compute engine via 2 IRs where one is "count me these"
>>> the other is "sum me these"?
>>> 
>>> Best,
>>> Jorge
>>> 
>> 
>> Not in its current incarnation.
>> 
>> The idea is that the IR producer communicates a desire to count(x) to a
>> consumer, and  it's up to the consumer to figure out how to turn that count
>> into something that makes sense for itself. In your example that's a series
>> of partial counts followed by a sum.
>> 
> 
> That said, I think there is a valid use case here where a system might
> make use of different engines to execute different (composable) layers
> of a complex query.
> 
> For example:
> 
> * suppose you want to scan and do predicate pushdown on an unusual
> data source that is only accessible from one particular engine but
> * you need to do some analytical operation with the scan results that
> is only supported by another engine
> 
> You could decompose the query into two stages with an IR relational
> expression for each stage and use then the engines together to
> accomplish what you need (of course, you would need an orchestration
> layer to handle plumbing the query engine inputs and outputs together
> as Arrow streams). Supporting this kind of intra-application engine
> heterogeneity is one of the motivations for the project.
> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cp...@gmail.com> wrote:
>>> 
>>>> Thanks Wes,
>>>> 
>>>> Great to be back working on Arrow again and engaging with the community.
>>> I
>>>> am really excited about this effort.
>>>> 
>>>> I think there are a number of concerns I see as important to address in
>>> the
>>>> compute IR proposal:
>>>> 
>>>> 1. Requirement for output types.
>>>> 
>>>> I think that so far there's been many reasons for requiring conforming IR
>>>> producers and consumers to adhere to output types, but I haven't seen a
>>>> strong rationale for keeping them optional (in the semantic sense, not
>>> WRT
>>>> any particular serialization format's representation of optionality).
>>>> 
>>>> I think a design that includes unambiguous semantics for output types (a
>>>> consumer must produce a value of the requested type or it's an
>>>> error/non-conforming) is simpler to reason about for producers, and
>>>> provides a strong guarantee for end users (humans or machines
>>> constructing
>>>> IR from an API and expecting the thing they ask for back from the IR
>>>> consumer).
>>>> 
>>>> 2. Flexibility
>>>> 
>>>> The current PR is currently unable to support what I think are killer
>>>> features of the IR: custom operators (relational or column) and UDFs. In
>>> my
>>>> mind, on top of the generalized compute description that the IR offers,
>>> the
>>>> ability for producers and consumers of IR to extend the IR without
>>> needing
>>>> to modify Arrow or depend on anything except the format is itself
>>> something
>>>> that is necessary to gain adoption.
>>>> 
>>>> Developers will need to build custom relational operators (e.g., scans of
>>>> backends that don't exist anywhere for which a user has code to
>>> implement)
>>>> and custom functions (anything operating on a column that doesn't already
>>>> exist, really). Furthermore, I think we can actually drive building an
>>>> Arrow consumer using the same API that an end user would use to extend
>>> the
>>>> IR.
>>>> 
>>>> 3. Window Functions
>>>> 
>>>> Window functions are, I think, an important part of the IR value
>>>> proposition, as they are one of the more complex operators in databases.
>>> I
>>>> think we need to have something in the initial IR proposal to support
>>> these
>>>> operations.
>>>> 
>>>> 4. Non relational Joins
>>>> 
>>>> Things like as-of join and window join operators aren't yet fleshed out
>>> in
>>>> the IR, and I'm not sure they should be in scope for the initial
>>> prototype.
>>>> I think once we settle on a design, we can work the design of these
>>>> particular operators out during the initial prototype. I think the
>>>> specification of these operators should basically be PR #2 after the
>>>> initial design lands.
>>>> 
>>>> # Order of Work
>>>> 
>>>> 1. Nail down the design. Anything else is a non-starter.
>>>> 
>>>> 2. Prototype an IR producer using Ibis
>>>> 
>>>> Ibis is IMO a good candidate for a first IR producer as it has a number
>>> of
>>>> desirable properties that make prototyping faster and allow for us to
>>>> refine the design of the IR as needed based on how the implementation
>>> goes:
>>>> * It's written in Python so it has native support for nearly all of
>>>> flatbuffers' features without having to creating bindings to C++.
>>>> * There's already a set of rules for type checking, as well as APIs for
>>>> constructing expression trees, which means we don't need to worry about
>>>> building a type checker for the prototype.
>>>> 
>>>> 3. Prototype an IR consumer in C++
>>>> 
>>>> I think in parallel to the producer prototype we can further inform the
>>>> design from the consumer side by prototyping an IR consumer in C++ . I
>>> know
>>>> Ben Kietzman has expressed interest in working on this.
>>>> 
>>>> Very interested to hear others' thoughts.
>>>> 
>>>> -Phillip
>>>> 
>>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
>>> wrote:
>>>> 
>>>>> Thank you for all the feedback and comments on the document. I'm on
>>>>> vacation this week, so I'm delayed responding to everything, but I
>>>>> will get to it as quickly as I can. I will be at VLDB in Copenhagen
>>>>> next week if anyone would like to chat in person about it, and we can
>>>>> relay the content of any discussions back to the document/PR/e-mail
>>>>> thread.
>>>>> 
>>>>> I know that Phillip Cloud expressed interest in working on the PR and
>>>>> helping work through many of the details, so I'm glad to have the
>>>>> help. If there are others who would like to work on the PR or dig into
>>>>> the details, please let me know. We might need to figure out how to
>>>>> accommodate "many cooks" by setting up the ComputeIR project somewhere
>>>>> separate from the format/ directory to permit it to exist in a
>>>>> Work-In-Progress status for a period of time until we work through the
>>>>> various details and design concerns.
>>>>> 
>>>>> Re Julian's comment
>>>>> 
>>>>>> The biggest surprise is that this language does full relational
>>>>> operations. I was expecting that it would do fragments of the
>>> operations.
>>>>> 
>>>>> There's a related but different (yet still interesting and worthy of
>>>>> analysis) problem of creating an "engine language" that describes more
>>>>> mechanically the constituent parts of implementing the relational
>>>>> operators. To create a functional computation language with concrete
>>>>> Arrow data structures as a top-level primitive sounds like an
>>>>> interesting research area where I could see something developing
>>>>> eventually.
>>>>> 
>>>>> The main problem I'm interested in solving right now is enabling front
>>>>> ends that have sufficient understanding of relational algebra and data
>>>>> frame operations to talk to engines without having to go backwards
>>>>> from their logical query plans to SQL. So as mentioned in the
>>>>> document, being able to faithfully carry the relational operator node
>>>>> information generated by Calcite or Ibis or another system would be
>>>>> super useful. Defining the semantics of various kinds of user-defined
>>>>> functions would also be helpful to standardize the
>>>>> engine-to-user-language UDF/extension interface.
>>>>> 
>>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <al...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> Hi Wes,
>>>>>> 
>>>>>> cool initiative! Reminded me of "Building Advanced SQL Analytics From
>>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
>>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
>>>> proposes a
>>>>>> set of building block for advanced aggregation.
>>>>>> 
>>>>>> Cheers,
>>>>>> Dimitri.
>>>>>> 
>>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jh...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Wes,
>>>>>>> 
>>>>>>> Thanks for this. I’ve added comments to the doc and to the PR.
>>>>>>> 
>>>>>>> The biggest surprise is that this language does full relational
>>>>>>> operations. I was expecting that it would do fragments of the
>>>>> operations.
>>>>>>> Consider join. A distributed hybrid hash join needs to partition
>>> rows
>>>>> into
>>>>>>> output buffers based on a hash key, build hash tables, probe into
>>>> hash
>>>>>>> tables, scan hash tables for untouched “outer”rows, and so forth.
>>>>>>> 
>>>>>>> I see Arrow’s compute as delivering each of those operations,
>>> working
>>>>> on
>>>>>>> perhaps a batch at a time, or a sequence of batches, with some
>>> other
>>>>> system
>>>>>>> coordinating those tasks. So I would expect to see Arrow’s compute
>>>>> language
>>>>>>> mainly operating on batches rather than a table abstraction.
>>>>>>> 
>>>>>>> Julian
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> hi folks,
>>>>>>>> 
>>>>>>>> This idea came up in passing in the past -- given that there are
>>>>>>>> multiple independent efforts to develop Arrow-native query
>>> engines
>>>>>>>> (and surely many more to come), it seems like it would be
>>> valuable
>>>> to
>>>>>>>> have a way to enable user languages (like Java, Python, R, or
>>> Rust,
>>>>>>>> for example) to communicate with backend computing engines (like
>>>>>>>> DataFusion, or new computing capabilities being built in the
>>> Arrow
>>>>> C++
>>>>>>>> library) in a fashion that is "lower-level" than SQL and
>>>> specialized
>>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
>>> parser /
>>>>>>>> analyzer framework to generate an expression tree of relational
>>>>>>>> operators and then translate that to an Arrow-native
>>>> compute-engine's
>>>>>>>> internal grammer, a user framework could provide the desired
>>>>>>>> Arrow-native expression tree / data manipulations directly and
>>> skip
>>>>>>>> the SQL altogether.
>>>>>>>> 
>>>>>>>> The idea of creating a "serialized intermediate representation
>>>> (IR)"
>>>>>>>> for Arrow compute operations would be to serve use cases large
>>> and
>>>>>>>> small -- from the most complex TPC-* or time series database
>>> query
>>>> to
>>>>>>>> the most simple array predicate/filter sent with an RPC request
>>>> using
>>>>>>>> Arrow Flight. It is deliberately language- and engine-agnostic,
>>>> with
>>>>>>>> the only commonality across usages being the Arrow columnar
>>> format
>>>>>>>> (schemas and array types). This would be better than leaving it
>>> to
>>>>>>>> each application to develop its own bespoke expression
>>>>> representations
>>>>>>>> for its needs.
>>>>>>>> 
>>>>>>>> I spent a while thinking about this and wrote up a brain dump RFC
>>>>>>>> document [1] and accompanying pull request [2] that makes the
>>>>> possibly
>>>>>>>> controversial choice of using Flatbuffers to represent the
>>>> serialized
>>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in the
>>>> RFC
>>>>>>>> document. This PR is obviously deficient in many regards
>>>> (incomplete,
>>>>>>>> hacky, or unclear in places), and will need some help from others
>>>> to
>>>>>>>> flesh out. I suspect that actually implementing the IR will be
>>>>>>>> necessary to work out many of the low-level details.
>>>>>>>> 
>>>>>>>> Note that this IR is intended to be more of a "superset" project
>>>> than
>>>>>>>> a "lowest common denominator". So there may be things that it
>>>>> includes
>>>>>>>> which are only available in some engines (e.g. engines that have
>>>>>>>> specialized handling of time series data).
>>>>>>>> 
>>>>>>>> As some of my personal motivation for the project, concurrent
>>> with
>>>>> the
>>>>>>>> genesis of Apache Arrow, I started a Python project called Ibis
>>> [3]
>>>>>>>> (which is similar to R's dplyr project) which serves as a "Python
>>>>>>>> analytical query IR builder" that is capable of generating most
>>> of
>>>>> the
>>>>>>>> SQL standard, targeting many different SQL dialects and other
>>>>> backends
>>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used this
>>>>> library
>>>>>>>> as a "many-SQL" middleware. As such, I would like to be able to
>>>>>>>> translate between the in-memory "logical query" data structures
>>> in
>>>> a
>>>>>>>> library like Ibis to a serialized format that can be executed by
>>>> many
>>>>>>>> different Arrow-native query engines. The expression primitives
>>>>>>>> available in Ibis should serve as helpful test cases, too.
>>>>>>>> 
>>>>>>>> I look forward to the community's comments on the RFC document
>>> [1]
>>>>> and
>>>>>>>> pull request [2] -- I realize that arriving at consensus on a
>>>> complex
>>>>>>>> and ambitious project like this can be challenging so I recommend
>>>>>>>> spending time on the "non-goals" section in the RFC and ask
>>>> questions
>>>>>>>> if you are unclear about the scope of what problems this is
>>> trying
>>>> to
>>>>>>>> solve. I would be happy to give Edit access on the RFC document
>>> to
>>>>>>>> others and would consider ideas about how to move forward with
>>>>>>>> something that is able to be implemented by different Arrow
>>>> libraries
>>>>>>>> in the reasonably near future.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Wes
>>>>>>>> 
>>>>>>>> [1]:
>>>>>>> 
>>>>> 
>>>> 
>>> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
>>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
>>>>>>>> [3]: https://ibis-project.org/
>>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
>>>>>>>> [5]:
>>>>>>> 
>>>>> 
>>>> 
>>> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Wes McKinney <we...@gmail.com>.

On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <cp...@gmail.com> wrote:
>
> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > Couple of questions
> >
> > 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
> > operation "(IR,data) - engine -> result" MUST be the same for all "engine"?
> >
>
> I think that might be a non-starter for mundane reasons: there's probably
> at least two engines
> that disagree on the result of sum(x) where x is a floating point column.
>
>
> > 2. if yes, imo we may need to worry about:
> > * a definition of equality that implementations agree on.
> > * agreement over what the semantics look like. For example, do we use
> > kleene logic for AND and OR?
> >
>
> WRT Kleene logic my thoughts are that the IR should support both Kleene and
> non-Kleene,
> and producers can choose their desired semantics.
>
> Ibis for example, would override the `&` operator in `a & b` to produce
> `KleeneAnd(Column(a), Column(b))`.
>
>
> >
> > To try to understand the gist, let's pick an aggregated count over a
> > column: engines often do partial counts over partitions followed by a final
> > "sum" over the partial counts. Is the idea that the query engine would
> > communicate with the compute engine via 2 IRs where one is "count me these"
> > the other is "sum me these"?
> >
> > Best,
> > Jorge
> >
>
> Not in its current incarnation.
>
> The idea is that the IR producer communicates a desire to count(x) to a
> consumer, and  it's up to the consumer to figure out how to turn that count
> into something that makes sense for itself. In your example that's a series
> of partial counts followed by a sum.
>

That said, I think there is a valid use case here where a system might
make use of different engines to execute different (composable) layers
of a complex query.

For example:

* suppose you want to scan and do predicate pushdown on an unusual
data source that is only accessible from one particular engine but
* you need to do some analytical operation with the scan results that
is only supported by another engine

You could decompose the query into two stages with an IR relational
expression for each stage and use then the engines together to
accomplish what you need (of course, you would need an orchestration
layer to handle plumbing the query engine inputs and outputs together
as Arrow streams). Supporting this kind of intra-application engine
heterogeneity is one of the motivations for the project.

> >
> >
> >
> >
> >
> > On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cp...@gmail.com> wrote:
> >
> > > Thanks Wes,
> > >
> > > Great to be back working on Arrow again and engaging with the community.
> > I
> > > am really excited about this effort.
> > >
> > > I think there are a number of concerns I see as important to address in
> > the
> > > compute IR proposal:
> > >
> > > 1. Requirement for output types.
> > >
> > > I think that so far there's been many reasons for requiring conforming IR
> > > producers and consumers to adhere to output types, but I haven't seen a
> > > strong rationale for keeping them optional (in the semantic sense, not
> > WRT
> > > any particular serialization format's representation of optionality).
> > >
> > > I think a design that includes unambiguous semantics for output types (a
> > > consumer must produce a value of the requested type or it's an
> > > error/non-conforming) is simpler to reason about for producers, and
> > > provides a strong guarantee for end users (humans or machines
> > constructing
> > > IR from an API and expecting the thing they ask for back from the IR
> > > consumer).
> > >
> > > 2. Flexibility
> > >
> > > The current PR is currently unable to support what I think are killer
> > > features of the IR: custom operators (relational or column) and UDFs. In
> > my
> > > mind, on top of the generalized compute description that the IR offers,
> > the
> > > ability for producers and consumers of IR to extend the IR without
> > needing
> > > to modify Arrow or depend on anything except the format is itself
> > something
> > > that is necessary to gain adoption.
> > >
> > > Developers will need to build custom relational operators (e.g., scans of
> > > backends that don't exist anywhere for which a user has code to
> > implement)
> > > and custom functions (anything operating on a column that doesn't already
> > > exist, really). Furthermore, I think we can actually drive building an
> > > Arrow consumer using the same API that an end user would use to extend
> > the
> > > IR.
> > >
> > > 3. Window Functions
> > >
> > > Window functions are, I think, an important part of the IR value
> > > proposition, as they are one of the more complex operators in databases.
> > I
> > > think we need to have something in the initial IR proposal to support
> > these
> > > operations.
> > >
> > > 4. Non relational Joins
> > >
> > > Things like as-of join and window join operators aren't yet fleshed out
> > in
> > > the IR, and I'm not sure they should be in scope for the initial
> > prototype.
> > > I think once we settle on a design, we can work the design of these
> > > particular operators out during the initial prototype. I think the
> > > specification of these operators should basically be PR #2 after the
> > > initial design lands.
> > >
> > > # Order of Work
> > >
> > > 1. Nail down the design. Anything else is a non-starter.
> > >
> > > 2. Prototype an IR producer using Ibis
> > >
> > > Ibis is IMO a good candidate for a first IR producer as it has a number
> > of
> > > desirable properties that make prototyping faster and allow for us to
> > > refine the design of the IR as needed based on how the implementation
> > goes:
> > > * It's written in Python so it has native support for nearly all of
> > > flatbuffers' features without having to creating bindings to C++.
> > > * There's already a set of rules for type checking, as well as APIs for
> > > constructing expression trees, which means we don't need to worry about
> > > building a type checker for the prototype.
> > >
> > > 3. Prototype an IR consumer in C++
> > >
> > > I think in parallel to the producer prototype we can further inform the
> > > design from the consumer side by prototyping an IR consumer in C++ . I
> > know
> > > Ben Kietzman has expressed interest in working on this.
> > >
> > > Very interested to hear others' thoughts.
> > >
> > > -Phillip
> > >
> > > On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > Thank you for all the feedback and comments on the document. I'm on
> > > > vacation this week, so I'm delayed responding to everything, but I
> > > > will get to it as quickly as I can. I will be at VLDB in Copenhagen
> > > > next week if anyone would like to chat in person about it, and we can
> > > > relay the content of any discussions back to the document/PR/e-mail
> > > > thread.
> > > >
> > > > I know that Phillip Cloud expressed interest in working on the PR and
> > > > helping work through many of the details, so I'm glad to have the
> > > > help. If there are others who would like to work on the PR or dig into
> > > > the details, please let me know. We might need to figure out how to
> > > > accommodate "many cooks" by setting up the ComputeIR project somewhere
> > > > separate from the format/ directory to permit it to exist in a
> > > > Work-In-Progress status for a period of time until we work through the
> > > > various details and design concerns.
> > > >
> > > > Re Julian's comment
> > > >
> > > > > The biggest surprise is that this language does full relational
> > > > operations. I was expecting that it would do fragments of the
> > operations.
> > > >
> > > > There's a related but different (yet still interesting and worthy of
> > > > analysis) problem of creating an "engine language" that describes more
> > > > mechanically the constituent parts of implementing the relational
> > > > operators. To create a functional computation language with concrete
> > > > Arrow data structures as a top-level primitive sounds like an
> > > > interesting research area where I could see something developing
> > > > eventually.
> > > >
> > > > The main problem I'm interested in solving right now is enabling front
> > > > ends that have sufficient understanding of relational algebra and data
> > > > frame operations to talk to engines without having to go backwards
> > > > from their logical query plans to SQL. So as mentioned in the
> > > > document, being able to faithfully carry the relational operator node
> > > > information generated by Calcite or Ibis or another system would be
> > > > super useful. Defining the semantics of various kinds of user-defined
> > > > functions would also be helpful to standardize the
> > > > engine-to-user-language UDF/extension interface.
> > > >
> > > > On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <al...@gmail.com>
> > > wrote:
> > > > >
> > > > > Hi Wes,
> > > > >
> > > > > cool initiative! Reminded me of "Building Advanced SQL Analytics From
> > > > > Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> > > proposes a
> > > > > set of building block for advanced aggregation.
> > > > >
> > > > > Cheers,
> > > > > Dimitri.
> > > > >
> > > > > On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jh...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Wes,
> > > > > >
> > > > > > Thanks for this. I’ve added comments to the doc and to the PR.
> > > > > >
> > > > > > The biggest surprise is that this language does full relational
> > > > > > operations. I was expecting that it would do fragments of the
> > > > operations.
> > > > > > Consider join. A distributed hybrid hash join needs to partition
> > rows
> > > > into
> > > > > > output buffers based on a hash key, build hash tables, probe into
> > > hash
> > > > > > tables, scan hash tables for untouched “outer”rows, and so forth.
> > > > > >
> > > > > > I see Arrow’s compute as delivering each of those operations,
> > working
> > > > on
> > > > > > perhaps a batch at a time, or a sequence of batches, with some
> > other
> > > > system
> > > > > > coordinating those tasks. So I would expect to see Arrow’s compute
> > > > language
> > > > > > mainly operating on batches rather than a table abstraction.
> > > > > >
> > > > > > Julian
> > > > > >
> > > > > >
> > > > > > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > hi folks,
> > > > > > >
> > > > > > > This idea came up in passing in the past -- given that there are
> > > > > > > multiple independent efforts to develop Arrow-native query
> > engines
> > > > > > > (and surely many more to come), it seems like it would be
> > valuable
> > > to
> > > > > > > have a way to enable user languages (like Java, Python, R, or
> > Rust,
> > > > > > > for example) to communicate with backend computing engines (like
> > > > > > > DataFusion, or new computing capabilities being built in the
> > Arrow
> > > > C++
> > > > > > > library) in a fashion that is "lower-level" than SQL and
> > > specialized
> > > > > > > to Arrow's type system. So rather than leaving it to a SQL
> > parser /
> > > > > > > analyzer framework to generate an expression tree of relational
> > > > > > > operators and then translate that to an Arrow-native
> > > compute-engine's
> > > > > > > internal grammer, a user framework could provide the desired
> > > > > > > Arrow-native expression tree / data manipulations directly and
> > skip
> > > > > > > the SQL altogether.
> > > > > > >
> > > > > > > The idea of creating a "serialized intermediate representation
> > > (IR)"
> > > > > > > for Arrow compute operations would be to serve use cases large
> > and
> > > > > > > small -- from the most complex TPC-* or time series database
> > query
> > > to
> > > > > > > the most simple array predicate/filter sent with an RPC request
> > > using
> > > > > > > Arrow Flight. It is deliberately language- and engine-agnostic,
> > > with
> > > > > > > the only commonality across usages being the Arrow columnar
> > format
> > > > > > > (schemas and array types). This would be better than leaving it
> > to
> > > > > > > each application to develop its own bespoke expression
> > > > representations
> > > > > > > for its needs.
> > > > > > >
> > > > > > > I spent a while thinking about this and wrote up a brain dump RFC
> > > > > > > document [1] and accompanying pull request [2] that makes the
> > > > possibly
> > > > > > > controversial choice of using Flatbuffers to represent the
> > > serialized
> > > > > > > IR. I discuss the rationale for the choice of Flatbuffers in the
> > > RFC
> > > > > > > document. This PR is obviously deficient in many regards
> > > (incomplete,
> > > > > > > hacky, or unclear in places), and will need some help from others
> > > to
> > > > > > > flesh out. I suspect that actually implementing the IR will be
> > > > > > > necessary to work out many of the low-level details.
> > > > > > >
> > > > > > > Note that this IR is intended to be more of a "superset" project
> > > than
> > > > > > > a "lowest common denominator". So there may be things that it
> > > > includes
> > > > > > > which are only available in some engines (e.g. engines that have
> > > > > > > specialized handling of time series data).
> > > > > > >
> > > > > > > As some of my personal motivation for the project, concurrent
> > with
> > > > the
> > > > > > > genesis of Apache Arrow, I started a Python project called Ibis
> > [3]
> > > > > > > (which is similar to R's dplyr project) which serves as a "Python
> > > > > > > analytical query IR builder" that is capable of generating most
> > of
> > > > the
> > > > > > > SQL standard, targeting many different SQL dialects and other
> > > > backends
> > > > > > > (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > > > library
> > > > > > > as a "many-SQL" middleware. As such, I would like to be able to
> > > > > > > translate between the in-memory "logical query" data structures
> > in
> > > a
> > > > > > > library like Ibis to a serialized format that can be executed by
> > > many
> > > > > > > different Arrow-native query engines. The expression primitives
> > > > > > > available in Ibis should serve as helpful test cases, too.
> > > > > > >
> > > > > > > I look forward to the community's comments on the RFC document
> > [1]
> > > > and
> > > > > > > pull request [2] -- I realize that arriving at consensus on a
> > > complex
> > > > > > > and ambitious project like this can be challenging so I recommend
> > > > > > > spending time on the "non-goals" section in the RFC and ask
> > > questions
> > > > > > > if you are unclear about the scope of what problems this is
> > trying
> > > to
> > > > > > > solve. I would be happy to give Edit access on the RFC document
> > to
> > > > > > > others and would consider ideas about how to move forward with
> > > > > > > something that is able to be implemented by different Arrow
> > > libraries
> > > > > > > in the reasonably near future.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Wes
> > > > > > >
> > > > > > > [1]:
> > > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > > [2]: https://github.com/apache/arrow/pull/10856
> > > > > > > [3]: https://ibis-project.org/
> > > > > > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > > [5]:
> > > > > >
> > > >
> > >
> > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > >
> > > > > >
> > > >
> > >
> >

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Phillip Cloud <cp...@gmail.com>.

On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Couple of questions
>
> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
> operation "(IR,data) - engine -> result" MUST be the same for all "engine"?
>

I think that might be a non-starter for mundane reasons: there's probably
at least two engines
that disagree on the result of sum(x) where x is a floating point column.


> 2. if yes, imo we may need to worry about:
> * a definition of equality that implementations agree on.
> * agreement over what the semantics look like. For example, do we use
> kleene logic for AND and OR?
>

WRT Kleene logic my thoughts are that the IR should support both Kleene and
non-Kleene,
and producers can choose their desired semantics.

Ibis for example, would override the `&` operator in `a & b` to produce
`KleeneAnd(Column(a), Column(b))`.


>
> To try to understand the gist, let's pick an aggregated count over a
> column: engines often do partial counts over partitions followed by a final
> "sum" over the partial counts. Is the idea that the query engine would
> communicate with the compute engine via 2 IRs where one is "count me these"
> the other is "sum me these"?
>
> Best,
> Jorge
>

Not in its current incarnation.

The idea is that the IR producer communicates a desire to count(x) to a
consumer, and  it's up to the consumer to figure out how to turn that count
into something that makes sense for itself. In your example that's a series
of partial counts followed by a sum.


>
>
>
>
>
> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cp...@gmail.com> wrote:
>
> > Thanks Wes,
> >
> > Great to be back working on Arrow again and engaging with the community.
> I
> > am really excited about this effort.
> >
> > I think there are a number of concerns I see as important to address in
> the
> > compute IR proposal:
> >
> > 1. Requirement for output types.
> >
> > I think that so far there's been many reasons for requiring conforming IR
> > producers and consumers to adhere to output types, but I haven't seen a
> > strong rationale for keeping them optional (in the semantic sense, not
> WRT
> > any particular serialization format's representation of optionality).
> >
> > I think a design that includes unambiguous semantics for output types (a
> > consumer must produce a value of the requested type or it's an
> > error/non-conforming) is simpler to reason about for producers, and
> > provides a strong guarantee for end users (humans or machines
> constructing
> > IR from an API and expecting the thing they ask for back from the IR
> > consumer).
> >
> > 2. Flexibility
> >
> > The current PR is currently unable to support what I think are killer
> > features of the IR: custom operators (relational or column) and UDFs. In
> my
> > mind, on top of the generalized compute description that the IR offers,
> the
> > ability for producers and consumers of IR to extend the IR without
> needing
> > to modify Arrow or depend on anything except the format is itself
> something
> > that is necessary to gain adoption.
> >
> > Developers will need to build custom relational operators (e.g., scans of
> > backends that don't exist anywhere for which a user has code to
> implement)
> > and custom functions (anything operating on a column that doesn't already
> > exist, really). Furthermore, I think we can actually drive building an
> > Arrow consumer using the same API that an end user would use to extend
> the
> > IR.
> >
> > 3. Window Functions
> >
> > Window functions are, I think, an important part of the IR value
> > proposition, as they are one of the more complex operators in databases.
> I
> > think we need to have something in the initial IR proposal to support
> these
> > operations.
> >
> > 4. Non relational Joins
> >
> > Things like as-of join and window join operators aren't yet fleshed out
> in
> > the IR, and I'm not sure they should be in scope for the initial
> prototype.
> > I think once we settle on a design, we can work the design of these
> > particular operators out during the initial prototype. I think the
> > specification of these operators should basically be PR #2 after the
> > initial design lands.
> >
> > # Order of Work
> >
> > 1. Nail down the design. Anything else is a non-starter.
> >
> > 2. Prototype an IR producer using Ibis
> >
> > Ibis is IMO a good candidate for a first IR producer as it has a number
> of
> > desirable properties that make prototyping faster and allow for us to
> > refine the design of the IR as needed based on how the implementation
> goes:
> > * It's written in Python so it has native support for nearly all of
> > flatbuffers' features without having to creating bindings to C++.
> > * There's already a set of rules for type checking, as well as APIs for
> > constructing expression trees, which means we don't need to worry about
> > building a type checker for the prototype.
> >
> > 3. Prototype an IR consumer in C++
> >
> > I think in parallel to the producer prototype we can further inform the
> > design from the consumer side by prototyping an IR consumer in C++ . I
> know
> > Ben Kietzman has expressed interest in working on this.
> >
> > Very interested to hear others' thoughts.
> >
> > -Phillip
> >
> > On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Thank you for all the feedback and comments on the document. I'm on
> > > vacation this week, so I'm delayed responding to everything, but I
> > > will get to it as quickly as I can. I will be at VLDB in Copenhagen
> > > next week if anyone would like to chat in person about it, and we can
> > > relay the content of any discussions back to the document/PR/e-mail
> > > thread.
> > >
> > > I know that Phillip Cloud expressed interest in working on the PR and
> > > helping work through many of the details, so I'm glad to have the
> > > help. If there are others who would like to work on the PR or dig into
> > > the details, please let me know. We might need to figure out how to
> > > accommodate "many cooks" by setting up the ComputeIR project somewhere
> > > separate from the format/ directory to permit it to exist in a
> > > Work-In-Progress status for a period of time until we work through the
> > > various details and design concerns.
> > >
> > > Re Julian's comment
> > >
> > > > The biggest surprise is that this language does full relational
> > > operations. I was expecting that it would do fragments of the
> operations.
> > >
> > > There's a related but different (yet still interesting and worthy of
> > > analysis) problem of creating an "engine language" that describes more
> > > mechanically the constituent parts of implementing the relational
> > > operators. To create a functional computation language with concrete
> > > Arrow data structures as a top-level primitive sounds like an
> > > interesting research area where I could see something developing
> > > eventually.
> > >
> > > The main problem I'm interested in solving right now is enabling front
> > > ends that have sufficient understanding of relational algebra and data
> > > frame operations to talk to engines without having to go backwards
> > > from their logical query plans to SQL. So as mentioned in the
> > > document, being able to faithfully carry the relational operator node
> > > information generated by Calcite or Ibis or another system would be
> > > super useful. Defining the semantics of various kinds of user-defined
> > > functions would also be helpful to standardize the
> > > engine-to-user-language UDF/extension interface.
> > >
> > > On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <al...@gmail.com>
> > wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > cool initiative! Reminded me of "Building Advanced SQL Analytics From
> > > > Low-Level Plan Operators" from SIGMOD 2021 (
> > > > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> > proposes a
> > > > set of building block for advanced aggregation.
> > > >
> > > > Cheers,
> > > > Dimitri.
> > > >
> > > > On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jh...@gmail.com>
> > > wrote:
> > > >
> > > > > Wes,
> > > > >
> > > > > Thanks for this. I’ve added comments to the doc and to the PR.
> > > > >
> > > > > The biggest surprise is that this language does full relational
> > > > > operations. I was expecting that it would do fragments of the
> > > operations.
> > > > > Consider join. A distributed hybrid hash join needs to partition
> rows
> > > into
> > > > > output buffers based on a hash key, build hash tables, probe into
> > hash
> > > > > tables, scan hash tables for untouched “outer”rows, and so forth.
> > > > >
> > > > > I see Arrow’s compute as delivering each of those operations,
> working
> > > on
> > > > > perhaps a batch at a time, or a sequence of batches, with some
> other
> > > system
> > > > > coordinating those tasks. So I would expect to see Arrow’s compute
> > > language
> > > > > mainly operating on batches rather than a table abstraction.
> > > > >
> > > > > Julian
> > > > >
> > > > >
> > > > > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > hi folks,
> > > > > >
> > > > > > This idea came up in passing in the past -- given that there are
> > > > > > multiple independent efforts to develop Arrow-native query
> engines
> > > > > > (and surely many more to come), it seems like it would be
> valuable
> > to
> > > > > > have a way to enable user languages (like Java, Python, R, or
> Rust,
> > > > > > for example) to communicate with backend computing engines (like
> > > > > > DataFusion, or new computing capabilities being built in the
> Arrow
> > > C++
> > > > > > library) in a fashion that is "lower-level" than SQL and
> > specialized
> > > > > > to Arrow's type system. So rather than leaving it to a SQL
> parser /
> > > > > > analyzer framework to generate an expression tree of relational
> > > > > > operators and then translate that to an Arrow-native
> > compute-engine's
> > > > > > internal grammer, a user framework could provide the desired
> > > > > > Arrow-native expression tree / data manipulations directly and
> skip
> > > > > > the SQL altogether.
> > > > > >
> > > > > > The idea of creating a "serialized intermediate representation
> > (IR)"
> > > > > > for Arrow compute operations would be to serve use cases large
> and
> > > > > > small -- from the most complex TPC-* or time series database
> query
> > to
> > > > > > the most simple array predicate/filter sent with an RPC request
> > using
> > > > > > Arrow Flight. It is deliberately language- and engine-agnostic,
> > with
> > > > > > the only commonality across usages being the Arrow columnar
> format
> > > > > > (schemas and array types). This would be better than leaving it
> to
> > > > > > each application to develop its own bespoke expression
> > > representations
> > > > > > for its needs.
> > > > > >
> > > > > > I spent a while thinking about this and wrote up a brain dump RFC
> > > > > > document [1] and accompanying pull request [2] that makes the
> > > possibly
> > > > > > controversial choice of using Flatbuffers to represent the
> > serialized
> > > > > > IR. I discuss the rationale for the choice of Flatbuffers in the
> > RFC
> > > > > > document. This PR is obviously deficient in many regards
> > (incomplete,
> > > > > > hacky, or unclear in places), and will need some help from others
> > to
> > > > > > flesh out. I suspect that actually implementing the IR will be
> > > > > > necessary to work out many of the low-level details.
> > > > > >
> > > > > > Note that this IR is intended to be more of a "superset" project
> > than
> > > > > > a "lowest common denominator". So there may be things that it
> > > includes
> > > > > > which are only available in some engines (e.g. engines that have
> > > > > > specialized handling of time series data).
> > > > > >
> > > > > > As some of my personal motivation for the project, concurrent
> with
> > > the
> > > > > > genesis of Apache Arrow, I started a Python project called Ibis
> [3]
> > > > > > (which is similar to R's dplyr project) which serves as a "Python
> > > > > > analytical query IR builder" that is capable of generating most
> of
> > > the
> > > > > > SQL standard, targeting many different SQL dialects and other
> > > backends
> > > > > > (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > > library
> > > > > > as a "many-SQL" middleware. As such, I would like to be able to
> > > > > > translate between the in-memory "logical query" data structures
> in
> > a
> > > > > > library like Ibis to a serialized format that can be executed by
> > many
> > > > > > different Arrow-native query engines. The expression primitives
> > > > > > available in Ibis should serve as helpful test cases, too.
> > > > > >
> > > > > > I look forward to the community's comments on the RFC document
> [1]
> > > and
> > > > > > pull request [2] -- I realize that arriving at consensus on a
> > complex
> > > > > > and ambitious project like this can be challenging so I recommend
> > > > > > spending time on the "non-goals" section in the RFC and ask
> > questions
> > > > > > if you are unclear about the scope of what problems this is
> trying
> > to
> > > > > > solve. I would be happy to give Edit access on the RFC document
> to
> > > > > > others and would consider ideas about how to move forward with
> > > > > > something that is able to be implemented by different Arrow
> > libraries
> > > > > > in the reasonably near future.
> > > > > >
> > > > > > Thanks,
> > > > > > Wes
> > > > > >
> > > > > > [1]:
> > > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > [2]: https://github.com/apache/arrow/pull/10856
> > > > > > [3]: https://ibis-project.org/
> > > > > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > [5]:
> > > > >
> > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Couple of questions

1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
operation "(IR,data) - engine -> result" MUST be the same for all "engine"?
2. if yes, imo we may need to worry about:
* a definition of equality that implementations agree on.
* agreement over what the semantics look like. For example, do we use
kleene logic for AND and OR?

To try to understand the gist, let's pick an aggregated count over a
column: engines often do partial counts over partitions followed by a final
"sum" over the partial counts. Is the idea that the query engine would
communicate with the compute engine via 2 IRs where one is "count me these"
the other is "sum me these"?

Best,
Jorge





On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cp...@gmail.com> wrote:

> Thanks Wes,
>
> Great to be back working on Arrow again and engaging with the community. I
> am really excited about this effort.
>
> I think there are a number of concerns I see as important to address in the
> compute IR proposal:
>
> 1. Requirement for output types.
>
> I think that so far there's been many reasons for requiring conforming IR
> producers and consumers to adhere to output types, but I haven't seen a
> strong rationale for keeping them optional (in the semantic sense, not WRT
> any particular serialization format's representation of optionality).
>
> I think a design that includes unambiguous semantics for output types (a
> consumer must produce a value of the requested type or it's an
> error/non-conforming) is simpler to reason about for producers, and
> provides a strong guarantee for end users (humans or machines constructing
> IR from an API and expecting the thing they ask for back from the IR
> consumer).
>
> 2. Flexibility
>
> The current PR is currently unable to support what I think are killer
> features of the IR: custom operators (relational or column) and UDFs. In my
> mind, on top of the generalized compute description that the IR offers, the
> ability for producers and consumers of IR to extend the IR without needing
> to modify Arrow or depend on anything except the format is itself something
> that is necessary to gain adoption.
>
> Developers will need to build custom relational operators (e.g., scans of
> backends that don't exist anywhere for which a user has code to implement)
> and custom functions (anything operating on a column that doesn't already
> exist, really). Furthermore, I think we can actually drive building an
> Arrow consumer using the same API that an end user would use to extend the
> IR.
>
> 3. Window Functions
>
> Window functions are, I think, an important part of the IR value
> proposition, as they are one of the more complex operators in databases. I
> think we need to have something in the initial IR proposal to support these
> operations.
>
> 4. Non relational Joins
>
> Things like as-of join and window join operators aren't yet fleshed out in
> the IR, and I'm not sure they should be in scope for the initial prototype.
> I think once we settle on a design, we can work the design of these
> particular operators out during the initial prototype. I think the
> specification of these operators should basically be PR #2 after the
> initial design lands.
>
> # Order of Work
>
> 1. Nail down the design. Anything else is a non-starter.
>
> 2. Prototype an IR producer using Ibis
>
> Ibis is IMO a good candidate for a first IR producer as it has a number of
> desirable properties that make prototyping faster and allow for us to
> refine the design of the IR as needed based on how the implementation goes:
> * It's written in Python so it has native support for nearly all of
> flatbuffers' features without having to creating bindings to C++.
> * There's already a set of rules for type checking, as well as APIs for
> constructing expression trees, which means we don't need to worry about
> building a type checker for the prototype.
>
> 3. Prototype an IR consumer in C++
>
> I think in parallel to the producer prototype we can further inform the
> design from the consumer side by prototyping an IR consumer in C++ . I know
> Ben Kietzman has expressed interest in working on this.
>
> Very interested to hear others' thoughts.
>
> -Phillip
>
> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Thank you for all the feedback and comments on the document. I'm on
> > vacation this week, so I'm delayed responding to everything, but I
> > will get to it as quickly as I can. I will be at VLDB in Copenhagen
> > next week if anyone would like to chat in person about it, and we can
> > relay the content of any discussions back to the document/PR/e-mail
> > thread.
> >
> > I know that Phillip Cloud expressed interest in working on the PR and
> > helping work through many of the details, so I'm glad to have the
> > help. If there are others who would like to work on the PR or dig into
> > the details, please let me know. We might need to figure out how to
> > accommodate "many cooks" by setting up the ComputeIR project somewhere
> > separate from the format/ directory to permit it to exist in a
> > Work-In-Progress status for a period of time until we work through the
> > various details and design concerns.
> >
> > Re Julian's comment
> >
> > > The biggest surprise is that this language does full relational
> > operations. I was expecting that it would do fragments of the operations.
> >
> > There's a related but different (yet still interesting and worthy of
> > analysis) problem of creating an "engine language" that describes more
> > mechanically the constituent parts of implementing the relational
> > operators. To create a functional computation language with concrete
> > Arrow data structures as a top-level primitive sounds like an
> > interesting research area where I could see something developing
> > eventually.
> >
> > The main problem I'm interested in solving right now is enabling front
> > ends that have sufficient understanding of relational algebra and data
> > frame operations to talk to engines without having to go backwards
> > from their logical query plans to SQL. So as mentioned in the
> > document, being able to faithfully carry the relational operator node
> > information generated by Calcite or Ibis or another system would be
> > super useful. Defining the semantics of various kinds of user-defined
> > functions would also be helpful to standardize the
> > engine-to-user-language UDF/extension interface.
> >
> > On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <al...@gmail.com>
> wrote:
> > >
> > > Hi Wes,
> > >
> > > cool initiative! Reminded me of "Building Advanced SQL Analytics From
> > > Low-Level Plan Operators" from SIGMOD 2021 (
> > > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> proposes a
> > > set of building block for advanced aggregation.
> > >
> > > Cheers,
> > > Dimitri.
> > >
> > > On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jh...@gmail.com>
> > wrote:
> > >
> > > > Wes,
> > > >
> > > > Thanks for this. I’ve added comments to the doc and to the PR.
> > > >
> > > > The biggest surprise is that this language does full relational
> > > > operations. I was expecting that it would do fragments of the
> > operations.
> > > > Consider join. A distributed hybrid hash join needs to partition rows
> > into
> > > > output buffers based on a hash key, build hash tables, probe into
> hash
> > > > tables, scan hash tables for untouched “outer”rows, and so forth.
> > > >
> > > > I see Arrow’s compute as delivering each of those operations, working
> > on
> > > > perhaps a batch at a time, or a sequence of batches, with some other
> > system
> > > > coordinating those tasks. So I would expect to see Arrow’s compute
> > language
> > > > mainly operating on batches rather than a table abstraction.
> > > >
> > > > Julian
> > > >
> > > >
> > > > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com>
> > wrote:
> > > > >
> > > > > hi folks,
> > > > >
> > > > > This idea came up in passing in the past -- given that there are
> > > > > multiple independent efforts to develop Arrow-native query engines
> > > > > (and surely many more to come), it seems like it would be valuable
> to
> > > > > have a way to enable user languages (like Java, Python, R, or Rust,
> > > > > for example) to communicate with backend computing engines (like
> > > > > DataFusion, or new computing capabilities being built in the Arrow
> > C++
> > > > > library) in a fashion that is "lower-level" than SQL and
> specialized
> > > > > to Arrow's type system. So rather than leaving it to a SQL parser /
> > > > > analyzer framework to generate an expression tree of relational
> > > > > operators and then translate that to an Arrow-native
> compute-engine's
> > > > > internal grammer, a user framework could provide the desired
> > > > > Arrow-native expression tree / data manipulations directly and skip
> > > > > the SQL altogether.
> > > > >
> > > > > The idea of creating a "serialized intermediate representation
> (IR)"
> > > > > for Arrow compute operations would be to serve use cases large and
> > > > > small -- from the most complex TPC-* or time series database query
> to
> > > > > the most simple array predicate/filter sent with an RPC request
> using
> > > > > Arrow Flight. It is deliberately language- and engine-agnostic,
> with
> > > > > the only commonality across usages being the Arrow columnar format
> > > > > (schemas and array types). This would be better than leaving it to
> > > > > each application to develop its own bespoke expression
> > representations
> > > > > for its needs.
> > > > >
> > > > > I spent a while thinking about this and wrote up a brain dump RFC
> > > > > document [1] and accompanying pull request [2] that makes the
> > possibly
> > > > > controversial choice of using Flatbuffers to represent the
> serialized
> > > > > IR. I discuss the rationale for the choice of Flatbuffers in the
> RFC
> > > > > document. This PR is obviously deficient in many regards
> (incomplete,
> > > > > hacky, or unclear in places), and will need some help from others
> to
> > > > > flesh out. I suspect that actually implementing the IR will be
> > > > > necessary to work out many of the low-level details.
> > > > >
> > > > > Note that this IR is intended to be more of a "superset" project
> than
> > > > > a "lowest common denominator". So there may be things that it
> > includes
> > > > > which are only available in some engines (e.g. engines that have
> > > > > specialized handling of time series data).
> > > > >
> > > > > As some of my personal motivation for the project, concurrent with
> > the
> > > > > genesis of Apache Arrow, I started a Python project called Ibis [3]
> > > > > (which is similar to R's dplyr project) which serves as a "Python
> > > > > analytical query IR builder" that is capable of generating most of
> > the
> > > > > SQL standard, targeting many different SQL dialects and other
> > backends
> > > > > (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > library
> > > > > as a "many-SQL" middleware. As such, I would like to be able to
> > > > > translate between the in-memory "logical query" data structures in
> a
> > > > > library like Ibis to a serialized format that can be executed by
> many
> > > > > different Arrow-native query engines. The expression primitives
> > > > > available in Ibis should serve as helpful test cases, too.
> > > > >
> > > > > I look forward to the community's comments on the RFC document [1]
> > and
> > > > > pull request [2] -- I realize that arriving at consensus on a
> complex
> > > > > and ambitious project like this can be challenging so I recommend
> > > > > spending time on the "non-goals" section in the RFC and ask
> questions
> > > > > if you are unclear about the scope of what problems this is trying
> to
> > > > > solve. I would be happy to give Edit access on the RFC document to
> > > > > others and would consider ideas about how to move forward with
> > > > > something that is able to be implemented by different Arrow
> libraries
> > > > > in the reasonably near future.
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > > > > [1]:
> > > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > [2]: https://github.com/apache/arrow/pull/10856
> > > > > [3]: https://ibis-project.org/
> > > > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > [5]:
> > > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > >
> > > >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Phillip Cloud <cp...@gmail.com>.

Thanks Wes,

Great to be back working on Arrow again and engaging with the community. I
am really excited about this effort.

I think there are a number of concerns I see as important to address in the
compute IR proposal:

1. Requirement for output types.

I think that so far there's been many reasons for requiring conforming IR
producers and consumers to adhere to output types, but I haven't seen a
strong rationale for keeping them optional (in the semantic sense, not WRT
any particular serialization format's representation of optionality).

I think a design that includes unambiguous semantics for output types (a
consumer must produce a value of the requested type or it's an
error/non-conforming) is simpler to reason about for producers, and
provides a strong guarantee for end users (humans or machines constructing
IR from an API and expecting the thing they ask for back from the IR
consumer).

2. Flexibility

The current PR is currently unable to support what I think are killer
features of the IR: custom operators (relational or column) and UDFs. In my
mind, on top of the generalized compute description that the IR offers, the
ability for producers and consumers of IR to extend the IR without needing
to modify Arrow or depend on anything except the format is itself something
that is necessary to gain adoption.

Developers will need to build custom relational operators (e.g., scans of
backends that don't exist anywhere for which a user has code to implement)
and custom functions (anything operating on a column that doesn't already
exist, really). Furthermore, I think we can actually drive building an
Arrow consumer using the same API that an end user would use to extend the
IR.

3. Window Functions

Window functions are, I think, an important part of the IR value
proposition, as they are one of the more complex operators in databases. I
think we need to have something in the initial IR proposal to support these
operations.

4. Non relational Joins

Things like as-of join and window join operators aren't yet fleshed out in
the IR, and I'm not sure they should be in scope for the initial prototype.
I think once we settle on a design, we can work the design of these
particular operators out during the initial prototype. I think the
specification of these operators should basically be PR #2 after the
initial design lands.

# Order of Work

1. Nail down the design. Anything else is a non-starter.

2. Prototype an IR producer using Ibis

Ibis is IMO a good candidate for a first IR producer as it has a number of
desirable properties that make prototyping faster and allow for us to
refine the design of the IR as needed based on how the implementation goes:
* It's written in Python so it has native support for nearly all of
flatbuffers' features without having to creating bindings to C++.
* There's already a set of rules for type checking, as well as APIs for
constructing expression trees, which means we don't need to worry about
building a type checker for the prototype.

3. Prototype an IR consumer in C++

I think in parallel to the producer prototype we can further inform the
design from the consumer side by prototyping an IR consumer in C++ . I know
Ben Kietzman has expressed interest in working on this.

Very interested to hear others' thoughts.

-Phillip

On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <we...@gmail.com> wrote:

> Thank you for all the feedback and comments on the document. I'm on
> vacation this week, so I'm delayed responding to everything, but I
> will get to it as quickly as I can. I will be at VLDB in Copenhagen
> next week if anyone would like to chat in person about it, and we can
> relay the content of any discussions back to the document/PR/e-mail
> thread.
>
> I know that Phillip Cloud expressed interest in working on the PR and
> helping work through many of the details, so I'm glad to have the
> help. If there are others who would like to work on the PR or dig into
> the details, please let me know. We might need to figure out how to
> accommodate "many cooks" by setting up the ComputeIR project somewhere
> separate from the format/ directory to permit it to exist in a
> Work-In-Progress status for a period of time until we work through the
> various details and design concerns.
>
> Re Julian's comment
>
> > The biggest surprise is that this language does full relational
> operations. I was expecting that it would do fragments of the operations.
>
> There's a related but different (yet still interesting and worthy of
> analysis) problem of creating an "engine language" that describes more
> mechanically the constituent parts of implementing the relational
> operators. To create a functional computation language with concrete
> Arrow data structures as a top-level primitive sounds like an
> interesting research area where I could see something developing
> eventually.
>
> The main problem I'm interested in solving right now is enabling front
> ends that have sufficient understanding of relational algebra and data
> frame operations to talk to engines without having to go backwards
> from their logical query plans to SQL. So as mentioned in the
> document, being able to faithfully carry the relational operator node
> information generated by Calcite or Ibis or another system would be
> super useful. Defining the semantics of various kinds of user-defined
> functions would also be helpful to standardize the
> engine-to-user-language UDF/extension interface.
>
> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <al...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > cool initiative! Reminded me of "Building Advanced SQL Analytics From
> > Low-Level Plan Operators" from SIGMOD 2021 (
> > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which proposes a
> > set of building block for advanced aggregation.
> >
> > Cheers,
> > Dimitri.
> >
> > On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jh...@gmail.com>
> wrote:
> >
> > > Wes,
> > >
> > > Thanks for this. I’ve added comments to the doc and to the PR.
> > >
> > > The biggest surprise is that this language does full relational
> > > operations. I was expecting that it would do fragments of the
> operations.
> > > Consider join. A distributed hybrid hash join needs to partition rows
> into
> > > output buffers based on a hash key, build hash tables, probe into hash
> > > tables, scan hash tables for untouched “outer”rows, and so forth.
> > >
> > > I see Arrow’s compute as delivering each of those operations, working
> on
> > > perhaps a batch at a time, or a sequence of batches, with some other
> system
> > > coordinating those tasks. So I would expect to see Arrow’s compute
> language
> > > mainly operating on batches rather than a table abstraction.
> > >
> > > Julian
> > >
> > >
> > > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com>
> wrote:
> > > >
> > > > hi folks,
> > > >
> > > > This idea came up in passing in the past -- given that there are
> > > > multiple independent efforts to develop Arrow-native query engines
> > > > (and surely many more to come), it seems like it would be valuable to
> > > > have a way to enable user languages (like Java, Python, R, or Rust,
> > > > for example) to communicate with backend computing engines (like
> > > > DataFusion, or new computing capabilities being built in the Arrow
> C++
> > > > library) in a fashion that is "lower-level" than SQL and specialized
> > > > to Arrow's type system. So rather than leaving it to a SQL parser /
> > > > analyzer framework to generate an expression tree of relational
> > > > operators and then translate that to an Arrow-native compute-engine's
> > > > internal grammer, a user framework could provide the desired
> > > > Arrow-native expression tree / data manipulations directly and skip
> > > > the SQL altogether.
> > > >
> > > > The idea of creating a "serialized intermediate representation (IR)"
> > > > for Arrow compute operations would be to serve use cases large and
> > > > small -- from the most complex TPC-* or time series database query to
> > > > the most simple array predicate/filter sent with an RPC request using
> > > > Arrow Flight. It is deliberately language- and engine-agnostic, with
> > > > the only commonality across usages being the Arrow columnar format
> > > > (schemas and array types). This would be better than leaving it to
> > > > each application to develop its own bespoke expression
> representations
> > > > for its needs.
> > > >
> > > > I spent a while thinking about this and wrote up a brain dump RFC
> > > > document [1] and accompanying pull request [2] that makes the
> possibly
> > > > controversial choice of using Flatbuffers to represent the serialized
> > > > IR. I discuss the rationale for the choice of Flatbuffers in the RFC
> > > > document. This PR is obviously deficient in many regards (incomplete,
> > > > hacky, or unclear in places), and will need some help from others to
> > > > flesh out. I suspect that actually implementing the IR will be
> > > > necessary to work out many of the low-level details.
> > > >
> > > > Note that this IR is intended to be more of a "superset" project than
> > > > a "lowest common denominator". So there may be things that it
> includes
> > > > which are only available in some engines (e.g. engines that have
> > > > specialized handling of time series data).
> > > >
> > > > As some of my personal motivation for the project, concurrent with
> the
> > > > genesis of Apache Arrow, I started a Python project called Ibis [3]
> > > > (which is similar to R's dplyr project) which serves as a "Python
> > > > analytical query IR builder" that is capable of generating most of
> the
> > > > SQL standard, targeting many different SQL dialects and other
> backends
> > > > (like pandas). Microsoft ([4]) and Google ([5]) have used this
> library
> > > > as a "many-SQL" middleware. As such, I would like to be able to
> > > > translate between the in-memory "logical query" data structures in a
> > > > library like Ibis to a serialized format that can be executed by many
> > > > different Arrow-native query engines. The expression primitives
> > > > available in Ibis should serve as helpful test cases, too.
> > > >
> > > > I look forward to the community's comments on the RFC document [1]
> and
> > > > pull request [2] -- I realize that arriving at consensus on a complex
> > > > and ambitious project like this can be challenging so I recommend
> > > > spending time on the "non-goals" section in the RFC and ask questions
> > > > if you are unclear about the scope of what problems this is trying to
> > > > solve. I would be happy to give Edit access on the RFC document to
> > > > others and would consider ideas about how to move forward with
> > > > something that is able to be implemented by different Arrow libraries
> > > > in the reasonably near future.
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > > > [1]:
> > >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > [2]: https://github.com/apache/arrow/pull/10856
> > > > [3]: https://ibis-project.org/
> > > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > [5]:
> > >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > >
> > >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Wes McKinney <we...@gmail.com>.

Thank you for all the feedback and comments on the document. I'm on
vacation this week, so I'm delayed responding to everything, but I
will get to it as quickly as I can. I will be at VLDB in Copenhagen
next week if anyone would like to chat in person about it, and we can
relay the content of any discussions back to the document/PR/e-mail
thread.

I know that Phillip Cloud expressed interest in working on the PR and
helping work through many of the details, so I'm glad to have the
help. If there are others who would like to work on the PR or dig into
the details, please let me know. We might need to figure out how to
accommodate "many cooks" by setting up the ComputeIR project somewhere
separate from the format/ directory to permit it to exist in a
Work-In-Progress status for a period of time until we work through the
various details and design concerns.

Re Julian's comment

> The biggest surprise is that this language does full relational operations. I was expecting that it would do fragments of the operations.

There's a related but different (yet still interesting and worthy of
analysis) problem of creating an "engine language" that describes more
mechanically the constituent parts of implementing the relational
operators. To create a functional computation language with concrete
Arrow data structures as a top-level primitive sounds like an
interesting research area where I could see something developing
eventually.

The main problem I'm interested in solving right now is enabling front
ends that have sufficient understanding of relational algebra and data
frame operations to talk to engines without having to go backwards
from their logical query plans to SQL. So as mentioned in the
document, being able to faithfully carry the relational operator node
information generated by Calcite or Ibis or another system would be
super useful. Defining the semantics of various kinds of user-defined
functions would also be helpful to standardize the
engine-to-user-language UDF/extension interface.

On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <al...@gmail.com> wrote:
>
> Hi Wes,
>
> cool initiative! Reminded me of "Building Advanced SQL Analytics From
> Low-Level Plan Operators" from SIGMOD 2021 (
> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which proposes a
> set of building block for advanced aggregation.
>
> Cheers,
> Dimitri.
>
> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jh...@gmail.com> wrote:
>
> > Wes,
> >
> > Thanks for this. I’ve added comments to the doc and to the PR.
> >
> > The biggest surprise is that this language does full relational
> > operations. I was expecting that it would do fragments of the operations.
> > Consider join. A distributed hybrid hash join needs to partition rows into
> > output buffers based on a hash key, build hash tables, probe into hash
> > tables, scan hash tables for untouched “outer”rows, and so forth.
> >
> > I see Arrow’s compute as delivering each of those operations, working on
> > perhaps a batch at a time, or a sequence of batches, with some other system
> > coordinating those tasks. So I would expect to see Arrow’s compute language
> > mainly operating on batches rather than a table abstraction.
> >
> > Julian
> >
> >
> > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com> wrote:
> > >
> > > hi folks,
> > >
> > > This idea came up in passing in the past -- given that there are
> > > multiple independent efforts to develop Arrow-native query engines
> > > (and surely many more to come), it seems like it would be valuable to
> > > have a way to enable user languages (like Java, Python, R, or Rust,
> > > for example) to communicate with backend computing engines (like
> > > DataFusion, or new computing capabilities being built in the Arrow C++
> > > library) in a fashion that is "lower-level" than SQL and specialized
> > > to Arrow's type system. So rather than leaving it to a SQL parser /
> > > analyzer framework to generate an expression tree of relational
> > > operators and then translate that to an Arrow-native compute-engine's
> > > internal grammer, a user framework could provide the desired
> > > Arrow-native expression tree / data manipulations directly and skip
> > > the SQL altogether.
> > >
> > > The idea of creating a "serialized intermediate representation (IR)"
> > > for Arrow compute operations would be to serve use cases large and
> > > small -- from the most complex TPC-* or time series database query to
> > > the most simple array predicate/filter sent with an RPC request using
> > > Arrow Flight. It is deliberately language- and engine-agnostic, with
> > > the only commonality across usages being the Arrow columnar format
> > > (schemas and array types). This would be better than leaving it to
> > > each application to develop its own bespoke expression representations
> > > for its needs.
> > >
> > > I spent a while thinking about this and wrote up a brain dump RFC
> > > document [1] and accompanying pull request [2] that makes the possibly
> > > controversial choice of using Flatbuffers to represent the serialized
> > > IR. I discuss the rationale for the choice of Flatbuffers in the RFC
> > > document. This PR is obviously deficient in many regards (incomplete,
> > > hacky, or unclear in places), and will need some help from others to
> > > flesh out. I suspect that actually implementing the IR will be
> > > necessary to work out many of the low-level details.
> > >
> > > Note that this IR is intended to be more of a "superset" project than
> > > a "lowest common denominator". So there may be things that it includes
> > > which are only available in some engines (e.g. engines that have
> > > specialized handling of time series data).
> > >
> > > As some of my personal motivation for the project, concurrent with the
> > > genesis of Apache Arrow, I started a Python project called Ibis [3]
> > > (which is similar to R's dplyr project) which serves as a "Python
> > > analytical query IR builder" that is capable of generating most of the
> > > SQL standard, targeting many different SQL dialects and other backends
> > > (like pandas). Microsoft ([4]) and Google ([5]) have used this library
> > > as a "many-SQL" middleware. As such, I would like to be able to
> > > translate between the in-memory "logical query" data structures in a
> > > library like Ibis to a serialized format that can be executed by many
> > > different Arrow-native query engines. The expression primitives
> > > available in Ibis should serve as helpful test cases, too.
> > >
> > > I look forward to the community's comments on the RFC document [1] and
> > > pull request [2] -- I realize that arriving at consensus on a complex
> > > and ambitious project like this can be challenging so I recommend
> > > spending time on the "non-goals" section in the RFC and ask questions
> > > if you are unclear about the scope of what problems this is trying to
> > > solve. I would be happy to give Edit access on the RFC document to
> > > others and would consider ideas about how to move forward with
> > > something that is able to be implemented by different Arrow libraries
> > > in the reasonably near future.
> > >
> > > Thanks,
> > > Wes
> > >
> > > [1]:
> > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > [2]: https://github.com/apache/arrow/pull/10856
> > > [3]: https://ibis-project.org/
> > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > [5]:
> > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> >
> >

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Dimitri Vorona <al...@gmail.com>.

Hi Wes,

cool initiative! Reminded me of "Building Advanced SQL Analytics From
Low-Level Plan Operators" from SIGMOD 2021 (
http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which proposes a
set of building block for advanced aggregation.

Cheers,
Dimitri.

On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jh...@gmail.com> wrote:

> Wes,
>
> Thanks for this. I’ve added comments to the doc and to the PR.
>
> The biggest surprise is that this language does full relational
> operations. I was expecting that it would do fragments of the operations.
> Consider join. A distributed hybrid hash join needs to partition rows into
> output buffers based on a hash key, build hash tables, probe into hash
> tables, scan hash tables for untouched “outer”rows, and so forth.
>
> I see Arrow’s compute as delivering each of those operations, working on
> perhaps a batch at a time, or a sequence of batches, with some other system
> coordinating those tasks. So I would expect to see Arrow’s compute language
> mainly operating on batches rather than a table abstraction.
>
> Julian
>
>
> > On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com> wrote:
> >
> > hi folks,
> >
> > This idea came up in passing in the past -- given that there are
> > multiple independent efforts to develop Arrow-native query engines
> > (and surely many more to come), it seems like it would be valuable to
> > have a way to enable user languages (like Java, Python, R, or Rust,
> > for example) to communicate with backend computing engines (like
> > DataFusion, or new computing capabilities being built in the Arrow C++
> > library) in a fashion that is "lower-level" than SQL and specialized
> > to Arrow's type system. So rather than leaving it to a SQL parser /
> > analyzer framework to generate an expression tree of relational
> > operators and then translate that to an Arrow-native compute-engine's
> > internal grammer, a user framework could provide the desired
> > Arrow-native expression tree / data manipulations directly and skip
> > the SQL altogether.
> >
> > The idea of creating a "serialized intermediate representation (IR)"
> > for Arrow compute operations would be to serve use cases large and
> > small -- from the most complex TPC-* or time series database query to
> > the most simple array predicate/filter sent with an RPC request using
> > Arrow Flight. It is deliberately language- and engine-agnostic, with
> > the only commonality across usages being the Arrow columnar format
> > (schemas and array types). This would be better than leaving it to
> > each application to develop its own bespoke expression representations
> > for its needs.
> >
> > I spent a while thinking about this and wrote up a brain dump RFC
> > document [1] and accompanying pull request [2] that makes the possibly
> > controversial choice of using Flatbuffers to represent the serialized
> > IR. I discuss the rationale for the choice of Flatbuffers in the RFC
> > document. This PR is obviously deficient in many regards (incomplete,
> > hacky, or unclear in places), and will need some help from others to
> > flesh out. I suspect that actually implementing the IR will be
> > necessary to work out many of the low-level details.
> >
> > Note that this IR is intended to be more of a "superset" project than
> > a "lowest common denominator". So there may be things that it includes
> > which are only available in some engines (e.g. engines that have
> > specialized handling of time series data).
> >
> > As some of my personal motivation for the project, concurrent with the
> > genesis of Apache Arrow, I started a Python project called Ibis [3]
> > (which is similar to R's dplyr project) which serves as a "Python
> > analytical query IR builder" that is capable of generating most of the
> > SQL standard, targeting many different SQL dialects and other backends
> > (like pandas). Microsoft ([4]) and Google ([5]) have used this library
> > as a "many-SQL" middleware. As such, I would like to be able to
> > translate between the in-memory "logical query" data structures in a
> > library like Ibis to a serialized format that can be executed by many
> > different Arrow-native query engines. The expression primitives
> > available in Ibis should serve as helpful test cases, too.
> >
> > I look forward to the community's comments on the RFC document [1] and
> > pull request [2] -- I realize that arriving at consensus on a complex
> > and ambitious project like this can be challenging so I recommend
> > spending time on the "non-goals" section in the RFC and ask questions
> > if you are unclear about the scope of what problems this is trying to
> > solve. I would be happy to give Edit access on the RFC document to
> > others and would consider ideas about how to move forward with
> > something that is able to be implemented by different Arrow libraries
> > in the reasonably near future.
> >
> > Thanks,
> > Wes
> >
> > [1]:
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > [2]: https://github.com/apache/arrow/pull/10856
> > [3]: https://ibis-project.org/
> > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > [5]:
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
>
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Julian Hyde <jh...@gmail.com>.

Wes,

Thanks for this. I’ve added comments to the doc and to the PR.

The biggest surprise is that this language does full relational operations. I was expecting that it would do fragments of the operations. Consider join. A distributed hybrid hash join needs to partition rows into output buffers based on a hash key, build hash tables, probe into hash tables, scan hash tables for untouched “outer”rows, and so forth.

I see Arrow’s compute as delivering each of those operations, working on perhaps a batch at a time, or a sequence of batches, with some other system coordinating those tasks. So I would expect to see Arrow’s compute language mainly operating on batches rather than a table abstraction.

Julian


> On Aug 2, 2021, at 5:16 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> hi folks,
> 
> This idea came up in passing in the past -- given that there are
> multiple independent efforts to develop Arrow-native query engines
> (and surely many more to come), it seems like it would be valuable to
> have a way to enable user languages (like Java, Python, R, or Rust,
> for example) to communicate with backend computing engines (like
> DataFusion, or new computing capabilities being built in the Arrow C++
> library) in a fashion that is "lower-level" than SQL and specialized
> to Arrow's type system. So rather than leaving it to a SQL parser /
> analyzer framework to generate an expression tree of relational
> operators and then translate that to an Arrow-native compute-engine's
> internal grammer, a user framework could provide the desired
> Arrow-native expression tree / data manipulations directly and skip
> the SQL altogether.
> 
> The idea of creating a "serialized intermediate representation (IR)"
> for Arrow compute operations would be to serve use cases large and
> small -- from the most complex TPC-* or time series database query to
> the most simple array predicate/filter sent with an RPC request using
> Arrow Flight. It is deliberately language- and engine-agnostic, with
> the only commonality across usages being the Arrow columnar format
> (schemas and array types). This would be better than leaving it to
> each application to develop its own bespoke expression representations
> for its needs.
> 
> I spent a while thinking about this and wrote up a brain dump RFC
> document [1] and accompanying pull request [2] that makes the possibly
> controversial choice of using Flatbuffers to represent the serialized
> IR. I discuss the rationale for the choice of Flatbuffers in the RFC
> document. This PR is obviously deficient in many regards (incomplete,
> hacky, or unclear in places), and will need some help from others to
> flesh out. I suspect that actually implementing the IR will be
> necessary to work out many of the low-level details.
> 
> Note that this IR is intended to be more of a "superset" project than
> a "lowest common denominator". So there may be things that it includes
> which are only available in some engines (e.g. engines that have
> specialized handling of time series data).
> 
> As some of my personal motivation for the project, concurrent with the
> genesis of Apache Arrow, I started a Python project called Ibis [3]
> (which is similar to R's dplyr project) which serves as a "Python
> analytical query IR builder" that is capable of generating most of the
> SQL standard, targeting many different SQL dialects and other backends
> (like pandas). Microsoft ([4]) and Google ([5]) have used this library
> as a "many-SQL" middleware. As such, I would like to be able to
> translate between the in-memory "logical query" data structures in a
> library like Ibis to a serialized format that can be executed by many
> different Arrow-native query engines. The expression primitives
> available in Ibis should serve as helpful test cases, too.
> 
> I look forward to the community's comments on the RFC document [1] and
> pull request [2] -- I realize that arriving at consensus on a complex
> and ambitious project like this can be challenging so I recommend
> spending time on the "non-goals" section in the RFC and ask questions
> if you are unclear about the scope of what problems this is trying to
> solve. I would be happy to give Edit access on the RFC document to
> others and would consider ideas about how to move forward with
> something that is able to be implemented by different Arrow libraries
> in the reasonably near future.
> 
> Thanks,
> Wes
> 
> [1]: https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> [2]: https://github.com/apache/arrow/pull/10856
> [3]: https://ibis-project.org/
> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> [5]: https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt

Fwd: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Posted by Julian Hyde <jh...@gmail.com>.

On dev@arrow, Wes McKinney just started a discussion of a language for “programming” Arrow. One of its explicit goals is for systems like Calcite to generate such code as their physical plan.

This would be great for Calcite. Arrow started as an excellent runtime for high-performance data processing.

Julian


> Begin forwarded message:
> 
> From: Wes McKinney <we...@gmail.com>
> Subject: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines
> Date: August 2, 2021 at 5:16:50 PM PDT
> To: dev <de...@arrow.apache.org>
> Reply-To: dev@arrow.apache.org
> 
> hi folks,
> 
> This idea came up in passing in the past -- given that there are
> multiple independent efforts to develop Arrow-native query engines
> (and surely many more to come), it seems like it would be valuable to
> have a way to enable user languages (like Java, Python, R, or Rust,
> for example) to communicate with backend computing engines (like
> DataFusion, or new computing capabilities being built in the Arrow C++
> library) in a fashion that is "lower-level" than SQL and specialized
> to Arrow's type system. So rather than leaving it to a SQL parser /
> analyzer framework to generate an expression tree of relational
> operators and then translate that to an Arrow-native compute-engine's
> internal grammer, a user framework could provide the desired
> Arrow-native expression tree / data manipulations directly and skip
> the SQL altogether.
> 
> The idea of creating a "serialized intermediate representation (IR)"
> for Arrow compute operations would be to serve use cases large and
> small -- from the most complex TPC-* or time series database query to
> the most simple array predicate/filter sent with an RPC request using
> Arrow Flight. It is deliberately language- and engine-agnostic, with
> the only commonality across usages being the Arrow columnar format
> (schemas and array types). This would be better than leaving it to
> each application to develop its own bespoke expression representations
> for its needs.
> 
> I spent a while thinking about this and wrote up a brain dump RFC
> document [1] and accompanying pull request [2] that makes the possibly
> controversial choice of using Flatbuffers to represent the serialized
> IR. I discuss the rationale for the choice of Flatbuffers in the RFC
> document. This PR is obviously deficient in many regards (incomplete,
> hacky, or unclear in places), and will need some help from others to
> flesh out. I suspect that actually implementing the IR will be
> necessary to work out many of the low-level details.
> 
> Note that this IR is intended to be more of a "superset" project than
> a "lowest common denominator". So there may be things that it includes
> which are only available in some engines (e.g. engines that have
> specialized handling of time series data).
> 
> As some of my personal motivation for the project, concurrent with the
> genesis of Apache Arrow, I started a Python project called Ibis [3]
> (which is similar to R's dplyr project) which serves as a "Python
> analytical query IR builder" that is capable of generating most of the
> SQL standard, targeting many different SQL dialects and other backends
> (like pandas). Microsoft ([4]) and Google ([5]) have used this library
> as a "many-SQL" middleware. As such, I would like to be able to
> translate between the in-memory "logical query" data structures in a
> library like Ibis to a serialized format that can be executed by many
> different Arrow-native query engines. The expression primitives
> available in Ibis should serve as helpful test cases, too.
> 
> I look forward to the community's comments on the RFC document [1] and
> pull request [2] -- I realize that arriving at consensus on a complex
> and ambitious project like this can be challenging so I recommend
> spending time on the "non-goals" section in the RFC and ask questions
> if you are unclear about the scope of what problems this is trying to
> solve. I would be happy to give Edit access on the RFC document to
> others and would consider ideas about how to move forward with
> something that is able to be implemented by different Arrow libraries
> in the reasonably near future.
> 
> Thanks,
> Wes
> 
> [1]: https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> [2]: https://github.com/apache/arrow/pull/10856
> [3]: https://ibis-project.org/
> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> [5]: https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt