You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2019/01/05 21:02:29 UTC

Arrow Rust roadmapping [was Re: [Gandiva] Representing logical query plans in protobuf]

hi Neville,

On Sat, Jan 5, 2019 at 2:37 PM Neville Dipale <ne...@gmail.com> wrote:
>
> Hi Andy & Wes,
>
> Apologies if I go off-topic a bit, I hope my thoughts are related though.
>
> I'm a new contributor to Arrow, but I've been using and following it since
> the feather days. I'm interested in contributing to Rust, as that aligns
> more with my day job(s).
>
> I think we (rather, the Rust contributors) can benefit from direction via a
> roadmap for a few releases (or for 2019), so that new contributors can find
> it easier to add value.
>
> What I've observed so far (on Rust and sometimes other languages) is that
> although there's the grand goal that exists (e.g. Ursa Labs' intentions), a
> lot of work scheduling is haphazard. For example, a lot of JIRAs are opened
> by developers, and then a PR is submitted not long after. Bug reports are
> exceptions. This phenomenon, even if minor, makes it difficult for someone
> to pick up work and contribute.
>
> I would propose the following for Rust and other less-maturely-supported
> languages like C#:
>
> 1. We look at gaps relative to python/cpp feature-wise, and create JIRAs
> for functionality that doesn't yet exist. For example, Rust doesn't have
> date/time support (I created a JIRA a few weeks ago)
> 2. For some of these features where more effort is required, provide some
> rough outline of what needs to be done.
> 3. For components/features that are common across languages (CSV), agree on
> overall design which languages can abide to as far as possible. CPP might
> be the template, but it's already likely that Go and Rust are doing their
> own thing, which might lead to inconsistent UX to Arrow users down the
> line. Such a design might already exist, but I haven't seen anything yet.
> This can include creating common test data like is being done with Parquet.
>

I don't mean to dismiss this concern, but leadership (what you are
asking for) in any kind of software project (whether open source or
not) is a _lot_ of work. If there is not an individual with the time
and space to effectively be a "product manager" then this work
generally does not happen, and development gets done on an ad hoc
basis based on what features people need to build the applications
they are working on.

One of the roles I've played over the last ~3 years in the project is
the chief JIRA wrangler for the C++ and Python implementations.
According to

https://cwiki.apache.org/confluence/display/ARROW/JIRA+Health+Dashboard

I have created almost 1500 JIRAs. If you want this work to happen for
Rust or one of the other implementations, generally either you will
have to do it, or you will have to find someone to compensate to do
it. Otherwise there may be a volunteer who will step up, but there's
no guarantee.

> Beyond in-memory rep, computing kernels are the hot thing on Arrow right
> now, with Gandiva being the crown jewel. We currently have *array_ops* in
> Rust, where Andy's been adding some operations (sum, add, mul, etc.).
>
> 4. I think we need some explicit decision-making on whether to continue
> this route, which might not be mutually exclusive to future Gandiva
> bindings (based on Wes' comments on what Gandiva's role is).

Apache projects are effectively do-ocracies that operate on the basis
of consensus. It is up to the contributors of the subcomponents to
self-manage what gets built and what does not get built. Much
consensus may be "lazy" (where absence of opinions implies consent).
It is a good idea to discuss objectives and requirements on the
mailing list so there is a written record of what the consensus was.
In the absence of "yay" arguments from contributors, ultimately the
decisions get made by the people doing the work.

> 5. If *array_ops* is the way to go, we could define the types of ops we
> want to support. This could be as easy as looking at what Pandas or Spark
> support. Having a growing suite of functions could encourage users to build
> on Arrow like DataFusion is doing. This would also help Andy push the SQL
> parser and DF's query engine (per goals and roadmap) while other people do
> the grunt-work of various functions.
>
> I believe the above could make it clearer for newbies like me, to
> contribute more to Arrow, and give us a better sense of what we can and
> can't do with Arrow in our daily applications.

It would be helpful to have a development roadmap for Rust, or for the
other language implementations. With luck someone will be able to
volunteer to take on this work.

- Wes

>
> Thanks
> Neville
>
>
> On Sat, 5 Jan 2019 at 20:29, Andy Grove <an...@gmail.com> wrote:
>
> > Wes,
> >
> > That makes sense.
> >
> > I'll create a fresh PR to add a new protobuf under the Rust module for now
> > (even though this won't be Rust specific).
> >
> > Thanks,
> >
> > Andy.
> >
> >
> > On Sat, Jan 5, 2019 at 9:19 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hey Andy,
> > >
> > > I replied on GitHub and then saw your e-mail thread.
> > >
> > > The Gandiva library as it stands right now is not a query engine or an
> > > execution engine, properly speaking. It is a subgraph compiler for
> > > creating accelerated expressions for use inside another execution or
> > > query engine, like it is being used now in Dremio.
> > >
> > > For this reason I am -1 on adding logical query plan definitions to
> > > Gandiva until a more rigorous design effort takes place to decide
> > > where to build an actual query/execution engine (which includes file /
> > > dataset scanners, projections, joins, aggregates, filters, etc.) in
> > > C++. My preference is to start building a from-the-ground-up system
> > > that will depend on Gandiva to compile expressions during execution.
> > > Among other things, I don't think it is necessarily a good idea to
> > > require a query engine to depend on LLVM, so tight coupling to an
> > > LLVM-based component may not be desirable.
> > >
> > > In the meantime, if you want to start creating an (experimental)
> > > Protobuf / Flatbuffer definition to define a general query execution
> > > plan (that lives outside Gandiva for the time being) to assist with
> > > building a query engine in Rust, I think that is fine, but I want to
> > > make sure we are being deliberate and layering the project components
> > > in a good way
> > >
> > > - Wes
> > >
> > > On Sat, Jan 5, 2019 at 8:15 AM Andy Grove <an...@gmail.com> wrote:
> > > >
> > > > I have created a PR to start a discussion around representing logical
> > > query
> > > > plans in Gandiva (ARROW-4163).
> > > >
> > > > https://github.com/apache/arrow/pull/3319
> > > >
> > > > I think that adding the various steps such as projection, selection,
> > > sort,
> > > > and so on are fairly simple and not contentious. The harder part is how
> > > we
> > > > represent data sources since this likely has different meanings to
> > > > different use cases. My thought is that we can register data sources by
> > > > name (similar to CREATE EXTERNAL TABLE in Hadoop) or tie this into the
> > > IPC
> > > > meta-data somehow so we can pass memory addresses and schema
> > information.
> > > >
> > > > I would love to hear others thoughts on this.
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > >
> >

Re: Arrow Rust roadmapping [was Re: [Gandiva] Representing logical query plans in protobuf]

Posted by "Uwe L. Korn" <ma...@uwekorn.com>.

Hello Andy,

one thing that we had in discussions in the past and also opened me up a bit to the parquet-cpp merge is that merging code into a repo doesn't mean that it will reside always there. Apache has the infrastructure and guidelines to split a part of a project into a separate one. This is how the Java part of Arrow came historically out of Drill and other projects like Calcite have spin-offs like Avatica that turn(ed) into their own projects (you should also have a look at Calcite, this might be useful for DataFusion).

Thus, if there is further interest in the Arrow Rust community, you should definitely think whether some of this code *currently* would develop faster if it would be part of the Arrow repo. If there is a separate community around it in future and not anymore tightly coupled with the core Arrow developmen: make it a separate project again.

Cheers,

Uwe

On Sun, Jan 6, 2019, at 7:33 AM, Andy Grove wrote:
> Hi Wes,
> 
> Yes, I have a SQL parser (actually this is a separate crate) and DataFusion
> has the query planner and execution engine. Here is a blog post from last
> summer with some performance comparisons with Apache Spark:
> 
> https://andygrove.io/2018/05/datafusion-aggregate-performance/
> 
> I have recently been updating the code to work with my fork of Arrow and
> currently it only works with CSV and not Parquet, but adding Parquet
> support again will be simple once the Arrow reader is added (others are
> working on this already).
> 
> I guess I should write this up in more detail and we can open it up to a
> vote here to see if there is an appetite to donate and support this code
> here?
> 
> Thanks,
> 
> Andy.

Re: Arrow Rust roadmapping [was Re: [Gandiva] Representing logical query plans in protobuf]

Posted by Andy Grove <an...@gmail.com>.

Hi Wes,

Yes, I have a SQL parser (actually this is a separate crate) and DataFusion
has the query planner and execution engine. Here is a blog post from last
summer with some performance comparisons with Apache Spark:

https://andygrove.io/2018/05/datafusion-aggregate-performance/

I have recently been updating the code to work with my fork of Arrow and
currently it only works with CSV and not Parquet, but adding Parquet
support again will be simple once the Arrow reader is added (others are
working on this already).

I guess I should write this up in more detail and we can open it up to a
vote here to see if there is an appetite to donate and support this code
here?

Thanks,

Andy.

Re: Arrow Rust roadmapping [was Re: [Gandiva] Representing logical query plans in protobuf]

Posted by Wes McKinney <we...@gmail.com>.

hi Andy,

On Sat, Jan 5, 2019 at 3:59 PM Andy Grove <an...@gmail.com> wrote:
>
> Thanks Neville for starting this discussion.
>
> The next set of things I am interested now that we have some primitive
> operators in place is performing aggregate queries over a sequence of
> RecordBatches (in fact I just got that working in DataFusion this morning)
> and then moving onto other SQL features such as ORDER BY and then adding
> support for scalar and array UDFs (supporting native Rust functions and/or
> calling shared objects that can be authored in C/C++ or Rust).
>
> I am also interested in the arrow reader for Parquet that is in progress,
> and then introducing a common data source trait to wrap CSV, Parquet, and
> future data sources.
>
> I like the idea of being about to define logical query plans in the Arrow
> project but maybe that only makes sense if there is an execution engine to
> use it.
>
> I am open to donating the current DataFusion code as a Rust-native
> execution engine for Arrow if that makes sense (currently supports
> projection, selection, and simple aggregates). Maybe having SQL directly in
> the Arrow project is going too far? However, the logical query plan +
> execution might make good sense.

This sounds pretty interesting.

If you have a SQL front end for the query engine and there is a
willingness from the Rust developers to maintain the code, it doesn't
sound unreasonable.

>
> Thanks,
>
> Andy.
>
>
>
>
>
>
>
> On Sat, Jan 5, 2019 at 2:20 PM Neville Dipale <ne...@gmail.com> wrote:
>
> > Hi Wes,
> >
> > I'm aware of your expressions re. the amount of work that leadership on OSS
> > projects takes, and for the time aspect, one has to just look at another's
> > local timezone to see even the hours and days which another works.
> >
> > To be proactive, I'll hash together such rough roadmap for Rust, and share
> > it on the mailing list when it's ready. Where I need guidance on features,
> > I'll put in enough research so I don't spend other contributors' time on
> > open-ended problems.
> >
> > Thanks for responding, really appreciate it
> >
> > Neville
> >
> > On Sat, 5 Jan 2019 at 23:03, Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Neville,
> > >
> > > On Sat, Jan 5, 2019 at 2:37 PM Neville Dipale <ne...@gmail.com>
> > > wrote:
> > > >
> > > > Hi Andy & Wes,
> > > >
> > > > Apologies if I go off-topic a bit, I hope my thoughts are related
> > though.
> > > >
> > > > I'm a new contributor to Arrow, but I've been using and following it
> > > since
> > > > the feather days. I'm interested in contributing to Rust, as that
> > aligns
> > > > more with my day job(s).
> > > >
> > > > I think we (rather, the Rust contributors) can benefit from direction
> > > via a
> > > > roadmap for a few releases (or for 2019), so that new contributors can
> > > find
> > > > it easier to add value.
> > > >
> > > > What I've observed so far (on Rust and sometimes other languages) is
> > that
> > > > although there's the grand goal that exists (e.g. Ursa Labs'
> > > intentions), a
> > > > lot of work scheduling is haphazard. For example, a lot of JIRAs are
> > > opened
> > > > by developers, and then a PR is submitted not long after. Bug reports
> > are
> > > > exceptions. This phenomenon, even if minor, makes it difficult for
> > > someone
> > > > to pick up work and contribute.
> > > >
> > > > I would propose the following for Rust and other
> > less-maturely-supported
> > > > languages like C#:
> > > >
> > > > 1. We look at gaps relative to python/cpp feature-wise, and create
> > JIRAs
> > > > for functionality that doesn't yet exist. For example, Rust doesn't
> > have
> > > > date/time support (I created a JIRA a few weeks ago)
> > > > 2. For some of these features where more effort is required, provide
> > some
> > > > rough outline of what needs to be done.
> > > > 3. For components/features that are common across languages (CSV),
> > agree
> > > on
> > > > overall design which languages can abide to as far as possible. CPP
> > might
> > > > be the template, but it's already likely that Go and Rust are doing
> > their
> > > > own thing, which might lead to inconsistent UX to Arrow users down the
> > > > line. Such a design might already exist, but I haven't seen anything
> > yet.
> > > > This can include creating common test data like is being done with
> > > Parquet.
> > > >
> > >
> > > I don't mean to dismiss this concern, but leadership (what you are
> > > asking for) in any kind of software project (whether open source or
> > > not) is a _lot_ of work. If there is not an individual with the time
> > > and space to effectively be a "product manager" then this work
> > > generally does not happen, and development gets done on an ad hoc
> > > basis based on what features people need to build the applications
> > > they are working on.
> > >
> > > One of the roles I've played over the last ~3 years in the project is
> > > the chief JIRA wrangler for the C++ and Python implementations.
> > > According to
> > >
> > > https://cwiki.apache.org/confluence/display/ARROW/JIRA+Health+Dashboard
> > >
> > > I have created almost 1500 JIRAs. If you want this work to happen for
> > > Rust or one of the other implementations, generally either you will
> > > have to do it, or you will have to find someone to compensate to do
> > > it. Otherwise there may be a volunteer who will step up, but there's
> > > no guarantee.
> > >
> > > > Beyond in-memory rep, computing kernels are the hot thing on Arrow
> > right
> > > > now, with Gandiva being the crown jewel. We currently have *array_ops*
> > in
> > > > Rust, where Andy's been adding some operations (sum, add, mul, etc.).
> > > >
> > > > 4. I think we need some explicit decision-making on whether to continue
> > > > this route, which might not be mutually exclusive to future Gandiva
> > > > bindings (based on Wes' comments on what Gandiva's role is).
> > >
> > > Apache projects are effectively do-ocracies that operate on the basis
> > > of consensus. It is up to the contributors of the subcomponents to
> > > self-manage what gets built and what does not get built. Much
> > > consensus may be "lazy" (where absence of opinions implies consent).
> > > It is a good idea to discuss objectives and requirements on the
> > > mailing list so there is a written record of what the consensus was.
> > > In the absence of "yay" arguments from contributors, ultimately the
> > > decisions get made by the people doing the work.
> > >
> > > > 5. If *array_ops* is the way to go, we could define the types of ops we
> > > > want to support. This could be as easy as looking at what Pandas or
> > Spark
> > > > support. Having a growing suite of functions could encourage users to
> > > build
> > > > on Arrow like DataFusion is doing. This would also help Andy push the
> > SQL
> > > > parser and DF's query engine (per goals and roadmap) while other people
> > > do
> > > > the grunt-work of various functions.
> > > >
> > > > I believe the above could make it clearer for newbies like me, to
> > > > contribute more to Arrow, and give us a better sense of what we can and
> > > > can't do with Arrow in our daily applications.
> > >
> > > It would be helpful to have a development roadmap for Rust, or for the
> > > other language implementations. With luck someone will be able to
> > > volunteer to take on this work.
> > >
> > > - Wes
> > >
> > > >
> > > > Thanks
> > > > Neville
> > > >
> > > >
> > > > On Sat, 5 Jan 2019 at 20:29, Andy Grove <an...@gmail.com> wrote:
> > > >
> > > > > Wes,
> > > > >
> > > > > That makes sense.
> > > > >
> > > > > I'll create a fresh PR to add a new protobuf under the Rust module
> > for
> > > now
> > > > > (even though this won't be Rust specific).
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > > >
> > > > > On Sat, Jan 5, 2019 at 9:19 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > > hey Andy,
> > > > > >
> > > > > > I replied on GitHub and then saw your e-mail thread.
> > > > > >
> > > > > > The Gandiva library as it stands right now is not a query engine or
> > > an
> > > > > > execution engine, properly speaking. It is a subgraph compiler for
> > > > > > creating accelerated expressions for use inside another execution
> > or
> > > > > > query engine, like it is being used now in Dremio.
> > > > > >
> > > > > > For this reason I am -1 on adding logical query plan definitions to
> > > > > > Gandiva until a more rigorous design effort takes place to decide
> > > > > > where to build an actual query/execution engine (which includes
> > file
> > > /
> > > > > > dataset scanners, projections, joins, aggregates, filters, etc.) in
> > > > > > C++. My preference is to start building a from-the-ground-up system
> > > > > > that will depend on Gandiva to compile expressions during
> > execution.
> > > > > > Among other things, I don't think it is necessarily a good idea to
> > > > > > require a query engine to depend on LLVM, so tight coupling to an
> > > > > > LLVM-based component may not be desirable.
> > > > > >
> > > > > > In the meantime, if you want to start creating an (experimental)
> > > > > > Protobuf / Flatbuffer definition to define a general query
> > execution
> > > > > > plan (that lives outside Gandiva for the time being) to assist with
> > > > > > building a query engine in Rust, I think that is fine, but I want
> > to
> > > > > > make sure we are being deliberate and layering the project
> > components
> > > > > > in a good way
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > > On Sat, Jan 5, 2019 at 8:15 AM Andy Grove <an...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > I have created a PR to start a discussion around representing
> > > logical
> > > > > > query
> > > > > > > plans in Gandiva (ARROW-4163).
> > > > > > >
> > > > > > > https://github.com/apache/arrow/pull/3319
> > > > > > >
> > > > > > > I think that adding the various steps such as projection,
> > > selection,
> > > > > > sort,
> > > > > > > and so on are fairly simple and not contentious. The harder part
> > > is how
> > > > > > we
> > > > > > > represent data sources since this likely has different meanings
> > to
> > > > > > > different use cases. My thought is that we can register data
> > > sources by
> > > > > > > name (similar to CREATE EXTERNAL TABLE in Hadoop) or tie this
> > into
> > > the
> > > > > > IPC
> > > > > > > meta-data somehow so we can pass memory addresses and schema
> > > > > information.
> > > > > > >
> > > > > > > I would love to hear others thoughts on this.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Andy.
> > > > > >
> > > > >
> > >
> >

Re: Arrow Rust roadmapping [was Re: [Gandiva] Representing logical query plans in protobuf]

Posted by Andy Grove <an...@gmail.com>.

Thanks Neville for starting this discussion.

The next set of things I am interested now that we have some primitive
operators in place is performing aggregate queries over a sequence of
RecordBatches (in fact I just got that working in DataFusion this morning)
and then moving onto other SQL features such as ORDER BY and then adding
support for scalar and array UDFs (supporting native Rust functions and/or
calling shared objects that can be authored in C/C++ or Rust).

I am also interested in the arrow reader for Parquet that is in progress,
and then introducing a common data source trait to wrap CSV, Parquet, and
future data sources.

I like the idea of being about to define logical query plans in the Arrow
project but maybe that only makes sense if there is an execution engine to
use it.

I am open to donating the current DataFusion code as a Rust-native
execution engine for Arrow if that makes sense (currently supports
projection, selection, and simple aggregates). Maybe having SQL directly in
the Arrow project is going too far? However, the logical query plan +
execution might make good sense.

Thanks,

Andy.







On Sat, Jan 5, 2019 at 2:20 PM Neville Dipale <ne...@gmail.com> wrote:

> Hi Wes,
>
> I'm aware of your expressions re. the amount of work that leadership on OSS
> projects takes, and for the time aspect, one has to just look at another's
> local timezone to see even the hours and days which another works.
>
> To be proactive, I'll hash together such rough roadmap for Rust, and share
> it on the mailing list when it's ready. Where I need guidance on features,
> I'll put in enough research so I don't spend other contributors' time on
> open-ended problems.
>
> Thanks for responding, really appreciate it
>
> Neville
>
> On Sat, 5 Jan 2019 at 23:03, Wes McKinney <we...@gmail.com> wrote:
>
> > hi Neville,
> >
> > On Sat, Jan 5, 2019 at 2:37 PM Neville Dipale <ne...@gmail.com>
> > wrote:
> > >
> > > Hi Andy & Wes,
> > >
> > > Apologies if I go off-topic a bit, I hope my thoughts are related
> though.
> > >
> > > I'm a new contributor to Arrow, but I've been using and following it
> > since
> > > the feather days. I'm interested in contributing to Rust, as that
> aligns
> > > more with my day job(s).
> > >
> > > I think we (rather, the Rust contributors) can benefit from direction
> > via a
> > > roadmap for a few releases (or for 2019), so that new contributors can
> > find
> > > it easier to add value.
> > >
> > > What I've observed so far (on Rust and sometimes other languages) is
> that
> > > although there's the grand goal that exists (e.g. Ursa Labs'
> > intentions), a
> > > lot of work scheduling is haphazard. For example, a lot of JIRAs are
> > opened
> > > by developers, and then a PR is submitted not long after. Bug reports
> are
> > > exceptions. This phenomenon, even if minor, makes it difficult for
> > someone
> > > to pick up work and contribute.
> > >
> > > I would propose the following for Rust and other
> less-maturely-supported
> > > languages like C#:
> > >
> > > 1. We look at gaps relative to python/cpp feature-wise, and create
> JIRAs
> > > for functionality that doesn't yet exist. For example, Rust doesn't
> have
> > > date/time support (I created a JIRA a few weeks ago)
> > > 2. For some of these features where more effort is required, provide
> some
> > > rough outline of what needs to be done.
> > > 3. For components/features that are common across languages (CSV),
> agree
> > on
> > > overall design which languages can abide to as far as possible. CPP
> might
> > > be the template, but it's already likely that Go and Rust are doing
> their
> > > own thing, which might lead to inconsistent UX to Arrow users down the
> > > line. Such a design might already exist, but I haven't seen anything
> yet.
> > > This can include creating common test data like is being done with
> > Parquet.
> > >
> >
> > I don't mean to dismiss this concern, but leadership (what you are
> > asking for) in any kind of software project (whether open source or
> > not) is a _lot_ of work. If there is not an individual with the time
> > and space to effectively be a "product manager" then this work
> > generally does not happen, and development gets done on an ad hoc
> > basis based on what features people need to build the applications
> > they are working on.
> >
> > One of the roles I've played over the last ~3 years in the project is
> > the chief JIRA wrangler for the C++ and Python implementations.
> > According to
> >
> > https://cwiki.apache.org/confluence/display/ARROW/JIRA+Health+Dashboard
> >
> > I have created almost 1500 JIRAs. If you want this work to happen for
> > Rust or one of the other implementations, generally either you will
> > have to do it, or you will have to find someone to compensate to do
> > it. Otherwise there may be a volunteer who will step up, but there's
> > no guarantee.
> >
> > > Beyond in-memory rep, computing kernels are the hot thing on Arrow
> right
> > > now, with Gandiva being the crown jewel. We currently have *array_ops*
> in
> > > Rust, where Andy's been adding some operations (sum, add, mul, etc.).
> > >
> > > 4. I think we need some explicit decision-making on whether to continue
> > > this route, which might not be mutually exclusive to future Gandiva
> > > bindings (based on Wes' comments on what Gandiva's role is).
> >
> > Apache projects are effectively do-ocracies that operate on the basis
> > of consensus. It is up to the contributors of the subcomponents to
> > self-manage what gets built and what does not get built. Much
> > consensus may be "lazy" (where absence of opinions implies consent).
> > It is a good idea to discuss objectives and requirements on the
> > mailing list so there is a written record of what the consensus was.
> > In the absence of "yay" arguments from contributors, ultimately the
> > decisions get made by the people doing the work.
> >
> > > 5. If *array_ops* is the way to go, we could define the types of ops we
> > > want to support. This could be as easy as looking at what Pandas or
> Spark
> > > support. Having a growing suite of functions could encourage users to
> > build
> > > on Arrow like DataFusion is doing. This would also help Andy push the
> SQL
> > > parser and DF's query engine (per goals and roadmap) while other people
> > do
> > > the grunt-work of various functions.
> > >
> > > I believe the above could make it clearer for newbies like me, to
> > > contribute more to Arrow, and give us a better sense of what we can and
> > > can't do with Arrow in our daily applications.
> >
> > It would be helpful to have a development roadmap for Rust, or for the
> > other language implementations. With luck someone will be able to
> > volunteer to take on this work.
> >
> > - Wes
> >
> > >
> > > Thanks
> > > Neville
> > >
> > >
> > > On Sat, 5 Jan 2019 at 20:29, Andy Grove <an...@gmail.com> wrote:
> > >
> > > > Wes,
> > > >
> > > > That makes sense.
> > > >
> > > > I'll create a fresh PR to add a new protobuf under the Rust module
> for
> > now
> > > > (even though this won't be Rust specific).
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > >
> > > > On Sat, Jan 5, 2019 at 9:19 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > > hey Andy,
> > > > >
> > > > > I replied on GitHub and then saw your e-mail thread.
> > > > >
> > > > > The Gandiva library as it stands right now is not a query engine or
> > an
> > > > > execution engine, properly speaking. It is a subgraph compiler for
> > > > > creating accelerated expressions for use inside another execution
> or
> > > > > query engine, like it is being used now in Dremio.
> > > > >
> > > > > For this reason I am -1 on adding logical query plan definitions to
> > > > > Gandiva until a more rigorous design effort takes place to decide
> > > > > where to build an actual query/execution engine (which includes
> file
> > /
> > > > > dataset scanners, projections, joins, aggregates, filters, etc.) in
> > > > > C++. My preference is to start building a from-the-ground-up system
> > > > > that will depend on Gandiva to compile expressions during
> execution.
> > > > > Among other things, I don't think it is necessarily a good idea to
> > > > > require a query engine to depend on LLVM, so tight coupling to an
> > > > > LLVM-based component may not be desirable.
> > > > >
> > > > > In the meantime, if you want to start creating an (experimental)
> > > > > Protobuf / Flatbuffer definition to define a general query
> execution
> > > > > plan (that lives outside Gandiva for the time being) to assist with
> > > > > building a query engine in Rust, I think that is fine, but I want
> to
> > > > > make sure we are being deliberate and layering the project
> components
> > > > > in a good way
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Sat, Jan 5, 2019 at 8:15 AM Andy Grove <an...@gmail.com>
> > wrote:
> > > > > >
> > > > > > I have created a PR to start a discussion around representing
> > logical
> > > > > query
> > > > > > plans in Gandiva (ARROW-4163).
> > > > > >
> > > > > > https://github.com/apache/arrow/pull/3319
> > > > > >
> > > > > > I think that adding the various steps such as projection,
> > selection,
> > > > > sort,
> > > > > > and so on are fairly simple and not contentious. The harder part
> > is how
> > > > > we
> > > > > > represent data sources since this likely has different meanings
> to
> > > > > > different use cases. My thought is that we can register data
> > sources by
> > > > > > name (similar to CREATE EXTERNAL TABLE in Hadoop) or tie this
> into
> > the
> > > > > IPC
> > > > > > meta-data somehow so we can pass memory addresses and schema
> > > > information.
> > > > > >
> > > > > > I would love to hear others thoughts on this.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Andy.
> > > > >
> > > >
> >
>

Re: Arrow Rust roadmapping [was Re: [Gandiva] Representing logical query plans in protobuf]

Posted by Neville Dipale <ne...@gmail.com>.

Hi Wes,

I'm aware of your expressions re. the amount of work that leadership on OSS
projects takes, and for the time aspect, one has to just look at another's
local timezone to see even the hours and days which another works.

To be proactive, I'll hash together such rough roadmap for Rust, and share
it on the mailing list when it's ready. Where I need guidance on features,
I'll put in enough research so I don't spend other contributors' time on
open-ended problems.

Thanks for responding, really appreciate it

Neville

On Sat, 5 Jan 2019 at 23:03, Wes McKinney <we...@gmail.com> wrote:

> hi Neville,
>
> On Sat, Jan 5, 2019 at 2:37 PM Neville Dipale <ne...@gmail.com>
> wrote:
> >
> > Hi Andy & Wes,
> >
> > Apologies if I go off-topic a bit, I hope my thoughts are related though.
> >
> > I'm a new contributor to Arrow, but I've been using and following it
> since
> > the feather days. I'm interested in contributing to Rust, as that aligns
> > more with my day job(s).
> >
> > I think we (rather, the Rust contributors) can benefit from direction
> via a
> > roadmap for a few releases (or for 2019), so that new contributors can
> find
> > it easier to add value.
> >
> > What I've observed so far (on Rust and sometimes other languages) is that
> > although there's the grand goal that exists (e.g. Ursa Labs'
> intentions), a
> > lot of work scheduling is haphazard. For example, a lot of JIRAs are
> opened
> > by developers, and then a PR is submitted not long after. Bug reports are
> > exceptions. This phenomenon, even if minor, makes it difficult for
> someone
> > to pick up work and contribute.
> >
> > I would propose the following for Rust and other less-maturely-supported
> > languages like C#:
> >
> > 1. We look at gaps relative to python/cpp feature-wise, and create JIRAs
> > for functionality that doesn't yet exist. For example, Rust doesn't have
> > date/time support (I created a JIRA a few weeks ago)
> > 2. For some of these features where more effort is required, provide some
> > rough outline of what needs to be done.
> > 3. For components/features that are common across languages (CSV), agree
> on
> > overall design which languages can abide to as far as possible. CPP might
> > be the template, but it's already likely that Go and Rust are doing their
> > own thing, which might lead to inconsistent UX to Arrow users down the
> > line. Such a design might already exist, but I haven't seen anything yet.
> > This can include creating common test data like is being done with
> Parquet.
> >
>
> I don't mean to dismiss this concern, but leadership (what you are
> asking for) in any kind of software project (whether open source or
> not) is a _lot_ of work. If there is not an individual with the time
> and space to effectively be a "product manager" then this work
> generally does not happen, and development gets done on an ad hoc
> basis based on what features people need to build the applications
> they are working on.
>
> One of the roles I've played over the last ~3 years in the project is
> the chief JIRA wrangler for the C++ and Python implementations.
> According to
>
> https://cwiki.apache.org/confluence/display/ARROW/JIRA+Health+Dashboard
>
> I have created almost 1500 JIRAs. If you want this work to happen for
> Rust or one of the other implementations, generally either you will
> have to do it, or you will have to find someone to compensate to do
> it. Otherwise there may be a volunteer who will step up, but there's
> no guarantee.
>
> > Beyond in-memory rep, computing kernels are the hot thing on Arrow right
> > now, with Gandiva being the crown jewel. We currently have *array_ops* in
> > Rust, where Andy's been adding some operations (sum, add, mul, etc.).
> >
> > 4. I think we need some explicit decision-making on whether to continue
> > this route, which might not be mutually exclusive to future Gandiva
> > bindings (based on Wes' comments on what Gandiva's role is).
>
> Apache projects are effectively do-ocracies that operate on the basis
> of consensus. It is up to the contributors of the subcomponents to
> self-manage what gets built and what does not get built. Much
> consensus may be "lazy" (where absence of opinions implies consent).
> It is a good idea to discuss objectives and requirements on the
> mailing list so there is a written record of what the consensus was.
> In the absence of "yay" arguments from contributors, ultimately the
> decisions get made by the people doing the work.
>
> > 5. If *array_ops* is the way to go, we could define the types of ops we
> > want to support. This could be as easy as looking at what Pandas or Spark
> > support. Having a growing suite of functions could encourage users to
> build
> > on Arrow like DataFusion is doing. This would also help Andy push the SQL
> > parser and DF's query engine (per goals and roadmap) while other people
> do
> > the grunt-work of various functions.
> >
> > I believe the above could make it clearer for newbies like me, to
> > contribute more to Arrow, and give us a better sense of what we can and
> > can't do with Arrow in our daily applications.
>
> It would be helpful to have a development roadmap for Rust, or for the
> other language implementations. With luck someone will be able to
> volunteer to take on this work.
>
> - Wes
>
> >
> > Thanks
> > Neville
> >
> >
> > On Sat, 5 Jan 2019 at 20:29, Andy Grove <an...@gmail.com> wrote:
> >
> > > Wes,
> > >
> > > That makes sense.
> > >
> > > I'll create a fresh PR to add a new protobuf under the Rust module for
> now
> > > (even though this won't be Rust specific).
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > >
> > > On Sat, Jan 5, 2019 at 9:19 AM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > hey Andy,
> > > >
> > > > I replied on GitHub and then saw your e-mail thread.
> > > >
> > > > The Gandiva library as it stands right now is not a query engine or
> an
> > > > execution engine, properly speaking. It is a subgraph compiler for
> > > > creating accelerated expressions for use inside another execution or
> > > > query engine, like it is being used now in Dremio.
> > > >
> > > > For this reason I am -1 on adding logical query plan definitions to
> > > > Gandiva until a more rigorous design effort takes place to decide
> > > > where to build an actual query/execution engine (which includes file
> /
> > > > dataset scanners, projections, joins, aggregates, filters, etc.) in
> > > > C++. My preference is to start building a from-the-ground-up system
> > > > that will depend on Gandiva to compile expressions during execution.
> > > > Among other things, I don't think it is necessarily a good idea to
> > > > require a query engine to depend on LLVM, so tight coupling to an
> > > > LLVM-based component may not be desirable.
> > > >
> > > > In the meantime, if you want to start creating an (experimental)
> > > > Protobuf / Flatbuffer definition to define a general query execution
> > > > plan (that lives outside Gandiva for the time being) to assist with
> > > > building a query engine in Rust, I think that is fine, but I want to
> > > > make sure we are being deliberate and layering the project components
> > > > in a good way
> > > >
> > > > - Wes
> > > >
> > > > On Sat, Jan 5, 2019 at 8:15 AM Andy Grove <an...@gmail.com>
> wrote:
> > > > >
> > > > > I have created a PR to start a discussion around representing
> logical
> > > > query
> > > > > plans in Gandiva (ARROW-4163).
> > > > >
> > > > > https://github.com/apache/arrow/pull/3319
> > > > >
> > > > > I think that adding the various steps such as projection,
> selection,
> > > > sort,
> > > > > and so on are fairly simple and not contentious. The harder part
> is how
> > > > we
> > > > > represent data sources since this likely has different meanings to
> > > > > different use cases. My thought is that we can register data
> sources by
> > > > > name (similar to CREATE EXTERNAL TABLE in Hadoop) or tie this into
> the
> > > > IPC
> > > > > meta-data somehow so we can pass memory addresses and schema
> > > information.
> > > > >
> > > > > I would love to hear others thoughts on this.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > >
> > >
>