You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Andrew Lamb <al...@influxdata.com> on 2022/12/26 18:12:24 UTC

[DISCUSS] State of the Arrow Project 2022

Hi all,

I am very excited and honored to help steer the Arrow Project this year as
Arrow PMC Chair.

Something Kou suggested, and the PMC thought would be valuable, is to have
a small retrospective about the state of the project and where we want to
take it. I would like to try doing so via a  “state of the project” type
discussion on this mailing list, inspired by an example from Apache Calcite
[1].

I welcome any / all comments on the following topics: What things /
activities, if any, do you you think the Apache Arrow Community should:

1. Continue
2. Start
3. Stop

My thoughts are below.

Andrew

[1] https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf

Continue:

I hope we can continue to encourage and support community growth, focused
especially on supporting the sub projects and their leadership. I also
would like to continue and grow the outward facing evangelism about the
project with blog posts and presentations.

Start:

Lower the barrier to contributors and accepting those contributions even
more, especially for casual contributors. The move to github issues from
JIRA I see as one example of lowering this barrier (by reducing the
required account maintenance). I would love to see additional improvements
in areas like documentation, examples, no-invite-needed chat, etc.

Stop:

It would be nice to stop (reduce) the reliance on the relatively small
number of core contributors for code review. I don’t have any particular
insight on how to accomplish this, and suspect we will always have less
review capacity than we would like, but it would be nice to encourage the
growth.

Re: [DISCUSS] State of the Arrow Project 2022

Posted by Weston Pace <we...@gmail.com>.

Start:

There have been a few calls in the past for an improved workflow for
reviewing PRs.  I think a bot that highlights pull requests that need
attention (e.g. has no reviews in the "changes requested" state, also
some way of knowing how long it's been waiting) would be helpful.

There has been some discussion on how we encourage community
contribution.  I think, in addition to these suggestions, we should
create content that encourages and guides community maintenance.  This
might help with the PR review burden as well. For example, one does
not need to be an expert in the language or the design in order to
review and make suggestions on user facing API changes (in fact, one
might argue that users are less likely to make incorrect decisions due
to design burden).  Someone that knows a language well but not
necessarily the library design could still help check for style issues
(e.g. moves, references, etc. in C++).  Github/ASF have a
"contributor" role that can be given to individuals that allows them
some labelling / PR maintenance permissions without write access.  We
could have a policy for this and encourage people to help close /
revive stale PRs (or ping potential reviewers if it looks like a PR is
stale because it is lacking review).

Continue:

The c data interface, streaming interface, ADBC, etc. are very cool
and allow for more seamless intra-process communication.  However,
this is a rather complex topic.  For example, it is not obvious why
something called "the C data interface" would be useful when
connecting (for example) a Rust library to a Java library.  I think we
should continue to develop these APIs as well as spread general
developer awareness of the topics.

The community focus of the Arrow project is great.  A lot of effort is
put into CI, various build environments, and aiming to support as wide
of a developer base as possible.  At the same time there are various
independent Arrow projects that overlap to some degree and yet they
all (from what I can see) encourage and inspire each other.  I don't
know how unique this is in open source development but I've found it
to be non-existent in private development.  I want to congratulate
this community for being so awesome.  Development is always a series
of ups and downs and I'd encourage us all, even when frustrated, to
continue extending the helping hand to the community (both
contributors and users).

On Sat, Jan 7, 2023 at 3:23 PM Jacob Wujciak
<ja...@voltrondata.com.invalid> wrote:
>
> +1 to the existing suggestions, this is such a great thread, great start to
> the year!
> A theme from this thread I would like to pick up is "community ux": The
> community overview on the arrow page is quite small and afaik the different
> syncalls and zulip are not documented anywhere, so we should improve that
> *ideal with more graphics too :D). With the number of different arrow
> subprojects continuing to grow it might be nice to have a sort of
> 'community overview' that helps people new to arrow orient themselves in
> and explore the arrow ecosystem. Ideally this would also be accessible from
> github as I think there are a number of people that enter the arrow world
> purely through gh.
>
> I think for the roadmaps we could use projects in the respective repos. As
> reference the official GH roadmap does this too [1].
>
> I have opened a PR with a PR template [2], please review and add feedback
> to the wording, I adapted the existing rust templates and hint that gets
> posted when the title is malformed.
>
> [1]: https://github.com/orgs/github/projects/4247
> [2]: https://github.com/apache/arrow/pull/15250
>
> On Sat, Jan 7, 2023 at 8:53 PM Andrew Lamb <al...@influxdata.com> wrote:
>
> > We have used pull request templates in the various rust projects to good
> > effect: most PRs clearly describe what they are doing and why.
> >
> > For your reference, they are at arrow-rs[1] and arrow-datafusion[2].
> >
> > [1]
> >
> > https://raw.githubusercontent.com/apache/arrow-rs/master/.github/pull_request_template.md
> > [2]
> >
> > https://raw.githubusercontent.com/apache/arrow-datafusion/master/.github/pull_request_template.md
> >
> > On Fri, Jan 6, 2023 at 11:18 PM Will Jones <wi...@gmail.com>
> > wrote:
> >
> > > Thanks, Kevin.
> > >
> > > Documenting a process for determining who should be included on a code
> > > > review would be helpful.
> > > >
> > >
> > > That's a good idea. We have a docs page directed at contributors, but I'm
> > > not sure how many people have read it [1]. This would be a good addition
> > to
> > > it. (There's also a good guide on reviewing contributions [2].) I also
> > like
> > > the idea of pull request templates, and it seems like if we provide a
> > link
> > > in the template to this overview, more of our contributors would read the
> > > guide. I have created an issue for this [3].
> > >
> > >  Also +1 on more diagrams. I've created a couple recently (for example
> > [4])
> > > and hope to make more.
> > >
> > > [1] https://arrow.apache.org/docs/developers/overview.html
> > > [2] https://arrow.apache.org/docs/developers/reviewing.html
> > > [3] https://github.com/apache/arrow/issues/15232
> > > [4] https://arrow.apache.org/docs/format/Glossary.html#term-table
> > >
> > > On Fri, Jan 6, 2023 at 12:26 PM Kevin Gurney <kg...@mathworks.com>
> > > wrote:
> > >
> > > > Thank you for starting this discussion, Andrew!
> > > >
> > > > Fiona, Sreehari, and I thought a bit about this, and I've summarized
> > some
> > > > of our thoughts below.
> > > >
> > > > Continue:
> > > >
> > > > 1. +1 to Will's suggestion about roadmaps for sub-projects. This is
> > > > something that would be helpful for the MATLAB interface, for example.
> > We
> > > > would also be interested in the possibility of exploring a MATLAB sync
> > > call
> > > > if it would be of interest to other community members.
> > > >
> > > > 2. Continue focusing on building an inclusive developer community.
> > Finish
> > > > the work required to rename the master branch to main. Consider running
> > > > automated checks on pull requests using a tool like alex [1] to prevent
> > > use
> > > > of inappropriate language and terminology.
> > > >
> > > > Start:
> > > >
> > > > 1. Add more visuals and diagrams to the documentation. It can be pretty
> > > > overwhelming for new community members to look at the in-depth Arrow
> > C++
> > > > documentation and be able to quickly get a high-level understanding of
> > > how
> > > > the various data structures (e.g. buffer, array, chunked array, record
> > > > batch, table, field, schema, data type, etc.) relate to one another.
> > > Having
> > > > more visuals with clear labels that show the relationship between these
> > > key
> > > > concepts would be very helpful. This also applies to other parts of the
> > > > documentation, like the CI systems (e.g. crossbow), which have a lot of
> > > > moving parts.
> > > >
> > > > 2. Use pull request templates. This would hopefully make it easier for
> > > > both new and existing contributors to describe their changes in a
> > focused
> > > > and clear way to others. For example, when making pull requests related
> > > to
> > > > the MATLAB interface, we've been trying to follow a fairly consistent
> > > > pattern for pull request descriptions which includes sections like
> > > > "Overview", "Implementation", "Testing", "Future Directions", "Notes",
> > > etc.
> > > >
> > > > Stop:
> > > >
> > > > 1. +1 to Andrew's point about the reliance on a small number of core
> > > > contributors for code reviews. Documenting a process for determining
> > who
> > > > should be included on a code review would be helpful.
> > > >
> > > > [1] https://github.com/get-alex/alex
> > > >
> > > > ________________________________
> > > > From: Dewey Dunnington <de...@voltrondata.com.INVALID>
> > > > Sent: Tuesday, January 3, 2023 2:33 PM
> > > > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > > > Subject: Re: [DISCUSS] State of the Arrow Project 2022
> > > >
> > > > First, a +1000 on Will's blog post! [1]
> > > >
> > > > Continue:
> > > >
> > > > Building tools that benefit users of all languages, with particular
> > kudos
> > > > to ADBC for providing an ABI-stable way to write database drivers that
> > > can
> > > > be used by practitioners in C++, Ruby, Python, Java, Go, and (soon!) R.
> > > >
> > > > Start:
> > > >
> > > > I wonder if this is the year that we can find a way to write compute
> > > > functions in such a way that separate implementations don't have to
> > exist
> > > > for C++, Go, and Rust (and maybe others I don't know about).
> > > >
> > > > Stop:
> > > >
> > > > Will's comment that we should stop building data scientist-facing tools
> > > > under the Arrow name struck a particular chord with me...the R package
> > is
> > > > very much data scientist facing and we have a rather large disjoint
> > > between
> > > > the technical capacity of our users and the technical capacity required
> > > to
> > > > contribute to the package (e.g., maintaining a development Arrow C++
> > > > install). The types of things we have to do to make RecordBatchReader,
> > > > Arrays, Buffer, RecordBatch and Table structures available to R users
> > and
> > > > the types of things we have to do to provide an Acero dplyr backend are
> > > > vastly different.
> > > >
> > > > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> > > > https://www.datawill.io/posts/apache-arrow-2022-reflection>
> > > >
> > > > On Thu, Dec 29, 2022 at 4:09 PM Jacob Wujciak
> > > > <ja...@voltrondata.com.invalid>
> > > > wrote:
> > > >
> > > > > This is a great idea, I will add some thoughts later but just wanted
> > to
> > > > > quickly add that the Zulip Chat [1] was recently switched to allow
> > > anyone
> > > > > to register without the need for an invite link!
> > > > > [1]: https://ursalabs.zulipchat.com/<https://ursalabs.zulipchat.com>
> > > > >
> > > > >
> > > > > On Wed, Dec 28, 2022 at 11:27 PM Will Jones <will.jones127@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Thanks for suggesting this Andrew.
> > > > > >
> > > > > > I just uploaded a blog post with my thoughts in long form [1]. Here
> > > are
> > > > > > some suggestions pulled from that:
> > > > > >
> > > > > > Continue:
> > > > > >
> > > > > > I hope we will continue prioritizing updating the spec for new
> > array
> > > > > > formats. [2] I think this is very important for avoiding
> > > fragmentation
> > > > > and
> > > > > > may even open opportunities for consolidation in the C++ ecosystem.
> > > > > >
> > > > > > +1 on additional improvements for documentation, examples,
> > no-invite
> > > > > chats.
> > > > > > I am particularly keen on seeing evangelism for our protocols;
> > > existing
> > > > > > ones like C Data Interface aren't nearly as widely known as they
> > > ought
> > > > to
> > > > > > be and I'm excited for new ones like ADBC.
> > > > > >
> > > > > > Start:
> > > > > >
> > > > > > Find ways for each subproject to publicly develop a clear roadmap.
> > > > > > Otherwise by default these discussions happen in private, either
> > > > between
> > > > > > individual ICs or within corporate environments. Some subprojects,
> > > such
> > > > > as
> > > > > > Acero could likely use their own sync call to help facilitate this,
> > > > even
> > > > > if
> > > > > > on a slower cadence than the main biweekly call.
> > > > > >
> > > > > > Also, other sync calls might consider adapting to the sync call
> > note
> > > > > style
> > > > > > used in the Rust projects, where all notes are in one google doc
> > [3]
> > > > > rather
> > > > > > than spread across main mailing list threads. That seems like a
> > > format
> > > > > that
> > > > > > would make it easy for new contributors to catch up on the major
> > > > focuses
> > > > > of
> > > > > > the project.
> > > > > >
> > > > > > Stop:
> > > > > >
> > > > > > Don't create end-user (e.g. data scientist) facing tools under the
> > > name
> > > > > > Arrow; prefer keeping separate brand identities for those tools and
> > > > > keeping
> > > > > > arrow libraries as developer-facing libraries.
> > > > > >
> > > > > > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> > > > https://www.datawill.io/posts/apache-arrow-2022-reflection/>
> > > > > > [2]
> > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > <
> > > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq>
> > > > > > [3]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> > > > <
> > > >
> > >
> > https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> > > > >
> > > > > >
> > > > > > On Mon, Dec 26, 2022 at 10:12 AM Andrew Lamb <alamb@influxdata.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I am very excited and honored to help steer the Arrow Project
> > this
> > > > year
> > > > > > as
> > > > > > > Arrow PMC Chair.
> > > > > > >
> > > > > > > Something Kou suggested, and the PMC thought would be valuable,
> > is
> > > to
> > > > > > have
> > > > > > > a small retrospective about the state of the project and where we
> > > > want
> > > > > to
> > > > > > > take it. I would like to try doing so via a “state of the
> > project”
> > > > > type
> > > > > > > discussion on this mailing list, inspired by an example from
> > Apache
> > > > > > Calcite
> > > > > > > [1].
> > > > > > >
> > > > > > > I welcome any / all comments on the following topics: What
> > things /
> > > > > > > activities, if any, do you you think the Apache Arrow Community
> > > > should:
> > > > > > >
> > > > > > > 1. Continue
> > > > > > > 2. Start
> > > > > > > 3. Stop
> > > > > > >
> > > > > > > My thoughts are below.
> > > > > > >
> > > > > > > Andrew
> > > > > > >
> > > > > > > [1]
> > > https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf
> > > > <https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf>
> > > > > > >
> > > > > > > Continue:
> > > > > > >
> > > > > > > I hope we can continue to encourage and support community growth,
> > > > > focused
> > > > > > > especially on supporting the sub projects and their leadership. I
> > > > also
> > > > > > > would like to continue and grow the outward facing evangelism
> > about
> > > > the
> > > > > > > project with blog posts and presentations.
> > > > > > >
> > > > > > > Start:
> > > > > > >
> > > > > > > Lower the barrier to contributors and accepting those
> > contributions
> > > > > even
> > > > > > > more, especially for casual contributors. The move to github
> > issues
> > > > > from
> > > > > > > JIRA I see as one example of lowering this barrier (by reducing
> > the
> > > > > > > required account maintenance). I would love to see additional
> > > > > > improvements
> > > > > > > in areas like documentation, examples, no-invite-needed chat,
> > etc.
> > > > > > >
> > > > > > > Stop:
> > > > > > >
> > > > > > > It would be nice to stop (reduce) the reliance on the relatively
> > > > small
> > > > > > > number of core contributors for code review. I don’t have any
> > > > > particular
> > > > > > > insight on how to accomplish this, and suspect we will always
> > have
> > > > less
> > > > > > > review capacity than we would like, but it would be nice to
> > > encourage
> > > > > the
> > > > > > > growth.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] State of the Arrow Project 2022

Posted by Jacob Wujciak <ja...@voltrondata.com.INVALID>.

+1 to the existing suggestions, this is such a great thread, great start to
the year!
A theme from this thread I would like to pick up is "community ux": The
community overview on the arrow page is quite small and afaik the different
syncalls and zulip are not documented anywhere, so we should improve that
*ideal with more graphics too :D). With the number of different arrow
subprojects continuing to grow it might be nice to have a sort of
'community overview' that helps people new to arrow orient themselves in
and explore the arrow ecosystem. Ideally this would also be accessible from
github as I think there are a number of people that enter the arrow world
purely through gh.

I think for the roadmaps we could use projects in the respective repos. As
reference the official GH roadmap does this too [1].

I have opened a PR with a PR template [2], please review and add feedback
to the wording, I adapted the existing rust templates and hint that gets
posted when the title is malformed.

[1]: https://github.com/orgs/github/projects/4247
[2]: https://github.com/apache/arrow/pull/15250

On Sat, Jan 7, 2023 at 8:53 PM Andrew Lamb <al...@influxdata.com> wrote:

> We have used pull request templates in the various rust projects to good
> effect: most PRs clearly describe what they are doing and why.
>
> For your reference, they are at arrow-rs[1] and arrow-datafusion[2].
>
> [1]
>
> https://raw.githubusercontent.com/apache/arrow-rs/master/.github/pull_request_template.md
> [2]
>
> https://raw.githubusercontent.com/apache/arrow-datafusion/master/.github/pull_request_template.md
>
> On Fri, Jan 6, 2023 at 11:18 PM Will Jones <wi...@gmail.com>
> wrote:
>
> > Thanks, Kevin.
> >
> > Documenting a process for determining who should be included on a code
> > > review would be helpful.
> > >
> >
> > That's a good idea. We have a docs page directed at contributors, but I'm
> > not sure how many people have read it [1]. This would be a good addition
> to
> > it. (There's also a good guide on reviewing contributions [2].) I also
> like
> > the idea of pull request templates, and it seems like if we provide a
> link
> > in the template to this overview, more of our contributors would read the
> > guide. I have created an issue for this [3].
> >
> >  Also +1 on more diagrams. I've created a couple recently (for example
> [4])
> > and hope to make more.
> >
> > [1] https://arrow.apache.org/docs/developers/overview.html
> > [2] https://arrow.apache.org/docs/developers/reviewing.html
> > [3] https://github.com/apache/arrow/issues/15232
> > [4] https://arrow.apache.org/docs/format/Glossary.html#term-table
> >
> > On Fri, Jan 6, 2023 at 12:26 PM Kevin Gurney <kg...@mathworks.com>
> > wrote:
> >
> > > Thank you for starting this discussion, Andrew!
> > >
> > > Fiona, Sreehari, and I thought a bit about this, and I've summarized
> some
> > > of our thoughts below.
> > >
> > > Continue:
> > >
> > > 1. +1 to Will's suggestion about roadmaps for sub-projects. This is
> > > something that would be helpful for the MATLAB interface, for example.
> We
> > > would also be interested in the possibility of exploring a MATLAB sync
> > call
> > > if it would be of interest to other community members.
> > >
> > > 2. Continue focusing on building an inclusive developer community.
> Finish
> > > the work required to rename the master branch to main. Consider running
> > > automated checks on pull requests using a tool like alex [1] to prevent
> > use
> > > of inappropriate language and terminology.
> > >
> > > Start:
> > >
> > > 1. Add more visuals and diagrams to the documentation. It can be pretty
> > > overwhelming for new community members to look at the in-depth Arrow
> C++
> > > documentation and be able to quickly get a high-level understanding of
> > how
> > > the various data structures (e.g. buffer, array, chunked array, record
> > > batch, table, field, schema, data type, etc.) relate to one another.
> > Having
> > > more visuals with clear labels that show the relationship between these
> > key
> > > concepts would be very helpful. This also applies to other parts of the
> > > documentation, like the CI systems (e.g. crossbow), which have a lot of
> > > moving parts.
> > >
> > > 2. Use pull request templates. This would hopefully make it easier for
> > > both new and existing contributors to describe their changes in a
> focused
> > > and clear way to others. For example, when making pull requests related
> > to
> > > the MATLAB interface, we've been trying to follow a fairly consistent
> > > pattern for pull request descriptions which includes sections like
> > > "Overview", "Implementation", "Testing", "Future Directions", "Notes",
> > etc.
> > >
> > > Stop:
> > >
> > > 1. +1 to Andrew's point about the reliance on a small number of core
> > > contributors for code reviews. Documenting a process for determining
> who
> > > should be included on a code review would be helpful.
> > >
> > > [1] https://github.com/get-alex/alex
> > >
> > > ________________________________
> > > From: Dewey Dunnington <de...@voltrondata.com.INVALID>
> > > Sent: Tuesday, January 3, 2023 2:33 PM
> > > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > > Subject: Re: [DISCUSS] State of the Arrow Project 2022
> > >
> > > First, a +1000 on Will's blog post! [1]
> > >
> > > Continue:
> > >
> > > Building tools that benefit users of all languages, with particular
> kudos
> > > to ADBC for providing an ABI-stable way to write database drivers that
> > can
> > > be used by practitioners in C++, Ruby, Python, Java, Go, and (soon!) R.
> > >
> > > Start:
> > >
> > > I wonder if this is the year that we can find a way to write compute
> > > functions in such a way that separate implementations don't have to
> exist
> > > for C++, Go, and Rust (and maybe others I don't know about).
> > >
> > > Stop:
> > >
> > > Will's comment that we should stop building data scientist-facing tools
> > > under the Arrow name struck a particular chord with me...the R package
> is
> > > very much data scientist facing and we have a rather large disjoint
> > between
> > > the technical capacity of our users and the technical capacity required
> > to
> > > contribute to the package (e.g., maintaining a development Arrow C++
> > > install). The types of things we have to do to make RecordBatchReader,
> > > Arrays, Buffer, RecordBatch and Table structures available to R users
> and
> > > the types of things we have to do to provide an Acero dplyr backend are
> > > vastly different.
> > >
> > > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> > > https://www.datawill.io/posts/apache-arrow-2022-reflection>
> > >
> > > On Thu, Dec 29, 2022 at 4:09 PM Jacob Wujciak
> > > <ja...@voltrondata.com.invalid>
> > > wrote:
> > >
> > > > This is a great idea, I will add some thoughts later but just wanted
> to
> > > > quickly add that the Zulip Chat [1] was recently switched to allow
> > anyone
> > > > to register without the need for an invite link!
> > > > [1]: https://ursalabs.zulipchat.com/<https://ursalabs.zulipchat.com>
> > > >
> > > >
> > > > On Wed, Dec 28, 2022 at 11:27 PM Will Jones <will.jones127@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Thanks for suggesting this Andrew.
> > > > >
> > > > > I just uploaded a blog post with my thoughts in long form [1]. Here
> > are
> > > > > some suggestions pulled from that:
> > > > >
> > > > > Continue:
> > > > >
> > > > > I hope we will continue prioritizing updating the spec for new
> array
> > > > > formats. [2] I think this is very important for avoiding
> > fragmentation
> > > > and
> > > > > may even open opportunities for consolidation in the C++ ecosystem.
> > > > >
> > > > > +1 on additional improvements for documentation, examples,
> no-invite
> > > > chats.
> > > > > I am particularly keen on seeing evangelism for our protocols;
> > existing
> > > > > ones like C Data Interface aren't nearly as widely known as they
> > ought
> > > to
> > > > > be and I'm excited for new ones like ADBC.
> > > > >
> > > > > Start:
> > > > >
> > > > > Find ways for each subproject to publicly develop a clear roadmap.
> > > > > Otherwise by default these discussions happen in private, either
> > > between
> > > > > individual ICs or within corporate environments. Some subprojects,
> > such
> > > > as
> > > > > Acero could likely use their own sync call to help facilitate this,
> > > even
> > > > if
> > > > > on a slower cadence than the main biweekly call.
> > > > >
> > > > > Also, other sync calls might consider adapting to the sync call
> note
> > > > style
> > > > > used in the Rust projects, where all notes are in one google doc
> [3]
> > > > rather
> > > > > than spread across main mailing list threads. That seems like a
> > format
> > > > that
> > > > > would make it easy for new contributors to catch up on the major
> > > focuses
> > > > of
> > > > > the project.
> > > > >
> > > > > Stop:
> > > > >
> > > > > Don't create end-user (e.g. data scientist) facing tools under the
> > name
> > > > > Arrow; prefer keeping separate brand identities for those tools and
> > > > keeping
> > > > > arrow libraries as developer-facing libraries.
> > > > >
> > > > > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> > > https://www.datawill.io/posts/apache-arrow-2022-reflection/>
> > > > > [2]
> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > <
> > > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq>
> > > > > [3]
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> > > <
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> > > >
> > > > >
> > > > > On Mon, Dec 26, 2022 at 10:12 AM Andrew Lamb <alamb@influxdata.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I am very excited and honored to help steer the Arrow Project
> this
> > > year
> > > > > as
> > > > > > Arrow PMC Chair.
> > > > > >
> > > > > > Something Kou suggested, and the PMC thought would be valuable,
> is
> > to
> > > > > have
> > > > > > a small retrospective about the state of the project and where we
> > > want
> > > > to
> > > > > > take it. I would like to try doing so via a “state of the
> project”
> > > > type
> > > > > > discussion on this mailing list, inspired by an example from
> Apache
> > > > > Calcite
> > > > > > [1].
> > > > > >
> > > > > > I welcome any / all comments on the following topics: What
> things /
> > > > > > activities, if any, do you you think the Apache Arrow Community
> > > should:
> > > > > >
> > > > > > 1. Continue
> > > > > > 2. Start
> > > > > > 3. Stop
> > > > > >
> > > > > > My thoughts are below.
> > > > > >
> > > > > > Andrew
> > > > > >
> > > > > > [1]
> > https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf
> > > <https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf>
> > > > > >
> > > > > > Continue:
> > > > > >
> > > > > > I hope we can continue to encourage and support community growth,
> > > > focused
> > > > > > especially on supporting the sub projects and their leadership. I
> > > also
> > > > > > would like to continue and grow the outward facing evangelism
> about
> > > the
> > > > > > project with blog posts and presentations.
> > > > > >
> > > > > > Start:
> > > > > >
> > > > > > Lower the barrier to contributors and accepting those
> contributions
> > > > even
> > > > > > more, especially for casual contributors. The move to github
> issues
> > > > from
> > > > > > JIRA I see as one example of lowering this barrier (by reducing
> the
> > > > > > required account maintenance). I would love to see additional
> > > > > improvements
> > > > > > in areas like documentation, examples, no-invite-needed chat,
> etc.
> > > > > >
> > > > > > Stop:
> > > > > >
> > > > > > It would be nice to stop (reduce) the reliance on the relatively
> > > small
> > > > > > number of core contributors for code review. I don’t have any
> > > > particular
> > > > > > insight on how to accomplish this, and suspect we will always
> have
> > > less
> > > > > > review capacity than we would like, but it would be nice to
> > encourage
> > > > the
> > > > > > growth.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] State of the Arrow Project 2022

Posted by Andrew Lamb <al...@influxdata.com>.

We have used pull request templates in the various rust projects to good
effect: most PRs clearly describe what they are doing and why.

For your reference, they are at arrow-rs[1] and arrow-datafusion[2].

[1]
https://raw.githubusercontent.com/apache/arrow-rs/master/.github/pull_request_template.md
[2]
https://raw.githubusercontent.com/apache/arrow-datafusion/master/.github/pull_request_template.md

On Fri, Jan 6, 2023 at 11:18 PM Will Jones <wi...@gmail.com> wrote:

> Thanks, Kevin.
>
> Documenting a process for determining who should be included on a code
> > review would be helpful.
> >
>
> That's a good idea. We have a docs page directed at contributors, but I'm
> not sure how many people have read it [1]. This would be a good addition to
> it. (There's also a good guide on reviewing contributions [2].) I also like
> the idea of pull request templates, and it seems like if we provide a link
> in the template to this overview, more of our contributors would read the
> guide. I have created an issue for this [3].
>
>  Also +1 on more diagrams. I've created a couple recently (for example [4])
> and hope to make more.
>
> [1] https://arrow.apache.org/docs/developers/overview.html
> [2] https://arrow.apache.org/docs/developers/reviewing.html
> [3] https://github.com/apache/arrow/issues/15232
> [4] https://arrow.apache.org/docs/format/Glossary.html#term-table
>
> On Fri, Jan 6, 2023 at 12:26 PM Kevin Gurney <kg...@mathworks.com>
> wrote:
>
> > Thank you for starting this discussion, Andrew!
> >
> > Fiona, Sreehari, and I thought a bit about this, and I've summarized some
> > of our thoughts below.
> >
> > Continue:
> >
> > 1. +1 to Will's suggestion about roadmaps for sub-projects. This is
> > something that would be helpful for the MATLAB interface, for example. We
> > would also be interested in the possibility of exploring a MATLAB sync
> call
> > if it would be of interest to other community members.
> >
> > 2. Continue focusing on building an inclusive developer community. Finish
> > the work required to rename the master branch to main. Consider running
> > automated checks on pull requests using a tool like alex [1] to prevent
> use
> > of inappropriate language and terminology.
> >
> > Start:
> >
> > 1. Add more visuals and diagrams to the documentation. It can be pretty
> > overwhelming for new community members to look at the in-depth Arrow C++
> > documentation and be able to quickly get a high-level understanding of
> how
> > the various data structures (e.g. buffer, array, chunked array, record
> > batch, table, field, schema, data type, etc.) relate to one another.
> Having
> > more visuals with clear labels that show the relationship between these
> key
> > concepts would be very helpful. This also applies to other parts of the
> > documentation, like the CI systems (e.g. crossbow), which have a lot of
> > moving parts.
> >
> > 2. Use pull request templates. This would hopefully make it easier for
> > both new and existing contributors to describe their changes in a focused
> > and clear way to others. For example, when making pull requests related
> to
> > the MATLAB interface, we've been trying to follow a fairly consistent
> > pattern for pull request descriptions which includes sections like
> > "Overview", "Implementation", "Testing", "Future Directions", "Notes",
> etc.
> >
> > Stop:
> >
> > 1. +1 to Andrew's point about the reliance on a small number of core
> > contributors for code reviews. Documenting a process for determining who
> > should be included on a code review would be helpful.
> >
> > [1] https://github.com/get-alex/alex
> >
> > ________________________________
> > From: Dewey Dunnington <de...@voltrondata.com.INVALID>
> > Sent: Tuesday, January 3, 2023 2:33 PM
> > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > Subject: Re: [DISCUSS] State of the Arrow Project 2022
> >
> > First, a +1000 on Will's blog post! [1]
> >
> > Continue:
> >
> > Building tools that benefit users of all languages, with particular kudos
> > to ADBC for providing an ABI-stable way to write database drivers that
> can
> > be used by practitioners in C++, Ruby, Python, Java, Go, and (soon!) R.
> >
> > Start:
> >
> > I wonder if this is the year that we can find a way to write compute
> > functions in such a way that separate implementations don't have to exist
> > for C++, Go, and Rust (and maybe others I don't know about).
> >
> > Stop:
> >
> > Will's comment that we should stop building data scientist-facing tools
> > under the Arrow name struck a particular chord with me...the R package is
> > very much data scientist facing and we have a rather large disjoint
> between
> > the technical capacity of our users and the technical capacity required
> to
> > contribute to the package (e.g., maintaining a development Arrow C++
> > install). The types of things we have to do to make RecordBatchReader,
> > Arrays, Buffer, RecordBatch and Table structures available to R users and
> > the types of things we have to do to provide an Acero dplyr backend are
> > vastly different.
> >
> > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> > https://www.datawill.io/posts/apache-arrow-2022-reflection>
> >
> > On Thu, Dec 29, 2022 at 4:09 PM Jacob Wujciak
> > <ja...@voltrondata.com.invalid>
> > wrote:
> >
> > > This is a great idea, I will add some thoughts later but just wanted to
> > > quickly add that the Zulip Chat [1] was recently switched to allow
> anyone
> > > to register without the need for an invite link!
> > > [1]: https://ursalabs.zulipchat.com/<https://ursalabs.zulipchat.com>
> > >
> > >
> > > On Wed, Dec 28, 2022 at 11:27 PM Will Jones <wi...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for suggesting this Andrew.
> > > >
> > > > I just uploaded a blog post with my thoughts in long form [1]. Here
> are
> > > > some suggestions pulled from that:
> > > >
> > > > Continue:
> > > >
> > > > I hope we will continue prioritizing updating the spec for new array
> > > > formats. [2] I think this is very important for avoiding
> fragmentation
> > > and
> > > > may even open opportunities for consolidation in the C++ ecosystem.
> > > >
> > > > +1 on additional improvements for documentation, examples, no-invite
> > > chats.
> > > > I am particularly keen on seeing evangelism for our protocols;
> existing
> > > > ones like C Data Interface aren't nearly as widely known as they
> ought
> > to
> > > > be and I'm excited for new ones like ADBC.
> > > >
> > > > Start:
> > > >
> > > > Find ways for each subproject to publicly develop a clear roadmap.
> > > > Otherwise by default these discussions happen in private, either
> > between
> > > > individual ICs or within corporate environments. Some subprojects,
> such
> > > as
> > > > Acero could likely use their own sync call to help facilitate this,
> > even
> > > if
> > > > on a slower cadence than the main biweekly call.
> > > >
> > > > Also, other sync calls might consider adapting to the sync call note
> > > style
> > > > used in the Rust projects, where all notes are in one google doc [3]
> > > rather
> > > > than spread across main mailing list threads. That seems like a
> format
> > > that
> > > > would make it easy for new contributors to catch up on the major
> > focuses
> > > of
> > > > the project.
> > > >
> > > > Stop:
> > > >
> > > > Don't create end-user (e.g. data scientist) facing tools under the
> name
> > > > Arrow; prefer keeping separate brand identities for those tools and
> > > keeping
> > > > arrow libraries as developer-facing libraries.
> > > >
> > > > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> > https://www.datawill.io/posts/apache-arrow-2022-reflection/>
> > > > [2] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> <
> > https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq>
> > > > [3]
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> > <
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> > >
> > > >
> > > > On Mon, Dec 26, 2022 at 10:12 AM Andrew Lamb <al...@influxdata.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I am very excited and honored to help steer the Arrow Project this
> > year
> > > > as
> > > > > Arrow PMC Chair.
> > > > >
> > > > > Something Kou suggested, and the PMC thought would be valuable, is
> to
> > > > have
> > > > > a small retrospective about the state of the project and where we
> > want
> > > to
> > > > > take it. I would like to try doing so via a “state of the project”
> > > type
> > > > > discussion on this mailing list, inspired by an example from Apache
> > > > Calcite
> > > > > [1].
> > > > >
> > > > > I welcome any / all comments on the following topics: What things /
> > > > > activities, if any, do you you think the Apache Arrow Community
> > should:
> > > > >
> > > > > 1. Continue
> > > > > 2. Start
> > > > > 3. Stop
> > > > >
> > > > > My thoughts are below.
> > > > >
> > > > > Andrew
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf
> > <https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf>
> > > > >
> > > > > Continue:
> > > > >
> > > > > I hope we can continue to encourage and support community growth,
> > > focused
> > > > > especially on supporting the sub projects and their leadership. I
> > also
> > > > > would like to continue and grow the outward facing evangelism about
> > the
> > > > > project with blog posts and presentations.
> > > > >
> > > > > Start:
> > > > >
> > > > > Lower the barrier to contributors and accepting those contributions
> > > even
> > > > > more, especially for casual contributors. The move to github issues
> > > from
> > > > > JIRA I see as one example of lowering this barrier (by reducing the
> > > > > required account maintenance). I would love to see additional
> > > > improvements
> > > > > in areas like documentation, examples, no-invite-needed chat, etc.
> > > > >
> > > > > Stop:
> > > > >
> > > > > It would be nice to stop (reduce) the reliance on the relatively
> > small
> > > > > number of core contributors for code review. I don’t have any
> > > particular
> > > > > insight on how to accomplish this, and suspect we will always have
> > less
> > > > > review capacity than we would like, but it would be nice to
> encourage
> > > the
> > > > > growth.
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] State of the Arrow Project 2022

Posted by Will Jones <wi...@gmail.com>.

Thanks, Kevin.

Documenting a process for determining who should be included on a code
> review would be helpful.
>

That's a good idea. We have a docs page directed at contributors, but I'm
not sure how many people have read it [1]. This would be a good addition to
it. (There's also a good guide on reviewing contributions [2].) I also like
the idea of pull request templates, and it seems like if we provide a link
in the template to this overview, more of our contributors would read the
guide. I have created an issue for this [3].

 Also +1 on more diagrams. I've created a couple recently (for example [4])
and hope to make more.

[1] https://arrow.apache.org/docs/developers/overview.html
[2] https://arrow.apache.org/docs/developers/reviewing.html
[3] https://github.com/apache/arrow/issues/15232
[4] https://arrow.apache.org/docs/format/Glossary.html#term-table

On Fri, Jan 6, 2023 at 12:26 PM Kevin Gurney <kg...@mathworks.com> wrote:

> Thank you for starting this discussion, Andrew!
>
> Fiona, Sreehari, and I thought a bit about this, and I've summarized some
> of our thoughts below.
>
> Continue:
>
> 1. +1 to Will's suggestion about roadmaps for sub-projects. This is
> something that would be helpful for the MATLAB interface, for example. We
> would also be interested in the possibility of exploring a MATLAB sync call
> if it would be of interest to other community members.
>
> 2. Continue focusing on building an inclusive developer community. Finish
> the work required to rename the master branch to main. Consider running
> automated checks on pull requests using a tool like alex [1] to prevent use
> of inappropriate language and terminology.
>
> Start:
>
> 1. Add more visuals and diagrams to the documentation. It can be pretty
> overwhelming for new community members to look at the in-depth Arrow C++
> documentation and be able to quickly get a high-level understanding of how
> the various data structures (e.g. buffer, array, chunked array, record
> batch, table, field, schema, data type, etc.) relate to one another. Having
> more visuals with clear labels that show the relationship between these key
> concepts would be very helpful. This also applies to other parts of the
> documentation, like the CI systems (e.g. crossbow), which have a lot of
> moving parts.
>
> 2. Use pull request templates. This would hopefully make it easier for
> both new and existing contributors to describe their changes in a focused
> and clear way to others. For example, when making pull requests related to
> the MATLAB interface, we've been trying to follow a fairly consistent
> pattern for pull request descriptions which includes sections like
> "Overview", "Implementation", "Testing", "Future Directions", "Notes", etc.
>
> Stop:
>
> 1. +1 to Andrew's point about the reliance on a small number of core
> contributors for code reviews. Documenting a process for determining who
> should be included on a code review would be helpful.
>
> [1] https://github.com/get-alex/alex
>
> ________________________________
> From: Dewey Dunnington <de...@voltrondata.com.INVALID>
> Sent: Tuesday, January 3, 2023 2:33 PM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: [DISCUSS] State of the Arrow Project 2022
>
> First, a +1000 on Will's blog post! [1]
>
> Continue:
>
> Building tools that benefit users of all languages, with particular kudos
> to ADBC for providing an ABI-stable way to write database drivers that can
> be used by practitioners in C++, Ruby, Python, Java, Go, and (soon!) R.
>
> Start:
>
> I wonder if this is the year that we can find a way to write compute
> functions in such a way that separate implementations don't have to exist
> for C++, Go, and Rust (and maybe others I don't know about).
>
> Stop:
>
> Will's comment that we should stop building data scientist-facing tools
> under the Arrow name struck a particular chord with me...the R package is
> very much data scientist facing and we have a rather large disjoint between
> the technical capacity of our users and the technical capacity required to
> contribute to the package (e.g., maintaining a development Arrow C++
> install). The types of things we have to do to make RecordBatchReader,
> Arrays, Buffer, RecordBatch and Table structures available to R users and
> the types of things we have to do to provide an Acero dplyr backend are
> vastly different.
>
> [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> https://www.datawill.io/posts/apache-arrow-2022-reflection>
>
> On Thu, Dec 29, 2022 at 4:09 PM Jacob Wujciak
> <ja...@voltrondata.com.invalid>
> wrote:
>
> > This is a great idea, I will add some thoughts later but just wanted to
> > quickly add that the Zulip Chat [1] was recently switched to allow anyone
> > to register without the need for an invite link!
> > [1]: https://ursalabs.zulipchat.com/<https://ursalabs.zulipchat.com>
> >
> >
> > On Wed, Dec 28, 2022 at 11:27 PM Will Jones <wi...@gmail.com>
> > wrote:
> >
> > > Thanks for suggesting this Andrew.
> > >
> > > I just uploaded a blog post with my thoughts in long form [1]. Here are
> > > some suggestions pulled from that:
> > >
> > > Continue:
> > >
> > > I hope we will continue prioritizing updating the spec for new array
> > > formats. [2] I think this is very important for avoiding fragmentation
> > and
> > > may even open opportunities for consolidation in the C++ ecosystem.
> > >
> > > +1 on additional improvements for documentation, examples, no-invite
> > chats.
> > > I am particularly keen on seeing evangelism for our protocols; existing
> > > ones like C Data Interface aren't nearly as widely known as they ought
> to
> > > be and I'm excited for new ones like ADBC.
> > >
> > > Start:
> > >
> > > Find ways for each subproject to publicly develop a clear roadmap.
> > > Otherwise by default these discussions happen in private, either
> between
> > > individual ICs or within corporate environments. Some subprojects, such
> > as
> > > Acero could likely use their own sync call to help facilitate this,
> even
> > if
> > > on a slower cadence than the main biweekly call.
> > >
> > > Also, other sync calls might consider adapting to the sync call note
> > style
> > > used in the Rust projects, where all notes are in one google doc [3]
> > rather
> > > than spread across main mailing list threads. That seems like a format
> > that
> > > would make it easy for new contributors to catch up on the major
> focuses
> > of
> > > the project.
> > >
> > > Stop:
> > >
> > > Don't create end-user (e.g. data scientist) facing tools under the name
> > > Arrow; prefer keeping separate brand identities for those tools and
> > keeping
> > > arrow libraries as developer-facing libraries.
> > >
> > > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<
> https://www.datawill.io/posts/apache-arrow-2022-reflection/>
> > > [2] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq<
> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq>
> > > [3]
> > >
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> <
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> >
> > >
> > > On Mon, Dec 26, 2022 at 10:12 AM Andrew Lamb <al...@influxdata.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am very excited and honored to help steer the Arrow Project this
> year
> > > as
> > > > Arrow PMC Chair.
> > > >
> > > > Something Kou suggested, and the PMC thought would be valuable, is to
> > > have
> > > > a small retrospective about the state of the project and where we
> want
> > to
> > > > take it. I would like to try doing so via a “state of the project”
> > type
> > > > discussion on this mailing list, inspired by an example from Apache
> > > Calcite
> > > > [1].
> > > >
> > > > I welcome any / all comments on the following topics: What things /
> > > > activities, if any, do you you think the Apache Arrow Community
> should:
> > > >
> > > > 1. Continue
> > > > 2. Start
> > > > 3. Stop
> > > >
> > > > My thoughts are below.
> > > >
> > > > Andrew
> > > >
> > > > [1] https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf
> <https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf>
> > > >
> > > > Continue:
> > > >
> > > > I hope we can continue to encourage and support community growth,
> > focused
> > > > especially on supporting the sub projects and their leadership. I
> also
> > > > would like to continue and grow the outward facing evangelism about
> the
> > > > project with blog posts and presentations.
> > > >
> > > > Start:
> > > >
> > > > Lower the barrier to contributors and accepting those contributions
> > even
> > > > more, especially for casual contributors. The move to github issues
> > from
> > > > JIRA I see as one example of lowering this barrier (by reducing the
> > > > required account maintenance). I would love to see additional
> > > improvements
> > > > in areas like documentation, examples, no-invite-needed chat, etc.
> > > >
> > > > Stop:
> > > >
> > > > It would be nice to stop (reduce) the reliance on the relatively
> small
> > > > number of core contributors for code review. I don’t have any
> > particular
> > > > insight on how to accomplish this, and suspect we will always have
> less
> > > > review capacity than we would like, but it would be nice to encourage
> > the
> > > > growth.
> > > >
> > >
> >
>

Re: [DISCUSS] State of the Arrow Project 2022

Posted by Kevin Gurney <kg...@mathworks.com>.

Thank you for starting this discussion, Andrew!

Fiona, Sreehari, and I thought a bit about this, and I've summarized some of our thoughts below.

Continue:

1. +1 to Will's suggestion about roadmaps for sub-projects. This is something that would be helpful for the MATLAB interface, for example. We would also be interested in the possibility of exploring a MATLAB sync call if it would be of interest to other community members.

2. Continue focusing on building an inclusive developer community. Finish the work required to rename the master branch to main. Consider running automated checks on pull requests using a tool like alex [1] to prevent use of inappropriate language and terminology.

Start:

1. Add more visuals and diagrams to the documentation. It can be pretty overwhelming for new community members to look at the in-depth Arrow C++ documentation and be able to quickly get a high-level understanding of how the various data structures (e.g. buffer, array, chunked array, record batch, table, field, schema, data type, etc.) relate to one another. Having more visuals with clear labels that show the relationship between these key concepts would be very helpful. This also applies to other parts of the documentation, like the CI systems (e.g. crossbow), which have a lot of moving parts.

2. Use pull request templates. This would hopefully make it easier for both new and existing contributors to describe their changes in a focused and clear way to others. For example, when making pull requests related to the MATLAB interface, we've been trying to follow a fairly consistent pattern for pull request descriptions which includes sections like "Overview", "Implementation", "Testing", "Future Directions", "Notes", etc.

Stop:

1. +1 to Andrew's point about the reliance on a small number of core contributors for code reviews. Documenting a process for determining who should be included on a code review would be helpful.

[1] https://github.com/get-alex/alex

________________________________
From: Dewey Dunnington <de...@voltrondata.com.INVALID>
Sent: Tuesday, January 3, 2023 2:33 PM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: [DISCUSS] State of the Arrow Project 2022

First, a +1000 on Will's blog post! [1]

Continue:

Building tools that benefit users of all languages, with particular kudos
to ADBC for providing an ABI-stable way to write database drivers that can
be used by practitioners in C++, Ruby, Python, Java, Go, and (soon!) R.

Start:

I wonder if this is the year that we can find a way to write compute
functions in such a way that separate implementations don't have to exist
for C++, Go, and Rust (and maybe others I don't know about).

Stop:

Will's comment that we should stop building data scientist-facing tools
under the Arrow name struck a particular chord with me...the R package is
very much data scientist facing and we have a rather large disjoint between
the technical capacity of our users and the technical capacity required to
contribute to the package (e.g., maintaining a development Arrow C++
install). The types of things we have to do to make RecordBatchReader,
Arrays, Buffer, RecordBatch and Table structures available to R users and
the types of things we have to do to provide an Acero dplyr backend are
vastly different.

[1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<https://www.datawill.io/posts/apache-arrow-2022-reflection>

On Thu, Dec 29, 2022 at 4:09 PM Jacob Wujciak <ja...@voltrondata.com.invalid>
wrote:

> This is a great idea, I will add some thoughts later but just wanted to
> quickly add that the Zulip Chat [1] was recently switched to allow anyone
> to register without the need for an invite link!
> [1]: https://ursalabs.zulipchat.com/<https://ursalabs.zulipchat.com>
>
>
> On Wed, Dec 28, 2022 at 11:27 PM Will Jones <wi...@gmail.com>
> wrote:
>
> > Thanks for suggesting this Andrew.
> >
> > I just uploaded a blog post with my thoughts in long form [1]. Here are
> > some suggestions pulled from that:
> >
> > Continue:
> >
> > I hope we will continue prioritizing updating the spec for new array
> > formats. [2] I think this is very important for avoiding fragmentation
> and
> > may even open opportunities for consolidation in the C++ ecosystem.
> >
> > +1 on additional improvements for documentation, examples, no-invite
> chats.
> > I am particularly keen on seeing evangelism for our protocols; existing
> > ones like C Data Interface aren't nearly as widely known as they ought to
> > be and I'm excited for new ones like ADBC.
> >
> > Start:
> >
> > Find ways for each subproject to publicly develop a clear roadmap.
> > Otherwise by default these discussions happen in private, either between
> > individual ICs or within corporate environments. Some subprojects, such
> as
> > Acero could likely use their own sync call to help facilitate this, even
> if
> > on a slower cadence than the main biweekly call.
> >
> > Also, other sync calls might consider adapting to the sync call note
> style
> > used in the Rust projects, where all notes are in one google doc [3]
> rather
> > than spread across main mailing list threads. That seems like a format
> that
> > would make it easy for new contributors to catch up on the major focuses
> of
> > the project.
> >
> > Stop:
> >
> > Don't create end-user (e.g. data scientist) facing tools under the name
> > Arrow; prefer keeping separate brand identities for those tools and
> keeping
> > arrow libraries as developer-facing libraries.
> >
> > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/<https://www.datawill.io/posts/apache-arrow-2022-reflection/>
> > [2] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq<https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq>
> > [3]
> >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa<https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa>
> >
> > On Mon, Dec 26, 2022 at 10:12 AM Andrew Lamb <al...@influxdata.com>
> wrote:
> >
> > > Hi all,
> > >
> > > I am very excited and honored to help steer the Arrow Project this year
> > as
> > > Arrow PMC Chair.
> > >
> > > Something Kou suggested, and the PMC thought would be valuable, is to
> > have
> > > a small retrospective about the state of the project and where we want
> to
> > > take it. I would like to try doing so via a “state of the project”
> type
> > > discussion on this mailing list, inspired by an example from Apache
> > Calcite
> > > [1].
> > >
> > > I welcome any / all comments on the following topics: What things /
> > > activities, if any, do you you think the Apache Arrow Community should:
> > >
> > > 1. Continue
> > > 2. Start
> > > 3. Stop
> > >
> > > My thoughts are below.
> > >
> > > Andrew
> > >
> > > [1] https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf<https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf>
> > >
> > > Continue:
> > >
> > > I hope we can continue to encourage and support community growth,
> focused
> > > especially on supporting the sub projects and their leadership. I also
> > > would like to continue and grow the outward facing evangelism about the
> > > project with blog posts and presentations.
> > >
> > > Start:
> > >
> > > Lower the barrier to contributors and accepting those contributions
> even
> > > more, especially for casual contributors. The move to github issues
> from
> > > JIRA I see as one example of lowering this barrier (by reducing the
> > > required account maintenance). I would love to see additional
> > improvements
> > > in areas like documentation, examples, no-invite-needed chat, etc.
> > >
> > > Stop:
> > >
> > > It would be nice to stop (reduce) the reliance on the relatively small
> > > number of core contributors for code review. I don’t have any
> particular
> > > insight on how to accomplish this, and suspect we will always have less
> > > review capacity than we would like, but it would be nice to encourage
> the
> > > growth.
> > >
> >
>

Re: [DISCUSS] State of the Arrow Project 2022

Posted by Dewey Dunnington <de...@voltrondata.com.INVALID>.

First, a +1000 on Will's blog post! [1]

Continue:

Building tools that benefit users of all languages, with particular kudos
to ADBC for providing an ABI-stable way to write database drivers that can
be used by practitioners in C++, Ruby, Python, Java, Go, and (soon!) R.

Start:

I wonder if this is the year that we can find a way to write compute
functions in such a way that separate implementations don't have to exist
for C++, Go, and Rust (and maybe others I don't know about).

Stop:

Will's comment that we should stop building data scientist-facing tools
under the Arrow name struck a particular chord with me...the R package is
very much data scientist facing and we have a rather large disjoint between
the technical capacity of our users and the technical capacity required to
contribute to the package (e.g., maintaining a development Arrow C++
install). The types of things we have to do to make RecordBatchReader,
Arrays, Buffer, RecordBatch and Table structures available to R users and
the types of things we have to do to provide an Acero dplyr backend are
vastly different.

[1] https://www.datawill.io/posts/apache-arrow-2022-reflection/

On Thu, Dec 29, 2022 at 4:09 PM Jacob Wujciak <ja...@voltrondata.com.invalid>
wrote:

> This is a great idea, I will add some thoughts later but just wanted to
> quickly add that the Zulip Chat [1] was recently switched to allow anyone
> to register without the need for an invite link!
> [1]:  https://ursalabs.zulipchat.com/
>
>
> On Wed, Dec 28, 2022 at 11:27 PM Will Jones <wi...@gmail.com>
> wrote:
>
> > Thanks for suggesting this Andrew.
> >
> > I just uploaded a blog post with my thoughts in long form [1]. Here are
> > some suggestions pulled from that:
> >
> > Continue:
> >
> > I hope we will continue prioritizing updating the spec for new array
> > formats. [2] I think this is very important for avoiding fragmentation
> and
> > may even open opportunities for consolidation in the C++ ecosystem.
> >
> > +1 on additional improvements for documentation, examples, no-invite
> chats.
> > I am particularly keen on seeing evangelism for our protocols; existing
> > ones like C Data Interface aren't nearly as widely known as they ought to
> > be and I'm excited for new ones like ADBC.
> >
> > Start:
> >
> > Find ways for each subproject to publicly develop a clear roadmap.
> > Otherwise by default these discussions happen in private, either between
> > individual ICs or within corporate environments. Some subprojects, such
> as
> > Acero could likely use their own sync call to help facilitate this, even
> if
> > on a slower cadence than the main biweekly call.
> >
> > Also, other sync calls might consider adapting to the sync call note
> style
> > used in the Rust projects, where all notes are in one google doc [3]
> rather
> > than spread across main mailing list threads. That seems like a format
> that
> > would make it easy for new contributors to catch up on the major focuses
> of
> > the project.
> >
> > Stop:
> >
> > Don't create end-user (e.g. data scientist) facing tools under the name
> > Arrow; prefer keeping separate brand identities for those tools and
> keeping
> > arrow libraries as developer-facing libraries.
> >
> > [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/
> > [2] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > [3]
> >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
> >
> > On Mon, Dec 26, 2022 at 10:12 AM Andrew Lamb <al...@influxdata.com>
> wrote:
> >
> > > Hi all,
> > >
> > > I am very excited and honored to help steer the Arrow Project this year
> > as
> > > Arrow PMC Chair.
> > >
> > > Something Kou suggested, and the PMC thought would be valuable, is to
> > have
> > > a small retrospective about the state of the project and where we want
> to
> > > take it. I would like to try doing so via a  “state of the project”
> type
> > > discussion on this mailing list, inspired by an example from Apache
> > Calcite
> > > [1].
> > >
> > > I welcome any / all comments on the following topics: What things /
> > > activities, if any, do you you think the Apache Arrow Community should:
> > >
> > > 1. Continue
> > > 2. Start
> > > 3. Stop
> > >
> > > My thoughts are below.
> > >
> > > Andrew
> > >
> > > [1] https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf
> > >
> > > Continue:
> > >
> > > I hope we can continue to encourage and support community growth,
> focused
> > > especially on supporting the sub projects and their leadership. I also
> > > would like to continue and grow the outward facing evangelism about the
> > > project with blog posts and presentations.
> > >
> > > Start:
> > >
> > > Lower the barrier to contributors and accepting those contributions
> even
> > > more, especially for casual contributors. The move to github issues
> from
> > > JIRA I see as one example of lowering this barrier (by reducing the
> > > required account maintenance). I would love to see additional
> > improvements
> > > in areas like documentation, examples, no-invite-needed chat, etc.
> > >
> > > Stop:
> > >
> > > It would be nice to stop (reduce) the reliance on the relatively small
> > > number of core contributors for code review. I don’t have any
> particular
> > > insight on how to accomplish this, and suspect we will always have less
> > > review capacity than we would like, but it would be nice to encourage
> the
> > > growth.
> > >
> >
>

Re: [DISCUSS] State of the Arrow Project 2022

Posted by Jacob Wujciak <ja...@voltrondata.com.INVALID>.

This is a great idea, I will add some thoughts later but just wanted to
quickly add that the Zulip Chat [1] was recently switched to allow anyone
to register without the need for an invite link!
[1]:  https://ursalabs.zulipchat.com/


On Wed, Dec 28, 2022 at 11:27 PM Will Jones <wi...@gmail.com> wrote:

> Thanks for suggesting this Andrew.
>
> I just uploaded a blog post with my thoughts in long form [1]. Here are
> some suggestions pulled from that:
>
> Continue:
>
> I hope we will continue prioritizing updating the spec for new array
> formats. [2] I think this is very important for avoiding fragmentation and
> may even open opportunities for consolidation in the C++ ecosystem.
>
> +1 on additional improvements for documentation, examples, no-invite chats.
> I am particularly keen on seeing evangelism for our protocols; existing
> ones like C Data Interface aren't nearly as widely known as they ought to
> be and I'm excited for new ones like ADBC.
>
> Start:
>
> Find ways for each subproject to publicly develop a clear roadmap.
> Otherwise by default these discussions happen in private, either between
> individual ICs or within corporate environments. Some subprojects, such as
> Acero could likely use their own sync call to help facilitate this, even if
> on a slower cadence than the main biweekly call.
>
> Also, other sync calls might consider adapting to the sync call note style
> used in the Rust projects, where all notes are in one google doc [3] rather
> than spread across main mailing list threads. That seems like a format that
> would make it easy for new contributors to catch up on the major focuses of
> the project.
>
> Stop:
>
> Don't create end-user (e.g. data scientist) facing tools under the name
> Arrow; prefer keeping separate brand identities for those tools and keeping
> arrow libraries as developer-facing libraries.
>
> [1] https://www.datawill.io/posts/apache-arrow-2022-reflection/
> [2] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> [3]
>
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa
>
> On Mon, Dec 26, 2022 at 10:12 AM Andrew Lamb <al...@influxdata.com> wrote:
>
> > Hi all,
> >
> > I am very excited and honored to help steer the Arrow Project this year
> as
> > Arrow PMC Chair.
> >
> > Something Kou suggested, and the PMC thought would be valuable, is to
> have
> > a small retrospective about the state of the project and where we want to
> > take it. I would like to try doing so via a  “state of the project” type
> > discussion on this mailing list, inspired by an example from Apache
> Calcite
> > [1].
> >
> > I welcome any / all comments on the following topics: What things /
> > activities, if any, do you you think the Apache Arrow Community should:
> >
> > 1. Continue
> > 2. Start
> > 3. Stop
> >
> > My thoughts are below.
> >
> > Andrew
> >
> > [1] https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf
> >
> > Continue:
> >
> > I hope we can continue to encourage and support community growth, focused
> > especially on supporting the sub projects and their leadership. I also
> > would like to continue and grow the outward facing evangelism about the
> > project with blog posts and presentations.
> >
> > Start:
> >
> > Lower the barrier to contributors and accepting those contributions even
> > more, especially for casual contributors. The move to github issues from
> > JIRA I see as one example of lowering this barrier (by reducing the
> > required account maintenance). I would love to see additional
> improvements
> > in areas like documentation, examples, no-invite-needed chat, etc.
> >
> > Stop:
> >
> > It would be nice to stop (reduce) the reliance on the relatively small
> > number of core contributors for code review. I don’t have any particular
> > insight on how to accomplish this, and suspect we will always have less
> > review capacity than we would like, but it would be nice to encourage the
> > growth.
> >
>

Re: [DISCUSS] State of the Arrow Project 2022

Posted by Will Jones <wi...@gmail.com>.

Thanks for suggesting this Andrew.

I just uploaded a blog post with my thoughts in long form [1]. Here are
some suggestions pulled from that:

Continue:

I hope we will continue prioritizing updating the spec for new array
formats. [2] I think this is very important for avoiding fragmentation and
may even open opportunities for consolidation in the C++ ecosystem.

+1 on additional improvements for documentation, examples, no-invite chats.
I am particularly keen on seeing evangelism for our protocols; existing
ones like C Data Interface aren't nearly as widely known as they ought to
be and I'm excited for new ones like ADBC.

Start:

Find ways for each subproject to publicly develop a clear roadmap.
Otherwise by default these discussions happen in private, either between
individual ICs or within corporate environments. Some subprojects, such as
Acero could likely use their own sync call to help facilitate this, even if
on a slower cadence than the main biweekly call.

Also, other sync calls might consider adapting to the sync call note style
used in the Rust projects, where all notes are in one google doc [3] rather
than spread across main mailing list threads. That seems like a format that
would make it easy for new contributors to catch up on the major focuses of
the project.

Stop:

Don't create end-user (e.g. data scientist) facing tools under the name
Arrow; prefer keeping separate brand identities for those tools and keeping
arrow libraries as developer-facing libraries.

[1] https://www.datawill.io/posts/apache-arrow-2022-reflection/
[2] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[3]
https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#heading=h.qkuvi08gk4qa

On Mon, Dec 26, 2022 at 10:12 AM Andrew Lamb <al...@influxdata.com> wrote:

> Hi all,
>
> I am very excited and honored to help steer the Arrow Project this year as
> Arrow PMC Chair.
>
> Something Kou suggested, and the PMC thought would be valuable, is to have
> a small retrospective about the state of the project and where we want to
> take it. I would like to try doing so via a  “state of the project” type
> discussion on this mailing list, inspired by an example from Apache Calcite
> [1].
>
> I welcome any / all comments on the following topics: What things /
> activities, if any, do you you think the Apache Arrow Community should:
>
> 1. Continue
> 2. Start
> 3. Stop
>
> My thoughts are below.
>
> Andrew
>
> [1] https://lists.apache.org/thread/tx8gw3vxc4kwfzjs6q2gqwgywnsm1zbf
>
> Continue:
>
> I hope we can continue to encourage and support community growth, focused
> especially on supporting the sub projects and their leadership. I also
> would like to continue and grow the outward facing evangelism about the
> project with blog posts and presentations.
>
> Start:
>
> Lower the barrier to contributors and accepting those contributions even
> more, especially for casual contributors. The move to github issues from
> JIRA I see as one example of lowering this barrier (by reducing the
> required account maintenance). I would love to see additional improvements
> in areas like documentation, examples, no-invite-needed chat, etc.
>
> Stop:
>
> It would be nice to stop (reduce) the reliance on the relatively small
> number of core contributors for code review. I don’t have any particular
> insight on how to accomplish this, and suspect we will always have less
> review capacity than we would like, but it would be nice to encourage the
> growth.
>