You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by QP Hou <ho...@gmail.com> on 2021/11/07 00:52:49 UTC

[DISCUSS] Community maintained extension repos for Datafusion

Hi all,

I would like to propose a new and more community friendly governance
model for community contributed and maintained extensions for the
datafusion project.

Over the last year, many datafusion extensions have been proposed and
created by the community including the java binding, s3 and hdfs[1]
object storage implementations, etc. Right now these code are or will
be hosted in individual github namespaces due to the following
reasons:

* Most of these extensions are not considered part of the Datafusion
core, so the current maintainers prefer to not have them managed in
the main repository. The current python binding and ballista code base
is already adding a decent amount of overhead to our development
process. Adding more dependent crates will slow us down further
without much upside.

* Considering the overhead of the official Apache release process,
current Datafusion PMCs don't have the bandwidth to manage individual
releases for these extensions. All of the authors of these extensions
are not Arrow PMC members, so they won't have the access to drive the
Apache releases by themselves.

Therefore, I am proposing that we create an unofficial shared Github
organization to host these Datafusion contrib type projects that are
only maintained by non-PMC community members. I think this is strictly
better than hosting these extensions projects in personal github
namespaces. If any of these extensions end up getting significant
involvements or interests from Datafusion committers, then we can
promote them into official projects and provide official Apache style
release support.

Other alternatives I have considered are:

* Keep these projects under personal namespaces and only link them in
Datafusion's documentation.

* Manage these extensions using experimental repos. But as far as I
know, the code owners still need to be a PMC member in order to
perform crates.io releases and it's not intended for long running
projects without no goal for eventual archival.

* Create a dedicated mono repo named apache/datafusion-contrib to host
these extensions. However, this approach also requires PMC members to
get involved for crates.io releases if I understand it correctly.

Am I curious if this is something that could be done under the Apache
governance model? My main goal is to create an unofficial incubator
type space for community members to develop and collaborate on
extensions that may or may not be adopted as official extensions in
the future.

[1]: https://github.com/apache/arrow-datafusion/pull/1223

Thanks,
QP