You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@incubator.apache.org by Thomas Weise <th...@apache.org> on 2019/01/13 22:34:16 UTC

[VOTE] Accept Hudi into the Apache Incubator

Hi all,

Following the discussion of the Hudi proposal in [1], this is a vote
on accepting Hudi into the Apache Incubator,
per the ASF policy [2] and voting rules [3].

A vote for accepting a new Apache Incubator podling is a
majority vote. Everyone is welcome to vote, only
Incubator PMC member votes are binding.

This vote will run for at least 72 hours. Please VOTE as
follows:

[ ] +1 Accept Hudi into the Apache Incubator
[ ] +0 Abstain
[ ] -1 Do not accept Hudi into the Apache Incubator because ...

The proposal is included below, but you can also access it on
the wiki [4].

Thanks for reviewing and voting,
Thomas

[1]
https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E

[2]
https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor

[3] http://www.apache.org/foundation/voting.html

[4] https://wiki.apache.org/incubator/HudiProposal



= Hudi Proposal =

== Abstract ==

Hudi is a big-data storage library, that provides atomic upserts and
incremental data streams.

Hudi manages data stored in Apache Hadoop and other API compatible
distributed file systems/cloud stores.

== Proposal ==

Hudi provides the ability to atomically upsert datasets with new values in
near-real time, making data available quickly to existing query engines
like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
sequence of changes to a dataset from a given point-in-time to enable
incremental data pipelines that yield greater efficiency & latency than
their typical batch counterparts. By carefully managing number of files &
sizes, Hudi greatly aids both query engines (e.g: always providing
well-sized files) and underlying storage (e.g: HDFS NameNode memory
consumption).

Hudi is largely implemented as an Apache Spark library that reads/writes
data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
supported via specialized Apache Hadoop input formats, that understand
Hudi’s storage layout. Currently, Hudi manages datasets using a combination
of Apache Parquet & Apache Avro file/serialization formats.

== Background ==

Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
longer term analytical storage for thousands of organizations. Typical
analytical datasets are built by reading data from a source (e.g: upstream
databases, messaging buses, or other datasets), transforming the data,
writing results back to storage, & making it available for analytical
queries--all of this typically accomplished in batch jobs which operate in
a bulk fashion on partitions of datasets. Such a style of processing
typically incurs large delays in making data available to queries as well
as lot of complexity in carefully partitioning datasets to guarantee
latency SLAs.

The need for fresher/faster analytics has increased enormously in the past
few years, as evidenced by the popularity of Stream processing systems like
Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
using updateable state store to incrementally compute & instantly reflect
new results to queries and using a “tailable” messaging bus to publish
these results to other downstream jobs, such systems employ a different
approach to building analytical dataset. Even though this approach yields
low latency, the amount of data managed in such real-time data-marts is
typically limited in comparison to the aforementioned longer term storage
options. As a result, the overall data architecture has become more complex
with more moving parts and specialized systems, leading to duplication of
data and a strain on usability.

Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
to streaming systems, we simply add the streaming primitives (upserts &
incremental consumption) onto existing batch processing technologies. We
believe that by adding some missing blocks to an existing Hadoop stack, we
are able to a provide similar capabilities right on top of Hadoop at a
reduced cost and with an increased efficiency, greatly simplifying the
overall architecture in the process.

Hudi was originally developed at Uber (original name “Hoodie”) to address
such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
ecosystem that required the upsert & incremental consumption primitives
supported by Hudi.

== Rationale ==

We truly believe the capabilities supported by Hudi would be increasingly
useful for big-data ecosystems, as data volumes & need for faster data
continue to increase. A detailed description of target use-cases can be
found at https://uber.github.io/hudi/use_cases.html.

Given our reliance on so many great Apache projects, we believe that the
Apache way of open source community driven development will enable us to
evolve Hudi in collaboration with a diverse set of contributors who can
bring new ideas into the project.

== Initial Goals ==

 * Move the existing codebase, website, documentation, and mailing lists to
an Apache-hosted infrastructure.
 * Integrate with the Apache development process.
 * Ensure all dependencies are compliant with Apache License version 2.0.
 * Incrementally develop and release per Apache guidelines.

== Current Status ==

Hudi is a stable project used in production at Uber since 2016 and was open
sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
manages 4000+ tables holding several petabytes, bringing our Hadoop
warehouse from several hours of data delays to under 30 minutes, over the
past two years. The source code is currently hosted at github.com (
https://github.com/uber/hudi ), which will seed the Apache git repository.

=== Meritocracy ===

We are fully committed to open, transparent, & meritocratic interactions
with our community. In fact, one of the primary motivations for us to enter
the incubation process is to be able to rely on Apache best practices that
can ensure meritocracy. This will eventually help incorporate the best
ideas back into the project & enable contributors to continue investing
their time in the project. Current guidelines (
https://uber.github.io/hudi/community.html#becoming-a-committer) have
already put in place a meritocratic process which we will replace with
Apache guidelines during incubation.

=== Community ===

Hudi community is fairly young, since the project was open sourced only in
early 2017. Currently, Hudi has committers from Uber & Snowflake. We have a
vibrant set of contributors (~46 members in our slack channel) including
Shopify, DoubleVerify and Vungle & others, who have either submitted
patches or filed issues with hudi pipelines either in early production or
testing stages. Our primary goal during the incubation would be to grow the
community and groom our existing active contributors into committers.

=== Core Developers ===

Current core developers work at Uber & Snowflake. We are confident that
incubation will help us grow a diverse community in a open & collaborative
way.

=== Alignment ===

Hudi is designed as a general purpose analytical storage abstraction that
integrates with multiple Apache projects: Apache Spark, Apache Hive, Apache
Hadoop. It was built using multiple Apache projects, including Apache
Parquet and Apache Avro, that support near-real time analytics right on top
of existing Apache Hadoop data lakes. Our sincere hope is that being a part
of the Apache foundation would enable us to drive the future of the project
in alignment with the other Apache projects for the benefit of thousands of
organizations that already leverage these projects.

== Known Risks ==

=== Orphaned products ===

The risk of abandonment of Hudi is low. It is used in production at Uber
for petabytes of data and other companies (mentioned in community section)
are either evaluating or in the early stage for production use. Uber is
committed to further development of the project and invest resources
towards the Apache processes & building the community, during incubation
period.

=== Inexperience with Open Source ===

Even though the initial committers are new to the Apache world, some have
considerable open source experience - Vinoth Chandar (Linkedin voldemort,
Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
(Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
successfully managing the current open source community answering questions
and taking feedback already. Moreover, we hope to obtain guidance and
mentorship from current ASF members to help us succeed with the incubation.

=== Length of Incubation ===

We expect the project be in incubation for 2 years or less.

=== Homogenous Developers ===

Currently, the lead developers for Hudi are from Uber. However, we have an
active set of early contributors/collaborators from Shopify, DoubleVerify
and Vungle, that we hope will increase the diversity going forward. Once
again, a primary motivation for incubation is to facilitate this in the
Apache way.

=== Reliance on Salaried Developers ===

Both the current committers & early contributors have several years of core
expertise around data systems. Current committers are very passionate about
the project and have already invested hundreds of hours towards helping &
building the community. Thus, even with employer changes, we expect they
will be able to actively engage in the project either because they will be
working in similar areas even with newer employers or out of belief in the
project.

=== Relationships with Other Apache Products ===

To the best of our knowledge, there are no direct competing projects with
Hudi that offer all of the feature set namely - upserts, incremental
streams, efficient storage/file management, snapshot isolation/rollbacks -
in a coherent way. However, some projects share common goals and technical
elements and we will highlight them here. Hive ACID/Kudu both offer upsert
capabilities without storage management/incremental streams. The recent
Iceberg project offers similar snapshot isolation/rollbacks, but not
upserts or other data plane features. A detailed comparison with their
trade-offs can be found at https://uber.github.io/hudi/comparison.html.

We are committed to open collaboration with such Apache projects and
incorporate changes to Hudi or contribute patches to other projects, with
the goal of making it easier for the community at large, to adopt these
open source technologies.

=== Excessive Fascination with the Apache Brand ===

This proposal is not for the purpose of generating publicity. We have
already been doing talks/meetups independently that have helped us build
our community. We are drawn towards Apache as a potential way of ensuring
that our open source community management is successful early on so  hudi
can evolve into a broadly accepted--and used--method of managing data on
Hadoop.

== Documentation ==
[1] Detailed documentation can be found at https://uber.github.io/hudi/

== Initial Source ==

The codebase is currently hosted on Github: https://github.com/uber/hudi .
During incubation, the codebase will be migrated to an Apache
infrastructure. The source code already has an Apache 2.0 licensed.

== Source and Intellectual Property Submission Plan ==

Current code is Apache 2.0 licensed and the copyright is assigned to Uber.
If the project enters incubator, Uber will transfer the source code &
trademark ownership to ASF via a Software Grant Agreement

== External Dependencies ==

Non apache dependencies are listed below

 * JCommander (1.48) Apache-2.0
 * Kryo (4.0.0) BSD-2-Clause
 * Kryo (2.21) BSD-3-Clause
 * Jackson-annotations (2.6.4) Apache-2.0
 * Jackson-annotations (2.6.5) Apache-2.0
 * jackson-databind (2.6.4) Apache-2.0
 * jackson-databind (2.6.5) Apache-2.0
 * Jackson datatype: Guava (2.9.4) Apache-2.0
 * docker-java (3.1.0-rc-3) Apache-2.0
 * Guava: Google Core Libraries for Java (20.0) Apache-2.0
 * bijection-avro (0.9.2) Apache-2.0
 * com.twitter.common:objectsize (0.0.12) Apache-2.0
 * Ascii Table (0.2.5) Apache-2.0
 * config (3.0.0) Apache-2.0
 * utils (3.0.0) Apache-2.0
 * kafka-avro-serializer (3.0.0) Apache-2.0
 * kafka-schema-registry-client (3.0.0) Apache-2.0
 * Metrics Core (3.1.1) Apache-2.0
 * Graphite Integration for Metrics (3.1.1) Apache-2.0
 * Joda-Time (2.9.6) Apache-2.0
 * JUnit CPL-1.0
 * Awaitility (3.1.2) Apache-2.0
 * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
 * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
 * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
 * htrace-core (3.0.4) Apache-2.0
 * Mockito (1.10.19) MIT
 * scalatest (3.0.1) Apache-2.0
 * Spring Shell (1.2.0.RELEASE) Apache-2.0

All of them are Apache compatible

== Cryptography ==

No cryptographic libraries used

== Required Resources ==

=== Mailing lists ===

 * private@hudi.incubator.apache.org (with moderated subscriptions)
 * dev@hudi.incubator.apache.org
 * commits@hudi.incubator.apache.org
 * user@hudi.incubator.apache.org

=== Git Repositories ===

Git is the preferred source control system: git://
git.apache.org/incubator-hudi

=== Issue Tracking ===

We prefer to use the Apache gitbox integration to sync Github & Apache
infrastructure, and rely on Github issues & pull requests for community
engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)

== Initial Committers ==

 * Vinoth Chandar (vinoth at uber dot com) (Uber)
 * Nishith Agarwal (nagarwal at uber dot com) (Uber)
 * Balaji Varadarajan (varadarb at uber dot com) (Uber)
 * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
 * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
 * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)

== Sponsors ==

=== Champion ===
Julien Le Dem (julien at apache dot org)

=== Nominated Mentors ===

 * Luciano Resende (lresende at apache dot org)
 * Thomas Weise (thw at apache dot org
 * Kishore Gopalakrishna (kishoreg at apache dot org)
 * Suneel Marthi (smarthi at apache dot org)

=== Sponsoring Entity ===

The Incubator PMC

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Akira Ajisaka <aa...@apache.org>.

+1 (binding)

-Akira

2019年1月15日(火) 10:25 Jakob Homan <jg...@gmail.com>:
>
> +1 (binding)
>
> -Jakob
>
> On Mon, Jan 14, 2019 at 5:22 PM Mayank Bansal <ma...@gmail.com> wrote:
> >
> > +1
> >
> > On Mon, Jan 14, 2019 at 5:11 PM Mohammad Islam <mi...@yahoo.com.invalid>
> > wrote:
> >
> > >  +1
> > >     On Monday, January 14, 2019, 12:46:48 PM PST, Kenneth Knowles <
> > > kenn@apache.org> wrote:
> > >
> > >  +1
> > >
> > > On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung <fe...@apache.org>
> > > wrote:
> > >
> > > > +1
> > > >
> > > >
> > > > On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
> > > > <su...@yahoo.com.invalid> wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On Jan 13, 2019, at 5:34 PM, Thomas Weise <th...@apache.org> wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > > > > on accepting Hudi into the Apache Incubator,
> > > > > > per the ASF policy [2] and voting rules [3].
> > > > > >
> > > > > > A vote for accepting a new Apache Incubator podling is a
> > > > > > majority vote. Everyone is welcome to vote, only
> > > > > > Incubator PMC member votes are binding.
> > > > > >
> > > > > > This vote will run for at least 72 hours. Please VOTE as
> > > > > > follows:
> > > > > >
> > > > > > [ ] +1 Accept Hudi into the Apache Incubator
> > > > > > [ ] +0 Abstain
> > > > > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > > > > >
> > > > > > The proposal is included below, but you can also access it on
> > > > > > the wiki [4].
> > > > > >
> > > > > > Thanks for reviewing and voting,
> > > > > > Thomas
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > > > > >
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > > > > >
> > > > > > [3] http://www.apache.org/foundation/voting.html
> > > > > >
> > > > > > [4] https://wiki.apache.org/incubator/HudiProposal
> > > > > >
> > > > > >
> > > > > >
> > > > > > = Hudi Proposal =
> > > > > >
> > > > > > == Abstract ==
> > > > > >
> > > > > > Hudi is a big-data storage library, that provides atomic upserts and
> > > > > > incremental data streams.
> > > > > >
> > > > > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > > > > distributed file systems/cloud stores.
> > > > > >
> > > > > > == Proposal ==
> > > > > >
> > > > > > Hudi provides the ability to atomically upsert datasets with new
> > > values
> > > > > in
> > > > > > near-real time, making data available quickly to existing query
> > > engines
> > > > > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi
> > > provides a
> > > > > > sequence of changes to a dataset from a given point-in-time to enable
> > > > > > incremental data pipelines that yield greater efficiency & latency
> > > than
> > > > > > their typical batch counterparts. By carefully managing number of
> > > > files &
> > > > > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > > > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > > > > consumption).
> > > > > >
> > > > > > Hudi is largely implemented as an Apache Spark library that
> > > > reads/writes
> > > > > > data from/to Hadoop compatible filesystem. SQL queries on Hudi
> > > datasets
> > > > > are
> > > > > > supported via specialized Apache Hadoop input formats, that
> > > understand
> > > > > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > > > > combination
> > > > > > of Apache Parquet & Apache Avro file/serialization formats.
> > > > > >
> > > > > > == Background ==
> > > > > >
> > > > > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > > > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> > > > as
> > > > > > longer term analytical storage for thousands of organizations.
> > > Typical
> > > > > > analytical datasets are built by reading data from a source (e.g:
> > > > > upstream
> > > > > > databases, messaging buses, or other datasets), transforming the
> > > data,
> > > > > > writing results back to storage, & making it available for analytical
> > > > > > queries--all of this typically accomplished in batch jobs which
> > > operate
> > > > > in
> > > > > > a bulk fashion on partitions of datasets. Such a style of processing
> > > > > > typically incurs large delays in making data available to queries as
> > > > well
> > > > > > as lot of complexity in carefully partitioning datasets to guarantee
> > > > > > latency SLAs.
> > > > > >
> > > > > > The need for fresher/faster analytics has increased enormously in the
> > > > > past
> > > > > > few years, as evidenced by the popularity of Stream processing
> > > systems
> > > > > like
> > > > > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka.
> > > By
> > > > > > using updateable state store to incrementally compute & instantly
> > > > reflect
> > > > > > new results to queries and using a “tailable” messaging bus to
> > > publish
> > > > > > these results to other downstream jobs, such systems employ a
> > > different
> > > > > > approach to building analytical dataset. Even though this approach
> > > > yields
> > > > > > low latency, the amount of data managed in such real-time data-marts
> > > is
> > > > > > typically limited in comparison to the aforementioned longer term
> > > > storage
> > > > > > options. As a result, the overall data architecture has become more
> > > > > complex
> > > > > > with more moving parts and specialized systems, leading to
> > > duplication
> > > > of
> > > > > > data and a strain on usability.
> > > > > >
> > > > > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> > > > > data
> > > > > > to streaming systems, we simply add the streaming primitives
> > > (upserts &
> > > > > > incremental consumption) onto existing batch processing technologies.
> > > > We
> > > > > > believe that by adding some missing blocks to an existing Hadoop
> > > stack,
> > > > > we
> > > > > > are able to a provide similar capabilities right on top of Hadoop at
> > > a
> > > > > > reduced cost and with an increased efficiency, greatly simplifying
> > > the
> > > > > > overall architecture in the process.
> > > > > >
> > > > > > Hudi was originally developed at Uber (original name “Hoodie”) to
> > > > address
> > > > > > such broad inefficiencies in ingest & ETL & ML pipelines across
> > > Uber’s
> > > > > data
> > > > > > ecosystem that required the upsert & incremental consumption
> > > primitives
> > > > > > supported by Hudi.
> > > > > >
> > > > > > == Rationale ==
> > > > > >
> > > > > > We truly believe the capabilities supported by Hudi would be
> > > > increasingly
> > > > > > useful for big-data ecosystems, as data volumes & need for faster
> > > data
> > > > > > continue to increase. A detailed description of target use-cases can
> > > be
> > > > > > found at https://uber.github.io/hudi/use_cases.html.
> > > > > >
> > > > > > Given our reliance on so many great Apache projects, we believe that
> > > > the
> > > > > > Apache way of open source community driven development will enable us
> > > > to
> > > > > > evolve Hudi in collaboration with a diverse set of contributors who
> > > can
> > > > > > bring new ideas into the project.
> > > > > >
> > > > > > == Initial Goals ==
> > > > > >
> > > > > > * Move the existing codebase, website, documentation, and mailing
> > > lists
> > > > > to
> > > > > > an Apache-hosted infrastructure.
> > > > > > * Integrate with the Apache development process.
> > > > > > * Ensure all dependencies are compliant with Apache License version
> > > > 2.0.
> > > > > > * Incrementally develop and release per Apache guidelines.
> > > > > >
> > > > > > == Current Status ==
> > > > > >
> > > > > > Hudi is a stable project used in production at Uber since 2016 and
> > > was
> > > > > open
> > > > > > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > > > > > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > > > > > warehouse from several hours of data delays to under 30 minutes, over
> > > > the
> > > > > > past two years. The source code is currently hosted at github.com (
> > > > > > https://github.com/uber/hudi ), which will seed the Apache git
> > > > > repository.
> > > > > >
> > > > > > === Meritocracy ===
> > > > > >
> > > > > > We are fully committed to open, transparent, & meritocratic
> > > > interactions
> > > > > > with our community. In fact, one of the primary motivations for us to
> > > > > enter
> > > > > > the incubation process is to be able to rely on Apache best practices
> > > > > that
> > > > > > can ensure meritocracy. This will eventually help incorporate the
> > > best
> > > > > > ideas back into the project & enable contributors to continue
> > > investing
> > > > > > their time in the project. Current guidelines (
> > > > > > https://uber.github.io/hudi/community.html#becoming-a-committer)
> > > have
> > > > > > already put in place a meritocratic process which we will replace
> > > with
> > > > > > Apache guidelines during incubation.
> > > > > >
> > > > > > === Community ===
> > > > > >
> > > > > > Hudi community is fairly young, since the project was open sourced
> > > only
> > > > > in
> > > > > > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> > > > > have a
> > > > > > vibrant set of contributors (~46 members in our slack channel)
> > > > including
> > > > > > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > > > > > patches or filed issues with hudi pipelines either in early
> > > production
> > > > or
> > > > > > testing stages. Our primary goal during the incubation would be to
> > > grow
> > > > > the
> > > > > > community and groom our existing active contributors into committers.
> > > > > >
> > > > > > === Core Developers ===
> > > > > >
> > > > > > Current core developers work at Uber & Snowflake. We are confident
> > > that
> > > > > > incubation will help us grow a diverse community in a open &
> > > > > collaborative
> > > > > > way.
> > > > > >
> > > > > > === Alignment ===
> > > > > >
> > > > > > Hudi is designed as a general purpose analytical storage abstraction
> > > > that
> > > > > > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> > > > > Apache
> > > > > > Hadoop. It was built using multiple Apache projects, including Apache
> > > > > > Parquet and Apache Avro, that support near-real time analytics right
> > > on
> > > > > top
> > > > > > of existing Apache Hadoop data lakes. Our sincere hope is that being
> > > a
> > > > > part
> > > > > > of the Apache foundation would enable us to drive the future of the
> > > > > project
> > > > > > in alignment with the other Apache projects for the benefit of
> > > > thousands
> > > > > of
> > > > > > organizations that already leverage these projects.
> > > > > >
> > > > > > == Known Risks ==
> > > > > >
> > > > > > === Orphaned products ===
> > > > > >
> > > > > > The risk of abandonment of Hudi is low. It is used in production at
> > > > Uber
> > > > > > for petabytes of data and other companies (mentioned in community
> > > > > section)
> > > > > > are either evaluating or in the early stage for production use. Uber
> > > is
> > > > > > committed to further development of the project and invest resources
> > > > > > towards the Apache processes & building the community, during
> > > > incubation
> > > > > > period.
> > > > > >
> > > > > > === Inexperience with Open Source ===
> > > > > >
> > > > > > Even though the initial committers are new to the Apache world, some
> > > > have
> > > > > > considerable open source experience - Vinoth Chandar (Linkedin
> > > > voldemort,
> > > > > > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan
> > > Qureshi
> > > > > > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > > > > > successfully managing the current open source community answering
> > > > > questions
> > > > > > and taking feedback already. Moreover, we hope to obtain guidance and
> > > > > > mentorship from current ASF members to help us succeed with the
> > > > > incubation.
> > > > > >
> > > > > > === Length of Incubation ===
> > > > > >
> > > > > > We expect the project be in incubation for 2 years or less.
> > > > > >
> > > > > > === Homogenous Developers ===
> > > > > >
> > > > > > Currently, the lead developers for Hudi are from Uber. However, we
> > > have
> > > > > an
> > > > > > active set of early contributors/collaborators from Shopify,
> > > > DoubleVerify
> > > > > > and Vungle, that we hope will increase the diversity going forward.
> > > > Once
> > > > > > again, a primary motivation for incubation is to facilitate this in
> > > the
> > > > > > Apache way.
> > > > > >
> > > > > > === Reliance on Salaried Developers ===
> > > > > >
> > > > > > Both the current committers & early contributors have several years
> > > of
> > > > > core
> > > > > > expertise around data systems. Current committers are very passionate
> > > > > about
> > > > > > the project and have already invested hundreds of hours towards
> > > > helping &
> > > > > > building the community. Thus, even with employer changes, we expect
> > > > they
> > > > > > will be able to actively engage in the project either because they
> > > will
> > > > > be
> > > > > > working in similar areas even with newer employers or out of belief
> > > in
> > > > > the
> > > > > > project.
> > > > > >
> > > > > > === Relationships with Other Apache Products ===
> > > > > >
> > > > > > To the best of our knowledge, there are no direct competing projects
> > > > with
> > > > > > Hudi that offer all of the feature set namely - upserts, incremental
> > > > > > streams, efficient storage/file management, snapshot
> > > > isolation/rollbacks
> > > > > -
> > > > > > in a coherent way. However, some projects share common goals and
> > > > > technical
> > > > > > elements and we will highlight them here. Hive ACID/Kudu both offer
> > > > > upsert
> > > > > > capabilities without storage management/incremental streams. The
> > > recent
> > > > > > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > > > > > upserts or other data plane features. A detailed comparison with
> > > their
> > > > > > trade-offs can be found at
> > > https://uber.github.io/hudi/comparison.html
> > > > .
> > > > > >
> > > > > > We are committed to open collaboration with such Apache projects and
> > > > > > incorporate changes to Hudi or contribute patches to other projects,
> > > > with
> > > > > > the goal of making it easier for the community at large, to adopt
> > > these
> > > > > > open source technologies.
> > > > > >
> > > > > > === Excessive Fascination with the Apache Brand ===
> > > > > >
> > > > > > This proposal is not for the purpose of generating publicity. We have
> > > > > > already been doing talks/meetups independently that have helped us
> > > > build
> > > > > > our community. We are drawn towards Apache as a potential way of
> > > > ensuring
> > > > > > that our open source community management is successful early on so
> > > > hudi
> > > > > > can evolve into a broadly accepted--and used--method of managing data
> > > > on
> > > > > > Hadoop.
> > > > > >
> > > > > > == Documentation ==
> > > > > > [1] Detailed documentation can be found at
> > > > https://uber.github.io/hudi/
> > > > > >
> > > > > > == Initial Source ==
> > > > > >
> > > > > > The codebase is currently hosted on Github:
> > > > https://github.com/uber/hudi
> > > > > .
> > > > > > During incubation, the codebase will be migrated to an Apache
> > > > > > infrastructure. The source code already has an Apache 2.0 licensed.
> > > > > >
> > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > >
> > > > > > Current code is Apache 2.0 licensed and the copyright is assigned to
> > > > > Uber.
> > > > > > If the project enters incubator, Uber will transfer the source code &
> > > > > > trademark ownership to ASF via a Software Grant Agreement
> > > > > >
> > > > > > == External Dependencies ==
> > > > > >
> > > > > > Non apache dependencies are listed below
> > > > > >
> > > > > > * JCommander (1.48) Apache-2.0
> > > > > > * Kryo (4.0.0) BSD-2-Clause
> > > > > > * Kryo (2.21) BSD-3-Clause
> > > > > > * Jackson-annotations (2.6.4) Apache-2.0
> > > > > > * Jackson-annotations (2.6.5) Apache-2.0
> > > > > > * jackson-databind (2.6.4) Apache-2.0
> > > > > > * jackson-databind (2.6.5) Apache-2.0
> > > > > > * Jackson datatype: Guava (2.9.4) Apache-2.0
> > > > > > * docker-java (3.1.0-rc-3) Apache-2.0
> > > > > > * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > > > > > * bijection-avro (0.9.2) Apache-2.0
> > > > > > * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > > > > > * Ascii Table (0.2.5) Apache-2.0
> > > > > > * config (3.0.0) Apache-2.0
> > > > > > * utils (3.0.0) Apache-2.0
> > > > > > * kafka-avro-serializer (3.0.0) Apache-2.0
> > > > > > * kafka-schema-registry-client (3.0.0) Apache-2.0
> > > > > > * Metrics Core (3.1.1) Apache-2.0
> > > > > > * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > > > > > * Joda-Time (2.9.6) Apache-2.0
> > > > > > * JUnit CPL-1.0
> > > > > > * Awaitility (3.1.2) Apache-2.0
> > > > > > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > > > > > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > > > > > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > > > > > * htrace-core (3.0.4) Apache-2.0
> > > > > > * Mockito (1.10.19) MIT
> > > > > > * scalatest (3.0.1) Apache-2.0
> > > > > > * Spring Shell (1.2.0.RELEASE) Apache-2.0
> > > > > >
> > > > > > All of them are Apache compatible
> > > > > >
> > > > > > == Cryptography ==
> > > > > >
> > > > > > No cryptographic libraries used
> > > > > >
> > > > > > == Required Resources ==
> > > > > >
> > > > > > === Mailing lists ===
> > > > > >
> > > > > > * private@hudi.incubator.apache.org (with moderated subscriptions)
> > > > > > * dev@hudi.incubator.apache.org
> > > > > > * commits@hudi.incubator.apache.org
> > > > > > * user@hudi.incubator.apache.org
> > > > > >
> > > > > > === Git Repositories ===
> > > > > >
> > > > > > Git is the preferred source control system: git://
> > > > > > git.apache.org/incubator-hudi
> > > > > >
> > > > > > === Issue Tracking ===
> > > > > >
> > > > > > We prefer to use the Apache gitbox integration to sync Github &
> > > Apache
> > > > > > infrastructure, and rely on Github issues & pull requests for
> > > community
> > > > > > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> > > > > >
> > > > > > == Initial Committers ==
> > > > > >
> > > > > > * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > > > > > * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > > > > > * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > > > > > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com)
> > > (Snowflake)
> > > > > > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > > > > > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> > > > > >
> > > > > > == Sponsors ==
> > > > > >
> > > > > > === Champion ===
> > > > > > Julien Le Dem (julien at apache dot org)
> > > > > >
> > > > > > === Nominated Mentors ===
> > > > > >
> > > > > > * Luciano Resende (lresende at apache dot org)
> > > > > > * Thomas Weise (thw at apache dot org
> > > > > > * Kishore Gopalakrishna (kishoreg at apache dot org)
> > > > > > * Suneel Marthi (smarthi at apache dot org)
> > > > > >
> > > > > > === Sponsoring Entity ===
> > > > > >
> > > > > > The Incubator PMC
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > > >
> > > > >
> > > >
> >
> > --
> > Thanks and Regards,
> > Mayank
> > Cell: 408-718-9370
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Jakob Homan <jg...@gmail.com>.

+1 (binding)

-Jakob

On Mon, Jan 14, 2019 at 5:22 PM Mayank Bansal <ma...@gmail.com> wrote:
>
> +1
>
> On Mon, Jan 14, 2019 at 5:11 PM Mohammad Islam <mi...@yahoo.com.invalid>
> wrote:
>
> >  +1
> >     On Monday, January 14, 2019, 12:46:48 PM PST, Kenneth Knowles <
> > kenn@apache.org> wrote:
> >
> >  +1
> >
> > On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung <fe...@apache.org>
> > wrote:
> >
> > > +1
> > >
> > >
> > > On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
> > > <su...@yahoo.com.invalid> wrote:
> > >
> > > > +1
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Jan 13, 2019, at 5:34 PM, Thomas Weise <th...@apache.org> wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > > > on accepting Hudi into the Apache Incubator,
> > > > > per the ASF policy [2] and voting rules [3].
> > > > >
> > > > > A vote for accepting a new Apache Incubator podling is a
> > > > > majority vote. Everyone is welcome to vote, only
> > > > > Incubator PMC member votes are binding.
> > > > >
> > > > > This vote will run for at least 72 hours. Please VOTE as
> > > > > follows:
> > > > >
> > > > > [ ] +1 Accept Hudi into the Apache Incubator
> > > > > [ ] +0 Abstain
> > > > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > > > >
> > > > > The proposal is included below, but you can also access it on
> > > > > the wiki [4].
> > > > >
> > > > > Thanks for reviewing and voting,
> > > > > Thomas
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > > > >
> > > > > [2]
> > > > >
> > > >
> > >
> > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > > > >
> > > > > [3] http://www.apache.org/foundation/voting.html
> > > > >
> > > > > [4] https://wiki.apache.org/incubator/HudiProposal
> > > > >
> > > > >
> > > > >
> > > > > = Hudi Proposal =
> > > > >
> > > > > == Abstract ==
> > > > >
> > > > > Hudi is a big-data storage library, that provides atomic upserts and
> > > > > incremental data streams.
> > > > >
> > > > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > > > distributed file systems/cloud stores.
> > > > >
> > > > > == Proposal ==
> > > > >
> > > > > Hudi provides the ability to atomically upsert datasets with new
> > values
> > > > in
> > > > > near-real time, making data available quickly to existing query
> > engines
> > > > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi
> > provides a
> > > > > sequence of changes to a dataset from a given point-in-time to enable
> > > > > incremental data pipelines that yield greater efficiency & latency
> > than
> > > > > their typical batch counterparts. By carefully managing number of
> > > files &
> > > > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > > > consumption).
> > > > >
> > > > > Hudi is largely implemented as an Apache Spark library that
> > > reads/writes
> > > > > data from/to Hadoop compatible filesystem. SQL queries on Hudi
> > datasets
> > > > are
> > > > > supported via specialized Apache Hadoop input formats, that
> > understand
> > > > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > > > combination
> > > > > of Apache Parquet & Apache Avro file/serialization formats.
> > > > >
> > > > > == Background ==
> > > > >
> > > > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> > > as
> > > > > longer term analytical storage for thousands of organizations.
> > Typical
> > > > > analytical datasets are built by reading data from a source (e.g:
> > > > upstream
> > > > > databases, messaging buses, or other datasets), transforming the
> > data,
> > > > > writing results back to storage, & making it available for analytical
> > > > > queries--all of this typically accomplished in batch jobs which
> > operate
> > > > in
> > > > > a bulk fashion on partitions of datasets. Such a style of processing
> > > > > typically incurs large delays in making data available to queries as
> > > well
> > > > > as lot of complexity in carefully partitioning datasets to guarantee
> > > > > latency SLAs.
> > > > >
> > > > > The need for fresher/faster analytics has increased enormously in the
> > > > past
> > > > > few years, as evidenced by the popularity of Stream processing
> > systems
> > > > like
> > > > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka.
> > By
> > > > > using updateable state store to incrementally compute & instantly
> > > reflect
> > > > > new results to queries and using a “tailable” messaging bus to
> > publish
> > > > > these results to other downstream jobs, such systems employ a
> > different
> > > > > approach to building analytical dataset. Even though this approach
> > > yields
> > > > > low latency, the amount of data managed in such real-time data-marts
> > is
> > > > > typically limited in comparison to the aforementioned longer term
> > > storage
> > > > > options. As a result, the overall data architecture has become more
> > > > complex
> > > > > with more moving parts and specialized systems, leading to
> > duplication
> > > of
> > > > > data and a strain on usability.
> > > > >
> > > > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> > > > data
> > > > > to streaming systems, we simply add the streaming primitives
> > (upserts &
> > > > > incremental consumption) onto existing batch processing technologies.
> > > We
> > > > > believe that by adding some missing blocks to an existing Hadoop
> > stack,
> > > > we
> > > > > are able to a provide similar capabilities right on top of Hadoop at
> > a
> > > > > reduced cost and with an increased efficiency, greatly simplifying
> > the
> > > > > overall architecture in the process.
> > > > >
> > > > > Hudi was originally developed at Uber (original name “Hoodie”) to
> > > address
> > > > > such broad inefficiencies in ingest & ETL & ML pipelines across
> > Uber’s
> > > > data
> > > > > ecosystem that required the upsert & incremental consumption
> > primitives
> > > > > supported by Hudi.
> > > > >
> > > > > == Rationale ==
> > > > >
> > > > > We truly believe the capabilities supported by Hudi would be
> > > increasingly
> > > > > useful for big-data ecosystems, as data volumes & need for faster
> > data
> > > > > continue to increase. A detailed description of target use-cases can
> > be
> > > > > found at https://uber.github.io/hudi/use_cases.html.
> > > > >
> > > > > Given our reliance on so many great Apache projects, we believe that
> > > the
> > > > > Apache way of open source community driven development will enable us
> > > to
> > > > > evolve Hudi in collaboration with a diverse set of contributors who
> > can
> > > > > bring new ideas into the project.
> > > > >
> > > > > == Initial Goals ==
> > > > >
> > > > > * Move the existing codebase, website, documentation, and mailing
> > lists
> > > > to
> > > > > an Apache-hosted infrastructure.
> > > > > * Integrate with the Apache development process.
> > > > > * Ensure all dependencies are compliant with Apache License version
> > > 2.0.
> > > > > * Incrementally develop and release per Apache guidelines.
> > > > >
> > > > > == Current Status ==
> > > > >
> > > > > Hudi is a stable project used in production at Uber since 2016 and
> > was
> > > > open
> > > > > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > > > > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > > > > warehouse from several hours of data delays to under 30 minutes, over
> > > the
> > > > > past two years. The source code is currently hosted at github.com (
> > > > > https://github.com/uber/hudi ), which will seed the Apache git
> > > > repository.
> > > > >
> > > > > === Meritocracy ===
> > > > >
> > > > > We are fully committed to open, transparent, & meritocratic
> > > interactions
> > > > > with our community. In fact, one of the primary motivations for us to
> > > > enter
> > > > > the incubation process is to be able to rely on Apache best practices
> > > > that
> > > > > can ensure meritocracy. This will eventually help incorporate the
> > best
> > > > > ideas back into the project & enable contributors to continue
> > investing
> > > > > their time in the project. Current guidelines (
> > > > > https://uber.github.io/hudi/community.html#becoming-a-committer)
> > have
> > > > > already put in place a meritocratic process which we will replace
> > with
> > > > > Apache guidelines during incubation.
> > > > >
> > > > > === Community ===
> > > > >
> > > > > Hudi community is fairly young, since the project was open sourced
> > only
> > > > in
> > > > > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> > > > have a
> > > > > vibrant set of contributors (~46 members in our slack channel)
> > > including
> > > > > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > > > > patches or filed issues with hudi pipelines either in early
> > production
> > > or
> > > > > testing stages. Our primary goal during the incubation would be to
> > grow
> > > > the
> > > > > community and groom our existing active contributors into committers.
> > > > >
> > > > > === Core Developers ===
> > > > >
> > > > > Current core developers work at Uber & Snowflake. We are confident
> > that
> > > > > incubation will help us grow a diverse community in a open &
> > > > collaborative
> > > > > way.
> > > > >
> > > > > === Alignment ===
> > > > >
> > > > > Hudi is designed as a general purpose analytical storage abstraction
> > > that
> > > > > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> > > > Apache
> > > > > Hadoop. It was built using multiple Apache projects, including Apache
> > > > > Parquet and Apache Avro, that support near-real time analytics right
> > on
> > > > top
> > > > > of existing Apache Hadoop data lakes. Our sincere hope is that being
> > a
> > > > part
> > > > > of the Apache foundation would enable us to drive the future of the
> > > > project
> > > > > in alignment with the other Apache projects for the benefit of
> > > thousands
> > > > of
> > > > > organizations that already leverage these projects.
> > > > >
> > > > > == Known Risks ==
> > > > >
> > > > > === Orphaned products ===
> > > > >
> > > > > The risk of abandonment of Hudi is low. It is used in production at
> > > Uber
> > > > > for petabytes of data and other companies (mentioned in community
> > > > section)
> > > > > are either evaluating or in the early stage for production use. Uber
> > is
> > > > > committed to further development of the project and invest resources
> > > > > towards the Apache processes & building the community, during
> > > incubation
> > > > > period.
> > > > >
> > > > > === Inexperience with Open Source ===
> > > > >
> > > > > Even though the initial committers are new to the Apache world, some
> > > have
> > > > > considerable open source experience - Vinoth Chandar (Linkedin
> > > voldemort,
> > > > > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan
> > Qureshi
> > > > > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > > > > successfully managing the current open source community answering
> > > > questions
> > > > > and taking feedback already. Moreover, we hope to obtain guidance and
> > > > > mentorship from current ASF members to help us succeed with the
> > > > incubation.
> > > > >
> > > > > === Length of Incubation ===
> > > > >
> > > > > We expect the project be in incubation for 2 years or less.
> > > > >
> > > > > === Homogenous Developers ===
> > > > >
> > > > > Currently, the lead developers for Hudi are from Uber. However, we
> > have
> > > > an
> > > > > active set of early contributors/collaborators from Shopify,
> > > DoubleVerify
> > > > > and Vungle, that we hope will increase the diversity going forward.
> > > Once
> > > > > again, a primary motivation for incubation is to facilitate this in
> > the
> > > > > Apache way.
> > > > >
> > > > > === Reliance on Salaried Developers ===
> > > > >
> > > > > Both the current committers & early contributors have several years
> > of
> > > > core
> > > > > expertise around data systems. Current committers are very passionate
> > > > about
> > > > > the project and have already invested hundreds of hours towards
> > > helping &
> > > > > building the community. Thus, even with employer changes, we expect
> > > they
> > > > > will be able to actively engage in the project either because they
> > will
> > > > be
> > > > > working in similar areas even with newer employers or out of belief
> > in
> > > > the
> > > > > project.
> > > > >
> > > > > === Relationships with Other Apache Products ===
> > > > >
> > > > > To the best of our knowledge, there are no direct competing projects
> > > with
> > > > > Hudi that offer all of the feature set namely - upserts, incremental
> > > > > streams, efficient storage/file management, snapshot
> > > isolation/rollbacks
> > > > -
> > > > > in a coherent way. However, some projects share common goals and
> > > > technical
> > > > > elements and we will highlight them here. Hive ACID/Kudu both offer
> > > > upsert
> > > > > capabilities without storage management/incremental streams. The
> > recent
> > > > > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > > > > upserts or other data plane features. A detailed comparison with
> > their
> > > > > trade-offs can be found at
> > https://uber.github.io/hudi/comparison.html
> > > .
> > > > >
> > > > > We are committed to open collaboration with such Apache projects and
> > > > > incorporate changes to Hudi or contribute patches to other projects,
> > > with
> > > > > the goal of making it easier for the community at large, to adopt
> > these
> > > > > open source technologies.
> > > > >
> > > > > === Excessive Fascination with the Apache Brand ===
> > > > >
> > > > > This proposal is not for the purpose of generating publicity. We have
> > > > > already been doing talks/meetups independently that have helped us
> > > build
> > > > > our community. We are drawn towards Apache as a potential way of
> > > ensuring
> > > > > that our open source community management is successful early on so
> > > hudi
> > > > > can evolve into a broadly accepted--and used--method of managing data
> > > on
> > > > > Hadoop.
> > > > >
> > > > > == Documentation ==
> > > > > [1] Detailed documentation can be found at
> > > https://uber.github.io/hudi/
> > > > >
> > > > > == Initial Source ==
> > > > >
> > > > > The codebase is currently hosted on Github:
> > > https://github.com/uber/hudi
> > > > .
> > > > > During incubation, the codebase will be migrated to an Apache
> > > > > infrastructure. The source code already has an Apache 2.0 licensed.
> > > > >
> > > > > == Source and Intellectual Property Submission Plan ==
> > > > >
> > > > > Current code is Apache 2.0 licensed and the copyright is assigned to
> > > > Uber.
> > > > > If the project enters incubator, Uber will transfer the source code &
> > > > > trademark ownership to ASF via a Software Grant Agreement
> > > > >
> > > > > == External Dependencies ==
> > > > >
> > > > > Non apache dependencies are listed below
> > > > >
> > > > > * JCommander (1.48) Apache-2.0
> > > > > * Kryo (4.0.0) BSD-2-Clause
> > > > > * Kryo (2.21) BSD-3-Clause
> > > > > * Jackson-annotations (2.6.4) Apache-2.0
> > > > > * Jackson-annotations (2.6.5) Apache-2.0
> > > > > * jackson-databind (2.6.4) Apache-2.0
> > > > > * jackson-databind (2.6.5) Apache-2.0
> > > > > * Jackson datatype: Guava (2.9.4) Apache-2.0
> > > > > * docker-java (3.1.0-rc-3) Apache-2.0
> > > > > * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > > > > * bijection-avro (0.9.2) Apache-2.0
> > > > > * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > > > > * Ascii Table (0.2.5) Apache-2.0
> > > > > * config (3.0.0) Apache-2.0
> > > > > * utils (3.0.0) Apache-2.0
> > > > > * kafka-avro-serializer (3.0.0) Apache-2.0
> > > > > * kafka-schema-registry-client (3.0.0) Apache-2.0
> > > > > * Metrics Core (3.1.1) Apache-2.0
> > > > > * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > > > > * Joda-Time (2.9.6) Apache-2.0
> > > > > * JUnit CPL-1.0
> > > > > * Awaitility (3.1.2) Apache-2.0
> > > > > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > > > > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > > > > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > > > > * htrace-core (3.0.4) Apache-2.0
> > > > > * Mockito (1.10.19) MIT
> > > > > * scalatest (3.0.1) Apache-2.0
> > > > > * Spring Shell (1.2.0.RELEASE) Apache-2.0
> > > > >
> > > > > All of them are Apache compatible
> > > > >
> > > > > == Cryptography ==
> > > > >
> > > > > No cryptographic libraries used
> > > > >
> > > > > == Required Resources ==
> > > > >
> > > > > === Mailing lists ===
> > > > >
> > > > > * private@hudi.incubator.apache.org (with moderated subscriptions)
> > > > > * dev@hudi.incubator.apache.org
> > > > > * commits@hudi.incubator.apache.org
> > > > > * user@hudi.incubator.apache.org
> > > > >
> > > > > === Git Repositories ===
> > > > >
> > > > > Git is the preferred source control system: git://
> > > > > git.apache.org/incubator-hudi
> > > > >
> > > > > === Issue Tracking ===
> > > > >
> > > > > We prefer to use the Apache gitbox integration to sync Github &
> > Apache
> > > > > infrastructure, and rely on Github issues & pull requests for
> > community
> > > > > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> > > > >
> > > > > == Initial Committers ==
> > > > >
> > > > > * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > > > > * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > > > > * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > > > > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com)
> > (Snowflake)
> > > > > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > > > > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> > > > >
> > > > > == Sponsors ==
> > > > >
> > > > > === Champion ===
> > > > > Julien Le Dem (julien at apache dot org)
> > > > >
> > > > > === Nominated Mentors ===
> > > > >
> > > > > * Luciano Resende (lresende at apache dot org)
> > > > > * Thomas Weise (thw at apache dot org
> > > > > * Kishore Gopalakrishna (kishoreg at apache dot org)
> > > > > * Suneel Marthi (smarthi at apache dot org)
> > > > >
> > > > > === Sponsoring Entity ===
> > > > >
> > > > > The Incubator PMC
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > > For additional commands, e-mail: general-help@incubator.apache.org
> > > >
> > > >
> > >
>
> --
> Thanks and Regards,
> Mayank
> Cell: 408-718-9370

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Mayank Bansal <ma...@gmail.com>.

+1

On Mon, Jan 14, 2019 at 5:11 PM Mohammad Islam <mi...@yahoo.com.invalid>
wrote:

>  +1
>     On Monday, January 14, 2019, 12:46:48 PM PST, Kenneth Knowles <
> kenn@apache.org> wrote:
>
>  +1
>
> On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung <fe...@apache.org>
> wrote:
>
> > +1
> >
> >
> > On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
> > <su...@yahoo.com.invalid> wrote:
> >
> > > +1
> > >
> > > Sent from my iPhone
> > >
> > > > On Jan 13, 2019, at 5:34 PM, Thomas Weise <th...@apache.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > > on accepting Hudi into the Apache Incubator,
> > > > per the ASF policy [2] and voting rules [3].
> > > >
> > > > A vote for accepting a new Apache Incubator podling is a
> > > > majority vote. Everyone is welcome to vote, only
> > > > Incubator PMC member votes are binding.
> > > >
> > > > This vote will run for at least 72 hours. Please VOTE as
> > > > follows:
> > > >
> > > > [ ] +1 Accept Hudi into the Apache Incubator
> > > > [ ] +0 Abstain
> > > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > > >
> > > > The proposal is included below, but you can also access it on
> > > > the wiki [4].
> > > >
> > > > Thanks for reviewing and voting,
> > > > Thomas
> > > >
> > > > [1]
> > > >
> > >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > > >
> > > > [2]
> > > >
> > >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > > >
> > > > [3] http://www.apache.org/foundation/voting.html
> > > >
> > > > [4] https://wiki.apache.org/incubator/HudiProposal
> > > >
> > > >
> > > >
> > > > = Hudi Proposal =
> > > >
> > > > == Abstract ==
> > > >
> > > > Hudi is a big-data storage library, that provides atomic upserts and
> > > > incremental data streams.
> > > >
> > > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > > distributed file systems/cloud stores.
> > > >
> > > > == Proposal ==
> > > >
> > > > Hudi provides the ability to atomically upsert datasets with new
> values
> > > in
> > > > near-real time, making data available quickly to existing query
> engines
> > > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi
> provides a
> > > > sequence of changes to a dataset from a given point-in-time to enable
> > > > incremental data pipelines that yield greater efficiency & latency
> than
> > > > their typical batch counterparts. By carefully managing number of
> > files &
> > > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > > consumption).
> > > >
> > > > Hudi is largely implemented as an Apache Spark library that
> > reads/writes
> > > > data from/to Hadoop compatible filesystem. SQL queries on Hudi
> datasets
> > > are
> > > > supported via specialized Apache Hadoop input formats, that
> understand
> > > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > > combination
> > > > of Apache Parquet & Apache Avro file/serialization formats.
> > > >
> > > > == Background ==
> > > >
> > > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> > as
> > > > longer term analytical storage for thousands of organizations.
> Typical
> > > > analytical datasets are built by reading data from a source (e.g:
> > > upstream
> > > > databases, messaging buses, or other datasets), transforming the
> data,
> > > > writing results back to storage, & making it available for analytical
> > > > queries--all of this typically accomplished in batch jobs which
> operate
> > > in
> > > > a bulk fashion on partitions of datasets. Such a style of processing
> > > > typically incurs large delays in making data available to queries as
> > well
> > > > as lot of complexity in carefully partitioning datasets to guarantee
> > > > latency SLAs.
> > > >
> > > > The need for fresher/faster analytics has increased enormously in the
> > > past
> > > > few years, as evidenced by the popularity of Stream processing
> systems
> > > like
> > > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka.
> By
> > > > using updateable state store to incrementally compute & instantly
> > reflect
> > > > new results to queries and using a “tailable” messaging bus to
> publish
> > > > these results to other downstream jobs, such systems employ a
> different
> > > > approach to building analytical dataset. Even though this approach
> > yields
> > > > low latency, the amount of data managed in such real-time data-marts
> is
> > > > typically limited in comparison to the aforementioned longer term
> > storage
> > > > options. As a result, the overall data architecture has become more
> > > complex
> > > > with more moving parts and specialized systems, leading to
> duplication
> > of
> > > > data and a strain on usability.
> > > >
> > > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> > > data
> > > > to streaming systems, we simply add the streaming primitives
> (upserts &
> > > > incremental consumption) onto existing batch processing technologies.
> > We
> > > > believe that by adding some missing blocks to an existing Hadoop
> stack,
> > > we
> > > > are able to a provide similar capabilities right on top of Hadoop at
> a
> > > > reduced cost and with an increased efficiency, greatly simplifying
> the
> > > > overall architecture in the process.
> > > >
> > > > Hudi was originally developed at Uber (original name “Hoodie”) to
> > address
> > > > such broad inefficiencies in ingest & ETL & ML pipelines across
> Uber’s
> > > data
> > > > ecosystem that required the upsert & incremental consumption
> primitives
> > > > supported by Hudi.
> > > >
> > > > == Rationale ==
> > > >
> > > > We truly believe the capabilities supported by Hudi would be
> > increasingly
> > > > useful for big-data ecosystems, as data volumes & need for faster
> data
> > > > continue to increase. A detailed description of target use-cases can
> be
> > > > found at https://uber.github.io/hudi/use_cases.html.
> > > >
> > > > Given our reliance on so many great Apache projects, we believe that
> > the
> > > > Apache way of open source community driven development will enable us
> > to
> > > > evolve Hudi in collaboration with a diverse set of contributors who
> can
> > > > bring new ideas into the project.
> > > >
> > > > == Initial Goals ==
> > > >
> > > > * Move the existing codebase, website, documentation, and mailing
> lists
> > > to
> > > > an Apache-hosted infrastructure.
> > > > * Integrate with the Apache development process.
> > > > * Ensure all dependencies are compliant with Apache License version
> > 2.0.
> > > > * Incrementally develop and release per Apache guidelines.
> > > >
> > > > == Current Status ==
> > > >
> > > > Hudi is a stable project used in production at Uber since 2016 and
> was
> > > open
> > > > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > > > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > > > warehouse from several hours of data delays to under 30 minutes, over
> > the
> > > > past two years. The source code is currently hosted at github.com (
> > > > https://github.com/uber/hudi ), which will seed the Apache git
> > > repository.
> > > >
> > > > === Meritocracy ===
> > > >
> > > > We are fully committed to open, transparent, & meritocratic
> > interactions
> > > > with our community. In fact, one of the primary motivations for us to
> > > enter
> > > > the incubation process is to be able to rely on Apache best practices
> > > that
> > > > can ensure meritocracy. This will eventually help incorporate the
> best
> > > > ideas back into the project & enable contributors to continue
> investing
> > > > their time in the project. Current guidelines (
> > > > https://uber.github.io/hudi/community.html#becoming-a-committer)
> have
> > > > already put in place a meritocratic process which we will replace
> with
> > > > Apache guidelines during incubation.
> > > >
> > > > === Community ===
> > > >
> > > > Hudi community is fairly young, since the project was open sourced
> only
> > > in
> > > > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> > > have a
> > > > vibrant set of contributors (~46 members in our slack channel)
> > including
> > > > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > > > patches or filed issues with hudi pipelines either in early
> production
> > or
> > > > testing stages. Our primary goal during the incubation would be to
> grow
> > > the
> > > > community and groom our existing active contributors into committers.
> > > >
> > > > === Core Developers ===
> > > >
> > > > Current core developers work at Uber & Snowflake. We are confident
> that
> > > > incubation will help us grow a diverse community in a open &
> > > collaborative
> > > > way.
> > > >
> > > > === Alignment ===
> > > >
> > > > Hudi is designed as a general purpose analytical storage abstraction
> > that
> > > > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> > > Apache
> > > > Hadoop. It was built using multiple Apache projects, including Apache
> > > > Parquet and Apache Avro, that support near-real time analytics right
> on
> > > top
> > > > of existing Apache Hadoop data lakes. Our sincere hope is that being
> a
> > > part
> > > > of the Apache foundation would enable us to drive the future of the
> > > project
> > > > in alignment with the other Apache projects for the benefit of
> > thousands
> > > of
> > > > organizations that already leverage these projects.
> > > >
> > > > == Known Risks ==
> > > >
> > > > === Orphaned products ===
> > > >
> > > > The risk of abandonment of Hudi is low. It is used in production at
> > Uber
> > > > for petabytes of data and other companies (mentioned in community
> > > section)
> > > > are either evaluating or in the early stage for production use. Uber
> is
> > > > committed to further development of the project and invest resources
> > > > towards the Apache processes & building the community, during
> > incubation
> > > > period.
> > > >
> > > > === Inexperience with Open Source ===
> > > >
> > > > Even though the initial committers are new to the Apache world, some
> > have
> > > > considerable open source experience - Vinoth Chandar (Linkedin
> > voldemort,
> > > > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan
> Qureshi
> > > > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > > > successfully managing the current open source community answering
> > > questions
> > > > and taking feedback already. Moreover, we hope to obtain guidance and
> > > > mentorship from current ASF members to help us succeed with the
> > > incubation.
> > > >
> > > > === Length of Incubation ===
> > > >
> > > > We expect the project be in incubation for 2 years or less.
> > > >
> > > > === Homogenous Developers ===
> > > >
> > > > Currently, the lead developers for Hudi are from Uber. However, we
> have
> > > an
> > > > active set of early contributors/collaborators from Shopify,
> > DoubleVerify
> > > > and Vungle, that we hope will increase the diversity going forward.
> > Once
> > > > again, a primary motivation for incubation is to facilitate this in
> the
> > > > Apache way.
> > > >
> > > > === Reliance on Salaried Developers ===
> > > >
> > > > Both the current committers & early contributors have several years
> of
> > > core
> > > > expertise around data systems. Current committers are very passionate
> > > about
> > > > the project and have already invested hundreds of hours towards
> > helping &
> > > > building the community. Thus, even with employer changes, we expect
> > they
> > > > will be able to actively engage in the project either because they
> will
> > > be
> > > > working in similar areas even with newer employers or out of belief
> in
> > > the
> > > > project.
> > > >
> > > > === Relationships with Other Apache Products ===
> > > >
> > > > To the best of our knowledge, there are no direct competing projects
> > with
> > > > Hudi that offer all of the feature set namely - upserts, incremental
> > > > streams, efficient storage/file management, snapshot
> > isolation/rollbacks
> > > -
> > > > in a coherent way. However, some projects share common goals and
> > > technical
> > > > elements and we will highlight them here. Hive ACID/Kudu both offer
> > > upsert
> > > > capabilities without storage management/incremental streams. The
> recent
> > > > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > > > upserts or other data plane features. A detailed comparison with
> their
> > > > trade-offs can be found at
> https://uber.github.io/hudi/comparison.html
> > .
> > > >
> > > > We are committed to open collaboration with such Apache projects and
> > > > incorporate changes to Hudi or contribute patches to other projects,
> > with
> > > > the goal of making it easier for the community at large, to adopt
> these
> > > > open source technologies.
> > > >
> > > > === Excessive Fascination with the Apache Brand ===
> > > >
> > > > This proposal is not for the purpose of generating publicity. We have
> > > > already been doing talks/meetups independently that have helped us
> > build
> > > > our community. We are drawn towards Apache as a potential way of
> > ensuring
> > > > that our open source community management is successful early on so
> > hudi
> > > > can evolve into a broadly accepted--and used--method of managing data
> > on
> > > > Hadoop.
> > > >
> > > > == Documentation ==
> > > > [1] Detailed documentation can be found at
> > https://uber.github.io/hudi/
> > > >
> > > > == Initial Source ==
> > > >
> > > > The codebase is currently hosted on Github:
> > https://github.com/uber/hudi
> > > .
> > > > During incubation, the codebase will be migrated to an Apache
> > > > infrastructure. The source code already has an Apache 2.0 licensed.
> > > >
> > > > == Source and Intellectual Property Submission Plan ==
> > > >
> > > > Current code is Apache 2.0 licensed and the copyright is assigned to
> > > Uber.
> > > > If the project enters incubator, Uber will transfer the source code &
> > > > trademark ownership to ASF via a Software Grant Agreement
> > > >
> > > > == External Dependencies ==
> > > >
> > > > Non apache dependencies are listed below
> > > >
> > > > * JCommander (1.48) Apache-2.0
> > > > * Kryo (4.0.0) BSD-2-Clause
> > > > * Kryo (2.21) BSD-3-Clause
> > > > * Jackson-annotations (2.6.4) Apache-2.0
> > > > * Jackson-annotations (2.6.5) Apache-2.0
> > > > * jackson-databind (2.6.4) Apache-2.0
> > > > * jackson-databind (2.6.5) Apache-2.0
> > > > * Jackson datatype: Guava (2.9.4) Apache-2.0
> > > > * docker-java (3.1.0-rc-3) Apache-2.0
> > > > * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > > > * bijection-avro (0.9.2) Apache-2.0
> > > > * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > > > * Ascii Table (0.2.5) Apache-2.0
> > > > * config (3.0.0) Apache-2.0
> > > > * utils (3.0.0) Apache-2.0
> > > > * kafka-avro-serializer (3.0.0) Apache-2.0
> > > > * kafka-schema-registry-client (3.0.0) Apache-2.0
> > > > * Metrics Core (3.1.1) Apache-2.0
> > > > * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > > > * Joda-Time (2.9.6) Apache-2.0
> > > > * JUnit CPL-1.0
> > > > * Awaitility (3.1.2) Apache-2.0
> > > > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > > > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > > > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > > > * htrace-core (3.0.4) Apache-2.0
> > > > * Mockito (1.10.19) MIT
> > > > * scalatest (3.0.1) Apache-2.0
> > > > * Spring Shell (1.2.0.RELEASE) Apache-2.0
> > > >
> > > > All of them are Apache compatible
> > > >
> > > > == Cryptography ==
> > > >
> > > > No cryptographic libraries used
> > > >
> > > > == Required Resources ==
> > > >
> > > > === Mailing lists ===
> > > >
> > > > * private@hudi.incubator.apache.org (with moderated subscriptions)
> > > > * dev@hudi.incubator.apache.org
> > > > * commits@hudi.incubator.apache.org
> > > > * user@hudi.incubator.apache.org
> > > >
> > > > === Git Repositories ===
> > > >
> > > > Git is the preferred source control system: git://
> > > > git.apache.org/incubator-hudi
> > > >
> > > > === Issue Tracking ===
> > > >
> > > > We prefer to use the Apache gitbox integration to sync Github &
> Apache
> > > > infrastructure, and rely on Github issues & pull requests for
> community
> > > > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> > > >
> > > > == Initial Committers ==
> > > >
> > > > * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > > > * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > > > * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > > > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com)
> (Snowflake)
> > > > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > > > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> > > >
> > > > == Sponsors ==
> > > >
> > > > === Champion ===
> > > > Julien Le Dem (julien at apache dot org)
> > > >
> > > > === Nominated Mentors ===
> > > >
> > > > * Luciano Resende (lresende at apache dot org)
> > > > * Thomas Weise (thw at apache dot org
> > > > * Kishore Gopalakrishna (kishoreg at apache dot org)
> > > > * Suneel Marthi (smarthi at apache dot org)
> > > >
> > > > === Sponsoring Entity ===
> > > >
> > > > The Incubator PMC
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >

-- 
Thanks and Regards,
Mayank
Cell: 408-718-9370

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Mohammad Islam <mi...@yahoo.com.INVALID>.

 +1
    On Monday, January 14, 2019, 12:46:48 PM PST, Kenneth Knowles <ke...@apache.org> wrote:  
 
 +1

On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung <fe...@apache.org> wrote:

> +1
>
>
> On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
> <su...@yahoo.com.invalid> wrote:
>
> > +1
> >
> > Sent from my iPhone
> >
> > > On Jan 13, 2019, at 5:34 PM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > Hi all,
> > >
> > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > on accepting Hudi into the Apache Incubator,
> > > per the ASF policy [2] and voting rules [3].
> > >
> > > A vote for accepting a new Apache Incubator podling is a
> > > majority vote. Everyone is welcome to vote, only
> > > Incubator PMC member votes are binding.
> > >
> > > This vote will run for at least 72 hours. Please VOTE as
> > > follows:
> > >
> > > [ ] +1 Accept Hudi into the Apache Incubator
> > > [ ] +0 Abstain
> > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > >
> > > The proposal is included below, but you can also access it on
> > > the wiki [4].
> > >
> > > Thanks for reviewing and voting,
> > > Thomas
> > >
> > > [1]
> > >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > >
> > > [2]
> > >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > >
> > > [3] http://www.apache.org/foundation/voting.html
> > >
> > > [4] https://wiki.apache.org/incubator/HudiProposal
> > >
> > >
> > >
> > > = Hudi Proposal =
> > >
> > > == Abstract ==
> > >
> > > Hudi is a big-data storage library, that provides atomic upserts and
> > > incremental data streams.
> > >
> > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > distributed file systems/cloud stores.
> > >
> > > == Proposal ==
> > >
> > > Hudi provides the ability to atomically upsert datasets with new values
> > in
> > > near-real time, making data available quickly to existing query engines
> > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > > sequence of changes to a dataset from a given point-in-time to enable
> > > incremental data pipelines that yield greater efficiency & latency than
> > > their typical batch counterparts. By carefully managing number of
> files &
> > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > consumption).
> > >
> > > Hudi is largely implemented as an Apache Spark library that
> reads/writes
> > > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> > are
> > > supported via specialized Apache Hadoop input formats, that understand
> > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > combination
> > > of Apache Parquet & Apache Avro file/serialization formats.
> > >
> > > == Background ==
> > >
> > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> as
> > > longer term analytical storage for thousands of organizations. Typical
> > > analytical datasets are built by reading data from a source (e.g:
> > upstream
> > > databases, messaging buses, or other datasets), transforming the data,
> > > writing results back to storage, & making it available for analytical
> > > queries--all of this typically accomplished in batch jobs which operate
> > in
> > > a bulk fashion on partitions of datasets. Such a style of processing
> > > typically incurs large delays in making data available to queries as
> well
> > > as lot of complexity in carefully partitioning datasets to guarantee
> > > latency SLAs.
> > >
> > > The need for fresher/faster analytics has increased enormously in the
> > past
> > > few years, as evidenced by the popularity of Stream processing systems
> > like
> > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > > using updateable state store to incrementally compute & instantly
> reflect
> > > new results to queries and using a “tailable” messaging bus to publish
> > > these results to other downstream jobs, such systems employ a different
> > > approach to building analytical dataset. Even though this approach
> yields
> > > low latency, the amount of data managed in such real-time data-marts is
> > > typically limited in comparison to the aforementioned longer term
> storage
> > > options. As a result, the overall data architecture has become more
> > complex
> > > with more moving parts and specialized systems, leading to duplication
> of
> > > data and a strain on usability.
> > >
> > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> > data
> > > to streaming systems, we simply add the streaming primitives (upserts &
> > > incremental consumption) onto existing batch processing technologies.
> We
> > > believe that by adding some missing blocks to an existing Hadoop stack,
> > we
> > > are able to a provide similar capabilities right on top of Hadoop at a
> > > reduced cost and with an increased efficiency, greatly simplifying the
> > > overall architecture in the process.
> > >
> > > Hudi was originally developed at Uber (original name “Hoodie”) to
> address
> > > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> > data
> > > ecosystem that required the upsert & incremental consumption primitives
> > > supported by Hudi.
> > >
> > > == Rationale ==
> > >
> > > We truly believe the capabilities supported by Hudi would be
> increasingly
> > > useful for big-data ecosystems, as data volumes & need for faster data
> > > continue to increase. A detailed description of target use-cases can be
> > > found at https://uber.github.io/hudi/use_cases.html.
> > >
> > > Given our reliance on so many great Apache projects, we believe that
> the
> > > Apache way of open source community driven development will enable us
> to
> > > evolve Hudi in collaboration with a diverse set of contributors who can
> > > bring new ideas into the project.
> > >
> > > == Initial Goals ==
> > >
> > > * Move the existing codebase, website, documentation, and mailing lists
> > to
> > > an Apache-hosted infrastructure.
> > > * Integrate with the Apache development process.
> > > * Ensure all dependencies are compliant with Apache License version
> 2.0.
> > > * Incrementally develop and release per Apache guidelines.
> > >
> > > == Current Status ==
> > >
> > > Hudi is a stable project used in production at Uber since 2016 and was
> > open
> > > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > > warehouse from several hours of data delays to under 30 minutes, over
> the
> > > past two years. The source code is currently hosted at github.com (
> > > https://github.com/uber/hudi ), which will seed the Apache git
> > repository.
> > >
> > > === Meritocracy ===
> > >
> > > We are fully committed to open, transparent, & meritocratic
> interactions
> > > with our community. In fact, one of the primary motivations for us to
> > enter
> > > the incubation process is to be able to rely on Apache best practices
> > that
> > > can ensure meritocracy. This will eventually help incorporate the best
> > > ideas back into the project & enable contributors to continue investing
> > > their time in the project. Current guidelines (
> > > https://uber.github.io/hudi/community.html#becoming-a-committer) have
> > > already put in place a meritocratic process which we will replace with
> > > Apache guidelines during incubation.
> > >
> > > === Community ===
> > >
> > > Hudi community is fairly young, since the project was open sourced only
> > in
> > > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> > have a
> > > vibrant set of contributors (~46 members in our slack channel)
> including
> > > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > > patches or filed issues with hudi pipelines either in early production
> or
> > > testing stages. Our primary goal during the incubation would be to grow
> > the
> > > community and groom our existing active contributors into committers.
> > >
> > > === Core Developers ===
> > >
> > > Current core developers work at Uber & Snowflake. We are confident that
> > > incubation will help us grow a diverse community in a open &
> > collaborative
> > > way.
> > >
> > > === Alignment ===
> > >
> > > Hudi is designed as a general purpose analytical storage abstraction
> that
> > > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> > Apache
> > > Hadoop. It was built using multiple Apache projects, including Apache
> > > Parquet and Apache Avro, that support near-real time analytics right on
> > top
> > > of existing Apache Hadoop data lakes. Our sincere hope is that being a
> > part
> > > of the Apache foundation would enable us to drive the future of the
> > project
> > > in alignment with the other Apache projects for the benefit of
> thousands
> > of
> > > organizations that already leverage these projects.
> > >
> > > == Known Risks ==
> > >
> > > === Orphaned products ===
> > >
> > > The risk of abandonment of Hudi is low. It is used in production at
> Uber
> > > for petabytes of data and other companies (mentioned in community
> > section)
> > > are either evaluating or in the early stage for production use. Uber is
> > > committed to further development of the project and invest resources
> > > towards the Apache processes & building the community, during
> incubation
> > > period.
> > >
> > > === Inexperience with Open Source ===
> > >
> > > Even though the initial committers are new to the Apache world, some
> have
> > > considerable open source experience - Vinoth Chandar (Linkedin
> voldemort,
> > > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> > > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > > successfully managing the current open source community answering
> > questions
> > > and taking feedback already. Moreover, we hope to obtain guidance and
> > > mentorship from current ASF members to help us succeed with the
> > incubation.
> > >
> > > === Length of Incubation ===
> > >
> > > We expect the project be in incubation for 2 years or less.
> > >
> > > === Homogenous Developers ===
> > >
> > > Currently, the lead developers for Hudi are from Uber. However, we have
> > an
> > > active set of early contributors/collaborators from Shopify,
> DoubleVerify
> > > and Vungle, that we hope will increase the diversity going forward.
> Once
> > > again, a primary motivation for incubation is to facilitate this in the
> > > Apache way.
> > >
> > > === Reliance on Salaried Developers ===
> > >
> > > Both the current committers & early contributors have several years of
> > core
> > > expertise around data systems. Current committers are very passionate
> > about
> > > the project and have already invested hundreds of hours towards
> helping &
> > > building the community. Thus, even with employer changes, we expect
> they
> > > will be able to actively engage in the project either because they will
> > be
> > > working in similar areas even with newer employers or out of belief in
> > the
> > > project.
> > >
> > > === Relationships with Other Apache Products ===
> > >
> > > To the best of our knowledge, there are no direct competing projects
> with
> > > Hudi that offer all of the feature set namely - upserts, incremental
> > > streams, efficient storage/file management, snapshot
> isolation/rollbacks
> > -
> > > in a coherent way. However, some projects share common goals and
> > technical
> > > elements and we will highlight them here. Hive ACID/Kudu both offer
> > upsert
> > > capabilities without storage management/incremental streams. The recent
> > > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > > upserts or other data plane features. A detailed comparison with their
> > > trade-offs can be found at https://uber.github.io/hudi/comparison.html
> .
> > >
> > > We are committed to open collaboration with such Apache projects and
> > > incorporate changes to Hudi or contribute patches to other projects,
> with
> > > the goal of making it easier for the community at large, to adopt these
> > > open source technologies.
> > >
> > > === Excessive Fascination with the Apache Brand ===
> > >
> > > This proposal is not for the purpose of generating publicity. We have
> > > already been doing talks/meetups independently that have helped us
> build
> > > our community. We are drawn towards Apache as a potential way of
> ensuring
> > > that our open source community management is successful early on so
> hudi
> > > can evolve into a broadly accepted--and used--method of managing data
> on
> > > Hadoop.
> > >
> > > == Documentation ==
> > > [1] Detailed documentation can be found at
> https://uber.github.io/hudi/
> > >
> > > == Initial Source ==
> > >
> > > The codebase is currently hosted on Github:
> https://github.com/uber/hudi
> > .
> > > During incubation, the codebase will be migrated to an Apache
> > > infrastructure. The source code already has an Apache 2.0 licensed.
> > >
> > > == Source and Intellectual Property Submission Plan ==
> > >
> > > Current code is Apache 2.0 licensed and the copyright is assigned to
> > Uber.
> > > If the project enters incubator, Uber will transfer the source code &
> > > trademark ownership to ASF via a Software Grant Agreement
> > >
> > > == External Dependencies ==
> > >
> > > Non apache dependencies are listed below
> > >
> > > * JCommander (1.48) Apache-2.0
> > > * Kryo (4.0.0) BSD-2-Clause
> > > * Kryo (2.21) BSD-3-Clause
> > > * Jackson-annotations (2.6.4) Apache-2.0
> > > * Jackson-annotations (2.6.5) Apache-2.0
> > > * jackson-databind (2.6.4) Apache-2.0
> > > * jackson-databind (2.6.5) Apache-2.0
> > > * Jackson datatype: Guava (2.9.4) Apache-2.0
> > > * docker-java (3.1.0-rc-3) Apache-2.0
> > > * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > > * bijection-avro (0.9.2) Apache-2.0
> > > * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > > * Ascii Table (0.2.5) Apache-2.0
> > > * config (3.0.0) Apache-2.0
> > > * utils (3.0.0) Apache-2.0
> > > * kafka-avro-serializer (3.0.0) Apache-2.0
> > > * kafka-schema-registry-client (3.0.0) Apache-2.0
> > > * Metrics Core (3.1.1) Apache-2.0
> > > * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > > * Joda-Time (2.9.6) Apache-2.0
> > > * JUnit CPL-1.0
> > > * Awaitility (3.1.2) Apache-2.0
> > > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > > * htrace-core (3.0.4) Apache-2.0
> > > * Mockito (1.10.19) MIT
> > > * scalatest (3.0.1) Apache-2.0
> > > * Spring Shell (1.2.0.RELEASE) Apache-2.0
> > >
> > > All of them are Apache compatible
> > >
> > > == Cryptography ==
> > >
> > > No cryptographic libraries used
> > >
> > > == Required Resources ==
> > >
> > > === Mailing lists ===
> > >
> > > * private@hudi.incubator.apache.org (with moderated subscriptions)
> > > * dev@hudi.incubator.apache.org
> > > * commits@hudi.incubator.apache.org
> > > * user@hudi.incubator.apache.org
> > >
> > > === Git Repositories ===
> > >
> > > Git is the preferred source control system: git://
> > > git.apache.org/incubator-hudi
> > >
> > > === Issue Tracking ===
> > >
> > > We prefer to use the Apache gitbox integration to sync Github & Apache
> > > infrastructure, and rely on Github issues & pull requests for community
> > > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> > >
> > > == Initial Committers ==
> > >
> > > * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > > * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > > * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> > > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> > >
> > > == Sponsors ==
> > >
> > > === Champion ===
> > > Julien Le Dem (julien at apache dot org)
> > >
> > > === Nominated Mentors ===
> > >
> > > * Luciano Resende (lresende at apache dot org)
> > > * Thomas Weise (thw at apache dot org
> > > * Kishore Gopalakrishna (kishoreg at apache dot org)
> > > * Suneel Marthi (smarthi at apache dot org)
> > >
> > > === Sponsoring Entity ===
> > >
> > > The Incubator PMC
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Kenneth Knowles <ke...@apache.org>.

+1

On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung <fe...@apache.org> wrote:

> +1
>
>
> On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
> <su...@yahoo.com.invalid> wrote:
>
> > +1
> >
> > Sent from my iPhone
> >
> > > On Jan 13, 2019, at 5:34 PM, Thomas Weise <th...@apache.org> wrote:
> > >
> > > Hi all,
> > >
> > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > on accepting Hudi into the Apache Incubator,
> > > per the ASF policy [2] and voting rules [3].
> > >
> > > A vote for accepting a new Apache Incubator podling is a
> > > majority vote. Everyone is welcome to vote, only
> > > Incubator PMC member votes are binding.
> > >
> > > This vote will run for at least 72 hours. Please VOTE as
> > > follows:
> > >
> > > [ ] +1 Accept Hudi into the Apache Incubator
> > > [ ] +0 Abstain
> > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > >
> > > The proposal is included below, but you can also access it on
> > > the wiki [4].
> > >
> > > Thanks for reviewing and voting,
> > > Thomas
> > >
> > > [1]
> > >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > >
> > > [2]
> > >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > >
> > > [3] http://www.apache.org/foundation/voting.html
> > >
> > > [4] https://wiki.apache.org/incubator/HudiProposal
> > >
> > >
> > >
> > > = Hudi Proposal =
> > >
> > > == Abstract ==
> > >
> > > Hudi is a big-data storage library, that provides atomic upserts and
> > > incremental data streams.
> > >
> > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > distributed file systems/cloud stores.
> > >
> > > == Proposal ==
> > >
> > > Hudi provides the ability to atomically upsert datasets with new values
> > in
> > > near-real time, making data available quickly to existing query engines
> > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > > sequence of changes to a dataset from a given point-in-time to enable
> > > incremental data pipelines that yield greater efficiency & latency than
> > > their typical batch counterparts. By carefully managing number of
> files &
> > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > consumption).
> > >
> > > Hudi is largely implemented as an Apache Spark library that
> reads/writes
> > > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> > are
> > > supported via specialized Apache Hadoop input formats, that understand
> > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > combination
> > > of Apache Parquet & Apache Avro file/serialization formats.
> > >
> > > == Background ==
> > >
> > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> as
> > > longer term analytical storage for thousands of organizations. Typical
> > > analytical datasets are built by reading data from a source (e.g:
> > upstream
> > > databases, messaging buses, or other datasets), transforming the data,
> > > writing results back to storage, & making it available for analytical
> > > queries--all of this typically accomplished in batch jobs which operate
> > in
> > > a bulk fashion on partitions of datasets. Such a style of processing
> > > typically incurs large delays in making data available to queries as
> well
> > > as lot of complexity in carefully partitioning datasets to guarantee
> > > latency SLAs.
> > >
> > > The need for fresher/faster analytics has increased enormously in the
> > past
> > > few years, as evidenced by the popularity of Stream processing systems
> > like
> > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > > using updateable state store to incrementally compute & instantly
> reflect
> > > new results to queries and using a “tailable” messaging bus to publish
> > > these results to other downstream jobs, such systems employ a different
> > > approach to building analytical dataset. Even though this approach
> yields
> > > low latency, the amount of data managed in such real-time data-marts is
> > > typically limited in comparison to the aforementioned longer term
> storage
> > > options. As a result, the overall data architecture has become more
> > complex
> > > with more moving parts and specialized systems, leading to duplication
> of
> > > data and a strain on usability.
> > >
> > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> > data
> > > to streaming systems, we simply add the streaming primitives (upserts &
> > > incremental consumption) onto existing batch processing technologies.
> We
> > > believe that by adding some missing blocks to an existing Hadoop stack,
> > we
> > > are able to a provide similar capabilities right on top of Hadoop at a
> > > reduced cost and with an increased efficiency, greatly simplifying the
> > > overall architecture in the process.
> > >
> > > Hudi was originally developed at Uber (original name “Hoodie”) to
> address
> > > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> > data
> > > ecosystem that required the upsert & incremental consumption primitives
> > > supported by Hudi.
> > >
> > > == Rationale ==
> > >
> > > We truly believe the capabilities supported by Hudi would be
> increasingly
> > > useful for big-data ecosystems, as data volumes & need for faster data
> > > continue to increase. A detailed description of target use-cases can be
> > > found at https://uber.github.io/hudi/use_cases.html.
> > >
> > > Given our reliance on so many great Apache projects, we believe that
> the
> > > Apache way of open source community driven development will enable us
> to
> > > evolve Hudi in collaboration with a diverse set of contributors who can
> > > bring new ideas into the project.
> > >
> > > == Initial Goals ==
> > >
> > > * Move the existing codebase, website, documentation, and mailing lists
> > to
> > > an Apache-hosted infrastructure.
> > > * Integrate with the Apache development process.
> > > * Ensure all dependencies are compliant with Apache License version
> 2.0.
> > > * Incrementally develop and release per Apache guidelines.
> > >
> > > == Current Status ==
> > >
> > > Hudi is a stable project used in production at Uber since 2016 and was
> > open
> > > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > > warehouse from several hours of data delays to under 30 minutes, over
> the
> > > past two years. The source code is currently hosted at github.com (
> > > https://github.com/uber/hudi ), which will seed the Apache git
> > repository.
> > >
> > > === Meritocracy ===
> > >
> > > We are fully committed to open, transparent, & meritocratic
> interactions
> > > with our community. In fact, one of the primary motivations for us to
> > enter
> > > the incubation process is to be able to rely on Apache best practices
> > that
> > > can ensure meritocracy. This will eventually help incorporate the best
> > > ideas back into the project & enable contributors to continue investing
> > > their time in the project. Current guidelines (
> > > https://uber.github.io/hudi/community.html#becoming-a-committer) have
> > > already put in place a meritocratic process which we will replace with
> > > Apache guidelines during incubation.
> > >
> > > === Community ===
> > >
> > > Hudi community is fairly young, since the project was open sourced only
> > in
> > > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> > have a
> > > vibrant set of contributors (~46 members in our slack channel)
> including
> > > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > > patches or filed issues with hudi pipelines either in early production
> or
> > > testing stages. Our primary goal during the incubation would be to grow
> > the
> > > community and groom our existing active contributors into committers.
> > >
> > > === Core Developers ===
> > >
> > > Current core developers work at Uber & Snowflake. We are confident that
> > > incubation will help us grow a diverse community in a open &
> > collaborative
> > > way.
> > >
> > > === Alignment ===
> > >
> > > Hudi is designed as a general purpose analytical storage abstraction
> that
> > > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> > Apache
> > > Hadoop. It was built using multiple Apache projects, including Apache
> > > Parquet and Apache Avro, that support near-real time analytics right on
> > top
> > > of existing Apache Hadoop data lakes. Our sincere hope is that being a
> > part
> > > of the Apache foundation would enable us to drive the future of the
> > project
> > > in alignment with the other Apache projects for the benefit of
> thousands
> > of
> > > organizations that already leverage these projects.
> > >
> > > == Known Risks ==
> > >
> > > === Orphaned products ===
> > >
> > > The risk of abandonment of Hudi is low. It is used in production at
> Uber
> > > for petabytes of data and other companies (mentioned in community
> > section)
> > > are either evaluating or in the early stage for production use. Uber is
> > > committed to further development of the project and invest resources
> > > towards the Apache processes & building the community, during
> incubation
> > > period.
> > >
> > > === Inexperience with Open Source ===
> > >
> > > Even though the initial committers are new to the Apache world, some
> have
> > > considerable open source experience - Vinoth Chandar (Linkedin
> voldemort,
> > > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> > > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > > successfully managing the current open source community answering
> > questions
> > > and taking feedback already. Moreover, we hope to obtain guidance and
> > > mentorship from current ASF members to help us succeed with the
> > incubation.
> > >
> > > === Length of Incubation ===
> > >
> > > We expect the project be in incubation for 2 years or less.
> > >
> > > === Homogenous Developers ===
> > >
> > > Currently, the lead developers for Hudi are from Uber. However, we have
> > an
> > > active set of early contributors/collaborators from Shopify,
> DoubleVerify
> > > and Vungle, that we hope will increase the diversity going forward.
> Once
> > > again, a primary motivation for incubation is to facilitate this in the
> > > Apache way.
> > >
> > > === Reliance on Salaried Developers ===
> > >
> > > Both the current committers & early contributors have several years of
> > core
> > > expertise around data systems. Current committers are very passionate
> > about
> > > the project and have already invested hundreds of hours towards
> helping &
> > > building the community. Thus, even with employer changes, we expect
> they
> > > will be able to actively engage in the project either because they will
> > be
> > > working in similar areas even with newer employers or out of belief in
> > the
> > > project.
> > >
> > > === Relationships with Other Apache Products ===
> > >
> > > To the best of our knowledge, there are no direct competing projects
> with
> > > Hudi that offer all of the feature set namely - upserts, incremental
> > > streams, efficient storage/file management, snapshot
> isolation/rollbacks
> > -
> > > in a coherent way. However, some projects share common goals and
> > technical
> > > elements and we will highlight them here. Hive ACID/Kudu both offer
> > upsert
> > > capabilities without storage management/incremental streams. The recent
> > > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > > upserts or other data plane features. A detailed comparison with their
> > > trade-offs can be found at https://uber.github.io/hudi/comparison.html
> .
> > >
> > > We are committed to open collaboration with such Apache projects and
> > > incorporate changes to Hudi or contribute patches to other projects,
> with
> > > the goal of making it easier for the community at large, to adopt these
> > > open source technologies.
> > >
> > > === Excessive Fascination with the Apache Brand ===
> > >
> > > This proposal is not for the purpose of generating publicity. We have
> > > already been doing talks/meetups independently that have helped us
> build
> > > our community. We are drawn towards Apache as a potential way of
> ensuring
> > > that our open source community management is successful early on so
> hudi
> > > can evolve into a broadly accepted--and used--method of managing data
> on
> > > Hadoop.
> > >
> > > == Documentation ==
> > > [1] Detailed documentation can be found at
> https://uber.github.io/hudi/
> > >
> > > == Initial Source ==
> > >
> > > The codebase is currently hosted on Github:
> https://github.com/uber/hudi
> > .
> > > During incubation, the codebase will be migrated to an Apache
> > > infrastructure. The source code already has an Apache 2.0 licensed.
> > >
> > > == Source and Intellectual Property Submission Plan ==
> > >
> > > Current code is Apache 2.0 licensed and the copyright is assigned to
> > Uber.
> > > If the project enters incubator, Uber will transfer the source code &
> > > trademark ownership to ASF via a Software Grant Agreement
> > >
> > > == External Dependencies ==
> > >
> > > Non apache dependencies are listed below
> > >
> > > * JCommander (1.48) Apache-2.0
> > > * Kryo (4.0.0) BSD-2-Clause
> > > * Kryo (2.21) BSD-3-Clause
> > > * Jackson-annotations (2.6.4) Apache-2.0
> > > * Jackson-annotations (2.6.5) Apache-2.0
> > > * jackson-databind (2.6.4) Apache-2.0
> > > * jackson-databind (2.6.5) Apache-2.0
> > > * Jackson datatype: Guava (2.9.4) Apache-2.0
> > > * docker-java (3.1.0-rc-3) Apache-2.0
> > > * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > > * bijection-avro (0.9.2) Apache-2.0
> > > * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > > * Ascii Table (0.2.5) Apache-2.0
> > > * config (3.0.0) Apache-2.0
> > > * utils (3.0.0) Apache-2.0
> > > * kafka-avro-serializer (3.0.0) Apache-2.0
> > > * kafka-schema-registry-client (3.0.0) Apache-2.0
> > > * Metrics Core (3.1.1) Apache-2.0
> > > * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > > * Joda-Time (2.9.6) Apache-2.0
> > > * JUnit CPL-1.0
> > > * Awaitility (3.1.2) Apache-2.0
> > > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > > * htrace-core (3.0.4) Apache-2.0
> > > * Mockito (1.10.19) MIT
> > > * scalatest (3.0.1) Apache-2.0
> > > * Spring Shell (1.2.0.RELEASE) Apache-2.0
> > >
> > > All of them are Apache compatible
> > >
> > > == Cryptography ==
> > >
> > > No cryptographic libraries used
> > >
> > > == Required Resources ==
> > >
> > > === Mailing lists ===
> > >
> > > * private@hudi.incubator.apache.org (with moderated subscriptions)
> > > * dev@hudi.incubator.apache.org
> > > * commits@hudi.incubator.apache.org
> > > * user@hudi.incubator.apache.org
> > >
> > > === Git Repositories ===
> > >
> > > Git is the preferred source control system: git://
> > > git.apache.org/incubator-hudi
> > >
> > > === Issue Tracking ===
> > >
> > > We prefer to use the Apache gitbox integration to sync Github & Apache
> > > infrastructure, and rely on Github issues & pull requests for community
> > > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> > >
> > > == Initial Committers ==
> > >
> > > * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > > * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > > * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> > > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> > >
> > > == Sponsors ==
> > >
> > > === Champion ===
> > > Julien Le Dem (julien at apache dot org)
> > >
> > > === Nominated Mentors ===
> > >
> > > * Luciano Resende (lresende at apache dot org)
> > > * Thomas Weise (thw at apache dot org
> > > * Kishore Gopalakrishna (kishoreg at apache dot org)
> > > * Suneel Marthi (smarthi at apache dot org)
> > >
> > > === Sponsoring Entity ===
> > >
> > > The Incubator PMC
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Felix Cheung <fe...@apache.org>.

+1


On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
<su...@yahoo.com.invalid> wrote:

> +1
>
> Sent from my iPhone
>
> > On Jan 13, 2019, at 5:34 PM, Thomas Weise <th...@apache.org> wrote:
> >
> > Hi all,
> >
> > Following the discussion of the Hudi proposal in [1], this is a vote
> > on accepting Hudi into the Apache Incubator,
> > per the ASF policy [2] and voting rules [3].
> >
> > A vote for accepting a new Apache Incubator podling is a
> > majority vote. Everyone is welcome to vote, only
> > Incubator PMC member votes are binding.
> >
> > This vote will run for at least 72 hours. Please VOTE as
> > follows:
> >
> > [ ] +1 Accept Hudi into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> >
> > The proposal is included below, but you can also access it on
> > the wiki [4].
> >
> > Thanks for reviewing and voting,
> > Thomas
> >
> > [1]
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> >
> > [2]
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> >
> > [3] http://www.apache.org/foundation/voting.html
> >
> > [4] https://wiki.apache.org/incubator/HudiProposal
> >
> >
> >
> > = Hudi Proposal =
> >
> > == Abstract ==
> >
> > Hudi is a big-data storage library, that provides atomic upserts and
> > incremental data streams.
> >
> > Hudi manages data stored in Apache Hadoop and other API compatible
> > distributed file systems/cloud stores.
> >
> > == Proposal ==
> >
> > Hudi provides the ability to atomically upsert datasets with new values
> in
> > near-real time, making data available quickly to existing query engines
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > sequence of changes to a dataset from a given point-in-time to enable
> > incremental data pipelines that yield greater efficiency & latency than
> > their typical batch counterparts. By carefully managing number of files &
> > sizes, Hudi greatly aids both query engines (e.g: always providing
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > consumption).
> >
> > Hudi is largely implemented as an Apache Spark library that reads/writes
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> are
> > supported via specialized Apache Hadoop input formats, that understand
> > Hudi’s storage layout. Currently, Hudi manages datasets using a
> combination
> > of Apache Parquet & Apache Avro file/serialization formats.
> >
> > == Background ==
> >
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> > longer term analytical storage for thousands of organizations. Typical
> > analytical datasets are built by reading data from a source (e.g:
> upstream
> > databases, messaging buses, or other datasets), transforming the data,
> > writing results back to storage, & making it available for analytical
> > queries--all of this typically accomplished in batch jobs which operate
> in
> > a bulk fashion on partitions of datasets. Such a style of processing
> > typically incurs large delays in making data available to queries as well
> > as lot of complexity in carefully partitioning datasets to guarantee
> > latency SLAs.
> >
> > The need for fresher/faster analytics has increased enormously in the
> past
> > few years, as evidenced by the popularity of Stream processing systems
> like
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > using updateable state store to incrementally compute & instantly reflect
> > new results to queries and using a “tailable” messaging bus to publish
> > these results to other downstream jobs, such systems employ a different
> > approach to building analytical dataset. Even though this approach yields
> > low latency, the amount of data managed in such real-time data-marts is
> > typically limited in comparison to the aforementioned longer term storage
> > options. As a result, the overall data architecture has become more
> complex
> > with more moving parts and specialized systems, leading to duplication of
> > data and a strain on usability.
> >
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> data
> > to streaming systems, we simply add the streaming primitives (upserts &
> > incremental consumption) onto existing batch processing technologies. We
> > believe that by adding some missing blocks to an existing Hadoop stack,
> we
> > are able to a provide similar capabilities right on top of Hadoop at a
> > reduced cost and with an increased efficiency, greatly simplifying the
> > overall architecture in the process.
> >
> > Hudi was originally developed at Uber (original name “Hoodie”) to address
> > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> data
> > ecosystem that required the upsert & incremental consumption primitives
> > supported by Hudi.
> >
> > == Rationale ==
> >
> > We truly believe the capabilities supported by Hudi would be increasingly
> > useful for big-data ecosystems, as data volumes & need for faster data
> > continue to increase. A detailed description of target use-cases can be
> > found at https://uber.github.io/hudi/use_cases.html.
> >
> > Given our reliance on so many great Apache projects, we believe that the
> > Apache way of open source community driven development will enable us to
> > evolve Hudi in collaboration with a diverse set of contributors who can
> > bring new ideas into the project.
> >
> > == Initial Goals ==
> >
> > * Move the existing codebase, website, documentation, and mailing lists
> to
> > an Apache-hosted infrastructure.
> > * Integrate with the Apache development process.
> > * Ensure all dependencies are compliant with Apache License version 2.0.
> > * Incrementally develop and release per Apache guidelines.
> >
> > == Current Status ==
> >
> > Hudi is a stable project used in production at Uber since 2016 and was
> open
> > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > warehouse from several hours of data delays to under 30 minutes, over the
> > past two years. The source code is currently hosted at github.com (
> > https://github.com/uber/hudi ), which will seed the Apache git
> repository.
> >
> > === Meritocracy ===
> >
> > We are fully committed to open, transparent, & meritocratic interactions
> > with our community. In fact, one of the primary motivations for us to
> enter
> > the incubation process is to be able to rely on Apache best practices
> that
> > can ensure meritocracy. This will eventually help incorporate the best
> > ideas back into the project & enable contributors to continue investing
> > their time in the project. Current guidelines (
> > https://uber.github.io/hudi/community.html#becoming-a-committer) have
> > already put in place a meritocratic process which we will replace with
> > Apache guidelines during incubation.
> >
> > === Community ===
> >
> > Hudi community is fairly young, since the project was open sourced only
> in
> > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> have a
> > vibrant set of contributors (~46 members in our slack channel) including
> > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > patches or filed issues with hudi pipelines either in early production or
> > testing stages. Our primary goal during the incubation would be to grow
> the
> > community and groom our existing active contributors into committers.
> >
> > === Core Developers ===
> >
> > Current core developers work at Uber & Snowflake. We are confident that
> > incubation will help us grow a diverse community in a open &
> collaborative
> > way.
> >
> > === Alignment ===
> >
> > Hudi is designed as a general purpose analytical storage abstraction that
> > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> Apache
> > Hadoop. It was built using multiple Apache projects, including Apache
> > Parquet and Apache Avro, that support near-real time analytics right on
> top
> > of existing Apache Hadoop data lakes. Our sincere hope is that being a
> part
> > of the Apache foundation would enable us to drive the future of the
> project
> > in alignment with the other Apache projects for the benefit of thousands
> of
> > organizations that already leverage these projects.
> >
> > == Known Risks ==
> >
> > === Orphaned products ===
> >
> > The risk of abandonment of Hudi is low. It is used in production at Uber
> > for petabytes of data and other companies (mentioned in community
> section)
> > are either evaluating or in the early stage for production use. Uber is
> > committed to further development of the project and invest resources
> > towards the Apache processes & building the community, during incubation
> > period.
> >
> > === Inexperience with Open Source ===
> >
> > Even though the initial committers are new to the Apache world, some have
> > considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > successfully managing the current open source community answering
> questions
> > and taking feedback already. Moreover, we hope to obtain guidance and
> > mentorship from current ASF members to help us succeed with the
> incubation.
> >
> > === Length of Incubation ===
> >
> > We expect the project be in incubation for 2 years or less.
> >
> > === Homogenous Developers ===
> >
> > Currently, the lead developers for Hudi are from Uber. However, we have
> an
> > active set of early contributors/collaborators from Shopify, DoubleVerify
> > and Vungle, that we hope will increase the diversity going forward. Once
> > again, a primary motivation for incubation is to facilitate this in the
> > Apache way.
> >
> > === Reliance on Salaried Developers ===
> >
> > Both the current committers & early contributors have several years of
> core
> > expertise around data systems. Current committers are very passionate
> about
> > the project and have already invested hundreds of hours towards helping &
> > building the community. Thus, even with employer changes, we expect they
> > will be able to actively engage in the project either because they will
> be
> > working in similar areas even with newer employers or out of belief in
> the
> > project.
> >
> > === Relationships with Other Apache Products ===
> >
> > To the best of our knowledge, there are no direct competing projects with
> > Hudi that offer all of the feature set namely - upserts, incremental
> > streams, efficient storage/file management, snapshot isolation/rollbacks
> -
> > in a coherent way. However, some projects share common goals and
> technical
> > elements and we will highlight them here. Hive ACID/Kudu both offer
> upsert
> > capabilities without storage management/incremental streams. The recent
> > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > upserts or other data plane features. A detailed comparison with their
> > trade-offs can be found at https://uber.github.io/hudi/comparison.html.
> >
> > We are committed to open collaboration with such Apache projects and
> > incorporate changes to Hudi or contribute patches to other projects, with
> > the goal of making it easier for the community at large, to adopt these
> > open source technologies.
> >
> > === Excessive Fascination with the Apache Brand ===
> >
> > This proposal is not for the purpose of generating publicity. We have
> > already been doing talks/meetups independently that have helped us build
> > our community. We are drawn towards Apache as a potential way of ensuring
> > that our open source community management is successful early on so  hudi
> > can evolve into a broadly accepted--and used--method of managing data on
> > Hadoop.
> >
> > == Documentation ==
> > [1] Detailed documentation can be found at https://uber.github.io/hudi/
> >
> > == Initial Source ==
> >
> > The codebase is currently hosted on Github: https://github.com/uber/hudi
> .
> > During incubation, the codebase will be migrated to an Apache
> > infrastructure. The source code already has an Apache 2.0 licensed.
> >
> > == Source and Intellectual Property Submission Plan ==
> >
> > Current code is Apache 2.0 licensed and the copyright is assigned to
> Uber.
> > If the project enters incubator, Uber will transfer the source code &
> > trademark ownership to ASF via a Software Grant Agreement
> >
> > == External Dependencies ==
> >
> > Non apache dependencies are listed below
> >
> > * JCommander (1.48) Apache-2.0
> > * Kryo (4.0.0) BSD-2-Clause
> > * Kryo (2.21) BSD-3-Clause
> > * Jackson-annotations (2.6.4) Apache-2.0
> > * Jackson-annotations (2.6.5) Apache-2.0
> > * jackson-databind (2.6.4) Apache-2.0
> > * jackson-databind (2.6.5) Apache-2.0
> > * Jackson datatype: Guava (2.9.4) Apache-2.0
> > * docker-java (3.1.0-rc-3) Apache-2.0
> > * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > * bijection-avro (0.9.2) Apache-2.0
> > * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > * Ascii Table (0.2.5) Apache-2.0
> > * config (3.0.0) Apache-2.0
> > * utils (3.0.0) Apache-2.0
> > * kafka-avro-serializer (3.0.0) Apache-2.0
> > * kafka-schema-registry-client (3.0.0) Apache-2.0
> > * Metrics Core (3.1.1) Apache-2.0
> > * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > * Joda-Time (2.9.6) Apache-2.0
> > * JUnit CPL-1.0
> > * Awaitility (3.1.2) Apache-2.0
> > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > * htrace-core (3.0.4) Apache-2.0
> > * Mockito (1.10.19) MIT
> > * scalatest (3.0.1) Apache-2.0
> > * Spring Shell (1.2.0.RELEASE) Apache-2.0
> >
> > All of them are Apache compatible
> >
> > == Cryptography ==
> >
> > No cryptographic libraries used
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> > * private@hudi.incubator.apache.org (with moderated subscriptions)
> > * dev@hudi.incubator.apache.org
> > * commits@hudi.incubator.apache.org
> > * user@hudi.incubator.apache.org
> >
> > === Git Repositories ===
> >
> > Git is the preferred source control system: git://
> > git.apache.org/incubator-hudi
> >
> > === Issue Tracking ===
> >
> > We prefer to use the Apache gitbox integration to sync Github & Apache
> > infrastructure, and rely on Github issues & pull requests for community
> > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> >
> > == Initial Committers ==
> >
> > * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> >
> > == Sponsors ==
> >
> > === Champion ===
> > Julien Le Dem (julien at apache dot org)
> >
> > === Nominated Mentors ===
> >
> > * Luciano Resende (lresende at apache dot org)
> > * Thomas Weise (thw at apache dot org
> > * Kishore Gopalakrishna (kishoreg at apache dot org)
> > * Suneel Marthi (smarthi at apache dot org)
> >
> > === Sponsoring Entity ===
> >
> > The Incubator PMC
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Suneel Marthi <su...@yahoo.com.INVALID>.

+1 

Sent from my iPhone

> On Jan 13, 2019, at 5:34 PM, Thomas Weise <th...@apache.org> wrote:
> 
> Hi all,
> 
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
> 
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
> 
> This vote will run for at least 72 hours. Please VOTE as
> follows:
> 
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> 
> The proposal is included below, but you can also access it on
> the wiki [4].
> 
> Thanks for reviewing and voting,
> Thomas
> 
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> 
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> 
> [3] http://www.apache.org/foundation/voting.html
> 
> [4] https://wiki.apache.org/incubator/HudiProposal
> 
> 
> 
> = Hudi Proposal =
> 
> == Abstract ==
> 
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
> 
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
> 
> == Proposal ==
> 
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
> 
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
> 
> == Background ==
> 
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
> 
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
> 
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
> 
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
> 
> == Rationale ==
> 
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found at https://uber.github.io/hudi/use_cases.html.
> 
> Given our reliance on so many great Apache projects, we believe that the
> Apache way of open source community driven development will enable us to
> evolve Hudi in collaboration with a diverse set of contributors who can
> bring new ideas into the project.
> 
> == Initial Goals ==
> 
> * Move the existing codebase, website, documentation, and mailing lists to
> an Apache-hosted infrastructure.
> * Integrate with the Apache development process.
> * Ensure all dependencies are compliant with Apache License version 2.0.
> * Incrementally develop and release per Apache guidelines.
> 
> == Current Status ==
> 
> Hudi is a stable project used in production at Uber since 2016 and was open
> sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> manages 4000+ tables holding several petabytes, bringing our Hadoop
> warehouse from several hours of data delays to under 30 minutes, over the
> past two years. The source code is currently hosted at github.com (
> https://github.com/uber/hudi ), which will seed the Apache git repository.
> 
> === Meritocracy ===
> 
> We are fully committed to open, transparent, & meritocratic interactions
> with our community. In fact, one of the primary motivations for us to enter
> the incubation process is to be able to rely on Apache best practices that
> can ensure meritocracy. This will eventually help incorporate the best
> ideas back into the project & enable contributors to continue investing
> their time in the project. Current guidelines (
> https://uber.github.io/hudi/community.html#becoming-a-committer) have
> already put in place a meritocratic process which we will replace with
> Apache guidelines during incubation.
> 
> === Community ===
> 
> Hudi community is fairly young, since the project was open sourced only in
> early 2017. Currently, Hudi has committers from Uber & Snowflake. We have a
> vibrant set of contributors (~46 members in our slack channel) including
> Shopify, DoubleVerify and Vungle & others, who have either submitted
> patches or filed issues with hudi pipelines either in early production or
> testing stages. Our primary goal during the incubation would be to grow the
> community and groom our existing active contributors into committers.
> 
> === Core Developers ===
> 
> Current core developers work at Uber & Snowflake. We are confident that
> incubation will help us grow a diverse community in a open & collaborative
> way.
> 
> === Alignment ===
> 
> Hudi is designed as a general purpose analytical storage abstraction that
> integrates with multiple Apache projects: Apache Spark, Apache Hive, Apache
> Hadoop. It was built using multiple Apache projects, including Apache
> Parquet and Apache Avro, that support near-real time analytics right on top
> of existing Apache Hadoop data lakes. Our sincere hope is that being a part
> of the Apache foundation would enable us to drive the future of the project
> in alignment with the other Apache projects for the benefit of thousands of
> organizations that already leverage these projects.
> 
> == Known Risks ==
> 
> === Orphaned products ===
> 
> The risk of abandonment of Hudi is low. It is used in production at Uber
> for petabytes of data and other companies (mentioned in community section)
> are either evaluating or in the early stage for production use. Uber is
> committed to further development of the project and invest resources
> towards the Apache processes & building the community, during incubation
> period.
> 
> === Inexperience with Open Source ===
> 
> Even though the initial committers are new to the Apache world, some have
> considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> successfully managing the current open source community answering questions
> and taking feedback already. Moreover, we hope to obtain guidance and
> mentorship from current ASF members to help us succeed with the incubation.
> 
> === Length of Incubation ===
> 
> We expect the project be in incubation for 2 years or less.
> 
> === Homogenous Developers ===
> 
> Currently, the lead developers for Hudi are from Uber. However, we have an
> active set of early contributors/collaborators from Shopify, DoubleVerify
> and Vungle, that we hope will increase the diversity going forward. Once
> again, a primary motivation for incubation is to facilitate this in the
> Apache way.
> 
> === Reliance on Salaried Developers ===
> 
> Both the current committers & early contributors have several years of core
> expertise around data systems. Current committers are very passionate about
> the project and have already invested hundreds of hours towards helping &
> building the community. Thus, even with employer changes, we expect they
> will be able to actively engage in the project either because they will be
> working in similar areas even with newer employers or out of belief in the
> project.
> 
> === Relationships with Other Apache Products ===
> 
> To the best of our knowledge, there are no direct competing projects with
> Hudi that offer all of the feature set namely - upserts, incremental
> streams, efficient storage/file management, snapshot isolation/rollbacks -
> in a coherent way. However, some projects share common goals and technical
> elements and we will highlight them here. Hive ACID/Kudu both offer upsert
> capabilities without storage management/incremental streams. The recent
> Iceberg project offers similar snapshot isolation/rollbacks, but not
> upserts or other data plane features. A detailed comparison with their
> trade-offs can be found at https://uber.github.io/hudi/comparison.html.
> 
> We are committed to open collaboration with such Apache projects and
> incorporate changes to Hudi or contribute patches to other projects, with
> the goal of making it easier for the community at large, to adopt these
> open source technologies.
> 
> === Excessive Fascination with the Apache Brand ===
> 
> This proposal is not for the purpose of generating publicity. We have
> already been doing talks/meetups independently that have helped us build
> our community. We are drawn towards Apache as a potential way of ensuring
> that our open source community management is successful early on so  hudi
> can evolve into a broadly accepted--and used--method of managing data on
> Hadoop.
> 
> == Documentation ==
> [1] Detailed documentation can be found at https://uber.github.io/hudi/
> 
> == Initial Source ==
> 
> The codebase is currently hosted on Github: https://github.com/uber/hudi .
> During incubation, the codebase will be migrated to an Apache
> infrastructure. The source code already has an Apache 2.0 licensed.
> 
> == Source and Intellectual Property Submission Plan ==
> 
> Current code is Apache 2.0 licensed and the copyright is assigned to Uber.
> If the project enters incubator, Uber will transfer the source code &
> trademark ownership to ASF via a Software Grant Agreement
> 
> == External Dependencies ==
> 
> Non apache dependencies are listed below
> 
> * JCommander (1.48) Apache-2.0
> * Kryo (4.0.0) BSD-2-Clause
> * Kryo (2.21) BSD-3-Clause
> * Jackson-annotations (2.6.4) Apache-2.0
> * Jackson-annotations (2.6.5) Apache-2.0
> * jackson-databind (2.6.4) Apache-2.0
> * jackson-databind (2.6.5) Apache-2.0
> * Jackson datatype: Guava (2.9.4) Apache-2.0
> * docker-java (3.1.0-rc-3) Apache-2.0
> * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> * bijection-avro (0.9.2) Apache-2.0
> * com.twitter.common:objectsize (0.0.12) Apache-2.0
> * Ascii Table (0.2.5) Apache-2.0
> * config (3.0.0) Apache-2.0
> * utils (3.0.0) Apache-2.0
> * kafka-avro-serializer (3.0.0) Apache-2.0
> * kafka-schema-registry-client (3.0.0) Apache-2.0
> * Metrics Core (3.1.1) Apache-2.0
> * Graphite Integration for Metrics (3.1.1) Apache-2.0
> * Joda-Time (2.9.6) Apache-2.0
> * JUnit CPL-1.0
> * Awaitility (3.1.2) Apache-2.0
> * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> * htrace-core (3.0.4) Apache-2.0
> * Mockito (1.10.19) MIT
> * scalatest (3.0.1) Apache-2.0
> * Spring Shell (1.2.0.RELEASE) Apache-2.0
> 
> All of them are Apache compatible
> 
> == Cryptography ==
> 
> No cryptographic libraries used
> 
> == Required Resources ==
> 
> === Mailing lists ===
> 
> * private@hudi.incubator.apache.org (with moderated subscriptions)
> * dev@hudi.incubator.apache.org
> * commits@hudi.incubator.apache.org
> * user@hudi.incubator.apache.org
> 
> === Git Repositories ===
> 
> Git is the preferred source control system: git://
> git.apache.org/incubator-hudi
> 
> === Issue Tracking ===
> 
> We prefer to use the Apache gitbox integration to sync Github & Apache
> infrastructure, and rely on Github issues & pull requests for community
> engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> 
> == Initial Committers ==
> 
> * Vinoth Chandar (vinoth at uber dot com) (Uber)
> * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> 
> == Sponsors ==
> 
> === Champion ===
> Julien Le Dem (julien at apache dot org)
> 
> === Nominated Mentors ===
> 
> * Luciano Resende (lresende at apache dot org)
> * Thomas Weise (thw at apache dot org
> * Kishore Gopalakrishna (kishoreg at apache dot org)
> * Suneel Marthi (smarthi at apache dot org)
> 
> === Sponsoring Entity ===
> 
> The Incubator PMC


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Furkan KAMACI <fu...@gmail.com>.

+1

16 Oca 2019 Çar, saat 01:40 tarihinde Vinayakumar B <vi...@apache.org>
şunu yazdı:

> +1
>
> - Vinay
>
> On Tue, 15 Jan 2019, 10:56 am Hongtao Gao <hanahmily@gmail.com wrote:
>
> > +1
> >
> > Hongtao Gao
> >
> >
> > Thomas Weise <th...@apache.org> 于 2019年1月14日周一 上午6:34写道：
> >
> > > Hi all,
> > >
> > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > on accepting Hudi into the Apache Incubator,
> > > per the ASF policy [2] and voting rules [3].
> > >
> > > A vote for accepting a new Apache Incubator podling is a
> > > majority vote. Everyone is welcome to vote, only
> > > Incubator PMC member votes are binding.
> > >
> > > This vote will run for at least 72 hours. Please VOTE as
> > > follows:
> > >
> > > [ ] +1 Accept Hudi into the Apache Incubator
> > > [ ] +0 Abstain
> > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > >
> > > The proposal is included below, but you can also access it on
> > > the wiki [4].
> > >
> > > Thanks for reviewing and voting,
> > > Thomas
> > >
> > > [1]
> > >
> > >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > >
> > > [2]
> > >
> > >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > >
> > > [3] http://www.apache.org/foundation/voting.html
> > >
> > > [4] https://wiki.apache.org/incubator/HudiProposal
> > >
> > >
> > >
> > > = Hudi Proposal =
> > >
> > > == Abstract ==
> > >
> > > Hudi is a big-data storage library, that provides atomic upserts and
> > > incremental data streams.
> > >
> > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > distributed file systems/cloud stores.
> > >
> > > == Proposal ==
> > >
> > > Hudi provides the ability to atomically upsert datasets with new values
> > in
> > > near-real time, making data available quickly to existing query engines
> > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > > sequence of changes to a dataset from a given point-in-time to enable
> > > incremental data pipelines that yield greater efficiency & latency than
> > > their typical batch counterparts. By carefully managing number of
> files &
> > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > consumption).
> > >
> > > Hudi is largely implemented as an Apache Spark library that
> reads/writes
> > > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> > are
> > > supported via specialized Apache Hadoop input formats, that understand
> > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > combination
> > > of Apache Parquet & Apache Avro file/serialization formats.
> > >
> > > == Background ==
> > >
> > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> as
> > > longer term analytical storage for thousands of organizations. Typical
> > > analytical datasets are built by reading data from a source (e.g:
> > upstream
> > > databases, messaging buses, or other datasets), transforming the data,
> > > writing results back to storage, & making it available for analytical
> > > queries--all of this typically accomplished in batch jobs which operate
> > in
> > > a bulk fashion on partitions of datasets. Such a style of processing
> > > typically incurs large delays in making data available to queries as
> well
> > > as lot of complexity in carefully partitioning datasets to guarantee
> > > latency SLAs.
> > >
> > > The need for fresher/faster analytics has increased enormously in the
> > past
> > > few years, as evidenced by the popularity of Stream processing systems
> > like
> > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > > using updateable state store to incrementally compute & instantly
> reflect
> > > new results to queries and using a “tailable” messaging bus to publish
> > > these results to other downstream jobs, such systems employ a different
> > > approach to building analytical dataset. Even though this approach
> yields
> > > low latency, the amount of data managed in such real-time data-marts is
> > > typically limited in comparison to the aforementioned longer term
> storage
> > > options. As a result, the overall data architecture has become more
> > complex
> > > with more moving parts and specialized systems, leading to duplication
> of
> > > data and a strain on usability.
> > >
> > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> > data
> > > to streaming systems, we simply add the streaming primitives (upserts &
> > > incremental consumption) onto existing batch processing technologies.
> We
> > > believe that by adding some missing blocks to an existing Hadoop stack,
> > we
> > > are able to a provide similar capabilities right on top of Hadoop at a
> > > reduced cost and with an increased efficiency, greatly simplifying the
> > > overall architecture in the process.
> > >
> > > Hudi was originally developed at Uber (original name “Hoodie”) to
> address
> > > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> > data
> > > ecosystem that required the upsert & incremental consumption primitives
> > > supported by Hudi.
> > >
> > > == Rationale ==
> > >
> > > We truly believe the capabilities supported by Hudi would be
> increasingly
> > > useful for big-data ecosystems, as data volumes & need for faster data
> > > continue to increase. A detailed description of target use-cases can be
> > > found at https://uber.github.io/hudi/use_cases.html.
> > >
> > > Given our reliance on so many great Apache projects, we believe that
> the
> > > Apache way of open source community driven development will enable us
> to
> > > evolve Hudi in collaboration with a diverse set of contributors who can
> > > bring new ideas into the project.
> > >
> > > == Initial Goals ==
> > >
> > >  * Move the existing codebase, website, documentation, and mailing
> lists
> > to
> > > an Apache-hosted infrastructure.
> > >  * Integrate with the Apache development process.
> > >  * Ensure all dependencies are compliant with Apache License version
> 2.0.
> > >  * Incrementally develop and release per Apache guidelines.
> > >
> > > == Current Status ==
> > >
> > > Hudi is a stable project used in production at Uber since 2016 and was
> > open
> > > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > > warehouse from several hours of data delays to under 30 minutes, over
> the
> > > past two years. The source code is currently hosted at github.com (
> > > https://github.com/uber/hudi ), which will seed the Apache git
> > repository.
> > >
> > > === Meritocracy ===
> > >
> > > We are fully committed to open, transparent, & meritocratic
> interactions
> > > with our community. In fact, one of the primary motivations for us to
> > enter
> > > the incubation process is to be able to rely on Apache best practices
> > that
> > > can ensure meritocracy. This will eventually help incorporate the best
> > > ideas back into the project & enable contributors to continue investing
> > > their time in the project. Current guidelines (
> > > https://uber.github.io/hudi/community.html#becoming-a-committer) have
> > > already put in place a meritocratic process which we will replace with
> > > Apache guidelines during incubation.
> > >
> > > === Community ===
> > >
> > > Hudi community is fairly young, since the project was open sourced only
> > in
> > > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> > have a
> > > vibrant set of contributors (~46 members in our slack channel)
> including
> > > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > > patches or filed issues with hudi pipelines either in early production
> or
> > > testing stages. Our primary goal during the incubation would be to grow
> > the
> > > community and groom our existing active contributors into committers.
> > >
> > > === Core Developers ===
> > >
> > > Current core developers work at Uber & Snowflake. We are confident that
> > > incubation will help us grow a diverse community in a open &
> > collaborative
> > > way.
> > >
> > > === Alignment ===
> > >
> > > Hudi is designed as a general purpose analytical storage abstraction
> that
> > > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> > Apache
> > > Hadoop. It was built using multiple Apache projects, including Apache
> > > Parquet and Apache Avro, that support near-real time analytics right on
> > top
> > > of existing Apache Hadoop data lakes. Our sincere hope is that being a
> > part
> > > of the Apache foundation would enable us to drive the future of the
> > project
> > > in alignment with the other Apache projects for the benefit of
> thousands
> > of
> > > organizations that already leverage these projects.
> > >
> > > == Known Risks ==
> > >
> > > === Orphaned products ===
> > >
> > > The risk of abandonment of Hudi is low. It is used in production at
> Uber
> > > for petabytes of data and other companies (mentioned in community
> > section)
> > > are either evaluating or in the early stage for production use. Uber is
> > > committed to further development of the project and invest resources
> > > towards the Apache processes & building the community, during
> incubation
> > > period.
> > >
> > > === Inexperience with Open Source ===
> > >
> > > Even though the initial committers are new to the Apache world, some
> have
> > > considerable open source experience - Vinoth Chandar (Linkedin
> voldemort,
> > > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> > > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > > successfully managing the current open source community answering
> > questions
> > > and taking feedback already. Moreover, we hope to obtain guidance and
> > > mentorship from current ASF members to help us succeed with the
> > incubation.
> > >
> > > === Length of Incubation ===
> > >
> > > We expect the project be in incubation for 2 years or less.
> > >
> > > === Homogenous Developers ===
> > >
> > > Currently, the lead developers for Hudi are from Uber. However, we have
> > an
> > > active set of early contributors/collaborators from Shopify,
> DoubleVerify
> > > and Vungle, that we hope will increase the diversity going forward.
> Once
> > > again, a primary motivation for incubation is to facilitate this in the
> > > Apache way.
> > >
> > > === Reliance on Salaried Developers ===
> > >
> > > Both the current committers & early contributors have several years of
> > core
> > > expertise around data systems. Current committers are very passionate
> > about
> > > the project and have already invested hundreds of hours towards
> helping &
> > > building the community. Thus, even with employer changes, we expect
> they
> > > will be able to actively engage in the project either because they will
> > be
> > > working in similar areas even with newer employers or out of belief in
> > the
> > > project.
> > >
> > > === Relationships with Other Apache Products ===
> > >
> > > To the best of our knowledge, there are no direct competing projects
> with
> > > Hudi that offer all of the feature set namely - upserts, incremental
> > > streams, efficient storage/file management, snapshot
> isolation/rollbacks
> > -
> > > in a coherent way. However, some projects share common goals and
> > technical
> > > elements and we will highlight them here. Hive ACID/Kudu both offer
> > upsert
> > > capabilities without storage management/incremental streams. The recent
> > > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > > upserts or other data plane features. A detailed comparison with their
> > > trade-offs can be found at https://uber.github.io/hudi/comparison.html
> .
> > >
> > > We are committed to open collaboration with such Apache projects and
> > > incorporate changes to Hudi or contribute patches to other projects,
> with
> > > the goal of making it easier for the community at large, to adopt these
> > > open source technologies.
> > >
> > > === Excessive Fascination with the Apache Brand ===
> > >
> > > This proposal is not for the purpose of generating publicity. We have
> > > already been doing talks/meetups independently that have helped us
> build
> > > our community. We are drawn towards Apache as a potential way of
> ensuring
> > > that our open source community management is successful early on so
> hudi
> > > can evolve into a broadly accepted--and used--method of managing data
> on
> > > Hadoop.
> > >
> > > == Documentation ==
> > > [1] Detailed documentation can be found at
> https://uber.github.io/hudi/
> > >
> > > == Initial Source ==
> > >
> > > The codebase is currently hosted on Github:
> https://github.com/uber/hudi
> > .
> > > During incubation, the codebase will be migrated to an Apache
> > > infrastructure. The source code already has an Apache 2.0 licensed.
> > >
> > > == Source and Intellectual Property Submission Plan ==
> > >
> > > Current code is Apache 2.0 licensed and the copyright is assigned to
> > Uber.
> > > If the project enters incubator, Uber will transfer the source code &
> > > trademark ownership to ASF via a Software Grant Agreement
> > >
> > > == External Dependencies ==
> > >
> > > Non apache dependencies are listed below
> > >
> > >  * JCommander (1.48) Apache-2.0
> > >  * Kryo (4.0.0) BSD-2-Clause
> > >  * Kryo (2.21) BSD-3-Clause
> > >  * Jackson-annotations (2.6.4) Apache-2.0
> > >  * Jackson-annotations (2.6.5) Apache-2.0
> > >  * jackson-databind (2.6.4) Apache-2.0
> > >  * jackson-databind (2.6.5) Apache-2.0
> > >  * Jackson datatype: Guava (2.9.4) Apache-2.0
> > >  * docker-java (3.1.0-rc-3) Apache-2.0
> > >  * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > >  * bijection-avro (0.9.2) Apache-2.0
> > >  * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > >  * Ascii Table (0.2.5) Apache-2.0
> > >  * config (3.0.0) Apache-2.0
> > >  * utils (3.0.0) Apache-2.0
> > >  * kafka-avro-serializer (3.0.0) Apache-2.0
> > >  * kafka-schema-registry-client (3.0.0) Apache-2.0
> > >  * Metrics Core (3.1.1) Apache-2.0
> > >  * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > >  * Joda-Time (2.9.6) Apache-2.0
> > >  * JUnit CPL-1.0
> > >  * Awaitility (3.1.2) Apache-2.0
> > >  * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > >  * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > >  * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > >  * htrace-core (3.0.4) Apache-2.0
> > >  * Mockito (1.10.19) MIT
> > >  * scalatest (3.0.1) Apache-2.0
> > >  * Spring Shell (1.2.0.RELEASE) Apache-2.0
> > >
> > > All of them are Apache compatible
> > >
> > > == Cryptography ==
> > >
> > > No cryptographic libraries used
> > >
> > > == Required Resources ==
> > >
> > > === Mailing lists ===
> > >
> > >  * private@hudi.incubator.apache.org (with moderated subscriptions)
> > >  * dev@hudi.incubator.apache.org
> > >  * commits@hudi.incubator.apache.org
> > >  * user@hudi.incubator.apache.org
> > >
> > > === Git Repositories ===
> > >
> > > Git is the preferred source control system: git://
> > > git.apache.org/incubator-hudi
> > >
> > > === Issue Tracking ===
> > >
> > > We prefer to use the Apache gitbox integration to sync Github & Apache
> > > infrastructure, and rely on Github issues & pull requests for community
> > > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> > >
> > > == Initial Committers ==
> > >
> > >  * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > >  * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > >  * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > >  * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> > >  * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > >  * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> > >
> > > == Sponsors ==
> > >
> > > === Champion ===
> > > Julien Le Dem (julien at apache dot org)
> > >
> > > === Nominated Mentors ===
> > >
> > >  * Luciano Resende (lresende at apache dot org)
> > >  * Thomas Weise (thw at apache dot org
> > >  * Kishore Gopalakrishna (kishoreg at apache dot org)
> > >  * Suneel Marthi (smarthi at apache dot org)
> > >
> > > === Sponsoring Entity ===
> > >
> > > The Incubator PMC
> > >
> >
>

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Vinayakumar B <vi...@apache.org>.

+1

- Vinay

On Tue, 15 Jan 2019, 10:56 am Hongtao Gao <hanahmily@gmail.com wrote:

> +1
>
> Hongtao Gao
>
>
> Thomas Weise <th...@apache.org> 于 2019年1月14日周一 上午6:34写道：
>
> > Hi all,
> >
> > Following the discussion of the Hudi proposal in [1], this is a vote
> > on accepting Hudi into the Apache Incubator,
> > per the ASF policy [2] and voting rules [3].
> >
> > A vote for accepting a new Apache Incubator podling is a
> > majority vote. Everyone is welcome to vote, only
> > Incubator PMC member votes are binding.
> >
> > This vote will run for at least 72 hours. Please VOTE as
> > follows:
> >
> > [ ] +1 Accept Hudi into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> >
> > The proposal is included below, but you can also access it on
> > the wiki [4].
> >
> > Thanks for reviewing and voting,
> > Thomas
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> >
> > [2]
> >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> >
> > [3] http://www.apache.org/foundation/voting.html
> >
> > [4] https://wiki.apache.org/incubator/HudiProposal
> >
> >
> >
> > = Hudi Proposal =
> >
> > == Abstract ==
> >
> > Hudi is a big-data storage library, that provides atomic upserts and
> > incremental data streams.
> >
> > Hudi manages data stored in Apache Hadoop and other API compatible
> > distributed file systems/cloud stores.
> >
> > == Proposal ==
> >
> > Hudi provides the ability to atomically upsert datasets with new values
> in
> > near-real time, making data available quickly to existing query engines
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > sequence of changes to a dataset from a given point-in-time to enable
> > incremental data pipelines that yield greater efficiency & latency than
> > their typical batch counterparts. By carefully managing number of files &
> > sizes, Hudi greatly aids both query engines (e.g: always providing
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > consumption).
> >
> > Hudi is largely implemented as an Apache Spark library that reads/writes
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> are
> > supported via specialized Apache Hadoop input formats, that understand
> > Hudi’s storage layout. Currently, Hudi manages datasets using a
> combination
> > of Apache Parquet & Apache Avro file/serialization formats.
> >
> > == Background ==
> >
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> > longer term analytical storage for thousands of organizations. Typical
> > analytical datasets are built by reading data from a source (e.g:
> upstream
> > databases, messaging buses, or other datasets), transforming the data,
> > writing results back to storage, & making it available for analytical
> > queries--all of this typically accomplished in batch jobs which operate
> in
> > a bulk fashion on partitions of datasets. Such a style of processing
> > typically incurs large delays in making data available to queries as well
> > as lot of complexity in carefully partitioning datasets to guarantee
> > latency SLAs.
> >
> > The need for fresher/faster analytics has increased enormously in the
> past
> > few years, as evidenced by the popularity of Stream processing systems
> like
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > using updateable state store to incrementally compute & instantly reflect
> > new results to queries and using a “tailable” messaging bus to publish
> > these results to other downstream jobs, such systems employ a different
> > approach to building analytical dataset. Even though this approach yields
> > low latency, the amount of data managed in such real-time data-marts is
> > typically limited in comparison to the aforementioned longer term storage
> > options. As a result, the overall data architecture has become more
> complex
> > with more moving parts and specialized systems, leading to duplication of
> > data and a strain on usability.
> >
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> data
> > to streaming systems, we simply add the streaming primitives (upserts &
> > incremental consumption) onto existing batch processing technologies. We
> > believe that by adding some missing blocks to an existing Hadoop stack,
> we
> > are able to a provide similar capabilities right on top of Hadoop at a
> > reduced cost and with an increased efficiency, greatly simplifying the
> > overall architecture in the process.
> >
> > Hudi was originally developed at Uber (original name “Hoodie”) to address
> > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> data
> > ecosystem that required the upsert & incremental consumption primitives
> > supported by Hudi.
> >
> > == Rationale ==
> >
> > We truly believe the capabilities supported by Hudi would be increasingly
> > useful for big-data ecosystems, as data volumes & need for faster data
> > continue to increase. A detailed description of target use-cases can be
> > found at https://uber.github.io/hudi/use_cases.html.
> >
> > Given our reliance on so many great Apache projects, we believe that the
> > Apache way of open source community driven development will enable us to
> > evolve Hudi in collaboration with a diverse set of contributors who can
> > bring new ideas into the project.
> >
> > == Initial Goals ==
> >
> >  * Move the existing codebase, website, documentation, and mailing lists
> to
> > an Apache-hosted infrastructure.
> >  * Integrate with the Apache development process.
> >  * Ensure all dependencies are compliant with Apache License version 2.0.
> >  * Incrementally develop and release per Apache guidelines.
> >
> > == Current Status ==
> >
> > Hudi is a stable project used in production at Uber since 2016 and was
> open
> > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > warehouse from several hours of data delays to under 30 minutes, over the
> > past two years. The source code is currently hosted at github.com (
> > https://github.com/uber/hudi ), which will seed the Apache git
> repository.
> >
> > === Meritocracy ===
> >
> > We are fully committed to open, transparent, & meritocratic interactions
> > with our community. In fact, one of the primary motivations for us to
> enter
> > the incubation process is to be able to rely on Apache best practices
> that
> > can ensure meritocracy. This will eventually help incorporate the best
> > ideas back into the project & enable contributors to continue investing
> > their time in the project. Current guidelines (
> > https://uber.github.io/hudi/community.html#becoming-a-committer) have
> > already put in place a meritocratic process which we will replace with
> > Apache guidelines during incubation.
> >
> > === Community ===
> >
> > Hudi community is fairly young, since the project was open sourced only
> in
> > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> have a
> > vibrant set of contributors (~46 members in our slack channel) including
> > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > patches or filed issues with hudi pipelines either in early production or
> > testing stages. Our primary goal during the incubation would be to grow
> the
> > community and groom our existing active contributors into committers.
> >
> > === Core Developers ===
> >
> > Current core developers work at Uber & Snowflake. We are confident that
> > incubation will help us grow a diverse community in a open &
> collaborative
> > way.
> >
> > === Alignment ===
> >
> > Hudi is designed as a general purpose analytical storage abstraction that
> > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> Apache
> > Hadoop. It was built using multiple Apache projects, including Apache
> > Parquet and Apache Avro, that support near-real time analytics right on
> top
> > of existing Apache Hadoop data lakes. Our sincere hope is that being a
> part
> > of the Apache foundation would enable us to drive the future of the
> project
> > in alignment with the other Apache projects for the benefit of thousands
> of
> > organizations that already leverage these projects.
> >
> > == Known Risks ==
> >
> > === Orphaned products ===
> >
> > The risk of abandonment of Hudi is low. It is used in production at Uber
> > for petabytes of data and other companies (mentioned in community
> section)
> > are either evaluating or in the early stage for production use. Uber is
> > committed to further development of the project and invest resources
> > towards the Apache processes & building the community, during incubation
> > period.
> >
> > === Inexperience with Open Source ===
> >
> > Even though the initial committers are new to the Apache world, some have
> > considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > successfully managing the current open source community answering
> questions
> > and taking feedback already. Moreover, we hope to obtain guidance and
> > mentorship from current ASF members to help us succeed with the
> incubation.
> >
> > === Length of Incubation ===
> >
> > We expect the project be in incubation for 2 years or less.
> >
> > === Homogenous Developers ===
> >
> > Currently, the lead developers for Hudi are from Uber. However, we have
> an
> > active set of early contributors/collaborators from Shopify, DoubleVerify
> > and Vungle, that we hope will increase the diversity going forward. Once
> > again, a primary motivation for incubation is to facilitate this in the
> > Apache way.
> >
> > === Reliance on Salaried Developers ===
> >
> > Both the current committers & early contributors have several years of
> core
> > expertise around data systems. Current committers are very passionate
> about
> > the project and have already invested hundreds of hours towards helping &
> > building the community. Thus, even with employer changes, we expect they
> > will be able to actively engage in the project either because they will
> be
> > working in similar areas even with newer employers or out of belief in
> the
> > project.
> >
> > === Relationships with Other Apache Products ===
> >
> > To the best of our knowledge, there are no direct competing projects with
> > Hudi that offer all of the feature set namely - upserts, incremental
> > streams, efficient storage/file management, snapshot isolation/rollbacks
> -
> > in a coherent way. However, some projects share common goals and
> technical
> > elements and we will highlight them here. Hive ACID/Kudu both offer
> upsert
> > capabilities without storage management/incremental streams. The recent
> > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > upserts or other data plane features. A detailed comparison with their
> > trade-offs can be found at https://uber.github.io/hudi/comparison.html.
> >
> > We are committed to open collaboration with such Apache projects and
> > incorporate changes to Hudi or contribute patches to other projects, with
> > the goal of making it easier for the community at large, to adopt these
> > open source technologies.
> >
> > === Excessive Fascination with the Apache Brand ===
> >
> > This proposal is not for the purpose of generating publicity. We have
> > already been doing talks/meetups independently that have helped us build
> > our community. We are drawn towards Apache as a potential way of ensuring
> > that our open source community management is successful early on so  hudi
> > can evolve into a broadly accepted--and used--method of managing data on
> > Hadoop.
> >
> > == Documentation ==
> > [1] Detailed documentation can be found at https://uber.github.io/hudi/
> >
> > == Initial Source ==
> >
> > The codebase is currently hosted on Github: https://github.com/uber/hudi
> .
> > During incubation, the codebase will be migrated to an Apache
> > infrastructure. The source code already has an Apache 2.0 licensed.
> >
> > == Source and Intellectual Property Submission Plan ==
> >
> > Current code is Apache 2.0 licensed and the copyright is assigned to
> Uber.
> > If the project enters incubator, Uber will transfer the source code &
> > trademark ownership to ASF via a Software Grant Agreement
> >
> > == External Dependencies ==
> >
> > Non apache dependencies are listed below
> >
> >  * JCommander (1.48) Apache-2.0
> >  * Kryo (4.0.0) BSD-2-Clause
> >  * Kryo (2.21) BSD-3-Clause
> >  * Jackson-annotations (2.6.4) Apache-2.0
> >  * Jackson-annotations (2.6.5) Apache-2.0
> >  * jackson-databind (2.6.4) Apache-2.0
> >  * jackson-databind (2.6.5) Apache-2.0
> >  * Jackson datatype: Guava (2.9.4) Apache-2.0
> >  * docker-java (3.1.0-rc-3) Apache-2.0
> >  * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> >  * bijection-avro (0.9.2) Apache-2.0
> >  * com.twitter.common:objectsize (0.0.12) Apache-2.0
> >  * Ascii Table (0.2.5) Apache-2.0
> >  * config (3.0.0) Apache-2.0
> >  * utils (3.0.0) Apache-2.0
> >  * kafka-avro-serializer (3.0.0) Apache-2.0
> >  * kafka-schema-registry-client (3.0.0) Apache-2.0
> >  * Metrics Core (3.1.1) Apache-2.0
> >  * Graphite Integration for Metrics (3.1.1) Apache-2.0
> >  * Joda-Time (2.9.6) Apache-2.0
> >  * JUnit CPL-1.0
> >  * Awaitility (3.1.2) Apache-2.0
> >  * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> >  * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> >  * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> >  * htrace-core (3.0.4) Apache-2.0
> >  * Mockito (1.10.19) MIT
> >  * scalatest (3.0.1) Apache-2.0
> >  * Spring Shell (1.2.0.RELEASE) Apache-2.0
> >
> > All of them are Apache compatible
> >
> > == Cryptography ==
> >
> > No cryptographic libraries used
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> >  * private@hudi.incubator.apache.org (with moderated subscriptions)
> >  * dev@hudi.incubator.apache.org
> >  * commits@hudi.incubator.apache.org
> >  * user@hudi.incubator.apache.org
> >
> > === Git Repositories ===
> >
> > Git is the preferred source control system: git://
> > git.apache.org/incubator-hudi
> >
> > === Issue Tracking ===
> >
> > We prefer to use the Apache gitbox integration to sync Github & Apache
> > infrastructure, and rely on Github issues & pull requests for community
> > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> >
> > == Initial Committers ==
> >
> >  * Vinoth Chandar (vinoth at uber dot com) (Uber)
> >  * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> >  * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> >  * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> >  * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> >  * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> >
> > == Sponsors ==
> >
> > === Champion ===
> > Julien Le Dem (julien at apache dot org)
> >
> > === Nominated Mentors ===
> >
> >  * Luciano Resende (lresende at apache dot org)
> >  * Thomas Weise (thw at apache dot org
> >  * Kishore Gopalakrishna (kishoreg at apache dot org)
> >  * Suneel Marthi (smarthi at apache dot org)
> >
> > === Sponsoring Entity ===
> >
> > The Incubator PMC
> >
>

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Hongtao Gao <ha...@gmail.com>.

+1

Hongtao Gao


Thomas Weise <th...@apache.org> 于 2019年1月14日周一 上午6:34写道：

> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
>
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
>
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
>
> == Rationale ==
>
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found at https://uber.github.io/hudi/use_cases.html.
>
> Given our reliance on so many great Apache projects, we believe that the
> Apache way of open source community driven development will enable us to
> evolve Hudi in collaboration with a diverse set of contributors who can
> bring new ideas into the project.
>
> == Initial Goals ==
>
>  * Move the existing codebase, website, documentation, and mailing lists to
> an Apache-hosted infrastructure.
>  * Integrate with the Apache development process.
>  * Ensure all dependencies are compliant with Apache License version 2.0.
>  * Incrementally develop and release per Apache guidelines.
>
> == Current Status ==
>
> Hudi is a stable project used in production at Uber since 2016 and was open
> sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> manages 4000+ tables holding several petabytes, bringing our Hadoop
> warehouse from several hours of data delays to under 30 minutes, over the
> past two years. The source code is currently hosted at github.com (
> https://github.com/uber/hudi ), which will seed the Apache git repository.
>
> === Meritocracy ===
>
> We are fully committed to open, transparent, & meritocratic interactions
> with our community. In fact, one of the primary motivations for us to enter
> the incubation process is to be able to rely on Apache best practices that
> can ensure meritocracy. This will eventually help incorporate the best
> ideas back into the project & enable contributors to continue investing
> their time in the project. Current guidelines (
> https://uber.github.io/hudi/community.html#becoming-a-committer) have
> already put in place a meritocratic process which we will replace with
> Apache guidelines during incubation.
>
> === Community ===
>
> Hudi community is fairly young, since the project was open sourced only in
> early 2017. Currently, Hudi has committers from Uber & Snowflake. We have a
> vibrant set of contributors (~46 members in our slack channel) including
> Shopify, DoubleVerify and Vungle & others, who have either submitted
> patches or filed issues with hudi pipelines either in early production or
> testing stages. Our primary goal during the incubation would be to grow the
> community and groom our existing active contributors into committers.
>
> === Core Developers ===
>
> Current core developers work at Uber & Snowflake. We are confident that
> incubation will help us grow a diverse community in a open & collaborative
> way.
>
> === Alignment ===
>
> Hudi is designed as a general purpose analytical storage abstraction that
> integrates with multiple Apache projects: Apache Spark, Apache Hive, Apache
> Hadoop. It was built using multiple Apache projects, including Apache
> Parquet and Apache Avro, that support near-real time analytics right on top
> of existing Apache Hadoop data lakes. Our sincere hope is that being a part
> of the Apache foundation would enable us to drive the future of the project
> in alignment with the other Apache projects for the benefit of thousands of
> organizations that already leverage these projects.
>
> == Known Risks ==
>
> === Orphaned products ===
>
> The risk of abandonment of Hudi is low. It is used in production at Uber
> for petabytes of data and other companies (mentioned in community section)
> are either evaluating or in the early stage for production use. Uber is
> committed to further development of the project and invest resources
> towards the Apache processes & building the community, during incubation
> period.
>
> === Inexperience with Open Source ===
>
> Even though the initial committers are new to the Apache world, some have
> considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> successfully managing the current open source community answering questions
> and taking feedback already. Moreover, we hope to obtain guidance and
> mentorship from current ASF members to help us succeed with the incubation.
>
> === Length of Incubation ===
>
> We expect the project be in incubation for 2 years or less.
>
> === Homogenous Developers ===
>
> Currently, the lead developers for Hudi are from Uber. However, we have an
> active set of early contributors/collaborators from Shopify, DoubleVerify
> and Vungle, that we hope will increase the diversity going forward. Once
> again, a primary motivation for incubation is to facilitate this in the
> Apache way.
>
> === Reliance on Salaried Developers ===
>
> Both the current committers & early contributors have several years of core
> expertise around data systems. Current committers are very passionate about
> the project and have already invested hundreds of hours towards helping &
> building the community. Thus, even with employer changes, we expect they
> will be able to actively engage in the project either because they will be
> working in similar areas even with newer employers or out of belief in the
> project.
>
> === Relationships with Other Apache Products ===
>
> To the best of our knowledge, there are no direct competing projects with
> Hudi that offer all of the feature set namely - upserts, incremental
> streams, efficient storage/file management, snapshot isolation/rollbacks -
> in a coherent way. However, some projects share common goals and technical
> elements and we will highlight them here. Hive ACID/Kudu both offer upsert
> capabilities without storage management/incremental streams. The recent
> Iceberg project offers similar snapshot isolation/rollbacks, but not
> upserts or other data plane features. A detailed comparison with their
> trade-offs can be found at https://uber.github.io/hudi/comparison.html.
>
> We are committed to open collaboration with such Apache projects and
> incorporate changes to Hudi or contribute patches to other projects, with
> the goal of making it easier for the community at large, to adopt these
> open source technologies.
>
> === Excessive Fascination with the Apache Brand ===
>
> This proposal is not for the purpose of generating publicity. We have
> already been doing talks/meetups independently that have helped us build
> our community. We are drawn towards Apache as a potential way of ensuring
> that our open source community management is successful early on so  hudi
> can evolve into a broadly accepted--and used--method of managing data on
> Hadoop.
>
> == Documentation ==
> [1] Detailed documentation can be found at https://uber.github.io/hudi/
>
> == Initial Source ==
>
> The codebase is currently hosted on Github: https://github.com/uber/hudi .
> During incubation, the codebase will be migrated to an Apache
> infrastructure. The source code already has an Apache 2.0 licensed.
>
> == Source and Intellectual Property Submission Plan ==
>
> Current code is Apache 2.0 licensed and the copyright is assigned to Uber.
> If the project enters incubator, Uber will transfer the source code &
> trademark ownership to ASF via a Software Grant Agreement
>
> == External Dependencies ==
>
> Non apache dependencies are listed below
>
>  * JCommander (1.48) Apache-2.0
>  * Kryo (4.0.0) BSD-2-Clause
>  * Kryo (2.21) BSD-3-Clause
>  * Jackson-annotations (2.6.4) Apache-2.0
>  * Jackson-annotations (2.6.5) Apache-2.0
>  * jackson-databind (2.6.4) Apache-2.0
>  * jackson-databind (2.6.5) Apache-2.0
>  * Jackson datatype: Guava (2.9.4) Apache-2.0
>  * docker-java (3.1.0-rc-3) Apache-2.0
>  * Guava: Google Core Libraries for Java (20.0) Apache-2.0
>  * bijection-avro (0.9.2) Apache-2.0
>  * com.twitter.common:objectsize (0.0.12) Apache-2.0
>  * Ascii Table (0.2.5) Apache-2.0
>  * config (3.0.0) Apache-2.0
>  * utils (3.0.0) Apache-2.0
>  * kafka-avro-serializer (3.0.0) Apache-2.0
>  * kafka-schema-registry-client (3.0.0) Apache-2.0
>  * Metrics Core (3.1.1) Apache-2.0
>  * Graphite Integration for Metrics (3.1.1) Apache-2.0
>  * Joda-Time (2.9.6) Apache-2.0
>  * JUnit CPL-1.0
>  * Awaitility (3.1.2) Apache-2.0
>  * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
>  * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
>  * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
>  * htrace-core (3.0.4) Apache-2.0
>  * Mockito (1.10.19) MIT
>  * scalatest (3.0.1) Apache-2.0
>  * Spring Shell (1.2.0.RELEASE) Apache-2.0
>
> All of them are Apache compatible
>
> == Cryptography ==
>
> No cryptographic libraries used
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@hudi.incubator.apache.org (with moderated subscriptions)
>  * dev@hudi.incubator.apache.org
>  * commits@hudi.incubator.apache.org
>  * user@hudi.incubator.apache.org
>
> === Git Repositories ===
>
> Git is the preferred source control system: git://
> git.apache.org/incubator-hudi
>
> === Issue Tracking ===
>
> We prefer to use the Apache gitbox integration to sync Github & Apache
> infrastructure, and rely on Github issues & pull requests for community
> engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
>
> == Initial Committers ==
>
>  * Vinoth Chandar (vinoth at uber dot com) (Uber)
>  * Nishith Agarwal (nagarwal at uber dot com) (Uber)
>  * Balaji Varadarajan (varadarb at uber dot com) (Uber)
>  * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
>  * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
>  * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
>
> == Sponsors ==
>
> === Champion ===
> Julien Le Dem (julien at apache dot org)
>
> === Nominated Mentors ===
>
>  * Luciano Resende (lresende at apache dot org)
>  * Thomas Weise (thw at apache dot org
>  * Kishore Gopalakrishna (kishoreg at apache dot org)
>  * Suneel Marthi (smarthi at apache dot org)
>
> === Sponsoring Entity ===
>
> The Incubator PMC
>

[RESULT] [VOTE] Accept Hudi into the Apache Incubator

Posted by Thomas Weise <th...@apache.org>.

The vote for accepting Hudi into the Apache Incubator passes with 11
binding +1 votes, 5 non-binding +1 votes and no other votes.

Thanks for voting!

+1 votes:

Luciano Resende*
Pierre Smits
Suneel Marthi*
Felix Cheung*
Kenneth Knowles*
Mohammad Islam
Mayank Bansal
Jakob Homan*
Akira Ajisaka*
Gosling Von*
Matt Sicker*
Brahma Reddy Battula
Hongtao Gao
Vinayakumar B*
Furkan Kamaci*
Thomas Weise*

* = binding


On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise <th...@apache.org> wrote:

> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
>
> == Rationale ==
>
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found at https://uber.github.io/hudi/use_cases.html.
>
> Given our reliance on so many great Apache projects, we believe that the
> Apache way of open source community driven development will enable us to
> evolve Hudi in collaboration with a diverse set of contributors who can
> bring new ideas into the project.
>
> == Initial Goals ==
>
>  * Move the existing codebase, website, documentation, and mailing lists
> to an Apache-hosted infrastructure.
>  * Integrate with the Apache development process.
>  * Ensure all dependencies are compliant with Apache License version 2.0.
>  * Incrementally develop and release per Apache guidelines.
>
> == Current Status ==
>
> Hudi is a stable project used in production at Uber since 2016 and was
> open sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> manages 4000+ tables holding several petabytes, bringing our Hadoop
> warehouse from several hours of data delays to under 30 minutes, over the
> past two years. The source code is currently hosted at github.com (
> https://github.com/uber/hudi ), which will seed the Apache git repository.
>
> === Meritocracy ===
>
> We are fully committed to open, transparent, & meritocratic interactions
> with our community. In fact, one of the primary motivations for us to enter
> the incubation process is to be able to rely on Apache best practices that
> can ensure meritocracy. This will eventually help incorporate the best
> ideas back into the project & enable contributors to continue investing
> their time in the project. Current guidelines (
> https://uber.github.io/hudi/community.html#becoming-a-committer) have
> already put in place a meritocratic process which we will replace with
> Apache guidelines during incubation.
>
> === Community ===
>
> Hudi community is fairly young, since the project was open sourced only in
> early 2017. Currently, Hudi has committers from Uber & Snowflake. We have a
> vibrant set of contributors (~46 members in our slack channel) including
> Shopify, DoubleVerify and Vungle & others, who have either submitted
> patches or filed issues with hudi pipelines either in early production or
> testing stages. Our primary goal during the incubation would be to grow the
> community and groom our existing active contributors into committers.
>
> === Core Developers ===
>
> Current core developers work at Uber & Snowflake. We are confident that
> incubation will help us grow a diverse community in a open & collaborative
> way.
>
> === Alignment ===
>
> Hudi is designed as a general purpose analytical storage abstraction that
> integrates with multiple Apache projects: Apache Spark, Apache Hive, Apache
> Hadoop. It was built using multiple Apache projects, including Apache
> Parquet and Apache Avro, that support near-real time analytics right on top
> of existing Apache Hadoop data lakes. Our sincere hope is that being a part
> of the Apache foundation would enable us to drive the future of the project
> in alignment with the other Apache projects for the benefit of thousands of
> organizations that already leverage these projects.
>
> == Known Risks ==
>
> === Orphaned products ===
>
> The risk of abandonment of Hudi is low. It is used in production at Uber
> for petabytes of data and other companies (mentioned in community section)
> are either evaluating or in the early stage for production use. Uber is
> committed to further development of the project and invest resources
> towards the Apache processes & building the community, during incubation
> period.
>
> === Inexperience with Open Source ===
>
> Even though the initial committers are new to the Apache world, some have
> considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> successfully managing the current open source community answering questions
> and taking feedback already. Moreover, we hope to obtain guidance and
> mentorship from current ASF members to help us succeed with the incubation.
>
> === Length of Incubation ===
>
> We expect the project be in incubation for 2 years or less.
>
> === Homogenous Developers ===
>
> Currently, the lead developers for Hudi are from Uber. However, we have an
> active set of early contributors/collaborators from Shopify, DoubleVerify
> and Vungle, that we hope will increase the diversity going forward. Once
> again, a primary motivation for incubation is to facilitate this in the
> Apache way.
>
> === Reliance on Salaried Developers ===
>
> Both the current committers & early contributors have several years of
> core expertise around data systems. Current committers are very passionate
> about the project and have already invested hundreds of hours towards
> helping & building the community. Thus, even with employer changes, we
> expect they will be able to actively engage in the project either because
> they will be working in similar areas even with newer employers or out of
> belief in the project.
>
> === Relationships with Other Apache Products ===
>
> To the best of our knowledge, there are no direct competing projects with
> Hudi that offer all of the feature set namely - upserts, incremental
> streams, efficient storage/file management, snapshot isolation/rollbacks -
> in a coherent way. However, some projects share common goals and technical
> elements and we will highlight them here. Hive ACID/Kudu both offer upsert
> capabilities without storage management/incremental streams. The recent
> Iceberg project offers similar snapshot isolation/rollbacks, but not
> upserts or other data plane features. A detailed comparison with their
> trade-offs can be found at https://uber.github.io/hudi/comparison.html.
>
> We are committed to open collaboration with such Apache projects and
> incorporate changes to Hudi or contribute patches to other projects, with
> the goal of making it easier for the community at large, to adopt these
> open source technologies.
>
> === Excessive Fascination with the Apache Brand ===
>
> This proposal is not for the purpose of generating publicity. We have
> already been doing talks/meetups independently that have helped us build
> our community. We are drawn towards Apache as a potential way of ensuring
> that our open source community management is successful early on so  hudi
> can evolve into a broadly accepted--and used--method of managing data on
> Hadoop.
>
> == Documentation ==
> [1] Detailed documentation can be found at https://uber.github.io/hudi/
>
> == Initial Source ==
>
> The codebase is currently hosted on Github: https://github.com/uber/hudi
> . During incubation, the codebase will be migrated to an Apache
> infrastructure. The source code already has an Apache 2.0 licensed.
>
> == Source and Intellectual Property Submission Plan ==
>
> Current code is Apache 2.0 licensed and the copyright is assigned to Uber.
> If the project enters incubator, Uber will transfer the source code &
> trademark ownership to ASF via a Software Grant Agreement
>
> == External Dependencies ==
>
> Non apache dependencies are listed below
>
>  * JCommander (1.48) Apache-2.0
>  * Kryo (4.0.0) BSD-2-Clause
>  * Kryo (2.21) BSD-3-Clause
>  * Jackson-annotations (2.6.4) Apache-2.0
>  * Jackson-annotations (2.6.5) Apache-2.0
>  * jackson-databind (2.6.4) Apache-2.0
>  * jackson-databind (2.6.5) Apache-2.0
>  * Jackson datatype: Guava (2.9.4) Apache-2.0
>  * docker-java (3.1.0-rc-3) Apache-2.0
>  * Guava: Google Core Libraries for Java (20.0) Apache-2.0
>  * bijection-avro (0.9.2) Apache-2.0
>  * com.twitter.common:objectsize (0.0.12) Apache-2.0
>  * Ascii Table (0.2.5) Apache-2.0
>  * config (3.0.0) Apache-2.0
>  * utils (3.0.0) Apache-2.0
>  * kafka-avro-serializer (3.0.0) Apache-2.0
>  * kafka-schema-registry-client (3.0.0) Apache-2.0
>  * Metrics Core (3.1.1) Apache-2.0
>  * Graphite Integration for Metrics (3.1.1) Apache-2.0
>  * Joda-Time (2.9.6) Apache-2.0
>  * JUnit CPL-1.0
>  * Awaitility (3.1.2) Apache-2.0
>  * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
>  * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
>  * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
>  * htrace-core (3.0.4) Apache-2.0
>  * Mockito (1.10.19) MIT
>  * scalatest (3.0.1) Apache-2.0
>  * Spring Shell (1.2.0.RELEASE) Apache-2.0
>
> All of them are Apache compatible
>
> == Cryptography ==
>
> No cryptographic libraries used
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@hudi.incubator.apache.org (with moderated subscriptions)
>  * dev@hudi.incubator.apache.org
>  * commits@hudi.incubator.apache.org
>  * user@hudi.incubator.apache.org
>
> === Git Repositories ===
>
> Git is the preferred source control system: git://
> git.apache.org/incubator-hudi
>
> === Issue Tracking ===
>
> We prefer to use the Apache gitbox integration to sync Github & Apache
> infrastructure, and rely on Github issues & pull requests for community
> engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
>
> == Initial Committers ==
>
>  * Vinoth Chandar (vinoth at uber dot com) (Uber)
>  * Nishith Agarwal (nagarwal at uber dot com) (Uber)
>  * Balaji Varadarajan (varadarb at uber dot com) (Uber)
>  * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
>  * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
>  * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
>
> == Sponsors ==
>
> === Champion ===
> Julien Le Dem (julien at apache dot org)
>
> === Nominated Mentors ===
>
>  * Luciano Resende (lresende at apache dot org)
>  * Thomas Weise (thw at apache dot org
>  * Kishore Gopalakrishna (kishoreg at apache dot org)
>  * Suneel Marthi (smarthi at apache dot org)
>
> === Sponsoring Entity ===
>
> The Incubator PMC
>
>

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Brahma Reddy Battula <br...@apache.org>.

+1 ( non-binding).

Best choice for incremental processing
<https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop>
.

On Mon, Jan 14, 2019 at 4:04 AM Thomas Weise <th...@apache.org> wrote:

> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
>
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
>
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
>
> == Rationale ==
>
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found at https://uber.github.io/hudi/use_cases.html.
>
> Given our reliance on so many great Apache projects, we believe that the
> Apache way of open source community driven development will enable us to
> evolve Hudi in collaboration with a diverse set of contributors who can
> bring new ideas into the project.
>
> == Initial Goals ==
>
>  * Move the existing codebase, website, documentation, and mailing lists to
> an Apache-hosted infrastructure.
>  * Integrate with the Apache development process.
>  * Ensure all dependencies are compliant with Apache License version 2.0.
>  * Incrementally develop and release per Apache guidelines.
>
> == Current Status ==
>
> Hudi is a stable project used in production at Uber since 2016 and was open
> sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> manages 4000+ tables holding several petabytes, bringing our Hadoop
> warehouse from several hours of data delays to under 30 minutes, over the
> past two years. The source code is currently hosted at github.com (
> https://github.com/uber/hudi ), which will seed the Apache git repository.
>
> === Meritocracy ===
>
> We are fully committed to open, transparent, & meritocratic interactions
> with our community. In fact, one of the primary motivations for us to enter
> the incubation process is to be able to rely on Apache best practices that
> can ensure meritocracy. This will eventually help incorporate the best
> ideas back into the project & enable contributors to continue investing
> their time in the project. Current guidelines (
> https://uber.github.io/hudi/community.html#becoming-a-committer) have
> already put in place a meritocratic process which we will replace with
> Apache guidelines during incubation.
>
> === Community ===
>
> Hudi community is fairly young, since the project was open sourced only in
> early 2017. Currently, Hudi has committers from Uber & Snowflake. We have a
> vibrant set of contributors (~46 members in our slack channel) including
> Shopify, DoubleVerify and Vungle & others, who have either submitted
> patches or filed issues with hudi pipelines either in early production or
> testing stages. Our primary goal during the incubation would be to grow the
> community and groom our existing active contributors into committers.
>
> === Core Developers ===
>
> Current core developers work at Uber & Snowflake. We are confident that
> incubation will help us grow a diverse community in a open & collaborative
> way.
>
> === Alignment ===
>
> Hudi is designed as a general purpose analytical storage abstraction that
> integrates with multiple Apache projects: Apache Spark, Apache Hive, Apache
> Hadoop. It was built using multiple Apache projects, including Apache
> Parquet and Apache Avro, that support near-real time analytics right on top
> of existing Apache Hadoop data lakes. Our sincere hope is that being a part
> of the Apache foundation would enable us to drive the future of the project
> in alignment with the other Apache projects for the benefit of thousands of
> organizations that already leverage these projects.
>
> == Known Risks ==
>
> === Orphaned products ===
>
> The risk of abandonment of Hudi is low. It is used in production at Uber
> for petabytes of data and other companies (mentioned in community section)
> are either evaluating or in the early stage for production use. Uber is
> committed to further development of the project and invest resources
> towards the Apache processes & building the community, during incubation
> period.
>
> === Inexperience with Open Source ===
>
> Even though the initial committers are new to the Apache world, some have
> considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> successfully managing the current open source community answering questions
> and taking feedback already. Moreover, we hope to obtain guidance and
> mentorship from current ASF members to help us succeed with the incubation.
>
> === Length of Incubation ===
>
> We expect the project be in incubation for 2 years or less.
>
> === Homogenous Developers ===
>
> Currently, the lead developers for Hudi are from Uber. However, we have an
> active set of early contributors/collaborators from Shopify, DoubleVerify
> and Vungle, that we hope will increase the diversity going forward. Once
> again, a primary motivation for incubation is to facilitate this in the
> Apache way.
>
> === Reliance on Salaried Developers ===
>
> Both the current committers & early contributors have several years of core
> expertise around data systems. Current committers are very passionate about
> the project and have already invested hundreds of hours towards helping &
> building the community. Thus, even with employer changes, we expect they
> will be able to actively engage in the project either because they will be
> working in similar areas even with newer employers or out of belief in the
> project.
>
> === Relationships with Other Apache Products ===
>
> To the best of our knowledge, there are no direct competing projects with
> Hudi that offer all of the feature set namely - upserts, incremental
> streams, efficient storage/file management, snapshot isolation/rollbacks -
> in a coherent way. However, some projects share common goals and technical
> elements and we will highlight them here. Hive ACID/Kudu both offer upsert
> capabilities without storage management/incremental streams. The recent
> Iceberg project offers similar snapshot isolation/rollbacks, but not
> upserts or other data plane features. A detailed comparison with their
> trade-offs can be found at https://uber.github.io/hudi/comparison.html.
>
> We are committed to open collaboration with such Apache projects and
> incorporate changes to Hudi or contribute patches to other projects, with
> the goal of making it easier for the community at large, to adopt these
> open source technologies.
>
> === Excessive Fascination with the Apache Brand ===
>
> This proposal is not for the purpose of generating publicity. We have
> already been doing talks/meetups independently that have helped us build
> our community. We are drawn towards Apache as a potential way of ensuring
> that our open source community management is successful early on so  hudi
> can evolve into a broadly accepted--and used--method of managing data on
> Hadoop.
>
> == Documentation ==
> [1] Detailed documentation can be found at https://uber.github.io/hudi/
>
> == Initial Source ==
>
> The codebase is currently hosted on Github: https://github.com/uber/hudi .
> During incubation, the codebase will be migrated to an Apache
> infrastructure. The source code already has an Apache 2.0 licensed.
>
> == Source and Intellectual Property Submission Plan ==
>
> Current code is Apache 2.0 licensed and the copyright is assigned to Uber.
> If the project enters incubator, Uber will transfer the source code &
> trademark ownership to ASF via a Software Grant Agreement
>
> == External Dependencies ==
>
> Non apache dependencies are listed below
>
>  * JCommander (1.48) Apache-2.0
>  * Kryo (4.0.0) BSD-2-Clause
>  * Kryo (2.21) BSD-3-Clause
>  * Jackson-annotations (2.6.4) Apache-2.0
>  * Jackson-annotations (2.6.5) Apache-2.0
>  * jackson-databind (2.6.4) Apache-2.0
>  * jackson-databind (2.6.5) Apache-2.0
>  * Jackson datatype: Guava (2.9.4) Apache-2.0
>  * docker-java (3.1.0-rc-3) Apache-2.0
>  * Guava: Google Core Libraries for Java (20.0) Apache-2.0
>  * bijection-avro (0.9.2) Apache-2.0
>  * com.twitter.common:objectsize (0.0.12) Apache-2.0
>  * Ascii Table (0.2.5) Apache-2.0
>  * config (3.0.0) Apache-2.0
>  * utils (3.0.0) Apache-2.0
>  * kafka-avro-serializer (3.0.0) Apache-2.0
>  * kafka-schema-registry-client (3.0.0) Apache-2.0
>  * Metrics Core (3.1.1) Apache-2.0
>  * Graphite Integration for Metrics (3.1.1) Apache-2.0
>  * Joda-Time (2.9.6) Apache-2.0
>  * JUnit CPL-1.0
>  * Awaitility (3.1.2) Apache-2.0
>  * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
>  * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
>  * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
>  * htrace-core (3.0.4) Apache-2.0
>  * Mockito (1.10.19) MIT
>  * scalatest (3.0.1) Apache-2.0
>  * Spring Shell (1.2.0.RELEASE) Apache-2.0
>
> All of them are Apache compatible
>
> == Cryptography ==
>
> No cryptographic libraries used
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@hudi.incubator.apache.org (with moderated subscriptions)
>  * dev@hudi.incubator.apache.org
>  * commits@hudi.incubator.apache.org
>  * user@hudi.incubator.apache.org
>
> === Git Repositories ===
>
> Git is the preferred source control system: git://
> git.apache.org/incubator-hudi
>
> === Issue Tracking ===
>
> We prefer to use the Apache gitbox integration to sync Github & Apache
> infrastructure, and rely on Github issues & pull requests for community
> engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
>
> == Initial Committers ==
>
>  * Vinoth Chandar (vinoth at uber dot com) (Uber)
>  * Nishith Agarwal (nagarwal at uber dot com) (Uber)
>  * Balaji Varadarajan (varadarb at uber dot com) (Uber)
>  * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
>  * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
>  * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
>
> == Sponsors ==
>
> === Champion ===
> Julien Le Dem (julien at apache dot org)
>
> === Nominated Mentors ===
>
>  * Luciano Resende (lresende at apache dot org)
>  * Thomas Weise (thw at apache dot org
>  * Kishore Gopalakrishna (kishoreg at apache dot org)
>  * Suneel Marthi (smarthi at apache dot org)
>
> === Sponsoring Entity ===
>
> The Incubator PMC
>


-- 



--Brahma Reddy Battula

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Pierre Smits <pi...@apache.org>.

+1

On Mon, 14 Jan 2019 at 00:02 Luciano Resende <lu...@gmail.com> wrote:

> +1 (binding)
>
> On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise <th...@apache.org> wrote:
> >
> > Hi all,
> >
> > Following the discussion of the Hudi proposal in [1], this is a vote
> > on accepting Hudi into the Apache Incubator,
> > per the ASF policy [2] and voting rules [3].
> >
> > A vote for accepting a new Apache Incubator podling is a
> > majority vote. Everyone is welcome to vote, only
> > Incubator PMC member votes are binding.
> >
> > This vote will run for at least 72 hours. Please VOTE as
> > follows:
> >
> > [ ] +1 Accept Hudi into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> >
> > The proposal is included below, but you can also access it on
> > the wiki [4].
> >
> > Thanks for reviewing and voting,
> > Thomas
> >
> > [1]
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> >
> > [2]
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> >
> > [3] http://www.apache.org/foundation/voting.html
> >
> > [4] https://wiki.apache.org/incubator/HudiProposal
> >
> >
> >
> > = Hudi Proposal =
> >
> > == Abstract ==
> >
> > Hudi is a big-data storage library, that provides atomic upserts and
> > incremental data streams.
> >
> > Hudi manages data stored in Apache Hadoop and other API compatible
> > distributed file systems/cloud stores.
> >
> > == Proposal ==
> >
> > Hudi provides the ability to atomically upsert datasets with new values
> in
> > near-real time, making data available quickly to existing query engines
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > sequence of changes to a dataset from a given point-in-time to enable
> > incremental data pipelines that yield greater efficiency & latency than
> > their typical batch counterparts. By carefully managing number of files &
> > sizes, Hudi greatly aids both query engines (e.g: always providing
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > consumption).
> >
> > Hudi is largely implemented as an Apache Spark library that reads/writes
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> are
> > supported via specialized Apache Hadoop input formats, that understand
> > Hudi’s storage layout. Currently, Hudi manages datasets using a
> combination
> > of Apache Parquet & Apache Avro file/serialization formats.
> >
> > == Background ==
> >
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> > longer term analytical storage for thousands of organizations. Typical
> > analytical datasets are built by reading data from a source (e.g:
> upstream
> > databases, messaging buses, or other datasets), transforming the data,
> > writing results back to storage, & making it available for analytical
> > queries--all of this typically accomplished in batch jobs which operate
> in
> > a bulk fashion on partitions of datasets. Such a style of processing
> > typically incurs large delays in making data available to queries as well
> > as lot of complexity in carefully partitioning datasets to guarantee
> > latency SLAs.
> >
> > The need for fresher/faster analytics has increased enormously in the
> past
> > few years, as evidenced by the popularity of Stream processing systems
> like
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > using updateable state store to incrementally compute & instantly reflect
> > new results to queries and using a “tailable” messaging bus to publish
> > these results to other downstream jobs, such systems employ a different
> > approach to building analytical dataset. Even though this approach yields
> > low latency, the amount of data managed in such real-time data-marts is
> > typically limited in comparison to the aforementioned longer term storage
> > options. As a result, the overall data architecture has become more
> complex
> > with more moving parts and specialized systems, leading to duplication of
> > data and a strain on usability.
> >
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> data
> > to streaming systems, we simply add the streaming primitives (upserts &
> > incremental consumption) onto existing batch processing technologies. We
> > believe that by adding some missing blocks to an existing Hadoop stack,
> we
> > are able to a provide similar capabilities right on top of Hadoop at a
> > reduced cost and with an increased efficiency, greatly simplifying the
> > overall architecture in the process.
> >
> > Hudi was originally developed at Uber (original name “Hoodie”) to address
> > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> data
> > ecosystem that required the upsert & incremental consumption primitives
> > supported by Hudi.
> >
> > == Rationale ==
> >
> > We truly believe the capabilities supported by Hudi would be increasingly
> > useful for big-data ecosystems, as data volumes & need for faster data
> > continue to increase. A detailed description of target use-cases can be
> > found at https://uber.github.io/hudi/use_cases.html.
> >
> > Given our reliance on so many great Apache projects, we believe that the
> > Apache way of open source community driven development will enable us to
> > evolve Hudi in collaboration with a diverse set of contributors who can
> > bring new ideas into the project.
> >
> > == Initial Goals ==
> >
> >  * Move the existing codebase, website, documentation, and mailing lists
> to
> > an Apache-hosted infrastructure.
> >  * Integrate with the Apache development process.
> >  * Ensure all dependencies are compliant with Apache License version 2.0.
> >  * Incrementally develop and release per Apache guidelines.
> >
> > == Current Status ==
> >
> > Hudi is a stable project used in production at Uber since 2016 and was
> open
> > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > warehouse from several hours of data delays to under 30 minutes, over the
> > past two years. The source code is currently hosted at github.com (
> > https://github.com/uber/hudi ), which will seed the Apache git
> repository.
> >
> > === Meritocracy ===
> >
> > We are fully committed to open, transparent, & meritocratic interactions
> > with our community. In fact, one of the primary motivations for us to
> enter
> > the incubation process is to be able to rely on Apache best practices
> that
> > can ensure meritocracy. This will eventually help incorporate the best
> > ideas back into the project & enable contributors to continue investing
> > their time in the project. Current guidelines (
> > https://uber.github.io/hudi/community.html#becoming-a-committer) have
> > already put in place a meritocratic process which we will replace with
> > Apache guidelines during incubation.
> >
> > === Community ===
> >
> > Hudi community is fairly young, since the project was open sourced only
> in
> > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> have a
> > vibrant set of contributors (~46 members in our slack channel) including
> > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > patches or filed issues with hudi pipelines either in early production or
> > testing stages. Our primary goal during the incubation would be to grow
> the
> > community and groom our existing active contributors into committers.
> >
> > === Core Developers ===
> >
> > Current core developers work at Uber & Snowflake. We are confident that
> > incubation will help us grow a diverse community in a open &
> collaborative
> > way.
> >
> > === Alignment ===
> >
> > Hudi is designed as a general purpose analytical storage abstraction that
> > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> Apache
> > Hadoop. It was built using multiple Apache projects, including Apache
> > Parquet and Apache Avro, that support near-real time analytics right on
> top
> > of existing Apache Hadoop data lakes. Our sincere hope is that being a
> part
> > of the Apache foundation would enable us to drive the future of the
> project
> > in alignment with the other Apache projects for the benefit of thousands
> of
> > organizations that already leverage these projects.
> >
> > == Known Risks ==
> >
> > === Orphaned products ===
> >
> > The risk of abandonment of Hudi is low. It is used in production at Uber
> > for petabytes of data and other companies (mentioned in community
> section)
> > are either evaluating or in the early stage for production use. Uber is
> > committed to further development of the project and invest resources
> > towards the Apache processes & building the community, during incubation
> > period.
> >
> > === Inexperience with Open Source ===
> >
> > Even though the initial committers are new to the Apache world, some have
> > considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > successfully managing the current open source community answering
> questions
> > and taking feedback already. Moreover, we hope to obtain guidance and
> > mentorship from current ASF members to help us succeed with the
> incubation.
> >
> > === Length of Incubation ===
> >
> > We expect the project be in incubation for 2 years or less.
> >
> > === Homogenous Developers ===
> >
> > Currently, the lead developers for Hudi are from Uber. However, we have
> an
> > active set of early contributors/collaborators from Shopify, DoubleVerify
> > and Vungle, that we hope will increase the diversity going forward. Once
> > again, a primary motivation for incubation is to facilitate this in the
> > Apache way.
> >
> > === Reliance on Salaried Developers ===
> >
> > Both the current committers & early contributors have several years of
> core
> > expertise around data systems. Current committers are very passionate
> about
> > the project and have already invested hundreds of hours towards helping &
> > building the community. Thus, even with employer changes, we expect they
> > will be able to actively engage in the project either because they will
> be
> > working in similar areas even with newer employers or out of belief in
> the
> > project.
> >
> > === Relationships with Other Apache Products ===
> >
> > To the best of our knowledge, there are no direct competing projects with
> > Hudi that offer all of the feature set namely - upserts, incremental
> > streams, efficient storage/file management, snapshot isolation/rollbacks
> -
> > in a coherent way. However, some projects share common goals and
> technical
> > elements and we will highlight them here. Hive ACID/Kudu both offer
> upsert
> > capabilities without storage management/incremental streams. The recent
> > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > upserts or other data plane features. A detailed comparison with their
> > trade-offs can be found at https://uber.github.io/hudi/comparison.html.
> >
> > We are committed to open collaboration with such Apache projects and
> > incorporate changes to Hudi or contribute patches to other projects, with
> > the goal of making it easier for the community at large, to adopt these
> > open source technologies.
> >
> > === Excessive Fascination with the Apache Brand ===
> >
> > This proposal is not for the purpose of generating publicity. We have
> > already been doing talks/meetups independently that have helped us build
> > our community. We are drawn towards Apache as a potential way of ensuring
> > that our open source community management is successful early on so  hudi
> > can evolve into a broadly accepted--and used--method of managing data on
> > Hadoop.
> >
> > == Documentation ==
> > [1] Detailed documentation can be found at https://uber.github.io/hudi/
> >
> > == Initial Source ==
> >
> > The codebase is currently hosted on Github: https://github.com/uber/hudi
> .
> > During incubation, the codebase will be migrated to an Apache
> > infrastructure. The source code already has an Apache 2.0 licensed.
> >
> > == Source and Intellectual Property Submission Plan ==
> >
> > Current code is Apache 2.0 licensed and the copyright is assigned to
> Uber.
> > If the project enters incubator, Uber will transfer the source code &
> > trademark ownership to ASF via a Software Grant Agreement
> >
> > == External Dependencies ==
> >
> > Non apache dependencies are listed below
> >
> >  * JCommander (1.48) Apache-2.0
> >  * Kryo (4.0.0) BSD-2-Clause
> >  * Kryo (2.21) BSD-3-Clause
> >  * Jackson-annotations (2.6.4) Apache-2.0
> >  * Jackson-annotations (2.6.5) Apache-2.0
> >  * jackson-databind (2.6.4) Apache-2.0
> >  * jackson-databind (2.6.5) Apache-2.0
> >  * Jackson datatype: Guava (2.9.4) Apache-2.0
> >  * docker-java (3.1.0-rc-3) Apache-2.0
> >  * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> >  * bijection-avro (0.9.2) Apache-2.0
> >  * com.twitter.common:objectsize (0.0.12) Apache-2.0
> >  * Ascii Table (0.2.5) Apache-2.0
> >  * config (3.0.0) Apache-2.0
> >  * utils (3.0.0) Apache-2.0
> >  * kafka-avro-serializer (3.0.0) Apache-2.0
> >  * kafka-schema-registry-client (3.0.0) Apache-2.0
> >  * Metrics Core (3.1.1) Apache-2.0
> >  * Graphite Integration for Metrics (3.1.1) Apache-2.0
> >  * Joda-Time (2.9.6) Apache-2.0
> >  * JUnit CPL-1.0
> >  * Awaitility (3.1.2) Apache-2.0
> >  * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> >  * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> >  * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> >  * htrace-core (3.0.4) Apache-2.0
> >  * Mockito (1.10.19) MIT
> >  * scalatest (3.0.1) Apache-2.0
> >  * Spring Shell (1.2.0.RELEASE) Apache-2.0
> >
> > All of them are Apache compatible
> >
> > == Cryptography ==
> >
> > No cryptographic libraries used
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> >  * private@hudi.incubator.apache.org (with moderated subscriptions)
> >  * dev@hudi.incubator.apache.org
> >  * commits@hudi.incubator.apache.org
> >  * user@hudi.incubator.apache.org
> >
> > === Git Repositories ===
> >
> > Git is the preferred source control system: git://
> > git.apache.org/incubator-hudi
> >
> > === Issue Tracking ===
> >
> > We prefer to use the Apache gitbox integration to sync Github & Apache
> > infrastructure, and rely on Github issues & pull requests for community
> > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> >
> > == Initial Committers ==
> >
> >  * Vinoth Chandar (vinoth at uber dot com) (Uber)
> >  * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> >  * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> >  * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> >  * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> >  * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> >
> > == Sponsors ==
> >
> > === Champion ===
> > Julien Le Dem (julien at apache dot org)
> >
> > === Nominated Mentors ===
> >
> >  * Luciano Resende (lresende at apache dot org)
> >  * Thomas Weise (thw at apache dot org
> >  * Kishore Gopalakrishna (kishoreg at apache dot org)
> >  * Suneel Marthi (smarthi at apache dot org)
> >
> > === Sponsoring Entity ===
> >
> > The Incubator PMC
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
> --
Sent from my phone

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Luciano Resende <lu...@gmail.com>.

+1 (binding)

On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise <th...@apache.org> wrote:
>
> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
>
> == Rationale ==
>
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found at https://uber.github.io/hudi/use_cases.html.
>
> Given our reliance on so many great Apache projects, we believe that the
> Apache way of open source community driven development will enable us to
> evolve Hudi in collaboration with a diverse set of contributors who can
> bring new ideas into the project.
>
> == Initial Goals ==
>
>  * Move the existing codebase, website, documentation, and mailing lists to
> an Apache-hosted infrastructure.
>  * Integrate with the Apache development process.
>  * Ensure all dependencies are compliant with Apache License version 2.0.
>  * Incrementally develop and release per Apache guidelines.
>
> == Current Status ==
>
> Hudi is a stable project used in production at Uber since 2016 and was open
> sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> manages 4000+ tables holding several petabytes, bringing our Hadoop
> warehouse from several hours of data delays to under 30 minutes, over the
> past two years. The source code is currently hosted at github.com (
> https://github.com/uber/hudi ), which will seed the Apache git repository.
>
> === Meritocracy ===
>
> We are fully committed to open, transparent, & meritocratic interactions
> with our community. In fact, one of the primary motivations for us to enter
> the incubation process is to be able to rely on Apache best practices that
> can ensure meritocracy. This will eventually help incorporate the best
> ideas back into the project & enable contributors to continue investing
> their time in the project. Current guidelines (
> https://uber.github.io/hudi/community.html#becoming-a-committer) have
> already put in place a meritocratic process which we will replace with
> Apache guidelines during incubation.
>
> === Community ===
>
> Hudi community is fairly young, since the project was open sourced only in
> early 2017. Currently, Hudi has committers from Uber & Snowflake. We have a
> vibrant set of contributors (~46 members in our slack channel) including
> Shopify, DoubleVerify and Vungle & others, who have either submitted
> patches or filed issues with hudi pipelines either in early production or
> testing stages. Our primary goal during the incubation would be to grow the
> community and groom our existing active contributors into committers.
>
> === Core Developers ===
>
> Current core developers work at Uber & Snowflake. We are confident that
> incubation will help us grow a diverse community in a open & collaborative
> way.
>
> === Alignment ===
>
> Hudi is designed as a general purpose analytical storage abstraction that
> integrates with multiple Apache projects: Apache Spark, Apache Hive, Apache
> Hadoop. It was built using multiple Apache projects, including Apache
> Parquet and Apache Avro, that support near-real time analytics right on top
> of existing Apache Hadoop data lakes. Our sincere hope is that being a part
> of the Apache foundation would enable us to drive the future of the project
> in alignment with the other Apache projects for the benefit of thousands of
> organizations that already leverage these projects.
>
> == Known Risks ==
>
> === Orphaned products ===
>
> The risk of abandonment of Hudi is low. It is used in production at Uber
> for petabytes of data and other companies (mentioned in community section)
> are either evaluating or in the early stage for production use. Uber is
> committed to further development of the project and invest resources
> towards the Apache processes & building the community, during incubation
> period.
>
> === Inexperience with Open Source ===
>
> Even though the initial committers are new to the Apache world, some have
> considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> successfully managing the current open source community answering questions
> and taking feedback already. Moreover, we hope to obtain guidance and
> mentorship from current ASF members to help us succeed with the incubation.
>
> === Length of Incubation ===
>
> We expect the project be in incubation for 2 years or less.
>
> === Homogenous Developers ===
>
> Currently, the lead developers for Hudi are from Uber. However, we have an
> active set of early contributors/collaborators from Shopify, DoubleVerify
> and Vungle, that we hope will increase the diversity going forward. Once
> again, a primary motivation for incubation is to facilitate this in the
> Apache way.
>
> === Reliance on Salaried Developers ===
>
> Both the current committers & early contributors have several years of core
> expertise around data systems. Current committers are very passionate about
> the project and have already invested hundreds of hours towards helping &
> building the community. Thus, even with employer changes, we expect they
> will be able to actively engage in the project either because they will be
> working in similar areas even with newer employers or out of belief in the
> project.
>
> === Relationships with Other Apache Products ===
>
> To the best of our knowledge, there are no direct competing projects with
> Hudi that offer all of the feature set namely - upserts, incremental
> streams, efficient storage/file management, snapshot isolation/rollbacks -
> in a coherent way. However, some projects share common goals and technical
> elements and we will highlight them here. Hive ACID/Kudu both offer upsert
> capabilities without storage management/incremental streams. The recent
> Iceberg project offers similar snapshot isolation/rollbacks, but not
> upserts or other data plane features. A detailed comparison with their
> trade-offs can be found at https://uber.github.io/hudi/comparison.html.
>
> We are committed to open collaboration with such Apache projects and
> incorporate changes to Hudi or contribute patches to other projects, with
> the goal of making it easier for the community at large, to adopt these
> open source technologies.
>
> === Excessive Fascination with the Apache Brand ===
>
> This proposal is not for the purpose of generating publicity. We have
> already been doing talks/meetups independently that have helped us build
> our community. We are drawn towards Apache as a potential way of ensuring
> that our open source community management is successful early on so  hudi
> can evolve into a broadly accepted--and used--method of managing data on
> Hadoop.
>
> == Documentation ==
> [1] Detailed documentation can be found at https://uber.github.io/hudi/
>
> == Initial Source ==
>
> The codebase is currently hosted on Github: https://github.com/uber/hudi .
> During incubation, the codebase will be migrated to an Apache
> infrastructure. The source code already has an Apache 2.0 licensed.
>
> == Source and Intellectual Property Submission Plan ==
>
> Current code is Apache 2.0 licensed and the copyright is assigned to Uber.
> If the project enters incubator, Uber will transfer the source code &
> trademark ownership to ASF via a Software Grant Agreement
>
> == External Dependencies ==
>
> Non apache dependencies are listed below
>
>  * JCommander (1.48) Apache-2.0
>  * Kryo (4.0.0) BSD-2-Clause
>  * Kryo (2.21) BSD-3-Clause
>  * Jackson-annotations (2.6.4) Apache-2.0
>  * Jackson-annotations (2.6.5) Apache-2.0
>  * jackson-databind (2.6.4) Apache-2.0
>  * jackson-databind (2.6.5) Apache-2.0
>  * Jackson datatype: Guava (2.9.4) Apache-2.0
>  * docker-java (3.1.0-rc-3) Apache-2.0
>  * Guava: Google Core Libraries for Java (20.0) Apache-2.0
>  * bijection-avro (0.9.2) Apache-2.0
>  * com.twitter.common:objectsize (0.0.12) Apache-2.0
>  * Ascii Table (0.2.5) Apache-2.0
>  * config (3.0.0) Apache-2.0
>  * utils (3.0.0) Apache-2.0
>  * kafka-avro-serializer (3.0.0) Apache-2.0
>  * kafka-schema-registry-client (3.0.0) Apache-2.0
>  * Metrics Core (3.1.1) Apache-2.0
>  * Graphite Integration for Metrics (3.1.1) Apache-2.0
>  * Joda-Time (2.9.6) Apache-2.0
>  * JUnit CPL-1.0
>  * Awaitility (3.1.2) Apache-2.0
>  * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
>  * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
>  * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
>  * htrace-core (3.0.4) Apache-2.0
>  * Mockito (1.10.19) MIT
>  * scalatest (3.0.1) Apache-2.0
>  * Spring Shell (1.2.0.RELEASE) Apache-2.0
>
> All of them are Apache compatible
>
> == Cryptography ==
>
> No cryptographic libraries used
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@hudi.incubator.apache.org (with moderated subscriptions)
>  * dev@hudi.incubator.apache.org
>  * commits@hudi.incubator.apache.org
>  * user@hudi.incubator.apache.org
>
> === Git Repositories ===
>
> Git is the preferred source control system: git://
> git.apache.org/incubator-hudi
>
> === Issue Tracking ===
>
> We prefer to use the Apache gitbox integration to sync Github & Apache
> infrastructure, and rely on Github issues & pull requests for community
> engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
>
> == Initial Committers ==
>
>  * Vinoth Chandar (vinoth at uber dot com) (Uber)
>  * Nishith Agarwal (nagarwal at uber dot com) (Uber)
>  * Balaji Varadarajan (varadarb at uber dot com) (Uber)
>  * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
>  * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
>  * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
>
> == Sponsors ==
>
> === Champion ===
> Julien Le Dem (julien at apache dot org)
>
> === Nominated Mentors ===
>
>  * Luciano Resende (lresende at apache dot org)
>  * Thomas Weise (thw at apache dot org
>  * Kishore Gopalakrishna (kishoreg at apache dot org)
>  * Suneel Marthi (smarthi at apache dot org)
>
> === Sponsoring Entity ===
>
> The Incubator PMC



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Thomas Weise <th...@apache.org>.

+1


On Wed, Jan 16, 2019 at 11:35 AM Matt Sicker <bo...@gmail.com> wrote:

> +1
>
> On Wed, 16 Jan 2019 at 01:25, Gosling Von <fe...@gmail.com> wrote:
> >
> > +1(binding)
> >
> > Best Regards,
> > Von Gosling
> >
> > > 在 2019年1月14日，上午6:34，Thomas Weise <th...@apache.org> 写道：
> > >
> > > Hi all,
> > >
> > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > on accepting Hudi into the Apache Incubator,
> > > per the ASF policy [2] and voting rules [3].
> > >
> > > A vote for accepting a new Apache Incubator podling is a
> > > majority vote. Everyone is welcome to vote, only
> > > Incubator PMC member votes are binding.
> > >
> > > This vote will run for at least 72 hours. Please VOTE as
> > > follows:
> > >
> > > [ ] +1 Accept Hudi into the Apache Incubator
> > > [ ] +0 Abstain
> > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > >
> > > The proposal is included below, but you can also access it on
> > > the wiki [4].
> > >
> > > Thanks for reviewing and voting,
> > > Thomas
> > >
> > > [1]
> > >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > >
> > > [2]
> > >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > >
> > > [3] http://www.apache.org/foundation/voting.html
> > >
> > > [4] https://wiki.apache.org/incubator/HudiProposal
> > >
> > >
> > >
> > > = Hudi Proposal =
> > >
> > > == Abstract ==
> > >
> > > Hudi is a big-data storage library, that provides atomic upserts and
> > > incremental data streams.
> > >
> > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > distributed file systems/cloud stores.
> > >
> > > == Proposal ==
> > >
> > > Hudi provides the ability to atomically upsert datasets with new
> values in
> > > near-real time, making data available quickly to existing query engines
> > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > > sequence of changes to a dataset from a given point-in-time to enable
> > > incremental data pipelines that yield greater efficiency & latency than
> > > their typical batch counterparts. By carefully managing number of
> files &
> > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > consumption).
> > >
> > > Hudi is largely implemented as an Apache Spark library that
> reads/writes
> > > data from/to Hadoop compatible filesystem. SQL queries on Hudi
> datasets are
> > > supported via specialized Apache Hadoop input formats, that understand
> > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> combination
> > > of Apache Parquet & Apache Avro file/serialization formats.
> > >
> > > == Background ==
> > >
> > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> as
> > > longer term analytical storage for thousands of organizations. Typical
> > > analytical datasets are built by reading data from a source (e.g:
> upstream
> > > databases, messaging buses, or other datasets), transforming the data,
> > > writing results back to storage, & making it available for analytical
> > > queries--all of this typically accomplished in batch jobs which
> operate in
> > > a bulk fashion on partitions of datasets. Such a style of processing
> > > typically incurs large delays in making data available to queries as
> well
> > > as lot of complexity in carefully partitioning datasets to guarantee
> > > latency SLAs.
> > >
> > > The need for fresher/faster analytics has increased enormously in the
> past
> > > few years, as evidenced by the popularity of Stream processing systems
> like
> > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > > using updateable state store to incrementally compute & instantly
> reflect
> > > new results to queries and using a “tailable” messaging bus to publish
> > > these results to other downstream jobs, such systems employ a different
> > > approach to building analytical dataset. Even though this approach
> yields
> > > low latency, the amount of data managed in such real-time data-marts is
> > > typically limited in comparison to the aforementioned longer term
> storage
> > > options. As a result, the overall data architecture has become more
> complex
> > > with more moving parts and specialized systems, leading to duplication
> of
> > > data and a strain on usability.
> > >
> > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> data
> > > to streaming systems, we simply add the streaming primitives (upserts &
> > > incremental consumption) onto existing batch processing technologies.
> We
> > > believe that by adding some missing blocks to an existing Hadoop
> stack, we
> > > are able to a provide similar capabilities right on top of Hadoop at a
> > > reduced cost and with an increased efficiency, greatly simplifying the
> > > overall architecture in the process.
> > >
> > > Hudi was originally developed at Uber (original name “Hoodie”) to
> address
> > > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> data
> > > ecosystem that required the upsert & incremental consumption primitives
> > > supported by Hudi.
> > >
> > > == Rationale ==
> > >
> > > We truly believe the capabilities supported by Hudi would be
> increasingly
> > > useful for big-data ecosystems, as data volumes & need for faster data
> > > continue to increase. A detailed description of target use-cases can be
> > > found at https://uber.github.io/hudi/use_cases.html.
> > >
> > > Given our reliance on so many great Apache projects, we believe that
> the
> > > Apache way of open source community driven development will enable us
> to
> > > evolve Hudi in collaboration with a diverse set of contributors who can
> > > bring new ideas into the project.
> > >
> > > == Initial Goals ==
> > >
> > > * Move the existing codebase, website, documentation, and mailing
> lists to
> > > an Apache-hosted infrastructure.
> > > * Integrate with the Apache development process.
> > > * Ensure all dependencies are compliant with Apache License version
> 2.0.
> > > * Incrementally develop and release per Apache guidelines.
> > >
> > > == Current Status ==
> > >
> > > Hudi is a stable project used in production at Uber since 2016 and was
> open
> > > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > > warehouse from several hours of data delays to under 30 minutes, over
> the
> > > past two years. The source code is currently hosted at github.com (
> > > https://github.com/uber/hudi ), which will seed the Apache git
> repository.
> > >
> > > === Meritocracy ===
> > >
> > > We are fully committed to open, transparent, & meritocratic
> interactions
> > > with our community. In fact, one of the primary motivations for us to
> enter
> > > the incubation process is to be able to rely on Apache best practices
> that
> > > can ensure meritocracy. This will eventually help incorporate the best
> > > ideas back into the project & enable contributors to continue investing
> > > their time in the project. Current guidelines (
> > > https://uber.github.io/hudi/community.html#becoming-a-committer) have
> > > already put in place a meritocratic process which we will replace with
> > > Apache guidelines during incubation.
> > >
> > > === Community ===
> > >
> > > Hudi community is fairly young, since the project was open sourced
> only in
> > > early 2017. Currently, Hudi has committers from Uber & Snowflake. We
> have a
> > > vibrant set of contributors (~46 members in our slack channel)
> including
> > > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > > patches or filed issues with hudi pipelines either in early production
> or
> > > testing stages. Our primary goal during the incubation would be to
> grow the
> > > community and groom our existing active contributors into committers.
> > >
> > > === Core Developers ===
> > >
> > > Current core developers work at Uber & Snowflake. We are confident that
> > > incubation will help us grow a diverse community in a open &
> collaborative
> > > way.
> > >
> > > === Alignment ===
> > >
> > > Hudi is designed as a general purpose analytical storage abstraction
> that
> > > integrates with multiple Apache projects: Apache Spark, Apache Hive,
> Apache
> > > Hadoop. It was built using multiple Apache projects, including Apache
> > > Parquet and Apache Avro, that support near-real time analytics right
> on top
> > > of existing Apache Hadoop data lakes. Our sincere hope is that being a
> part
> > > of the Apache foundation would enable us to drive the future of the
> project
> > > in alignment with the other Apache projects for the benefit of
> thousands of
> > > organizations that already leverage these projects.
> > >
> > > == Known Risks ==
> > >
> > > === Orphaned products ===
> > >
> > > The risk of abandonment of Hudi is low. It is used in production at
> Uber
> > > for petabytes of data and other companies (mentioned in community
> section)
> > > are either evaluating or in the early stage for production use. Uber is
> > > committed to further development of the project and invest resources
> > > towards the Apache processes & building the community, during
> incubation
> > > period.
> > >
> > > === Inexperience with Open Source ===
> > >
> > > Even though the initial committers are new to the Apache world, some
> have
> > > considerable open source experience - Vinoth Chandar (Linkedin
> voldemort,
> > > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> > > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > > successfully managing the current open source community answering
> questions
> > > and taking feedback already. Moreover, we hope to obtain guidance and
> > > mentorship from current ASF members to help us succeed with the
> incubation.
> > >
> > > === Length of Incubation ===
> > >
> > > We expect the project be in incubation for 2 years or less.
> > >
> > > === Homogenous Developers ===
> > >
> > > Currently, the lead developers for Hudi are from Uber. However, we
> have an
> > > active set of early contributors/collaborators from Shopify,
> DoubleVerify
> > > and Vungle, that we hope will increase the diversity going forward.
> Once
> > > again, a primary motivation for incubation is to facilitate this in the
> > > Apache way.
> > >
> > > === Reliance on Salaried Developers ===
> > >
> > > Both the current committers & early contributors have several years of
> core
> > > expertise around data systems. Current committers are very passionate
> about
> > > the project and have already invested hundreds of hours towards
> helping &
> > > building the community. Thus, even with employer changes, we expect
> they
> > > will be able to actively engage in the project either because they
> will be
> > > working in similar areas even with newer employers or out of belief in
> the
> > > project.
> > >
> > > === Relationships with Other Apache Products ===
> > >
> > > To the best of our knowledge, there are no direct competing projects
> with
> > > Hudi that offer all of the feature set namely - upserts, incremental
> > > streams, efficient storage/file management, snapshot
> isolation/rollbacks -
> > > in a coherent way. However, some projects share common goals and
> technical
> > > elements and we will highlight them here. Hive ACID/Kudu both offer
> upsert
> > > capabilities without storage management/incremental streams. The recent
> > > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > > upserts or other data plane features. A detailed comparison with their
> > > trade-offs can be found at https://uber.github.io/hudi/comparison.html
> .
> > >
> > > We are committed to open collaboration with such Apache projects and
> > > incorporate changes to Hudi or contribute patches to other projects,
> with
> > > the goal of making it easier for the community at large, to adopt these
> > > open source technologies.
> > >
> > > === Excessive Fascination with the Apache Brand ===
> > >
> > > This proposal is not for the purpose of generating publicity. We have
> > > already been doing talks/meetups independently that have helped us
> build
> > > our community. We are drawn towards Apache as a potential way of
> ensuring
> > > that our open source community management is successful early on so
> hudi
> > > can evolve into a broadly accepted--and used--method of managing data
> on
> > > Hadoop.
> > >
> > > == Documentation ==
> > > [1] Detailed documentation can be found at
> https://uber.github.io/hudi/
> > >
> > > == Initial Source ==
> > >
> > > The codebase is currently hosted on Github:
> https://github.com/uber/hudi .
> > > During incubation, the codebase will be migrated to an Apache
> > > infrastructure. The source code already has an Apache 2.0 licensed.
> > >
> > > == Source and Intellectual Property Submission Plan ==
> > >
> > > Current code is Apache 2.0 licensed and the copyright is assigned to
> Uber.
> > > If the project enters incubator, Uber will transfer the source code &
> > > trademark ownership to ASF via a Software Grant Agreement
> > >
> > > == External Dependencies ==
> > >
> > > Non apache dependencies are listed below
> > >
> > > * JCommander (1.48) Apache-2.0
> > > * Kryo (4.0.0) BSD-2-Clause
> > > * Kryo (2.21) BSD-3-Clause
> > > * Jackson-annotations (2.6.4) Apache-2.0
> > > * Jackson-annotations (2.6.5) Apache-2.0
> > > * jackson-databind (2.6.4) Apache-2.0
> > > * jackson-databind (2.6.5) Apache-2.0
> > > * Jackson datatype: Guava (2.9.4) Apache-2.0
> > > * docker-java (3.1.0-rc-3) Apache-2.0
> > > * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > > * bijection-avro (0.9.2) Apache-2.0
> > > * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > > * Ascii Table (0.2.5) Apache-2.0
> > > * config (3.0.0) Apache-2.0
> > > * utils (3.0.0) Apache-2.0
> > > * kafka-avro-serializer (3.0.0) Apache-2.0
> > > * kafka-schema-registry-client (3.0.0) Apache-2.0
> > > * Metrics Core (3.1.1) Apache-2.0
> > > * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > > * Joda-Time (2.9.6) Apache-2.0
> > > * JUnit CPL-1.0
> > > * Awaitility (3.1.2) Apache-2.0
> > > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > > * htrace-core (3.0.4) Apache-2.0
> > > * Mockito (1.10.19) MIT
> > > * scalatest (3.0.1) Apache-2.0
> > > * Spring Shell (1.2.0.RELEASE) Apache-2.0
> > >
> > > All of them are Apache compatible
> > >
> > > == Cryptography ==
> > >
> > > No cryptographic libraries used
> > >
> > > == Required Resources ==
> > >
> > > === Mailing lists ===
> > >
> > > * private@hudi.incubator.apache.org (with moderated subscriptions)
> > > * dev@hudi.incubator.apache.org
> > > * commits@hudi.incubator.apache.org
> > > * user@hudi.incubator.apache.org
> > >
> > > === Git Repositories ===
> > >
> > > Git is the preferred source control system: git://
> > > git.apache.org/incubator-hudi
> > >
> > > === Issue Tracking ===
> > >
> > > We prefer to use the Apache gitbox integration to sync Github & Apache
> > > infrastructure, and rely on Github issues & pull requests for community
> > > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> > >
> > > == Initial Committers ==
> > >
> > > * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > > * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > > * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> > > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> > >
> > > == Sponsors ==
> > >
> > > === Champion ===
> > > Julien Le Dem (julien at apache dot org)
> > >
> > > === Nominated Mentors ===
> > >
> > > * Luciano Resende (lresende at apache dot org)
> > > * Thomas Weise (thw at apache dot org
> > > * Kishore Gopalakrishna (kishoreg at apache dot org)
> > > * Suneel Marthi (smarthi at apache dot org)
> > >
> > > === Sponsoring Entity ===
> > >
> > > The Incubator PMC
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
>
>
> --
> Matt Sicker <bo...@gmail.com>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Matt Sicker <bo...@gmail.com>.

+1

On Wed, 16 Jan 2019 at 01:25, Gosling Von <fe...@gmail.com> wrote:
>
> +1(binding)
>
> Best Regards,
> Von Gosling
>
> > 在 2019年1月14日，上午6:34，Thomas Weise <th...@apache.org> 写道：
> >
> > Hi all,
> >
> > Following the discussion of the Hudi proposal in [1], this is a vote
> > on accepting Hudi into the Apache Incubator,
> > per the ASF policy [2] and voting rules [3].
> >
> > A vote for accepting a new Apache Incubator podling is a
> > majority vote. Everyone is welcome to vote, only
> > Incubator PMC member votes are binding.
> >
> > This vote will run for at least 72 hours. Please VOTE as
> > follows:
> >
> > [ ] +1 Accept Hudi into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> >
> > The proposal is included below, but you can also access it on
> > the wiki [4].
> >
> > Thanks for reviewing and voting,
> > Thomas
> >
> > [1]
> > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> >
> > [2]
> > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> >
> > [3] http://www.apache.org/foundation/voting.html
> >
> > [4] https://wiki.apache.org/incubator/HudiProposal
> >
> >
> >
> > = Hudi Proposal =
> >
> > == Abstract ==
> >
> > Hudi is a big-data storage library, that provides atomic upserts and
> > incremental data streams.
> >
> > Hudi manages data stored in Apache Hadoop and other API compatible
> > distributed file systems/cloud stores.
> >
> > == Proposal ==
> >
> > Hudi provides the ability to atomically upsert datasets with new values in
> > near-real time, making data available quickly to existing query engines
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > sequence of changes to a dataset from a given point-in-time to enable
> > incremental data pipelines that yield greater efficiency & latency than
> > their typical batch counterparts. By carefully managing number of files &
> > sizes, Hudi greatly aids both query engines (e.g: always providing
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > consumption).
> >
> > Hudi is largely implemented as an Apache Spark library that reads/writes
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> > supported via specialized Apache Hadoop input formats, that understand
> > Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> > of Apache Parquet & Apache Avro file/serialization formats.
> >
> > == Background ==
> >
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> > longer term analytical storage for thousands of organizations. Typical
> > analytical datasets are built by reading data from a source (e.g: upstream
> > databases, messaging buses, or other datasets), transforming the data,
> > writing results back to storage, & making it available for analytical
> > queries--all of this typically accomplished in batch jobs which operate in
> > a bulk fashion on partitions of datasets. Such a style of processing
> > typically incurs large delays in making data available to queries as well
> > as lot of complexity in carefully partitioning datasets to guarantee
> > latency SLAs.
> >
> > The need for fresher/faster analytics has increased enormously in the past
> > few years, as evidenced by the popularity of Stream processing systems like
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > using updateable state store to incrementally compute & instantly reflect
> > new results to queries and using a “tailable” messaging bus to publish
> > these results to other downstream jobs, such systems employ a different
> > approach to building analytical dataset. Even though this approach yields
> > low latency, the amount of data managed in such real-time data-marts is
> > typically limited in comparison to the aforementioned longer term storage
> > options. As a result, the overall data architecture has become more complex
> > with more moving parts and specialized systems, leading to duplication of
> > data and a strain on usability.
> >
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> > to streaming systems, we simply add the streaming primitives (upserts &
> > incremental consumption) onto existing batch processing technologies. We
> > believe that by adding some missing blocks to an existing Hadoop stack, we
> > are able to a provide similar capabilities right on top of Hadoop at a
> > reduced cost and with an increased efficiency, greatly simplifying the
> > overall architecture in the process.
> >
> > Hudi was originally developed at Uber (original name “Hoodie”) to address
> > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> > ecosystem that required the upsert & incremental consumption primitives
> > supported by Hudi.
> >
> > == Rationale ==
> >
> > We truly believe the capabilities supported by Hudi would be increasingly
> > useful for big-data ecosystems, as data volumes & need for faster data
> > continue to increase. A detailed description of target use-cases can be
> > found at https://uber.github.io/hudi/use_cases.html.
> >
> > Given our reliance on so many great Apache projects, we believe that the
> > Apache way of open source community driven development will enable us to
> > evolve Hudi in collaboration with a diverse set of contributors who can
> > bring new ideas into the project.
> >
> > == Initial Goals ==
> >
> > * Move the existing codebase, website, documentation, and mailing lists to
> > an Apache-hosted infrastructure.
> > * Integrate with the Apache development process.
> > * Ensure all dependencies are compliant with Apache License version 2.0.
> > * Incrementally develop and release per Apache guidelines.
> >
> > == Current Status ==
> >
> > Hudi is a stable project used in production at Uber since 2016 and was open
> > sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> > manages 4000+ tables holding several petabytes, bringing our Hadoop
> > warehouse from several hours of data delays to under 30 minutes, over the
> > past two years. The source code is currently hosted at github.com (
> > https://github.com/uber/hudi ), which will seed the Apache git repository.
> >
> > === Meritocracy ===
> >
> > We are fully committed to open, transparent, & meritocratic interactions
> > with our community. In fact, one of the primary motivations for us to enter
> > the incubation process is to be able to rely on Apache best practices that
> > can ensure meritocracy. This will eventually help incorporate the best
> > ideas back into the project & enable contributors to continue investing
> > their time in the project. Current guidelines (
> > https://uber.github.io/hudi/community.html#becoming-a-committer) have
> > already put in place a meritocratic process which we will replace with
> > Apache guidelines during incubation.
> >
> > === Community ===
> >
> > Hudi community is fairly young, since the project was open sourced only in
> > early 2017. Currently, Hudi has committers from Uber & Snowflake. We have a
> > vibrant set of contributors (~46 members in our slack channel) including
> > Shopify, DoubleVerify and Vungle & others, who have either submitted
> > patches or filed issues with hudi pipelines either in early production or
> > testing stages. Our primary goal during the incubation would be to grow the
> > community and groom our existing active contributors into committers.
> >
> > === Core Developers ===
> >
> > Current core developers work at Uber & Snowflake. We are confident that
> > incubation will help us grow a diverse community in a open & collaborative
> > way.
> >
> > === Alignment ===
> >
> > Hudi is designed as a general purpose analytical storage abstraction that
> > integrates with multiple Apache projects: Apache Spark, Apache Hive, Apache
> > Hadoop. It was built using multiple Apache projects, including Apache
> > Parquet and Apache Avro, that support near-real time analytics right on top
> > of existing Apache Hadoop data lakes. Our sincere hope is that being a part
> > of the Apache foundation would enable us to drive the future of the project
> > in alignment with the other Apache projects for the benefit of thousands of
> > organizations that already leverage these projects.
> >
> > == Known Risks ==
> >
> > === Orphaned products ===
> >
> > The risk of abandonment of Hudi is low. It is used in production at Uber
> > for petabytes of data and other companies (mentioned in community section)
> > are either evaluating or in the early stage for production use. Uber is
> > committed to further development of the project and invest resources
> > towards the Apache processes & building the community, during incubation
> > period.
> >
> > === Inexperience with Open Source ===
> >
> > Even though the initial committers are new to the Apache world, some have
> > considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> > Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> > (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> > successfully managing the current open source community answering questions
> > and taking feedback already. Moreover, we hope to obtain guidance and
> > mentorship from current ASF members to help us succeed with the incubation.
> >
> > === Length of Incubation ===
> >
> > We expect the project be in incubation for 2 years or less.
> >
> > === Homogenous Developers ===
> >
> > Currently, the lead developers for Hudi are from Uber. However, we have an
> > active set of early contributors/collaborators from Shopify, DoubleVerify
> > and Vungle, that we hope will increase the diversity going forward. Once
> > again, a primary motivation for incubation is to facilitate this in the
> > Apache way.
> >
> > === Reliance on Salaried Developers ===
> >
> > Both the current committers & early contributors have several years of core
> > expertise around data systems. Current committers are very passionate about
> > the project and have already invested hundreds of hours towards helping &
> > building the community. Thus, even with employer changes, we expect they
> > will be able to actively engage in the project either because they will be
> > working in similar areas even with newer employers or out of belief in the
> > project.
> >
> > === Relationships with Other Apache Products ===
> >
> > To the best of our knowledge, there are no direct competing projects with
> > Hudi that offer all of the feature set namely - upserts, incremental
> > streams, efficient storage/file management, snapshot isolation/rollbacks -
> > in a coherent way. However, some projects share common goals and technical
> > elements and we will highlight them here. Hive ACID/Kudu both offer upsert
> > capabilities without storage management/incremental streams. The recent
> > Iceberg project offers similar snapshot isolation/rollbacks, but not
> > upserts or other data plane features. A detailed comparison with their
> > trade-offs can be found at https://uber.github.io/hudi/comparison.html.
> >
> > We are committed to open collaboration with such Apache projects and
> > incorporate changes to Hudi or contribute patches to other projects, with
> > the goal of making it easier for the community at large, to adopt these
> > open source technologies.
> >
> > === Excessive Fascination with the Apache Brand ===
> >
> > This proposal is not for the purpose of generating publicity. We have
> > already been doing talks/meetups independently that have helped us build
> > our community. We are drawn towards Apache as a potential way of ensuring
> > that our open source community management is successful early on so  hudi
> > can evolve into a broadly accepted--and used--method of managing data on
> > Hadoop.
> >
> > == Documentation ==
> > [1] Detailed documentation can be found at https://uber.github.io/hudi/
> >
> > == Initial Source ==
> >
> > The codebase is currently hosted on Github: https://github.com/uber/hudi .
> > During incubation, the codebase will be migrated to an Apache
> > infrastructure. The source code already has an Apache 2.0 licensed.
> >
> > == Source and Intellectual Property Submission Plan ==
> >
> > Current code is Apache 2.0 licensed and the copyright is assigned to Uber.
> > If the project enters incubator, Uber will transfer the source code &
> > trademark ownership to ASF via a Software Grant Agreement
> >
> > == External Dependencies ==
> >
> > Non apache dependencies are listed below
> >
> > * JCommander (1.48) Apache-2.0
> > * Kryo (4.0.0) BSD-2-Clause
> > * Kryo (2.21) BSD-3-Clause
> > * Jackson-annotations (2.6.4) Apache-2.0
> > * Jackson-annotations (2.6.5) Apache-2.0
> > * jackson-databind (2.6.4) Apache-2.0
> > * jackson-databind (2.6.5) Apache-2.0
> > * Jackson datatype: Guava (2.9.4) Apache-2.0
> > * docker-java (3.1.0-rc-3) Apache-2.0
> > * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> > * bijection-avro (0.9.2) Apache-2.0
> > * com.twitter.common:objectsize (0.0.12) Apache-2.0
> > * Ascii Table (0.2.5) Apache-2.0
> > * config (3.0.0) Apache-2.0
> > * utils (3.0.0) Apache-2.0
> > * kafka-avro-serializer (3.0.0) Apache-2.0
> > * kafka-schema-registry-client (3.0.0) Apache-2.0
> > * Metrics Core (3.1.1) Apache-2.0
> > * Graphite Integration for Metrics (3.1.1) Apache-2.0
> > * Joda-Time (2.9.6) Apache-2.0
> > * JUnit CPL-1.0
> > * Awaitility (3.1.2) Apache-2.0
> > * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> > * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> > * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> > * htrace-core (3.0.4) Apache-2.0
> > * Mockito (1.10.19) MIT
> > * scalatest (3.0.1) Apache-2.0
> > * Spring Shell (1.2.0.RELEASE) Apache-2.0
> >
> > All of them are Apache compatible
> >
> > == Cryptography ==
> >
> > No cryptographic libraries used
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> > * private@hudi.incubator.apache.org (with moderated subscriptions)
> > * dev@hudi.incubator.apache.org
> > * commits@hudi.incubator.apache.org
> > * user@hudi.incubator.apache.org
> >
> > === Git Repositories ===
> >
> > Git is the preferred source control system: git://
> > git.apache.org/incubator-hudi
> >
> > === Issue Tracking ===
> >
> > We prefer to use the Apache gitbox integration to sync Github & Apache
> > infrastructure, and rely on Github issues & pull requests for community
> > engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> >
> > == Initial Committers ==
> >
> > * Vinoth Chandar (vinoth at uber dot com) (Uber)
> > * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> > * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> > * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> > * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> > * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> >
> > == Sponsors ==
> >
> > === Champion ===
> > Julien Le Dem (julien at apache dot org)
> >
> > === Nominated Mentors ===
> >
> > * Luciano Resende (lresende at apache dot org)
> > * Thomas Weise (thw at apache dot org
> > * Kishore Gopalakrishna (kishoreg at apache dot org)
> > * Suneel Marthi (smarthi at apache dot org)
> >
> > === Sponsoring Entity ===
> >
> > The Incubator PMC
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>


-- 
Matt Sicker <bo...@gmail.com>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept Hudi into the Apache Incubator

Posted by Gosling Von <fe...@gmail.com>.

+1(binding)

Best Regards,
Von Gosling

> 在 2019年1月14日，上午6:34，Thomas Weise <th...@apache.org> 写道：
> 
> Hi all,
> 
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
> 
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
> 
> This vote will run for at least 72 hours. Please VOTE as
> follows:
> 
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> 
> The proposal is included below, but you can also access it on
> the wiki [4].
> 
> Thanks for reviewing and voting,
> Thomas
> 
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> 
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> 
> [3] http://www.apache.org/foundation/voting.html
> 
> [4] https://wiki.apache.org/incubator/HudiProposal
> 
> 
> 
> = Hudi Proposal =
> 
> == Abstract ==
> 
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
> 
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
> 
> == Proposal ==
> 
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
> 
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
> 
> == Background ==
> 
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
> 
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
> 
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
> 
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
> 
> == Rationale ==
> 
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found at https://uber.github.io/hudi/use_cases.html.
> 
> Given our reliance on so many great Apache projects, we believe that the
> Apache way of open source community driven development will enable us to
> evolve Hudi in collaboration with a diverse set of contributors who can
> bring new ideas into the project.
> 
> == Initial Goals ==
> 
> * Move the existing codebase, website, documentation, and mailing lists to
> an Apache-hosted infrastructure.
> * Integrate with the Apache development process.
> * Ensure all dependencies are compliant with Apache License version 2.0.
> * Incrementally develop and release per Apache guidelines.
> 
> == Current Status ==
> 
> Hudi is a stable project used in production at Uber since 2016 and was open
> sourced under the Apache License, Version 2.0 in 2017. At Uber, Hudi
> manages 4000+ tables holding several petabytes, bringing our Hadoop
> warehouse from several hours of data delays to under 30 minutes, over the
> past two years. The source code is currently hosted at github.com (
> https://github.com/uber/hudi ), which will seed the Apache git repository.
> 
> === Meritocracy ===
> 
> We are fully committed to open, transparent, & meritocratic interactions
> with our community. In fact, one of the primary motivations for us to enter
> the incubation process is to be able to rely on Apache best practices that
> can ensure meritocracy. This will eventually help incorporate the best
> ideas back into the project & enable contributors to continue investing
> their time in the project. Current guidelines (
> https://uber.github.io/hudi/community.html#becoming-a-committer) have
> already put in place a meritocratic process which we will replace with
> Apache guidelines during incubation.
> 
> === Community ===
> 
> Hudi community is fairly young, since the project was open sourced only in
> early 2017. Currently, Hudi has committers from Uber & Snowflake. We have a
> vibrant set of contributors (~46 members in our slack channel) including
> Shopify, DoubleVerify and Vungle & others, who have either submitted
> patches or filed issues with hudi pipelines either in early production or
> testing stages. Our primary goal during the incubation would be to grow the
> community and groom our existing active contributors into committers.
> 
> === Core Developers ===
> 
> Current core developers work at Uber & Snowflake. We are confident that
> incubation will help us grow a diverse community in a open & collaborative
> way.
> 
> === Alignment ===
> 
> Hudi is designed as a general purpose analytical storage abstraction that
> integrates with multiple Apache projects: Apache Spark, Apache Hive, Apache
> Hadoop. It was built using multiple Apache projects, including Apache
> Parquet and Apache Avro, that support near-real time analytics right on top
> of existing Apache Hadoop data lakes. Our sincere hope is that being a part
> of the Apache foundation would enable us to drive the future of the project
> in alignment with the other Apache projects for the benefit of thousands of
> organizations that already leverage these projects.
> 
> == Known Risks ==
> 
> === Orphaned products ===
> 
> The risk of abandonment of Hudi is low. It is used in production at Uber
> for petabytes of data and other companies (mentioned in community section)
> are either evaluating or in the early stage for production use. Uber is
> committed to further development of the project and invest resources
> towards the Apache processes & building the community, during incubation
> period.
> 
> === Inexperience with Open Source ===
> 
> Even though the initial committers are new to the Apache world, some have
> considerable open source experience - Vinoth Chandar (Linkedin voldemort,
> Chromium), Prasanna Rajaperumal (Cloudera experience), Zeeshan Qureshi
> (Chromium) & Balaji Varadarajan (Linkedin Databus). We have been
> successfully managing the current open source community answering questions
> and taking feedback already. Moreover, we hope to obtain guidance and
> mentorship from current ASF members to help us succeed with the incubation.
> 
> === Length of Incubation ===
> 
> We expect the project be in incubation for 2 years or less.
> 
> === Homogenous Developers ===
> 
> Currently, the lead developers for Hudi are from Uber. However, we have an
> active set of early contributors/collaborators from Shopify, DoubleVerify
> and Vungle, that we hope will increase the diversity going forward. Once
> again, a primary motivation for incubation is to facilitate this in the
> Apache way.
> 
> === Reliance on Salaried Developers ===
> 
> Both the current committers & early contributors have several years of core
> expertise around data systems. Current committers are very passionate about
> the project and have already invested hundreds of hours towards helping &
> building the community. Thus, even with employer changes, we expect they
> will be able to actively engage in the project either because they will be
> working in similar areas even with newer employers or out of belief in the
> project.
> 
> === Relationships with Other Apache Products ===
> 
> To the best of our knowledge, there are no direct competing projects with
> Hudi that offer all of the feature set namely - upserts, incremental
> streams, efficient storage/file management, snapshot isolation/rollbacks -
> in a coherent way. However, some projects share common goals and technical
> elements and we will highlight them here. Hive ACID/Kudu both offer upsert
> capabilities without storage management/incremental streams. The recent
> Iceberg project offers similar snapshot isolation/rollbacks, but not
> upserts or other data plane features. A detailed comparison with their
> trade-offs can be found at https://uber.github.io/hudi/comparison.html.
> 
> We are committed to open collaboration with such Apache projects and
> incorporate changes to Hudi or contribute patches to other projects, with
> the goal of making it easier for the community at large, to adopt these
> open source technologies.
> 
> === Excessive Fascination with the Apache Brand ===
> 
> This proposal is not for the purpose of generating publicity. We have
> already been doing talks/meetups independently that have helped us build
> our community. We are drawn towards Apache as a potential way of ensuring
> that our open source community management is successful early on so  hudi
> can evolve into a broadly accepted--and used--method of managing data on
> Hadoop.
> 
> == Documentation ==
> [1] Detailed documentation can be found at https://uber.github.io/hudi/
> 
> == Initial Source ==
> 
> The codebase is currently hosted on Github: https://github.com/uber/hudi .
> During incubation, the codebase will be migrated to an Apache
> infrastructure. The source code already has an Apache 2.0 licensed.
> 
> == Source and Intellectual Property Submission Plan ==
> 
> Current code is Apache 2.0 licensed and the copyright is assigned to Uber.
> If the project enters incubator, Uber will transfer the source code &
> trademark ownership to ASF via a Software Grant Agreement
> 
> == External Dependencies ==
> 
> Non apache dependencies are listed below
> 
> * JCommander (1.48) Apache-2.0
> * Kryo (4.0.0) BSD-2-Clause
> * Kryo (2.21) BSD-3-Clause
> * Jackson-annotations (2.6.4) Apache-2.0
> * Jackson-annotations (2.6.5) Apache-2.0
> * jackson-databind (2.6.4) Apache-2.0
> * jackson-databind (2.6.5) Apache-2.0
> * Jackson datatype: Guava (2.9.4) Apache-2.0
> * docker-java (3.1.0-rc-3) Apache-2.0
> * Guava: Google Core Libraries for Java (20.0) Apache-2.0
> * bijection-avro (0.9.2) Apache-2.0
> * com.twitter.common:objectsize (0.0.12) Apache-2.0
> * Ascii Table (0.2.5) Apache-2.0
> * config (3.0.0) Apache-2.0
> * utils (3.0.0) Apache-2.0
> * kafka-avro-serializer (3.0.0) Apache-2.0
> * kafka-schema-registry-client (3.0.0) Apache-2.0
> * Metrics Core (3.1.1) Apache-2.0
> * Graphite Integration for Metrics (3.1.1) Apache-2.0
> * Joda-Time (2.9.6) Apache-2.0
> * JUnit CPL-1.0
> * Awaitility (3.1.2) Apache-2.0
> * jersey-connectors-apache (2.17) GPL-2.0-only CDDL-1.0
> * jersey-container-servlet-core (2.17) GPL-2.0-only CDDL-1.0
> * jersey-core-server (2.17) GPL-2.0-only CDDL-1.0
> * htrace-core (3.0.4) Apache-2.0
> * Mockito (1.10.19) MIT
> * scalatest (3.0.1) Apache-2.0
> * Spring Shell (1.2.0.RELEASE) Apache-2.0
> 
> All of them are Apache compatible
> 
> == Cryptography ==
> 
> No cryptographic libraries used
> 
> == Required Resources ==
> 
> === Mailing lists ===
> 
> * private@hudi.incubator.apache.org (with moderated subscriptions)
> * dev@hudi.incubator.apache.org
> * commits@hudi.incubator.apache.org
> * user@hudi.incubator.apache.org
> 
> === Git Repositories ===
> 
> Git is the preferred source control system: git://
> git.apache.org/incubator-hudi
> 
> === Issue Tracking ===
> 
> We prefer to use the Apache gitbox integration to sync Github & Apache
> infrastructure, and rely on Github issues & pull requests for community
> engagement. If this is not possible, then we prefer JIRA: Hudi (HUDI)
> 
> == Initial Committers ==
> 
> * Vinoth Chandar (vinoth at uber dot com) (Uber)
> * Nishith Agarwal (nagarwal at uber dot com) (Uber)
> * Balaji Varadarajan (varadarb at uber dot com) (Uber)
> * Prasanna Rajaperumal (prasanna dot raj at gmail dot com) (Snowflake)
> * Zeeshan Qureshi (zeeshan dot qureshi at shopify dot com) (Shopify)
> * Anbu Cheeralan (alunarbeach at gmail dot com) (DoubleVerify)
> 
> == Sponsors ==
> 
> === Champion ===
> Julien Le Dem (julien at apache dot org)
> 
> === Nominated Mentors ===
> 
> * Luciano Resende (lresende at apache dot org)
> * Thomas Weise (thw at apache dot org
> * Kishore Gopalakrishna (kishoreg at apache dot org)
> * Suneel Marthi (smarthi at apache dot org)
> 
> === Sponsoring Entity ===
> 
> The Incubator PMC


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org