You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@incubator.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2015/01/15 03:21:10 UTC

[PROPOSAL] Apache AsterixDB Incubator

Hi Folks,

I am pleased to bring forth the Apache AsterixDB proposal to the
Apache Incubator as Champion, working in collaboration with the
team. Please find the wiki proposal here:

https://wiki.apache.org/incubator/AsterixDBProposal


Full text of the proposal is below. Please discuss and enjoy. I’ll
leave the discussion open for a week, and then look to call a VOTE
hopefully end of next week if all is well.

Cheers!
Chris Mattmann

=============================================================
Apache AsterixDB Proposal

Abstract

Apache AsterixDB is a scalable big data management system (BDMS) that
provides storage, management, and query capabilities for large
collections of semi-structured data.

Proposal

AsterixDB is a big data management system (BDMS) that makes it
well-suited to needs such as web data warehousing and social data
storage and analysis. Feature-wise, AsterixDB has:

* A NoSQL style data model (ADM) based on extending JSON with object
  database concepts.
* An expressive and declarative query language (AQL) for querying
  semi-structured data.
* A runtime query execution engine, Hyracks, for partitioned-parallel
  execution of query plans.
* Partitioned LSM-based data storage and indexing for efficient
  ingestion of newly arriving data.
* Support for querying and indexing external data (e.g., in HDFS) as
  well as data stored within AsterixDB.
* A rich set of primitive data types, including support for spatial,
  temporal, and textual data.
* Indexing options that include B+ trees, R trees, and inverted
  keyword index support.
* Basic transactional (concurrency and recovery) capabilities akin to
  those of a NoSQL store.


Background and Rationale

In the world of relational databases, the need to tackle data volumes
that exceed the capabilities of a single server led to the
development of “shared-nothing” parallel database systems several
decades ago. These systems spread data over a cluster based on a
partitioning strategy, such as hash partitioning, and queries are
processed by employing partitioned-parallel divide-and-conquer
techniques. Since these systems are fronted by a high-level,
declarative language (SQL), their users are shielded from the
complexities of parallel programming. Parallel database systems have
been an extremely successful application of parallel computing, and
quite a number of commercial products exist today.

In the distributed systems world, the Web brought a need to index and
query its huge content. SQL and relational databases were not the
answer, though shared-nothing clusters again emerged as the hardware
platform of choice. Google developed the Google File System (GFS) and
MapReduce programming model to allow programmers to store and process
Big Data by writing a few user-defined functions. The MapReduce
framework applies these functions in parallel to data instances in
distributed files (map) and to sorted groups of instances sharing a
common key (reduce) -- not unlike the partitioned parallelism in
parallel database systems. Apache's Hadoop MapReduce platform is the
most prominent implementation of this paradigm for the rest of the
Big Data community. On top of Hadoop and HDFS sit declarative
languages like Pig and Hive that each compile down to Hadoop
MapReduce jobs.

The big Web companies were also challenged by extreme user bases
(100s of millions of users) and needed fast simple lookups and
updates to very large keyed data sets like user profiles. SQL
databases were deemed either too expensive or not scalable, so the
“NoSQL movement” was born. The ASF now has HBase and Cassandra, two
popular key-value stores, in this space. MongoDB and Couchbase are
other open source alternatives (document stores).

It is evident from the rapidly growing popularity of "NoSQL" stores,
as well as the strong demand for Big Data analytics engines today,
that there is a strong (and growing!) need to store, process, *and*
query large volumes of semi-structured data in many application
areas. Until very recently, developers have had to ``choose'' between
using big data analytics engines like Apache Hive or Apache Spark,
which can do complex query processing and analysis over HDFS-resident
files, and flexible but low-function data stores like MongoDB or
Apache HBase. (The Apache Phoenix project,
http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
aims to bridge between these choices.)

AsterixDB is a highly scalable data management system that can store,
index, and manage semi-structured data, e.g., much like MongoDB, but
it also supports a full-power query language with the expressiveness
of SQL (and more). Unlike analytics engines like Hive or Spark, it
stores and manages data, so AsterixDB can exploit its knowledge of
data partitioning and the availability of indexes to avoid always
scanning data set(s) to process queries. Somewhat surprisingly, there
is no open source parallel database system (relational or otherwise)
available to developers today -- AsterixDB aims to fill this need.
Since Apache is where the majority of the today's most important Big
Data technologies live, the ASF seems like the obvious home for a
system like AsterixDB.

Current Status

The current version of AsterixDB was co-developed by a team of
faculty, staff, and students at UC Irvine and UC Riverside. The
project was initiated as a large NSF-sponsored project in 2009, the
goal of which was to combine the best ideas from the parallel
database world, the then new Hadoop world, and the semi-structured
(e.g., XML/JSON) data world in order to create a next-generation
BDMS. A first informal open source release was made four years later,
in June of 2013, under the Apache Software License 2.0.


Meritocracy

The current developers are familiar with meritocratic open source
development at Apache. Apache was chosen specifically because we want
to encourage this style of development for the project.


Community

While AsterixDB started as a university project it has developed into
a community. A number of the initial committers started contributing
in academia and continue to actively participate and contribute after
graduation. And we seek to further develop developer and user
communities. One way to broaden the community that is ongoing is
through academic collaborations (currently with IIT Mumbai in India
and TU Berlin in Germany). During incubation we will also explicitly
seek increased industrial participation.

Some indicators of the effort's development community and history can
be
found at:
https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo


Core Developers

The core developers of the project are diverse, although initially UC
Irvine heavy (roughly 50) due to the project's origins at UCI. The
other 50 are from other academic institutions (UC Riverside and the
Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).


Alignment

Apache is, by far, the most natural home for taking the AsterixDB
project forward. A large fraction of today's top Big Data
technologies have their homes in Apache, including Hadoop, YARN, Pig,
Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
significant gap -- the parallel data management system gap -- that
exists in the Big Data open source world. It is well-aligned with a
number of the Apache projects, e.g., it has strong support for
accessing and indexing external data in HDFS, and it uses YARN as an
answer to basic cluster resource management. AsterixDB also seeks to
achieve an Apache-style development model; it is seeking a broader
community of contributors and users in order to achieve its full
potential and value to the Big Data community.

There are also a number of related Apache projects and dependencies
that will be mentioned below in the Relationships with Other Apache
products section.


Known Risks

Orphaned products

Given the current level of intellectual investment in AsterixDB, the
risk of the project being abandoned is very small. The UCI/UCR
faculty team leads are highly incentivized to continue development
since the database groups at UC Irvine and UC Riverside are both
reliant on AsterixDB as a platform for long-term graduate research
projects. UC San Diego is also beginning to contribute to the code
base, and a collaboration involving public health applications is
forming with UCLA. The work on AsterixDB is managed via a mix of
mailing list discussions supplemented by weekly project status
meetings which are summarized on the mailing list. Typical (local
plus Skype-in) attendance to the weekly status meetings runs at about
20 active contributors.


Inexperience with Open Source

AsterixDB and Hyracks were completely developed in Open Source under
the ASL 2.0. The source code repositories, issue tracker, and mailing
lists are available on Google Code and discussions and decisions
happen on the mailing lists (which is necessary due to the geographic
distribution of the current developers).

Also a few of the initial committers have contributed to Apache
projects. Vinayak Borkar is a committer on the Apache Helix and
Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
and an IPMC member. Preston Carman and Steven Jacobs are committers
on the Apache VXQuery project.


Relationships with Other Apache Products

Apache VXQuery is based on the Hyracks data-parallel runtime, which
is also included in the AsterixDB code base.

AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
is support for accessing external data in HDFS (and Hive formats),
and resource management and system administration features are in the
process of being migrated to YARN.

AsterixDB's AQL query facilities offer comparable query power to
Apache's Pig and Hive systems for big data analytics. AsterixDB
differs in storing and indexing data and thus being able to quickly
answer small and medium queries without large HDFS data scans -
thereby targeting a different class of use cases.

AsterixDB's data storage and indexing facilities are similar to those
of HBase, but AsterixDB differs in being a much more complete and
queryable BDMS (not just a key-value style store).

AsterixDB's target use cases are not in-memory processing or
iterative algorithm support, making AsterixDB complementary to the
Apache Spark platform. (Spark interoperability is on our longer-term
to-do wishlist.)


Homogeneous Developers

As mentioned before the current community is already organizationally
and geographically distributed - and we would like to increase the
heterogeneity.


Reliance on Salaried Developers

Of the initial committers only 3 are full-time UCI staff. The other
committers are a mix of students, alumni who continue to contribute
to the effort, and individuals working with permission part-time (or
in spare time) on this project.


A Excessive Fascination with the Apache Brand

We believe in the processes, systems, and framework Apache has put in
place. Apache is also known to foster a great community around their
projects and provide exposure. While brand is important, our
fascination with it is not excessive. We believe that the ASF is the
right home for AsterixDB and that having AsterixDB inside of the ASF
will lead to a better long-term outcome for the Big Data community.


Documentation

Documentation and publications related to AsterixDB can be found at
http://asterixdb.ics.uci.edu/.


Initial Source

Current source resides in Google code:
https://code.google.com/p/asterixdb/ (query language and upper system
layers) and https://code.google.com/p/hyracks/ (dataflow runtime
system and storage management libraries).


External Dependencies

AsterixDB depends on a number of Apache projects:

- Ant
- Avro
- ApacheDB JDO
- Commons
- Derby
- Hadoop
- Hive
- HTTPComponents
- Jakarta ORO
- Maven
- Tomcat
- Thrift
- Velocity
- Wicket
- Xerces

and other open source projects (organized by license):

-- ASL 2.0:
 - Jackson
 - Google Guava
 - Google Guice
 - JSON-simple
 - BoneCP
 - Microsoft Azure SDK
 - Netty
 - Rome
 - JetS3t
 - Groovy
 - Jettison
 - Plexus
 - Datanucleus (JDO)
 - Jetty
 - Twitter4J
 - Snappy-java

-- BSD:
 - Antlr
 - ObjectWeb ASM
 - Protobuf
 - JSCH
 - JavaCC
 - Paranamer
 - JLine
 - Stax
 - StringTemplate
 - xmlEnc

-- MIT
 - AppAssembler
 - SimpleLog4J

-- CDDL 1.0
 - Java Activation Framework
 - Java Transactions
 - Java Servlet API
 - Grizzly
 - gmbal
 - Glassfish

-- CDDL 1.1
 - Jersey
 - JAXB Reference Implementation

-- JSON License
 - JSON

-- EPL 1.0
 - JUnit

-- JDOM License
 - JDOM

-- Public Domain
 - xz
 - AOPAlliance

As all dependencies are managed using Apache Maven, none of the
external libraries need to be packaged in a source distribution.


Required Resources

Developer and user mailing lists

private@asterixdb.incubator.apache.org (with moderated subscriptions)
commits@asterixdb.incubator.apache.org
dev@asterixdb.incubator.apache.org
users@asterixdb.incubator.apache.org


A git repository

https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git


A JIRA issue tracker

https://issues.apache.org/jira/browse/ASTERIXDB


Initial Committers

The following is a list of the planned initial Apache committers (the
active subset of the committers for the current repository at Google
code).

Abdullah Alamoudi (bamousaa@gmail.com)
Cameron Samak (eufery@gmail.com)
Chen Li (chenli@gmail.com)
Ian Maxon (imaxon@uci.edu)
Ildar Absalyamov (ildar.absalyamov@gmail.com)
Jianfeng Jia (jianfeng.jia@gmail.com)
Karen Ouaknine (kereno@gmail.com)
Markus Dreseler (apache@dreseler.de)
Mike Carey (dtabass@apache.org)
Murtadha Hubail (hubailmor@gmail.com)
Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
Preston Carman (prestonc@apache.org)
Raman Grover (RamanGrover29@gmail.com)
Sattam Alsubaiee (salsubaiee@gmail.com)
Steven Jacobs (sjaco002@apache.org)
Taewoo Kim (wangsaeu@gmail.com)
Till Westmann (tillw@apache.org)
Vinayak Borkar (vinayakb@apache.org)
Yingyi Bu (buyingyi@gmail.com)
Young-Seok Kim (kisskys@gmail.com)
Zach Heilbron (zheilbron@gmail.com)


Affiliations

UC Irvine
- Mike Carey
- Chen Li
- Ian Maxon
- Yingyi Bu
- Raman Grover
- Pouria Pirzadeh
- Young-Seok Kim
- Cameron Samak
- Taewoo Kim
- Jianfeng Jia
- Murtadha Hubail
- Markus Dreseler

UC Riverside
- Ildar Absalyamov
- Preston Carman
- Steven Jacobs

Hebrew University
- Keren Ouaknine

Oracle
- Till Westmann

X15 Software
- Vinayak Borkar
- Zach Heilbron

KACST Saudi Arabia
- Sattam Alsubaiee

Saudi Aramco
- Abdullah Alamoudi

Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
(UC Irvine) and UCR (UC Riverside) affiliates being students. The
non-UC committers are a mix of alumni who continue to contribute to
the effort and individuals working with permission part-time (or in
spare time) on this project.


Sponsors

Champion

Chris Mattmann (NASA/JPL)

Nominated Mentors

TBD

Sponsoring Entity

The Apache Incubator





++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Till Westmann <we...@gmail.com>.

Hi,

if you read the proposal all the way to the end you will see that - while we do have some community and code - we don’t have mentors.
So if you like the proposal, please volunteer.

Cheers,
Till

> On Jan 14, 2015, at 6:21 PM, Mattmann, Chris A (3980) <ch...@jpl.nasa.gov> wrote:
> 
> Hi Folks,
> 
> I am pleased to bring forth the Apache AsterixDB proposal to the
> Apache Incubator as Champion, working in collaboration with the
> team. Please find the wiki proposal here:
> 
> https://wiki.apache.org/incubator/AsterixDBProposal
> 
> 
> Full text of the proposal is below. Please discuss and enjoy. I’ll
> leave the discussion open for a week, and then look to call a VOTE
> hopefully end of next week if all is well.
> 
> Cheers!
> Chris Mattmann
> 
> =============================================================
> Apache AsterixDB Proposal
> 
> Abstract
> 
> Apache AsterixDB is a scalable big data management system (BDMS) that
> provides storage, management, and query capabilities for large
> collections of semi-structured data.
> 
> Proposal
> 
> AsterixDB is a big data management system (BDMS) that makes it
> well-suited to needs such as web data warehousing and social data
> storage and analysis. Feature-wise, AsterixDB has:
> 
> * A NoSQL style data model (ADM) based on extending JSON with object
>  database concepts.
> * An expressive and declarative query language (AQL) for querying
>  semi-structured data.
> * A runtime query execution engine, Hyracks, for partitioned-parallel
>  execution of query plans.
> * Partitioned LSM-based data storage and indexing for efficient
>  ingestion of newly arriving data.
> * Support for querying and indexing external data (e.g., in HDFS) as
>  well as data stored within AsterixDB.
> * A rich set of primitive data types, including support for spatial,
>  temporal, and textual data.
> * Indexing options that include B+ trees, R trees, and inverted
>  keyword index support.
> * Basic transactional (concurrency and recovery) capabilities akin to
>  those of a NoSQL store.
> 
> 
> Background and Rationale
> 
> In the world of relational databases, the need to tackle data volumes
> that exceed the capabilities of a single server led to the
> development of “shared-nothing” parallel database systems several
> decades ago. These systems spread data over a cluster based on a
> partitioning strategy, such as hash partitioning, and queries are
> processed by employing partitioned-parallel divide-and-conquer
> techniques. Since these systems are fronted by a high-level,
> declarative language (SQL), their users are shielded from the
> complexities of parallel programming. Parallel database systems have
> been an extremely successful application of parallel computing, and
> quite a number of commercial products exist today.
> 
> In the distributed systems world, the Web brought a need to index and
> query its huge content. SQL and relational databases were not the
> answer, though shared-nothing clusters again emerged as the hardware
> platform of choice. Google developed the Google File System (GFS) and
> MapReduce programming model to allow programmers to store and process
> Big Data by writing a few user-defined functions. The MapReduce
> framework applies these functions in parallel to data instances in
> distributed files (map) and to sorted groups of instances sharing a
> common key (reduce) -- not unlike the partitioned parallelism in
> parallel database systems. Apache's Hadoop MapReduce platform is the
> most prominent implementation of this paradigm for the rest of the
> Big Data community. On top of Hadoop and HDFS sit declarative
> languages like Pig and Hive that each compile down to Hadoop
> MapReduce jobs.
> 
> The big Web companies were also challenged by extreme user bases
> (100s of millions of users) and needed fast simple lookups and
> updates to very large keyed data sets like user profiles. SQL
> databases were deemed either too expensive or not scalable, so the
> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> popular key-value stores, in this space. MongoDB and Couchbase are
> other open source alternatives (document stores).
> 
> It is evident from the rapidly growing popularity of "NoSQL" stores,
> as well as the strong demand for Big Data analytics engines today,
> that there is a strong (and growing!) need to store, process, *and*
> query large volumes of semi-structured data in many application
> areas. Until very recently, developers have had to ``choose'' between
> using big data analytics engines like Apache Hive or Apache Spark,
> which can do complex query processing and analysis over HDFS-resident
> files, and flexible but low-function data stores like MongoDB or
> Apache HBase. (The Apache Phoenix project,
> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
> aims to bridge between these choices.)
> 
> AsterixDB is a highly scalable data management system that can store,
> index, and manage semi-structured data, e.g., much like MongoDB, but
> it also supports a full-power query language with the expressiveness
> of SQL (and more). Unlike analytics engines like Hive or Spark, it
> stores and manages data, so AsterixDB can exploit its knowledge of
> data partitioning and the availability of indexes to avoid always
> scanning data set(s) to process queries. Somewhat surprisingly, there
> is no open source parallel database system (relational or otherwise)
> available to developers today -- AsterixDB aims to fill this need.
> Since Apache is where the majority of the today's most important Big
> Data technologies live, the ASF seems like the obvious home for a
> system like AsterixDB.
> 
> Current Status
> 
> The current version of AsterixDB was co-developed by a team of
> faculty, staff, and students at UC Irvine and UC Riverside. The
> project was initiated as a large NSF-sponsored project in 2009, the
> goal of which was to combine the best ideas from the parallel
> database world, the then new Hadoop world, and the semi-structured
> (e.g., XML/JSON) data world in order to create a next-generation
> BDMS. A first informal open source release was made four years later,
> in June of 2013, under the Apache Software License 2.0.
> 
> 
> Meritocracy
> 
> The current developers are familiar with meritocratic open source
> development at Apache. Apache was chosen specifically because we want
> to encourage this style of development for the project.
> 
> 
> Community
> 
> While AsterixDB started as a university project it has developed into
> a community. A number of the initial committers started contributing
> in academia and continue to actively participate and contribute after
> graduation. And we seek to further develop developer and user
> communities. One way to broaden the community that is ongoing is
> through academic collaborations (currently with IIT Mumbai in India
> and TU Berlin in Germany). During incubation we will also explicitly
> seek increased industrial participation.
> 
> Some indicators of the effort's development community and history can
> be
> found at:
> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
> 
> 
> Core Developers
> 
> The core developers of the project are diverse, although initially UC
> Irvine heavy (roughly 50) due to the project's origins at UCI. The
> other 50 are from other academic institutions (UC Riverside and the
> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
> 
> 
> Alignment
> 
> Apache is, by far, the most natural home for taking the AsterixDB
> project forward. A large fraction of today's top Big Data
> technologies have their homes in Apache, including Hadoop, YARN, Pig,
> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> significant gap -- the parallel data management system gap -- that
> exists in the Big Data open source world. It is well-aligned with a
> number of the Apache projects, e.g., it has strong support for
> accessing and indexing external data in HDFS, and it uses YARN as an
> answer to basic cluster resource management. AsterixDB also seeks to
> achieve an Apache-style development model; it is seeking a broader
> community of contributors and users in order to achieve its full
> potential and value to the Big Data community.
> 
> There are also a number of related Apache projects and dependencies
> that will be mentioned below in the Relationships with Other Apache
> products section.
> 
> 
> Known Risks
> 
> Orphaned products
> 
> Given the current level of intellectual investment in AsterixDB, the
> risk of the project being abandoned is very small. The UCI/UCR
> faculty team leads are highly incentivized to continue development
> since the database groups at UC Irvine and UC Riverside are both
> reliant on AsterixDB as a platform for long-term graduate research
> projects. UC San Diego is also beginning to contribute to the code
> base, and a collaboration involving public health applications is
> forming with UCLA. The work on AsterixDB is managed via a mix of
> mailing list discussions supplemented by weekly project status
> meetings which are summarized on the mailing list. Typical (local
> plus Skype-in) attendance to the weekly status meetings runs at about
> 20 active contributors.
> 
> 
> Inexperience with Open Source
> 
> AsterixDB and Hyracks were completely developed in Open Source under
> the ASL 2.0. The source code repositories, issue tracker, and mailing
> lists are available on Google Code and discussions and decisions
> happen on the mailing lists (which is necessary due to the geographic
> distribution of the current developers).
> 
> Also a few of the initial committers have contributed to Apache
> projects. Vinayak Borkar is a committer on the Apache Helix and
> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> and an IPMC member. Preston Carman and Steven Jacobs are committers
> on the Apache VXQuery project.
> 
> 
> Relationships with Other Apache Products
> 
> Apache VXQuery is based on the Hyracks data-parallel runtime, which
> is also included in the AsterixDB code base.
> 
> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> is support for accessing external data in HDFS (and Hive formats),
> and resource management and system administration features are in the
> process of being migrated to YARN.
> 
> AsterixDB's AQL query facilities offer comparable query power to
> Apache's Pig and Hive systems for big data analytics. AsterixDB
> differs in storing and indexing data and thus being able to quickly
> answer small and medium queries without large HDFS data scans -
> thereby targeting a different class of use cases.
> 
> AsterixDB's data storage and indexing facilities are similar to those
> of HBase, but AsterixDB differs in being a much more complete and
> queryable BDMS (not just a key-value style store).
> 
> AsterixDB's target use cases are not in-memory processing or
> iterative algorithm support, making AsterixDB complementary to the
> Apache Spark platform. (Spark interoperability is on our longer-term
> to-do wishlist.)
> 
> 
> Homogeneous Developers
> 
> As mentioned before the current community is already organizationally
> and geographically distributed - and we would like to increase the
> heterogeneity.
> 
> 
> Reliance on Salaried Developers
> 
> Of the initial committers only 3 are full-time UCI staff. The other
> committers are a mix of students, alumni who continue to contribute
> to the effort, and individuals working with permission part-time (or
> in spare time) on this project.
> 
> 
> A Excessive Fascination with the Apache Brand
> 
> We believe in the processes, systems, and framework Apache has put in
> place. Apache is also known to foster a great community around their
> projects and provide exposure. While brand is important, our
> fascination with it is not excessive. We believe that the ASF is the
> right home for AsterixDB and that having AsterixDB inside of the ASF
> will lead to a better long-term outcome for the Big Data community.
> 
> 
> Documentation
> 
> Documentation and publications related to AsterixDB can be found at
> http://asterixdb.ics.uci.edu/.
> 
> 
> Initial Source
> 
> Current source resides in Google code:
> https://code.google.com/p/asterixdb/ (query language and upper system
> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
> system and storage management libraries).
> 
> 
> External Dependencies
> 
> AsterixDB depends on a number of Apache projects:
> 
> - Ant
> - Avro
> - ApacheDB JDO
> - Commons
> - Derby
> - Hadoop
> - Hive
> - HTTPComponents
> - Jakarta ORO
> - Maven
> - Tomcat
> - Thrift
> - Velocity
> - Wicket
> - Xerces
> 
> and other open source projects (organized by license):
> 
> -- ASL 2.0:
> - Jackson
> - Google Guava
> - Google Guice
> - JSON-simple
> - BoneCP
> - Microsoft Azure SDK
> - Netty
> - Rome
> - JetS3t
> - Groovy
> - Jettison
> - Plexus
> - Datanucleus (JDO)
> - Jetty
> - Twitter4J
> - Snappy-java
> 
> -- BSD:
> - Antlr
> - ObjectWeb ASM
> - Protobuf
> - JSCH
> - JavaCC
> - Paranamer
> - JLine
> - Stax
> - StringTemplate
> - xmlEnc
> 
> -- MIT
> - AppAssembler
> - SimpleLog4J
> 
> -- CDDL 1.0
> - Java Activation Framework
> - Java Transactions
> - Java Servlet API
> - Grizzly
> - gmbal
> - Glassfish
> 
> -- CDDL 1.1
> - Jersey
> - JAXB Reference Implementation
> 
> -- JSON License
> - JSON
> 
> -- EPL 1.0
> - JUnit
> 
> -- JDOM License
> - JDOM
> 
> -- Public Domain
> - xz
> - AOPAlliance
> 
> As all dependencies are managed using Apache Maven, none of the
> external libraries need to be packaged in a source distribution.
> 
> 
> Required Resources
> 
> Developer and user mailing lists
> 
> private@asterixdb.incubator.apache.org (with moderated subscriptions)
> commits@asterixdb.incubator.apache.org
> dev@asterixdb.incubator.apache.org
> users@asterixdb.incubator.apache.org
> 
> 
> A git repository
> 
> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
> 
> 
> A JIRA issue tracker
> 
> https://issues.apache.org/jira/browse/ASTERIXDB
> 
> 
> Initial Committers
> 
> The following is a list of the planned initial Apache committers (the
> active subset of the committers for the current repository at Google
> code).
> 
> Abdullah Alamoudi (bamousaa@gmail.com)
> Cameron Samak (eufery@gmail.com)
> Chen Li (chenli@gmail.com)
> Ian Maxon (imaxon@uci.edu)
> Ildar Absalyamov (ildar.absalyamov@gmail.com)
> Jianfeng Jia (jianfeng.jia@gmail.com)
> Karen Ouaknine (kereno@gmail.com)
> Markus Dreseler (apache@dreseler.de)
> Mike Carey (dtabass@apache.org)
> Murtadha Hubail (hubailmor@gmail.com)
> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
> Preston Carman (prestonc@apache.org)
> Raman Grover (RamanGrover29@gmail.com)
> Sattam Alsubaiee (salsubaiee@gmail.com)
> Steven Jacobs (sjaco002@apache.org)
> Taewoo Kim (wangsaeu@gmail.com)
> Till Westmann (tillw@apache.org)
> Vinayak Borkar (vinayakb@apache.org)
> Yingyi Bu (buyingyi@gmail.com)
> Young-Seok Kim (kisskys@gmail.com)
> Zach Heilbron (zheilbron@gmail.com)
> 
> 
> Affiliations
> 
> UC Irvine
> - Mike Carey
> - Chen Li
> - Ian Maxon
> - Yingyi Bu
> - Raman Grover
> - Pouria Pirzadeh
> - Young-Seok Kim
> - Cameron Samak
> - Taewoo Kim
> - Jianfeng Jia
> - Murtadha Hubail
> - Markus Dreseler
> 
> UC Riverside
> - Ildar Absalyamov
> - Preston Carman
> - Steven Jacobs
> 
> Hebrew University
> - Keren Ouaknine
> 
> Oracle
> - Till Westmann
> 
> X15 Software
> - Vinayak Borkar
> - Zach Heilbron
> 
> KACST Saudi Arabia
> - Sattam Alsubaiee
> 
> Saudi Aramco
> - Abdullah Alamoudi
> 
> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> non-UC committers are a mix of alumni who continue to contribute to
> the effort and individuals working with permission part-time (or in
> spare time) on this project.
> 
> 
> Sponsors
> 
> Champion
> 
> Chris Mattmann (NASA/JPL)
> 
> Nominated Mentors
> 
> TBD
> 
> Sponsoring Entity
> 
> The Apache Incubator
> 
> 
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Mike Carey <dt...@gmail.com>.

Thanks, Steve!!  (We'd love to talk there, BTW; the challenge is doing
so w/a teaching day-job.  We'll see if we can find a volunteer who's
not schedule-conflicted that week...!)

On 1/21/15 2:44 AM, Steve Loughran wrote:
> +1 for the proposal: I've a lot of respect for the team...I met some of
> them at a workshop in Germany a few years back along with the (then)
> Stratosphere project.
>
> I'm would volunteer as a mentor except I'm fairly overcommitted with other
> things (like the slider incubating project). If it does need rounding out
> I'll add my name to the list.
>
> Mike: note that you have until Feb 1 to get a proposal for a paper in for
> ApacheCon: http://apachecon.com/
> You might want to think about doing that, as it's a great way to get known
> by the community.
>
> -Steve
>
>
> On 15 January 2015 at 02:21, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Hi Folks,
>>
>> I am pleased to bring forth the Apache AsterixDB proposal to the
>> Apache Incubator as Champion, working in collaboration with the
>> team. Please find the wiki proposal here:
>>
>> https://wiki.apache.org/incubator/AsterixDBProposal
>>
>>
>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>> leave the discussion open for a week, and then look to call a VOTE
>> hopefully end of next week if all is well.
>>
>> Cheers!
>> Chris Mattmann
>>
>> =============================================================
>> Apache AsterixDB Proposal
>>
>> Abstract
>>
>> Apache AsterixDB is a scalable big data management system (BDMS) that
>> provides storage, management, and query capabilities for large
>> collections of semi-structured data.
>>
>> Proposal
>>
>> AsterixDB is a big data management system (BDMS) that makes it
>> well-suited to needs such as web data warehousing and social data
>> storage and analysis. Feature-wise, AsterixDB has:
>>
>> * A NoSQL style data model (ADM) based on extending JSON with object
>>    database concepts.
>> * An expressive and declarative query language (AQL) for querying
>>    semi-structured data.
>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>    execution of query plans.
>> * Partitioned LSM-based data storage and indexing for efficient
>>    ingestion of newly arriving data.
>> * Support for querying and indexing external data (e.g., in HDFS) as
>>    well as data stored within AsterixDB.
>> * A rich set of primitive data types, including support for spatial,
>>    temporal, and textual data.
>> * Indexing options that include B+ trees, R trees, and inverted
>>    keyword index support.
>> * Basic transactional (concurrency and recovery) capabilities akin to
>>    those of a NoSQL store.
>>
>>
>> Background and Rationale
>>
>> In the world of relational databases, the need to tackle data volumes
>> that exceed the capabilities of a single server led to the
>> development of “shared-nothing” parallel database systems several
>> decades ago. These systems spread data over a cluster based on a
>> partitioning strategy, such as hash partitioning, and queries are
>> processed by employing partitioned-parallel divide-and-conquer
>> techniques. Since these systems are fronted by a high-level,
>> declarative language (SQL), their users are shielded from the
>> complexities of parallel programming. Parallel database systems have
>> been an extremely successful application of parallel computing, and
>> quite a number of commercial products exist today.
>>
>> In the distributed systems world, the Web brought a need to index and
>> query its huge content. SQL and relational databases were not the
>> answer, though shared-nothing clusters again emerged as the hardware
>> platform of choice. Google developed the Google File System (GFS) and
>> MapReduce programming model to allow programmers to store and process
>> Big Data by writing a few user-defined functions. The MapReduce
>> framework applies these functions in parallel to data instances in
>> distributed files (map) and to sorted groups of instances sharing a
>> common key (reduce) -- not unlike the partitioned parallelism in
>> parallel database systems. Apache's Hadoop MapReduce platform is the
>> most prominent implementation of this paradigm for the rest of the
>> Big Data community. On top of Hadoop and HDFS sit declarative
>> languages like Pig and Hive that each compile down to Hadoop
>> MapReduce jobs.
>>
>> The big Web companies were also challenged by extreme user bases
>> (100s of millions of users) and needed fast simple lookups and
>> updates to very large keyed data sets like user profiles. SQL
>> databases were deemed either too expensive or not scalable, so the
>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>> popular key-value stores, in this space. MongoDB and Couchbase are
>> other open source alternatives (document stores).
>>
>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>> as well as the strong demand for Big Data analytics engines today,
>> that there is a strong (and growing!) need to store, process, *and*
>> query large volumes of semi-structured data in many application
>> areas. Until very recently, developers have had to ``choose'' between
>> using big data analytics engines like Apache Hive or Apache Spark,
>> which can do complex query processing and analysis over HDFS-resident
>> files, and flexible but low-function data stores like MongoDB or
>> Apache HBase. (The Apache Phoenix project,
>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>> aims to bridge between these choices.)
>>
>> AsterixDB is a highly scalable data management system that can store,
>> index, and manage semi-structured data, e.g., much like MongoDB, but
>> it also supports a full-power query language with the expressiveness
>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>> stores and manages data, so AsterixDB can exploit its knowledge of
>> data partitioning and the availability of indexes to avoid always
>> scanning data set(s) to process queries. Somewhat surprisingly, there
>> is no open source parallel database system (relational or otherwise)
>> available to developers today -- AsterixDB aims to fill this need.
>> Since Apache is where the majority of the today's most important Big
>> Data technologies live, the ASF seems like the obvious home for a
>> system like AsterixDB.
>>
>> Current Status
>>
>> The current version of AsterixDB was co-developed by a team of
>> faculty, staff, and students at UC Irvine and UC Riverside. The
>> project was initiated as a large NSF-sponsored project in 2009, the
>> goal of which was to combine the best ideas from the parallel
>> database world, the then new Hadoop world, and the semi-structured
>> (e.g., XML/JSON) data world in order to create a next-generation
>> BDMS. A first informal open source release was made four years later,
>> in June of 2013, under the Apache Software License 2.0.
>>
>>
>> Meritocracy
>>
>> The current developers are familiar with meritocratic open source
>> development at Apache. Apache was chosen specifically because we want
>> to encourage this style of development for the project.
>>
>>
>> Community
>>
>> While AsterixDB started as a university project it has developed into
>> a community. A number of the initial committers started contributing
>> in academia and continue to actively participate and contribute after
>> graduation. And we seek to further develop developer and user
>> communities. One way to broaden the community that is ongoing is
>> through academic collaborations (currently with IIT Mumbai in India
>> and TU Berlin in Germany). During incubation we will also explicitly
>> seek increased industrial participation.
>>
>> Some indicators of the effort's development community and history can
>> be
>> found at:
>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo
>> ,
>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>
>>
>> Core Developers
>>
>> The core developers of the project are diverse, although initially UC
>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>> other 50 are from other academic institutions (UC Riverside and the
>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>
>>
>> Alignment
>>
>> Apache is, by far, the most natural home for taking the AsterixDB
>> project forward. A large fraction of today's top Big Data
>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>> significant gap -- the parallel data management system gap -- that
>> exists in the Big Data open source world. It is well-aligned with a
>> number of the Apache projects, e.g., it has strong support for
>> accessing and indexing external data in HDFS, and it uses YARN as an
>> answer to basic cluster resource management. AsterixDB also seeks to
>> achieve an Apache-style development model; it is seeking a broader
>> community of contributors and users in order to achieve its full
>> potential and value to the Big Data community.
>>
>> There are also a number of related Apache projects and dependencies
>> that will be mentioned below in the Relationships with Other Apache
>> products section.
>>
>>
>> Known Risks
>>
>> Orphaned products
>>
>> Given the current level of intellectual investment in AsterixDB, the
>> risk of the project being abandoned is very small. The UCI/UCR
>> faculty team leads are highly incentivized to continue development
>> since the database groups at UC Irvine and UC Riverside are both
>> reliant on AsterixDB as a platform for long-term graduate research
>> projects. UC San Diego is also beginning to contribute to the code
>> base, and a collaboration involving public health applications is
>> forming with UCLA. The work on AsterixDB is managed via a mix of
>> mailing list discussions supplemented by weekly project status
>> meetings which are summarized on the mailing list. Typical (local
>> plus Skype-in) attendance to the weekly status meetings runs at about
>> 20 active contributors.
>>
>>
>> Inexperience with Open Source
>>
>> AsterixDB and Hyracks were completely developed in Open Source under
>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>> lists are available on Google Code and discussions and decisions
>> happen on the mailing lists (which is necessary due to the geographic
>> distribution of the current developers).
>>
>> Also a few of the initial committers have contributed to Apache
>> projects. Vinayak Borkar is a committer on the Apache Helix and
>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>> on the Apache VXQuery project.
>>
>>
>> Relationships with Other Apache Products
>>
>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>> is also included in the AsterixDB code base.
>>
>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>> is support for accessing external data in HDFS (and Hive formats),
>> and resource management and system administration features are in the
>> process of being migrated to YARN.
>>
>> AsterixDB's AQL query facilities offer comparable query power to
>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>> differs in storing and indexing data and thus being able to quickly
>> answer small and medium queries without large HDFS data scans -
>> thereby targeting a different class of use cases.
>>
>> AsterixDB's data storage and indexing facilities are similar to those
>> of HBase, but AsterixDB differs in being a much more complete and
>> queryable BDMS (not just a key-value style store).
>>
>> AsterixDB's target use cases are not in-memory processing or
>> iterative algorithm support, making AsterixDB complementary to the
>> Apache Spark platform. (Spark interoperability is on our longer-term
>> to-do wishlist.)
>>
>>
>> Homogeneous Developers
>>
>> As mentioned before the current community is already organizationally
>> and geographically distributed - and we would like to increase the
>> heterogeneity.
>>
>>
>> Reliance on Salaried Developers
>>
>> Of the initial committers only 3 are full-time UCI staff. The other
>> committers are a mix of students, alumni who continue to contribute
>> to the effort, and individuals working with permission part-time (or
>> in spare time) on this project.
>>
>>
>> A Excessive Fascination with the Apache Brand
>>
>> We believe in the processes, systems, and framework Apache has put in
>> place. Apache is also known to foster a great community around their
>> projects and provide exposure. While brand is important, our
>> fascination with it is not excessive. We believe that the ASF is the
>> right home for AsterixDB and that having AsterixDB inside of the ASF
>> will lead to a better long-term outcome for the Big Data community.
>>
>>
>> Documentation
>>
>> Documentation and publications related to AsterixDB can be found at
>> http://asterixdb.ics.uci.edu/.
>>
>>
>> Initial Source
>>
>> Current source resides in Google code:
>> https://code.google.com/p/asterixdb/ (query language and upper system
>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>> system and storage management libraries).
>>
>>
>> External Dependencies
>>
>> AsterixDB depends on a number of Apache projects:
>>
>> - Ant
>> - Avro
>> - ApacheDB JDO
>> - Commons
>> - Derby
>> - Hadoop
>> - Hive
>> - HTTPComponents
>> - Jakarta ORO
>> - Maven
>> - Tomcat
>> - Thrift
>> - Velocity
>> - Wicket
>> - Xerces
>>
>> and other open source projects (organized by license):
>>
>> -- ASL 2.0:
>>   - Jackson
>>   - Google Guava
>>   - Google Guice
>>   - JSON-simple
>>   - BoneCP
>>   - Microsoft Azure SDK
>>   - Netty
>>   - Rome
>>   - JetS3t
>>   - Groovy
>>   - Jettison
>>   - Plexus
>>   - Datanucleus (JDO)
>>   - Jetty
>>   - Twitter4J
>>   - Snappy-java
>>
>> -- BSD:
>>   - Antlr
>>   - ObjectWeb ASM
>>   - Protobuf
>>   - JSCH
>>   - JavaCC
>>   - Paranamer
>>   - JLine
>>   - Stax
>>   - StringTemplate
>>   - xmlEnc
>>
>> -- MIT
>>   - AppAssembler
>>   - SimpleLog4J
>>
>> -- CDDL 1.0
>>   - Java Activation Framework
>>   - Java Transactions
>>   - Java Servlet API
>>   - Grizzly
>>   - gmbal
>>   - Glassfish
>>
>> -- CDDL 1.1
>>   - Jersey
>>   - JAXB Reference Implementation
>>
>> -- JSON License
>>   - JSON
>>
>> -- EPL 1.0
>>   - JUnit
>>
>> -- JDOM License
>>   - JDOM
>>
>> -- Public Domain
>>   - xz
>>   - AOPAlliance
>>
>> As all dependencies are managed using Apache Maven, none of the
>> external libraries need to be packaged in a source distribution.
>>
>>
>> Required Resources
>>
>> Developer and user mailing lists
>>
>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>> commits@asterixdb.incubator.apache.org
>> dev@asterixdb.incubator.apache.org
>> users@asterixdb.incubator.apache.org
>>
>>
>> A git repository
>>
>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>
>>
>> A JIRA issue tracker
>>
>> https://issues.apache.org/jira/browse/ASTERIXDB
>>
>>
>> Initial Committers
>>
>> The following is a list of the planned initial Apache committers (the
>> active subset of the committers for the current repository at Google
>> code).
>>
>> Abdullah Alamoudi (bamousaa@gmail.com)
>> Cameron Samak (eufery@gmail.com)
>> Chen Li (chenli@gmail.com)
>> Ian Maxon (imaxon@uci.edu)
>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>> Jianfeng Jia (jianfeng.jia@gmail.com)
>> Karen Ouaknine (kereno@gmail.com)
>> Markus Dreseler (apache@dreseler.de)
>> Mike Carey (dtabass@apache.org)
>> Murtadha Hubail (hubailmor@gmail.com)
>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>> Preston Carman (prestonc@apache.org)
>> Raman Grover (RamanGrover29@gmail.com)
>> Sattam Alsubaiee (salsubaiee@gmail.com)
>> Steven Jacobs (sjaco002@apache.org)
>> Taewoo Kim (wangsaeu@gmail.com)
>> Till Westmann (tillw@apache.org)
>> Vinayak Borkar (vinayakb@apache.org)
>> Yingyi Bu (buyingyi@gmail.com)
>> Young-Seok Kim (kisskys@gmail.com)
>> Zach Heilbron (zheilbron@gmail.com)
>>
>>
>> Affiliations
>>
>> UC Irvine
>> - Mike Carey
>> - Chen Li
>> - Ian Maxon
>> - Yingyi Bu
>> - Raman Grover
>> - Pouria Pirzadeh
>> - Young-Seok Kim
>> - Cameron Samak
>> - Taewoo Kim
>> - Jianfeng Jia
>> - Murtadha Hubail
>> - Markus Dreseler
>>
>> UC Riverside
>> - Ildar Absalyamov
>> - Preston Carman
>> - Steven Jacobs
>>
>> Hebrew University
>> - Keren Ouaknine
>>
>> Oracle
>> - Till Westmann
>>
>> X15 Software
>> - Vinayak Borkar
>> - Zach Heilbron
>>
>> KACST Saudi Arabia
>> - Sattam Alsubaiee
>>
>> Saudi Aramco
>> - Abdullah Alamoudi
>>
>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>> non-UC committers are a mix of alumni who continue to contribute to
>> the effort and individuals working with permission part-time (or in
>> spare time) on this project.
>>
>>
>> Sponsors
>>
>> Champion
>>
>> Chris Mattmann (NASA/JPL)
>>
>> Nominated Mentors
>>
>> TBD
>>
>> Sponsoring Entity
>>
>> The Apache Incubator
>>
>>
>>
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Steve Loughran <st...@hortonworks.com>.

+1 for the proposal: I've a lot of respect for the team...I met some of
them at a workshop in Germany a few years back along with the (then)
Stratosphere project.

I'm would volunteer as a mentor except I'm fairly overcommitted with other
things (like the slider incubating project). If it does need rounding out
I'll add my name to the list.

Mike: note that you have until Feb 1 to get a proposal for a paper in for
ApacheCon: http://apachecon.com/
You might want to think about doing that, as it's a great way to get known
by the community.

-Steve


On 15 January 2015 at 02:21, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Folks,
>
> I am pleased to bring forth the Apache AsterixDB proposal to the
> Apache Incubator as Champion, working in collaboration with the
> team. Please find the wiki proposal here:
>
> https://wiki.apache.org/incubator/AsterixDBProposal
>
>
> Full text of the proposal is below. Please discuss and enjoy. I’ll
> leave the discussion open for a week, and then look to call a VOTE
> hopefully end of next week if all is well.
>
> Cheers!
> Chris Mattmann
>
> =============================================================
> Apache AsterixDB Proposal
>
> Abstract
>
> Apache AsterixDB is a scalable big data management system (BDMS) that
> provides storage, management, and query capabilities for large
> collections of semi-structured data.
>
> Proposal
>
> AsterixDB is a big data management system (BDMS) that makes it
> well-suited to needs such as web data warehousing and social data
> storage and analysis. Feature-wise, AsterixDB has:
>
> * A NoSQL style data model (ADM) based on extending JSON with object
>   database concepts.
> * An expressive and declarative query language (AQL) for querying
>   semi-structured data.
> * A runtime query execution engine, Hyracks, for partitioned-parallel
>   execution of query plans.
> * Partitioned LSM-based data storage and indexing for efficient
>   ingestion of newly arriving data.
> * Support for querying and indexing external data (e.g., in HDFS) as
>   well as data stored within AsterixDB.
> * A rich set of primitive data types, including support for spatial,
>   temporal, and textual data.
> * Indexing options that include B+ trees, R trees, and inverted
>   keyword index support.
> * Basic transactional (concurrency and recovery) capabilities akin to
>   those of a NoSQL store.
>
>
> Background and Rationale
>
> In the world of relational databases, the need to tackle data volumes
> that exceed the capabilities of a single server led to the
> development of “shared-nothing” parallel database systems several
> decades ago. These systems spread data over a cluster based on a
> partitioning strategy, such as hash partitioning, and queries are
> processed by employing partitioned-parallel divide-and-conquer
> techniques. Since these systems are fronted by a high-level,
> declarative language (SQL), their users are shielded from the
> complexities of parallel programming. Parallel database systems have
> been an extremely successful application of parallel computing, and
> quite a number of commercial products exist today.
>
> In the distributed systems world, the Web brought a need to index and
> query its huge content. SQL and relational databases were not the
> answer, though shared-nothing clusters again emerged as the hardware
> platform of choice. Google developed the Google File System (GFS) and
> MapReduce programming model to allow programmers to store and process
> Big Data by writing a few user-defined functions. The MapReduce
> framework applies these functions in parallel to data instances in
> distributed files (map) and to sorted groups of instances sharing a
> common key (reduce) -- not unlike the partitioned parallelism in
> parallel database systems. Apache's Hadoop MapReduce platform is the
> most prominent implementation of this paradigm for the rest of the
> Big Data community. On top of Hadoop and HDFS sit declarative
> languages like Pig and Hive that each compile down to Hadoop
> MapReduce jobs.
>
> The big Web companies were also challenged by extreme user bases
> (100s of millions of users) and needed fast simple lookups and
> updates to very large keyed data sets like user profiles. SQL
> databases were deemed either too expensive or not scalable, so the
> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> popular key-value stores, in this space. MongoDB and Couchbase are
> other open source alternatives (document stores).
>
> It is evident from the rapidly growing popularity of "NoSQL" stores,
> as well as the strong demand for Big Data analytics engines today,
> that there is a strong (and growing!) need to store, process, *and*
> query large volumes of semi-structured data in many application
> areas. Until very recently, developers have had to ``choose'' between
> using big data analytics engines like Apache Hive or Apache Spark,
> which can do complex query processing and analysis over HDFS-resident
> files, and flexible but low-function data stores like MongoDB or
> Apache HBase. (The Apache Phoenix project,
> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
> aims to bridge between these choices.)
>
> AsterixDB is a highly scalable data management system that can store,
> index, and manage semi-structured data, e.g., much like MongoDB, but
> it also supports a full-power query language with the expressiveness
> of SQL (and more). Unlike analytics engines like Hive or Spark, it
> stores and manages data, so AsterixDB can exploit its knowledge of
> data partitioning and the availability of indexes to avoid always
> scanning data set(s) to process queries. Somewhat surprisingly, there
> is no open source parallel database system (relational or otherwise)
> available to developers today -- AsterixDB aims to fill this need.
> Since Apache is where the majority of the today's most important Big
> Data technologies live, the ASF seems like the obvious home for a
> system like AsterixDB.
>
> Current Status
>
> The current version of AsterixDB was co-developed by a team of
> faculty, staff, and students at UC Irvine and UC Riverside. The
> project was initiated as a large NSF-sponsored project in 2009, the
> goal of which was to combine the best ideas from the parallel
> database world, the then new Hadoop world, and the semi-structured
> (e.g., XML/JSON) data world in order to create a next-generation
> BDMS. A first informal open source release was made four years later,
> in June of 2013, under the Apache Software License 2.0.
>
>
> Meritocracy
>
> The current developers are familiar with meritocratic open source
> development at Apache. Apache was chosen specifically because we want
> to encourage this style of development for the project.
>
>
> Community
>
> While AsterixDB started as a university project it has developed into
> a community. A number of the initial committers started contributing
> in academia and continue to actively participate and contribute after
> graduation. And we seek to further develop developer and user
> communities. One way to broaden the community that is ongoing is
> through academic collaborations (currently with IIT Mumbai in India
> and TU Berlin in Germany). During incubation we will also explicitly
> seek increased industrial participation.
>
> Some indicators of the effort's development community and history can
> be
> found at:
> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo
> ,
> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>
>
> Core Developers
>
> The core developers of the project are diverse, although initially UC
> Irvine heavy (roughly 50) due to the project's origins at UCI. The
> other 50 are from other academic institutions (UC Riverside and the
> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>
>
> Alignment
>
> Apache is, by far, the most natural home for taking the AsterixDB
> project forward. A large fraction of today's top Big Data
> technologies have their homes in Apache, including Hadoop, YARN, Pig,
> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> significant gap -- the parallel data management system gap -- that
> exists in the Big Data open source world. It is well-aligned with a
> number of the Apache projects, e.g., it has strong support for
> accessing and indexing external data in HDFS, and it uses YARN as an
> answer to basic cluster resource management. AsterixDB also seeks to
> achieve an Apache-style development model; it is seeking a broader
> community of contributors and users in order to achieve its full
> potential and value to the Big Data community.
>
> There are also a number of related Apache projects and dependencies
> that will be mentioned below in the Relationships with Other Apache
> products section.
>
>
> Known Risks
>
> Orphaned products
>
> Given the current level of intellectual investment in AsterixDB, the
> risk of the project being abandoned is very small. The UCI/UCR
> faculty team leads are highly incentivized to continue development
> since the database groups at UC Irvine and UC Riverside are both
> reliant on AsterixDB as a platform for long-term graduate research
> projects. UC San Diego is also beginning to contribute to the code
> base, and a collaboration involving public health applications is
> forming with UCLA. The work on AsterixDB is managed via a mix of
> mailing list discussions supplemented by weekly project status
> meetings which are summarized on the mailing list. Typical (local
> plus Skype-in) attendance to the weekly status meetings runs at about
> 20 active contributors.
>
>
> Inexperience with Open Source
>
> AsterixDB and Hyracks were completely developed in Open Source under
> the ASL 2.0. The source code repositories, issue tracker, and mailing
> lists are available on Google Code and discussions and decisions
> happen on the mailing lists (which is necessary due to the geographic
> distribution of the current developers).
>
> Also a few of the initial committers have contributed to Apache
> projects. Vinayak Borkar is a committer on the Apache Helix and
> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> and an IPMC member. Preston Carman and Steven Jacobs are committers
> on the Apache VXQuery project.
>
>
> Relationships with Other Apache Products
>
> Apache VXQuery is based on the Hyracks data-parallel runtime, which
> is also included in the AsterixDB code base.
>
> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> is support for accessing external data in HDFS (and Hive formats),
> and resource management and system administration features are in the
> process of being migrated to YARN.
>
> AsterixDB's AQL query facilities offer comparable query power to
> Apache's Pig and Hive systems for big data analytics. AsterixDB
> differs in storing and indexing data and thus being able to quickly
> answer small and medium queries without large HDFS data scans -
> thereby targeting a different class of use cases.
>
> AsterixDB's data storage and indexing facilities are similar to those
> of HBase, but AsterixDB differs in being a much more complete and
> queryable BDMS (not just a key-value style store).
>
> AsterixDB's target use cases are not in-memory processing or
> iterative algorithm support, making AsterixDB complementary to the
> Apache Spark platform. (Spark interoperability is on our longer-term
> to-do wishlist.)
>
>
> Homogeneous Developers
>
> As mentioned before the current community is already organizationally
> and geographically distributed - and we would like to increase the
> heterogeneity.
>
>
> Reliance on Salaried Developers
>
> Of the initial committers only 3 are full-time UCI staff. The other
> committers are a mix of students, alumni who continue to contribute
> to the effort, and individuals working with permission part-time (or
> in spare time) on this project.
>
>
> A Excessive Fascination with the Apache Brand
>
> We believe in the processes, systems, and framework Apache has put in
> place. Apache is also known to foster a great community around their
> projects and provide exposure. While brand is important, our
> fascination with it is not excessive. We believe that the ASF is the
> right home for AsterixDB and that having AsterixDB inside of the ASF
> will lead to a better long-term outcome for the Big Data community.
>
>
> Documentation
>
> Documentation and publications related to AsterixDB can be found at
> http://asterixdb.ics.uci.edu/.
>
>
> Initial Source
>
> Current source resides in Google code:
> https://code.google.com/p/asterixdb/ (query language and upper system
> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
> system and storage management libraries).
>
>
> External Dependencies
>
> AsterixDB depends on a number of Apache projects:
>
> - Ant
> - Avro
> - ApacheDB JDO
> - Commons
> - Derby
> - Hadoop
> - Hive
> - HTTPComponents
> - Jakarta ORO
> - Maven
> - Tomcat
> - Thrift
> - Velocity
> - Wicket
> - Xerces
>
> and other open source projects (organized by license):
>
> -- ASL 2.0:
>  - Jackson
>  - Google Guava
>  - Google Guice
>  - JSON-simple
>  - BoneCP
>  - Microsoft Azure SDK
>  - Netty
>  - Rome
>  - JetS3t
>  - Groovy
>  - Jettison
>  - Plexus
>  - Datanucleus (JDO)
>  - Jetty
>  - Twitter4J
>  - Snappy-java
>
> -- BSD:
>  - Antlr
>  - ObjectWeb ASM
>  - Protobuf
>  - JSCH
>  - JavaCC
>  - Paranamer
>  - JLine
>  - Stax
>  - StringTemplate
>  - xmlEnc
>
> -- MIT
>  - AppAssembler
>  - SimpleLog4J
>
> -- CDDL 1.0
>  - Java Activation Framework
>  - Java Transactions
>  - Java Servlet API
>  - Grizzly
>  - gmbal
>  - Glassfish
>
> -- CDDL 1.1
>  - Jersey
>  - JAXB Reference Implementation
>
> -- JSON License
>  - JSON
>
> -- EPL 1.0
>  - JUnit
>
> -- JDOM License
>  - JDOM
>
> -- Public Domain
>  - xz
>  - AOPAlliance
>
> As all dependencies are managed using Apache Maven, none of the
> external libraries need to be packaged in a source distribution.
>
>
> Required Resources
>
> Developer and user mailing lists
>
> private@asterixdb.incubator.apache.org (with moderated subscriptions)
> commits@asterixdb.incubator.apache.org
> dev@asterixdb.incubator.apache.org
> users@asterixdb.incubator.apache.org
>
>
> A git repository
>
> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>
>
> A JIRA issue tracker
>
> https://issues.apache.org/jira/browse/ASTERIXDB
>
>
> Initial Committers
>
> The following is a list of the planned initial Apache committers (the
> active subset of the committers for the current repository at Google
> code).
>
> Abdullah Alamoudi (bamousaa@gmail.com)
> Cameron Samak (eufery@gmail.com)
> Chen Li (chenli@gmail.com)
> Ian Maxon (imaxon@uci.edu)
> Ildar Absalyamov (ildar.absalyamov@gmail.com)
> Jianfeng Jia (jianfeng.jia@gmail.com)
> Karen Ouaknine (kereno@gmail.com)
> Markus Dreseler (apache@dreseler.de)
> Mike Carey (dtabass@apache.org)
> Murtadha Hubail (hubailmor@gmail.com)
> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
> Preston Carman (prestonc@apache.org)
> Raman Grover (RamanGrover29@gmail.com)
> Sattam Alsubaiee (salsubaiee@gmail.com)
> Steven Jacobs (sjaco002@apache.org)
> Taewoo Kim (wangsaeu@gmail.com)
> Till Westmann (tillw@apache.org)
> Vinayak Borkar (vinayakb@apache.org)
> Yingyi Bu (buyingyi@gmail.com)
> Young-Seok Kim (kisskys@gmail.com)
> Zach Heilbron (zheilbron@gmail.com)
>
>
> Affiliations
>
> UC Irvine
> - Mike Carey
> - Chen Li
> - Ian Maxon
> - Yingyi Bu
> - Raman Grover
> - Pouria Pirzadeh
> - Young-Seok Kim
> - Cameron Samak
> - Taewoo Kim
> - Jianfeng Jia
> - Murtadha Hubail
> - Markus Dreseler
>
> UC Riverside
> - Ildar Absalyamov
> - Preston Carman
> - Steven Jacobs
>
> Hebrew University
> - Keren Ouaknine
>
> Oracle
> - Till Westmann
>
> X15 Software
> - Vinayak Borkar
> - Zach Heilbron
>
> KACST Saudi Arabia
> - Sattam Alsubaiee
>
> Saudi Aramco
> - Abdullah Alamoudi
>
> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> non-UC committers are a mix of alumni who continue to contribute to
> the effort and individuals working with permission part-time (or in
> spare time) on this project.
>
>
> Sponsors
>
> Champion
>
> Chris Mattmann (NASA/JPL)
>
> Nominated Mentors
>
> TBD
>
> Sponsoring Entity
>
> The Apache Incubator
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Mike Carey <dt...@gmail.com>.

Ditto - thanks for the support!
Cheers,
Mike

On 1/19/15 5:39 PM, Till Westmann wrote:
>
>> On Jan 19, 2015, at 11:34 AM, jan i <jani@apache.org 
>> <ma...@apache.org>> wrote:
>>
>> Looks like a real challenging project, and the proposal looks as if 
>> it has already been through a couple of refinement rounds.
>>
>> Count on my +1, when it comes to voting.
>
> Will do!
>
> Thanks,
> Till
>
>>
>> rgds
>> jan i
>>
>> On 19 January 2015 at 19:26, Henry Saputra <henry.saputra@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>>     +1 This is GREAT News!
>>
>>     Was watching and trying AsterixDB last year and looked in awesome
>>     shape.
>>
>>     I have my plate full but would love to help mentor this project
>>     to get
>>     it going to ASF if needed!
>>
>>     - Henry
>>
>>     On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>>     <chris.a.mattmann@jpl.nasa.gov
>>     <ma...@jpl.nasa.gov>> wrote:
>>     > Hi Folks,
>>     >
>>     > I am pleased to bring forth the Apache AsterixDB proposal to the
>>     > Apache Incubator as Champion, working in collaboration with the
>>     > team. Please find the wiki proposal here:
>>     >
>>     > https://wiki.apache.org/incubator/AsterixDBProposal
>>     >
>>     >
>>     > Full text of the proposal is below. Please discuss and enjoy. I’ll
>>     > leave the discussion open for a week, and then look to call a VOTE
>>     > hopefully end of next week if all is well.
>>     >
>>     > Cheers!
>>     > Chris Mattmann
>>     >
>>     > =============================================================
>>     > Apache AsterixDB Proposal
>>     >
>>     > Abstract
>>     >
>>     > Apache AsterixDB is a scalable big data management system
>>     (BDMS) that
>>     > provides storage, management, and query capabilities for large
>>     > collections of semi-structured data.
>>     >
>>     > Proposal
>>     >
>>     > AsterixDB is a big data management system (BDMS) that makes it
>>     > well-suited to needs such as web data warehousing and social data
>>     > storage and analysis. Feature-wise, AsterixDB has:
>>     >
>>     > * A NoSQL style data model (ADM) based on extending JSON with
>>     object
>>     >   database concepts.
>>     > * An expressive and declarative query language (AQL) for querying
>>     >   semi-structured data.
>>     > * A runtime query execution engine, Hyracks, for
>>     partitioned-parallel
>>     >   execution of query plans.
>>     > * Partitioned LSM-based data storage and indexing for efficient
>>     >   ingestion of newly arriving data.
>>     > * Support for querying and indexing external data (e.g., in
>>     HDFS) as
>>     >   well as data stored within AsterixDB.
>>     > * A rich set of primitive data types, including support for
>>     spatial,
>>     >   temporal, and textual data.
>>     > * Indexing options that include B+ trees, R trees, and inverted
>>     >   keyword index support.
>>     > * Basic transactional (concurrency and recovery) capabilities
>>     akin to
>>     >   those of a NoSQL store.
>>     >
>>     >
>>     > Background and Rationale
>>     >
>>     > In the world of relational databases, the need to tackle data
>>     volumes
>>     > that exceed the capabilities of a single server led to the
>>     > development of “shared-nothing” parallel database systems several
>>     > decades ago. These systems spread data over a cluster based on a
>>     > partitioning strategy, such as hash partitioning, and queries are
>>     > processed by employing partitioned-parallel divide-and-conquer
>>     > techniques. Since these systems are fronted by a high-level,
>>     > declarative language (SQL), their users are shielded from the
>>     > complexities of parallel programming. Parallel database systems
>>     have
>>     > been an extremely successful application of parallel computing, and
>>     > quite a number of commercial products exist today.
>>     >
>>     > In the distributed systems world, the Web brought a need to
>>     index and
>>     > query its huge content. SQL and relational databases were not the
>>     > answer, though shared-nothing clusters again emerged as the
>>     hardware
>>     > platform of choice. Google developed the Google File System
>>     (GFS) and
>>     > MapReduce programming model to allow programmers to store and
>>     process
>>     > Big Data by writing a few user-defined functions. The MapReduce
>>     > framework applies these functions in parallel to data instances in
>>     > distributed files (map) and to sorted groups of instances sharing a
>>     > common key (reduce) -- not unlike the partitioned parallelism in
>>     > parallel database systems. Apache's Hadoop MapReduce platform
>>     is the
>>     > most prominent implementation of this paradigm for the rest of the
>>     > Big Data community. On top of Hadoop and HDFS sit declarative
>>     > languages like Pig and Hive that each compile down to Hadoop
>>     > MapReduce jobs.
>>     >
>>     > The big Web companies were also challenged by extreme user bases
>>     > (100s of millions of users) and needed fast simple lookups and
>>     > updates to very large keyed data sets like user profiles. SQL
>>     > databases were deemed either too expensive or not scalable, so the
>>     > “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>>     > popular key-value stores, in this space. MongoDB and Couchbase are
>>     > other open source alternatives (document stores).
>>     >
>>     > It is evident from the rapidly growing popularity of "NoSQL"
>>     stores,
>>     > as well as the strong demand for Big Data analytics engines today,
>>     > that there is a strong (and growing!) need to store, process, *and*
>>     > query large volumes of semi-structured data in many application
>>     > areas. Until very recently, developers have had to ``choose''
>>     between
>>     > using big data analytics engines like Apache Hive or Apache Spark,
>>     > which can do complex query processing and analysis over
>>     HDFS-resident
>>     > files, and flexible but low-function data stores like MongoDB or
>>     > Apache HBase. (The Apache Phoenix project,
>>     > http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>>     > aims to bridge between these choices.)
>>     >
>>     > AsterixDB is a highly scalable data management system that can
>>     store,
>>     > index, and manage semi-structured data, e.g., much like
>>     MongoDB, but
>>     > it also supports a full-power query language with the
>>     expressiveness
>>     > of SQL (and more). Unlike analytics engines like Hive or Spark, it
>>     > stores and manages data, so AsterixDB can exploit its knowledge of
>>     > data partitioning and the availability of indexes to avoid always
>>     > scanning data set(s) to process queries. Somewhat surprisingly,
>>     there
>>     > is no open source parallel database system (relational or
>>     otherwise)
>>     > available to developers today -- AsterixDB aims to fill this need.
>>     > Since Apache is where the majority of the today's most
>>     important Big
>>     > Data technologies live, the ASF seems like the obvious home for a
>>     > system like AsterixDB.
>>     >
>>     > Current Status
>>     >
>>     > The current version of AsterixDB was co-developed by a team of
>>     > faculty, staff, and students at UC Irvine and UC Riverside. The
>>     > project was initiated as a large NSF-sponsored project in 2009, the
>>     > goal of which was to combine the best ideas from the parallel
>>     > database world, the then new Hadoop world, and the semi-structured
>>     > (e.g., XML/JSON) data world in order to create a next-generation
>>     > BDMS. A first informal open source release was made four years
>>     later,
>>     > in June of 2013, under the Apache Software License 2.0.
>>     >
>>     >
>>     > Meritocracy
>>     >
>>     > The current developers are familiar with meritocratic open source
>>     > development at Apache. Apache was chosen specifically because
>>     we want
>>     > to encourage this style of development for the project.
>>     >
>>     >
>>     > Community
>>     >
>>     > While AsterixDB started as a university project it has
>>     developed into
>>     > a community. A number of the initial committers started
>>     contributing
>>     > in academia and continue to actively participate and contribute
>>     after
>>     > graduation. And we seek to further develop developer and user
>>     > communities. One way to broaden the community that is ongoing is
>>     > through academic collaborations (currently with IIT Mumbai in India
>>     > and TU Berlin in Germany). During incubation we will also
>>     explicitly
>>     > seek increased industrial participation.
>>     >
>>     > Some indicators of the effort's development community and
>>     history can
>>     > be
>>     > found at:
>>     >
>>     https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>>     >
>>     https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>     >
>>     >
>>     > Core Developers
>>     >
>>     > The core developers of the project are diverse, although
>>     initially UC
>>     > Irvine heavy (roughly 50) due to the project's origins at UCI. The
>>     > other 50 are from other academic institutions (UC Riverside and the
>>     > Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>>     > IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>     >
>>     >
>>     > Alignment
>>     >
>>     > Apache is, by far, the most natural home for taking the AsterixDB
>>     > project forward. A large fraction of today's top Big Data
>>     > technologies have their homes in Apache, including Hadoop,
>>     YARN, Pig,
>>     > Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>>     > significant gap -- the parallel data management system gap -- that
>>     > exists in the Big Data open source world. It is well-aligned with a
>>     > number of the Apache projects, e.g., it has strong support for
>>     > accessing and indexing external data in HDFS, and it uses YARN
>>     as an
>>     > answer to basic cluster resource management. AsterixDB also
>>     seeks to
>>     > achieve an Apache-style development model; it is seeking a broader
>>     > community of contributors and users in order to achieve its full
>>     > potential and value to the Big Data community.
>>     >
>>     > There are also a number of related Apache projects and dependencies
>>     > that will be mentioned below in the Relationships with Other Apache
>>     > products section.
>>     >
>>     >
>>     > Known Risks
>>     >
>>     > Orphaned products
>>     >
>>     > Given the current level of intellectual investment in
>>     AsterixDB, the
>>     > risk of the project being abandoned is very small. The UCI/UCR
>>     > faculty team leads are highly incentivized to continue development
>>     > since the database groups at UC Irvine and UC Riverside are both
>>     > reliant on AsterixDB as a platform for long-term graduate research
>>     > projects. UC San Diego is also beginning to contribute to the code
>>     > base, and a collaboration involving public health applications is
>>     > forming with UCLA. The work on AsterixDB is managed via a mix of
>>     > mailing list discussions supplemented by weekly project status
>>     > meetings which are summarized on the mailing list. Typical (local
>>     > plus Skype-in) attendance to the weekly status meetings runs at
>>     about
>>     > 20 active contributors.
>>     >
>>     >
>>     > Inexperience with Open Source
>>     >
>>     > AsterixDB and Hyracks were completely developed in Open Source
>>     under
>>     > the ASL 2.0. The source code repositories, issue tracker, and
>>     mailing
>>     > lists are available on Google Code and discussions and decisions
>>     > happen on the mailing lists (which is necessary due to the
>>     geographic
>>     > distribution of the current developers).
>>     >
>>     > Also a few of the initial committers have contributed to Apache
>>     > projects. Vinayak Borkar is a committer on the Apache Helix and
>>     > Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>>     > and an IPMC member. Preston Carman and Steven Jacobs are committers
>>     > on the Apache VXQuery project.
>>     >
>>     >
>>     > Relationships with Other Apache Products
>>     >
>>     > Apache VXQuery is based on the Hyracks data-parallel runtime, which
>>     > is also included in the AsterixDB code base.
>>     >
>>     > AsterixDB is closely related to Apache Hadoop. Included in
>>     AsterixDB
>>     > is support for accessing external data in HDFS (and Hive formats),
>>     > and resource management and system administration features are
>>     in the
>>     > process of being migrated to YARN.
>>     >
>>     > AsterixDB's AQL query facilities offer comparable query power to
>>     > Apache's Pig and Hive systems for big data analytics. AsterixDB
>>     > differs in storing and indexing data and thus being able to quickly
>>     > answer small and medium queries without large HDFS data scans -
>>     > thereby targeting a different class of use cases.
>>     >
>>     > AsterixDB's data storage and indexing facilities are similar to
>>     those
>>     > of HBase, but AsterixDB differs in being a much more complete and
>>     > queryable BDMS (not just a key-value style store).
>>     >
>>     > AsterixDB's target use cases are not in-memory processing or
>>     > iterative algorithm support, making AsterixDB complementary to the
>>     > Apache Spark platform. (Spark interoperability is on our
>>     longer-term
>>     > to-do wishlist.)
>>     >
>>     >
>>     > Homogeneous Developers
>>     >
>>     > As mentioned before the current community is already
>>     organizationally
>>     > and geographically distributed - and we would like to increase the
>>     > heterogeneity.
>>     >
>>     >
>>     > Reliance on Salaried Developers
>>     >
>>     > Of the initial committers only 3 are full-time UCI staff. The other
>>     > committers are a mix of students, alumni who continue to contribute
>>     > to the effort, and individuals working with permission
>>     part-time (or
>>     > in spare time) on this project.
>>     >
>>     >
>>     > A Excessive Fascination with the Apache Brand
>>     >
>>     > We believe in the processes, systems, and framework Apache has
>>     put in
>>     > place. Apache is also known to foster a great community around
>>     their
>>     > projects and provide exposure. While brand is important, our
>>     > fascination with it is not excessive. We believe that the ASF
>>     is the
>>     > right home for AsterixDB and that having AsterixDB inside of
>>     the ASF
>>     > will lead to a better long-term outcome for the Big Data community.
>>     >
>>     >
>>     > Documentation
>>     >
>>     > Documentation and publications related to AsterixDB can be found at
>>     > http://asterixdb.ics.uci.edu/.
>>     >
>>     >
>>     > Initial Source
>>     >
>>     > Current source resides in Google code:
>>     > https://code.google.com/p/asterixdb/ (query language and upper
>>     system
>>     > layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>>     > system and storage management libraries).
>>     >
>>     >
>>     > External Dependencies
>>     >
>>     > AsterixDB depends on a number of Apache projects:
>>     >
>>     > - Ant
>>     > - Avro
>>     > - ApacheDB JDO
>>     > - Commons
>>     > - Derby
>>     > - Hadoop
>>     > - Hive
>>     > - HTTPComponents
>>     > - Jakarta ORO
>>     > - Maven
>>     > - Tomcat
>>     > - Thrift
>>     > - Velocity
>>     > - Wicket
>>     > - Xerces
>>     >
>>     > and other open source projects (organized by license):
>>     >
>>     > -- ASL 2.0:
>>     >  - Jackson
>>     >  - Google Guava
>>     >  - Google Guice
>>     >  - JSON-simple
>>     >  - BoneCP
>>     >  - Microsoft Azure SDK
>>     >  - Netty
>>     >  - Rome
>>     >  - JetS3t
>>     >  - Groovy
>>     >  - Jettison
>>     >  - Plexus
>>     >  - Datanucleus (JDO)
>>     >  - Jetty
>>     >  - Twitter4J
>>     >  - Snappy-java
>>     >
>>     > -- BSD:
>>     >  - Antlr
>>     >  - ObjectWeb ASM
>>     >  - Protobuf
>>     >  - JSCH
>>     >  - JavaCC
>>     >  - Paranamer
>>     >  - JLine
>>     >  - Stax
>>     >  - StringTemplate
>>     >  - xmlEnc
>>     >
>>     > -- MIT
>>     >  - AppAssembler
>>     >  - SimpleLog4J
>>     >
>>     > -- CDDL 1.0
>>     >  - Java Activation Framework
>>     >  - Java Transactions
>>     >  - Java Servlet API
>>     >  - Grizzly
>>     >  - gmbal
>>     >  - Glassfish
>>     >
>>     > -- CDDL 1.1
>>     >  - Jersey
>>     >  - JAXB Reference Implementation
>>     >
>>     > -- JSON License
>>     >  - JSON
>>     >
>>     > -- EPL 1.0
>>     >  - JUnit
>>     >
>>     > -- JDOM License
>>     >  - JDOM
>>     >
>>     > -- Public Domain
>>     >  - xz
>>     >  - AOPAlliance
>>     >
>>     > As all dependencies are managed using Apache Maven, none of the
>>     > external libraries need to be packaged in a source distribution.
>>     >
>>     >
>>     > Required Resources
>>     >
>>     > Developer and user mailing lists
>>     >
>>     > private@asterixdb.incubator.apache.org
>>     <ma...@asterixdb.incubator.apache.org> (with moderated
>>     subscriptions)
>>     > commits@asterixdb.incubator.apache.org
>>     <ma...@asterixdb.incubator.apache.org>
>>     > dev@asterixdb.incubator.apache.org
>>     <ma...@asterixdb.incubator.apache.org>
>>     > users@asterixdb.incubator.apache.org
>>     <ma...@asterixdb.incubator.apache.org>
>>     >
>>     >
>>     > A git repository
>>     >
>>     > https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>     >
>>     >
>>     > A JIRA issue tracker
>>     >
>>     > https://issues.apache.org/jira/browse/ASTERIXDB
>>     >
>>     >
>>     > Initial Committers
>>     >
>>     > The following is a list of the planned initial Apache
>>     committers (the
>>     > active subset of the committers for the current repository at
>>     Google
>>     > code).
>>     >
>>     > Abdullah Alamoudi (bamousaa@gmail.com <ma...@gmail.com>)
>>     > Cameron Samak (eufery@gmail.com <ma...@gmail.com>)
>>     > Chen Li (chenli@gmail.com <ma...@gmail.com>)
>>     > Ian Maxon (imaxon@uci.edu <ma...@uci.edu>)
>>     > Ildar Absalyamov (ildar.absalyamov@gmail.com
>>     <ma...@gmail.com>)
>>     > Jianfeng Jia (jianfeng.jia@gmail.com
>>     <ma...@gmail.com>)
>>     > Karen Ouaknine (kereno@gmail.com <ma...@gmail.com>)
>>     > Markus Dreseler (apache@dreseler.de <ma...@dreseler.de>)
>>     > Mike Carey (dtabass@apache.org <ma...@apache.org>)
>>     > Murtadha Hubail (hubailmor@gmail.com <ma...@gmail.com>)
>>     > Pouria Pirzadeh (pouria.pirzadeh@gmail.com
>>     <ma...@gmail.com>)
>>     > Preston Carman (prestonc@apache.org <ma...@apache.org>)
>>     > Raman Grover (RamanGrover29@gmail.com
>>     <ma...@gmail.com>)
>>     > Sattam Alsubaiee (salsubaiee@gmail.com
>>     <ma...@gmail.com>)
>>     > Steven Jacobs (sjaco002@apache.org <ma...@apache.org>)
>>     > Taewoo Kim (wangsaeu@gmail.com <ma...@gmail.com>)
>>     > Till Westmann (tillw@apache.org <ma...@apache.org>)
>>     > Vinayak Borkar (vinayakb@apache.org <ma...@apache.org>)
>>     > Yingyi Bu (buyingyi@gmail.com <ma...@gmail.com>)
>>     > Young-Seok Kim (kisskys@gmail.com <ma...@gmail.com>)
>>     > Zach Heilbron (zheilbron@gmail.com <ma...@gmail.com>)
>>     >
>>     >
>>     > Affiliations
>>     >
>>     > UC Irvine
>>     > - Mike Carey
>>     > - Chen Li
>>     > - Ian Maxon
>>     > - Yingyi Bu
>>     > - Raman Grover
>>     > - Pouria Pirzadeh
>>     > - Young-Seok Kim
>>     > - Cameron Samak
>>     > - Taewoo Kim
>>     > - Jianfeng Jia
>>     > - Murtadha Hubail
>>     > - Markus Dreseler
>>     >
>>     > UC Riverside
>>     > - Ildar Absalyamov
>>     > - Preston Carman
>>     > - Steven Jacobs
>>     >
>>     > Hebrew University
>>     > - Keren Ouaknine
>>     >
>>     > Oracle
>>     > - Till Westmann
>>     >
>>     > X15 Software
>>     > - Vinayak Borkar
>>     > - Zach Heilbron
>>     >
>>     > KACST Saudi Arabia
>>     > - Sattam Alsubaiee
>>     >
>>     > Saudi Aramco
>>     > - Abdullah Alamoudi
>>     >
>>     > Carey, Li, and Maxon are full-time UCI staff, with the
>>     remaining UCI
>>     > (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>>     > non-UC committers are a mix of alumni who continue to contribute to
>>     > the effort and individuals working with permission part-time (or in
>>     > spare time) on this project.
>>     >
>>     >
>>     > Sponsors
>>     >
>>     > Champion
>>     >
>>     > Chris Mattmann (NASA/JPL)
>>     >
>>     > Nominated Mentors
>>     >
>>     > TBD
>>     >
>>     > Sponsoring Entity
>>     >
>>     > The Apache Incubator
>>     >
>>     >
>>     >
>>     >
>>     >
>>     > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>     > Chris Mattmann, Ph.D.
>>     > Chief Architect
>>     > Instrument Software and Science Data Systems Section (398)
>>     > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>     > Office: 168-519, Mailstop: 168-527
>>     > Email: chris.a.mattmann@nasa.gov <ma...@nasa.gov>
>>     > WWW: http://sunset.usc.edu/~mattmann/
>>     <http://sunset.usc.edu/%7Emattmann/>
>>     > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>     > Adjunct Associate Professor, Computer Science Department
>>     > University of Southern California, Los Angeles, CA 90089 USA
>>     > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>     >
>>     >
>>     >
>>     >
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>     <ma...@incubator.apache.org>
>>     For additional commands, e-mail:
>>     general-help@incubator.apache.org
>>     <ma...@incubator.apache.org>
>>
>>
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Till Westmann <ti...@westmann.org>.

> On Jan 19, 2015, at 11:34 AM, jan i <ja...@apache.org> wrote:
> 
> Looks like a real challenging project, and the proposal looks as if it has already been through a couple of refinement rounds.
> 
> Count on my +1, when it comes to voting.

Will do!

Thanks,
Till

> 
> rgds
> jan i
> 
> On 19 January 2015 at 19:26, Henry Saputra <henry.saputra@gmail.com <ma...@gmail.com>> wrote:
> +1 This is GREAT News!
> 
> Was watching and trying AsterixDB last year and looked in awesome shape.
> 
> I have my plate full but would love to help mentor this project to get
> it going to ASF if needed!
> 
> - Henry
> 
> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
> <chris.a.mattmann@jpl.nasa.gov <ma...@jpl.nasa.gov>> wrote:
> > Hi Folks,
> >
> > I am pleased to bring forth the Apache AsterixDB proposal to the
> > Apache Incubator as Champion, working in collaboration with the
> > team. Please find the wiki proposal here:
> >
> > https://wiki.apache.org/incubator/AsterixDBProposal <https://wiki.apache.org/incubator/AsterixDBProposal>
> >
> >
> > Full text of the proposal is below. Please discuss and enjoy. I’ll
> > leave the discussion open for a week, and then look to call a VOTE
> > hopefully end of next week if all is well.
> >
> > Cheers!
> > Chris Mattmann
> >
> > =============================================================
> > Apache AsterixDB Proposal
> >
> > Abstract
> >
> > Apache AsterixDB is a scalable big data management system (BDMS) that
> > provides storage, management, and query capabilities for large
> > collections of semi-structured data.
> >
> > Proposal
> >
> > AsterixDB is a big data management system (BDMS) that makes it
> > well-suited to needs such as web data warehousing and social data
> > storage and analysis. Feature-wise, AsterixDB has:
> >
> > * A NoSQL style data model (ADM) based on extending JSON with object
> >   database concepts.
> > * An expressive and declarative query language (AQL) for querying
> >   semi-structured data.
> > * A runtime query execution engine, Hyracks, for partitioned-parallel
> >   execution of query plans.
> > * Partitioned LSM-based data storage and indexing for efficient
> >   ingestion of newly arriving data.
> > * Support for querying and indexing external data (e.g., in HDFS) as
> >   well as data stored within AsterixDB.
> > * A rich set of primitive data types, including support for spatial,
> >   temporal, and textual data.
> > * Indexing options that include B+ trees, R trees, and inverted
> >   keyword index support.
> > * Basic transactional (concurrency and recovery) capabilities akin to
> >   those of a NoSQL store.
> >
> >
> > Background and Rationale
> >
> > In the world of relational databases, the need to tackle data volumes
> > that exceed the capabilities of a single server led to the
> > development of “shared-nothing” parallel database systems several
> > decades ago. These systems spread data over a cluster based on a
> > partitioning strategy, such as hash partitioning, and queries are
> > processed by employing partitioned-parallel divide-and-conquer
> > techniques. Since these systems are fronted by a high-level,
> > declarative language (SQL), their users are shielded from the
> > complexities of parallel programming. Parallel database systems have
> > been an extremely successful application of parallel computing, and
> > quite a number of commercial products exist today.
> >
> > In the distributed systems world, the Web brought a need to index and
> > query its huge content. SQL and relational databases were not the
> > answer, though shared-nothing clusters again emerged as the hardware
> > platform of choice. Google developed the Google File System (GFS) and
> > MapReduce programming model to allow programmers to store and process
> > Big Data by writing a few user-defined functions. The MapReduce
> > framework applies these functions in parallel to data instances in
> > distributed files (map) and to sorted groups of instances sharing a
> > common key (reduce) -- not unlike the partitioned parallelism in
> > parallel database systems. Apache's Hadoop MapReduce platform is the
> > most prominent implementation of this paradigm for the rest of the
> > Big Data community. On top of Hadoop and HDFS sit declarative
> > languages like Pig and Hive that each compile down to Hadoop
> > MapReduce jobs.
> >
> > The big Web companies were also challenged by extreme user bases
> > (100s of millions of users) and needed fast simple lookups and
> > updates to very large keyed data sets like user profiles. SQL
> > databases were deemed either too expensive or not scalable, so the
> > “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> > popular key-value stores, in this space. MongoDB and Couchbase are
> > other open source alternatives (document stores).
> >
> > It is evident from the rapidly growing popularity of "NoSQL" stores,
> > as well as the strong demand for Big Data analytics engines today,
> > that there is a strong (and growing!) need to store, process, *and*
> > query large volumes of semi-structured data in many application
> > areas. Until very recently, developers have had to ``choose'' between
> > using big data analytics engines like Apache Hive or Apache Spark,
> > which can do complex query processing and analysis over HDFS-resident
> > files, and flexible but low-function data stores like MongoDB or
> > Apache HBase. (The Apache Phoenix project,
> > http://phoenix.apache.org/ <http://phoenix.apache.org/>, is a recent SQL-over-HBase effort that
> > aims to bridge between these choices.)
> >
> > AsterixDB is a highly scalable data management system that can store,
> > index, and manage semi-structured data, e.g., much like MongoDB, but
> > it also supports a full-power query language with the expressiveness
> > of SQL (and more). Unlike analytics engines like Hive or Spark, it
> > stores and manages data, so AsterixDB can exploit its knowledge of
> > data partitioning and the availability of indexes to avoid always
> > scanning data set(s) to process queries. Somewhat surprisingly, there
> > is no open source parallel database system (relational or otherwise)
> > available to developers today -- AsterixDB aims to fill this need.
> > Since Apache is where the majority of the today's most important Big
> > Data technologies live, the ASF seems like the obvious home for a
> > system like AsterixDB.
> >
> > Current Status
> >
> > The current version of AsterixDB was co-developed by a team of
> > faculty, staff, and students at UC Irvine and UC Riverside. The
> > project was initiated as a large NSF-sponsored project in 2009, the
> > goal of which was to combine the best ideas from the parallel
> > database world, the then new Hadoop world, and the semi-structured
> > (e.g., XML/JSON) data world in order to create a next-generation
> > BDMS. A first informal open source release was made four years later,
> > in June of 2013, under the Apache Software License 2.0.
> >
> >
> > Meritocracy
> >
> > The current developers are familiar with meritocratic open source
> > development at Apache. Apache was chosen specifically because we want
> > to encourage this style of development for the project.
> >
> >
> > Community
> >
> > While AsterixDB started as a university project it has developed into
> > a community. A number of the initial committers started contributing
> > in academia and continue to actively participate and contribute after
> > graduation. And we seek to further develop developer and user
> > communities. One way to broaden the community that is ongoing is
> > through academic collaborations (currently with IIT Mumbai in India
> > and TU Berlin in Germany). During incubation we will also explicitly
> > seek increased industrial participation.
> >
> > Some indicators of the effort's development community and history can
> > be
> > found at:
> > https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo <https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo>,
> > https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo <https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo>
> >
> >
> > Core Developers
> >
> > The core developers of the project are diverse, although initially UC
> > Irvine heavy (roughly 50) due to the project's origins at UCI. The
> > other 50 are from other academic institutions (UC Riverside and the
> > Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> > IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
> >
> >
> > Alignment
> >
> > Apache is, by far, the most natural home for taking the AsterixDB
> > project forward. A large fraction of today's top Big Data
> > technologies have their homes in Apache, including Hadoop, YARN, Pig,
> > Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> > significant gap -- the parallel data management system gap -- that
> > exists in the Big Data open source world. It is well-aligned with a
> > number of the Apache projects, e.g., it has strong support for
> > accessing and indexing external data in HDFS, and it uses YARN as an
> > answer to basic cluster resource management. AsterixDB also seeks to
> > achieve an Apache-style development model; it is seeking a broader
> > community of contributors and users in order to achieve its full
> > potential and value to the Big Data community.
> >
> > There are also a number of related Apache projects and dependencies
> > that will be mentioned below in the Relationships with Other Apache
> > products section.
> >
> >
> > Known Risks
> >
> > Orphaned products
> >
> > Given the current level of intellectual investment in AsterixDB, the
> > risk of the project being abandoned is very small. The UCI/UCR
> > faculty team leads are highly incentivized to continue development
> > since the database groups at UC Irvine and UC Riverside are both
> > reliant on AsterixDB as a platform for long-term graduate research
> > projects. UC San Diego is also beginning to contribute to the code
> > base, and a collaboration involving public health applications is
> > forming with UCLA. The work on AsterixDB is managed via a mix of
> > mailing list discussions supplemented by weekly project status
> > meetings which are summarized on the mailing list. Typical (local
> > plus Skype-in) attendance to the weekly status meetings runs at about
> > 20 active contributors.
> >
> >
> > Inexperience with Open Source
> >
> > AsterixDB and Hyracks were completely developed in Open Source under
> > the ASL 2.0. The source code repositories, issue tracker, and mailing
> > lists are available on Google Code and discussions and decisions
> > happen on the mailing lists (which is necessary due to the geographic
> > distribution of the current developers).
> >
> > Also a few of the initial committers have contributed to Apache
> > projects. Vinayak Borkar is a committer on the Apache Helix and
> > Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> > and an IPMC member. Preston Carman and Steven Jacobs are committers
> > on the Apache VXQuery project.
> >
> >
> > Relationships with Other Apache Products
> >
> > Apache VXQuery is based on the Hyracks data-parallel runtime, which
> > is also included in the AsterixDB code base.
> >
> > AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> > is support for accessing external data in HDFS (and Hive formats),
> > and resource management and system administration features are in the
> > process of being migrated to YARN.
> >
> > AsterixDB's AQL query facilities offer comparable query power to
> > Apache's Pig and Hive systems for big data analytics. AsterixDB
> > differs in storing and indexing data and thus being able to quickly
> > answer small and medium queries without large HDFS data scans -
> > thereby targeting a different class of use cases.
> >
> > AsterixDB's data storage and indexing facilities are similar to those
> > of HBase, but AsterixDB differs in being a much more complete and
> > queryable BDMS (not just a key-value style store).
> >
> > AsterixDB's target use cases are not in-memory processing or
> > iterative algorithm support, making AsterixDB complementary to the
> > Apache Spark platform. (Spark interoperability is on our longer-term
> > to-do wishlist.)
> >
> >
> > Homogeneous Developers
> >
> > As mentioned before the current community is already organizationally
> > and geographically distributed - and we would like to increase the
> > heterogeneity.
> >
> >
> > Reliance on Salaried Developers
> >
> > Of the initial committers only 3 are full-time UCI staff. The other
> > committers are a mix of students, alumni who continue to contribute
> > to the effort, and individuals working with permission part-time (or
> > in spare time) on this project.
> >
> >
> > A Excessive Fascination with the Apache Brand
> >
> > We believe in the processes, systems, and framework Apache has put in
> > place. Apache is also known to foster a great community around their
> > projects and provide exposure. While brand is important, our
> > fascination with it is not excessive. We believe that the ASF is the
> > right home for AsterixDB and that having AsterixDB inside of the ASF
> > will lead to a better long-term outcome for the Big Data community.
> >
> >
> > Documentation
> >
> > Documentation and publications related to AsterixDB can be found at
> > http://asterixdb.ics.uci.edu/ <http://asterixdb.ics.uci.edu/>.
> >
> >
> > Initial Source
> >
> > Current source resides in Google code:
> > https://code.google.com/p/asterixdb/ <https://code.google.com/p/asterixdb/> (query language and upper system
> > layers) and https://code.google.com/p/hyracks/ <https://code.google.com/p/hyracks/> (dataflow runtime
> > system and storage management libraries).
> >
> >
> > External Dependencies
> >
> > AsterixDB depends on a number of Apache projects:
> >
> > - Ant
> > - Avro
> > - ApacheDB JDO
> > - Commons
> > - Derby
> > - Hadoop
> > - Hive
> > - HTTPComponents
> > - Jakarta ORO
> > - Maven
> > - Tomcat
> > - Thrift
> > - Velocity
> > - Wicket
> > - Xerces
> >
> > and other open source projects (organized by license):
> >
> > -- ASL 2.0:
> >  - Jackson
> >  - Google Guava
> >  - Google Guice
> >  - JSON-simple
> >  - BoneCP
> >  - Microsoft Azure SDK
> >  - Netty
> >  - Rome
> >  - JetS3t
> >  - Groovy
> >  - Jettison
> >  - Plexus
> >  - Datanucleus (JDO)
> >  - Jetty
> >  - Twitter4J
> >  - Snappy-java
> >
> > -- BSD:
> >  - Antlr
> >  - ObjectWeb ASM
> >  - Protobuf
> >  - JSCH
> >  - JavaCC
> >  - Paranamer
> >  - JLine
> >  - Stax
> >  - StringTemplate
> >  - xmlEnc
> >
> > -- MIT
> >  - AppAssembler
> >  - SimpleLog4J
> >
> > -- CDDL 1.0
> >  - Java Activation Framework
> >  - Java Transactions
> >  - Java Servlet API
> >  - Grizzly
> >  - gmbal
> >  - Glassfish
> >
> > -- CDDL 1.1
> >  - Jersey
> >  - JAXB Reference Implementation
> >
> > -- JSON License
> >  - JSON
> >
> > -- EPL 1.0
> >  - JUnit
> >
> > -- JDOM License
> >  - JDOM
> >
> > -- Public Domain
> >  - xz
> >  - AOPAlliance
> >
> > As all dependencies are managed using Apache Maven, none of the
> > external libraries need to be packaged in a source distribution.
> >
> >
> > Required Resources
> >
> > Developer and user mailing lists
> >
> > private@asterixdb.incubator.apache.org <ma...@asterixdb.incubator.apache.org> (with moderated subscriptions)
> > commits@asterixdb.incubator.apache.org <ma...@asterixdb.incubator.apache.org>
> > dev@asterixdb.incubator.apache.org <ma...@asterixdb.incubator.apache.org>
> > users@asterixdb.incubator.apache.org <ma...@asterixdb.incubator.apache.org>
> >
> >
> > A git repository
> >
> > https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git <https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git>
> >
> >
> > A JIRA issue tracker
> >
> > https://issues.apache.org/jira/browse/ASTERIXDB <https://issues.apache.org/jira/browse/ASTERIXDB>
> >
> >
> > Initial Committers
> >
> > The following is a list of the planned initial Apache committers (the
> > active subset of the committers for the current repository at Google
> > code).
> >
> > Abdullah Alamoudi (bamousaa@gmail.com <ma...@gmail.com>)
> > Cameron Samak (eufery@gmail.com <ma...@gmail.com>)
> > Chen Li (chenli@gmail.com <ma...@gmail.com>)
> > Ian Maxon (imaxon@uci.edu <ma...@uci.edu>)
> > Ildar Absalyamov (ildar.absalyamov@gmail.com <ma...@gmail.com>)
> > Jianfeng Jia (jianfeng.jia@gmail.com <ma...@gmail.com>)
> > Karen Ouaknine (kereno@gmail.com <ma...@gmail.com>)
> > Markus Dreseler (apache@dreseler.de <ma...@dreseler.de>)
> > Mike Carey (dtabass@apache.org <ma...@apache.org>)
> > Murtadha Hubail (hubailmor@gmail.com <ma...@gmail.com>)
> > Pouria Pirzadeh (pouria.pirzadeh@gmail.com <ma...@gmail.com>)
> > Preston Carman (prestonc@apache.org <ma...@apache.org>)
> > Raman Grover (RamanGrover29@gmail.com <ma...@gmail.com>)
> > Sattam Alsubaiee (salsubaiee@gmail.com <ma...@gmail.com>)
> > Steven Jacobs (sjaco002@apache.org <ma...@apache.org>)
> > Taewoo Kim (wangsaeu@gmail.com <ma...@gmail.com>)
> > Till Westmann (tillw@apache.org <ma...@apache.org>)
> > Vinayak Borkar (vinayakb@apache.org <ma...@apache.org>)
> > Yingyi Bu (buyingyi@gmail.com <ma...@gmail.com>)
> > Young-Seok Kim (kisskys@gmail.com <ma...@gmail.com>)
> > Zach Heilbron (zheilbron@gmail.com <ma...@gmail.com>)
> >
> >
> > Affiliations
> >
> > UC Irvine
> > - Mike Carey
> > - Chen Li
> > - Ian Maxon
> > - Yingyi Bu
> > - Raman Grover
> > - Pouria Pirzadeh
> > - Young-Seok Kim
> > - Cameron Samak
> > - Taewoo Kim
> > - Jianfeng Jia
> > - Murtadha Hubail
> > - Markus Dreseler
> >
> > UC Riverside
> > - Ildar Absalyamov
> > - Preston Carman
> > - Steven Jacobs
> >
> > Hebrew University
> > - Keren Ouaknine
> >
> > Oracle
> > - Till Westmann
> >
> > X15 Software
> > - Vinayak Borkar
> > - Zach Heilbron
> >
> > KACST Saudi Arabia
> > - Sattam Alsubaiee
> >
> > Saudi Aramco
> > - Abdullah Alamoudi
> >
> > Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> > (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> > non-UC committers are a mix of alumni who continue to contribute to
> > the effort and individuals working with permission part-time (or in
> > spare time) on this project.
> >
> >
> > Sponsors
> >
> > Champion
> >
> > Chris Mattmann (NASA/JPL)
> >
> > Nominated Mentors
> >
> > TBD
> >
> > Sponsoring Entity
> >
> > The Apache Incubator
> >
> >
> >
> >
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattmann@nasa.gov <ma...@nasa.gov>
> > WWW:  http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org <ma...@incubator.apache.org>
> For additional commands, e-mail: general-help@incubator.apache.org <ma...@incubator.apache.org>
> 
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by jan i <ja...@apache.org>.

Looks like a real challenging project, and the proposal looks as if it has
already been through a couple of refinement rounds.

Count on my +1, when it comes to voting.

rgds
jan i

On 19 January 2015 at 19:26, Henry Saputra <he...@gmail.com> wrote:

> +1 This is GREAT News!
>
> Was watching and trying AsterixDB last year and looked in awesome shape.
>
> I have my plate full but would love to help mentor this project to get
> it going to ASF if needed!
>
> - Henry
>
> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
> <ch...@jpl.nasa.gov> wrote:
> > Hi Folks,
> >
> > I am pleased to bring forth the Apache AsterixDB proposal to the
> > Apache Incubator as Champion, working in collaboration with the
> > team. Please find the wiki proposal here:
> >
> > https://wiki.apache.org/incubator/AsterixDBProposal
> >
> >
> > Full text of the proposal is below. Please discuss and enjoy. I’ll
> > leave the discussion open for a week, and then look to call a VOTE
> > hopefully end of next week if all is well.
> >
> > Cheers!
> > Chris Mattmann
> >
> > =============================================================
> > Apache AsterixDB Proposal
> >
> > Abstract
> >
> > Apache AsterixDB is a scalable big data management system (BDMS) that
> > provides storage, management, and query capabilities for large
> > collections of semi-structured data.
> >
> > Proposal
> >
> > AsterixDB is a big data management system (BDMS) that makes it
> > well-suited to needs such as web data warehousing and social data
> > storage and analysis. Feature-wise, AsterixDB has:
> >
> > * A NoSQL style data model (ADM) based on extending JSON with object
> >   database concepts.
> > * An expressive and declarative query language (AQL) for querying
> >   semi-structured data.
> > * A runtime query execution engine, Hyracks, for partitioned-parallel
> >   execution of query plans.
> > * Partitioned LSM-based data storage and indexing for efficient
> >   ingestion of newly arriving data.
> > * Support for querying and indexing external data (e.g., in HDFS) as
> >   well as data stored within AsterixDB.
> > * A rich set of primitive data types, including support for spatial,
> >   temporal, and textual data.
> > * Indexing options that include B+ trees, R trees, and inverted
> >   keyword index support.
> > * Basic transactional (concurrency and recovery) capabilities akin to
> >   those of a NoSQL store.
> >
> >
> > Background and Rationale
> >
> > In the world of relational databases, the need to tackle data volumes
> > that exceed the capabilities of a single server led to the
> > development of “shared-nothing” parallel database systems several
> > decades ago. These systems spread data over a cluster based on a
> > partitioning strategy, such as hash partitioning, and queries are
> > processed by employing partitioned-parallel divide-and-conquer
> > techniques. Since these systems are fronted by a high-level,
> > declarative language (SQL), their users are shielded from the
> > complexities of parallel programming. Parallel database systems have
> > been an extremely successful application of parallel computing, and
> > quite a number of commercial products exist today.
> >
> > In the distributed systems world, the Web brought a need to index and
> > query its huge content. SQL and relational databases were not the
> > answer, though shared-nothing clusters again emerged as the hardware
> > platform of choice. Google developed the Google File System (GFS) and
> > MapReduce programming model to allow programmers to store and process
> > Big Data by writing a few user-defined functions. The MapReduce
> > framework applies these functions in parallel to data instances in
> > distributed files (map) and to sorted groups of instances sharing a
> > common key (reduce) -- not unlike the partitioned parallelism in
> > parallel database systems. Apache's Hadoop MapReduce platform is the
> > most prominent implementation of this paradigm for the rest of the
> > Big Data community. On top of Hadoop and HDFS sit declarative
> > languages like Pig and Hive that each compile down to Hadoop
> > MapReduce jobs.
> >
> > The big Web companies were also challenged by extreme user bases
> > (100s of millions of users) and needed fast simple lookups and
> > updates to very large keyed data sets like user profiles. SQL
> > databases were deemed either too expensive or not scalable, so the
> > “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> > popular key-value stores, in this space. MongoDB and Couchbase are
> > other open source alternatives (document stores).
> >
> > It is evident from the rapidly growing popularity of "NoSQL" stores,
> > as well as the strong demand for Big Data analytics engines today,
> > that there is a strong (and growing!) need to store, process, *and*
> > query large volumes of semi-structured data in many application
> > areas. Until very recently, developers have had to ``choose'' between
> > using big data analytics engines like Apache Hive or Apache Spark,
> > which can do complex query processing and analysis over HDFS-resident
> > files, and flexible but low-function data stores like MongoDB or
> > Apache HBase. (The Apache Phoenix project,
> > http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
> > aims to bridge between these choices.)
> >
> > AsterixDB is a highly scalable data management system that can store,
> > index, and manage semi-structured data, e.g., much like MongoDB, but
> > it also supports a full-power query language with the expressiveness
> > of SQL (and more). Unlike analytics engines like Hive or Spark, it
> > stores and manages data, so AsterixDB can exploit its knowledge of
> > data partitioning and the availability of indexes to avoid always
> > scanning data set(s) to process queries. Somewhat surprisingly, there
> > is no open source parallel database system (relational or otherwise)
> > available to developers today -- AsterixDB aims to fill this need.
> > Since Apache is where the majority of the today's most important Big
> > Data technologies live, the ASF seems like the obvious home for a
> > system like AsterixDB.
> >
> > Current Status
> >
> > The current version of AsterixDB was co-developed by a team of
> > faculty, staff, and students at UC Irvine and UC Riverside. The
> > project was initiated as a large NSF-sponsored project in 2009, the
> > goal of which was to combine the best ideas from the parallel
> > database world, the then new Hadoop world, and the semi-structured
> > (e.g., XML/JSON) data world in order to create a next-generation
> > BDMS. A first informal open source release was made four years later,
> > in June of 2013, under the Apache Software License 2.0.
> >
> >
> > Meritocracy
> >
> > The current developers are familiar with meritocratic open source
> > development at Apache. Apache was chosen specifically because we want
> > to encourage this style of development for the project.
> >
> >
> > Community
> >
> > While AsterixDB started as a university project it has developed into
> > a community. A number of the initial committers started contributing
> > in academia and continue to actively participate and contribute after
> > graduation. And we seek to further develop developer and user
> > communities. One way to broaden the community that is ongoing is
> > through academic collaborations (currently with IIT Mumbai in India
> > and TU Berlin in Germany). During incubation we will also explicitly
> > seek increased industrial participation.
> >
> > Some indicators of the effort's development community and history can
> > be
> > found at:
> >
> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo
> ,
> > https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
> >
> >
> > Core Developers
> >
> > The core developers of the project are diverse, although initially UC
> > Irvine heavy (roughly 50) due to the project's origins at UCI. The
> > other 50 are from other academic institutions (UC Riverside and the
> > Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> > IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
> >
> >
> > Alignment
> >
> > Apache is, by far, the most natural home for taking the AsterixDB
> > project forward. A large fraction of today's top Big Data
> > technologies have their homes in Apache, including Hadoop, YARN, Pig,
> > Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> > significant gap -- the parallel data management system gap -- that
> > exists in the Big Data open source world. It is well-aligned with a
> > number of the Apache projects, e.g., it has strong support for
> > accessing and indexing external data in HDFS, and it uses YARN as an
> > answer to basic cluster resource management. AsterixDB also seeks to
> > achieve an Apache-style development model; it is seeking a broader
> > community of contributors and users in order to achieve its full
> > potential and value to the Big Data community.
> >
> > There are also a number of related Apache projects and dependencies
> > that will be mentioned below in the Relationships with Other Apache
> > products section.
> >
> >
> > Known Risks
> >
> > Orphaned products
> >
> > Given the current level of intellectual investment in AsterixDB, the
> > risk of the project being abandoned is very small. The UCI/UCR
> > faculty team leads are highly incentivized to continue development
> > since the database groups at UC Irvine and UC Riverside are both
> > reliant on AsterixDB as a platform for long-term graduate research
> > projects. UC San Diego is also beginning to contribute to the code
> > base, and a collaboration involving public health applications is
> > forming with UCLA. The work on AsterixDB is managed via a mix of
> > mailing list discussions supplemented by weekly project status
> > meetings which are summarized on the mailing list. Typical (local
> > plus Skype-in) attendance to the weekly status meetings runs at about
> > 20 active contributors.
> >
> >
> > Inexperience with Open Source
> >
> > AsterixDB and Hyracks were completely developed in Open Source under
> > the ASL 2.0. The source code repositories, issue tracker, and mailing
> > lists are available on Google Code and discussions and decisions
> > happen on the mailing lists (which is necessary due to the geographic
> > distribution of the current developers).
> >
> > Also a few of the initial committers have contributed to Apache
> > projects. Vinayak Borkar is a committer on the Apache Helix and
> > Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> > and an IPMC member. Preston Carman and Steven Jacobs are committers
> > on the Apache VXQuery project.
> >
> >
> > Relationships with Other Apache Products
> >
> > Apache VXQuery is based on the Hyracks data-parallel runtime, which
> > is also included in the AsterixDB code base.
> >
> > AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> > is support for accessing external data in HDFS (and Hive formats),
> > and resource management and system administration features are in the
> > process of being migrated to YARN.
> >
> > AsterixDB's AQL query facilities offer comparable query power to
> > Apache's Pig and Hive systems for big data analytics. AsterixDB
> > differs in storing and indexing data and thus being able to quickly
> > answer small and medium queries without large HDFS data scans -
> > thereby targeting a different class of use cases.
> >
> > AsterixDB's data storage and indexing facilities are similar to those
> > of HBase, but AsterixDB differs in being a much more complete and
> > queryable BDMS (not just a key-value style store).
> >
> > AsterixDB's target use cases are not in-memory processing or
> > iterative algorithm support, making AsterixDB complementary to the
> > Apache Spark platform. (Spark interoperability is on our longer-term
> > to-do wishlist.)
> >
> >
> > Homogeneous Developers
> >
> > As mentioned before the current community is already organizationally
> > and geographically distributed - and we would like to increase the
> > heterogeneity.
> >
> >
> > Reliance on Salaried Developers
> >
> > Of the initial committers only 3 are full-time UCI staff. The other
> > committers are a mix of students, alumni who continue to contribute
> > to the effort, and individuals working with permission part-time (or
> > in spare time) on this project.
> >
> >
> > A Excessive Fascination with the Apache Brand
> >
> > We believe in the processes, systems, and framework Apache has put in
> > place. Apache is also known to foster a great community around their
> > projects and provide exposure. While brand is important, our
> > fascination with it is not excessive. We believe that the ASF is the
> > right home for AsterixDB and that having AsterixDB inside of the ASF
> > will lead to a better long-term outcome for the Big Data community.
> >
> >
> > Documentation
> >
> > Documentation and publications related to AsterixDB can be found at
> > http://asterixdb.ics.uci.edu/.
> >
> >
> > Initial Source
> >
> > Current source resides in Google code:
> > https://code.google.com/p/asterixdb/ (query language and upper system
> > layers) and https://code.google.com/p/hyracks/ (dataflow runtime
> > system and storage management libraries).
> >
> >
> > External Dependencies
> >
> > AsterixDB depends on a number of Apache projects:
> >
> > - Ant
> > - Avro
> > - ApacheDB JDO
> > - Commons
> > - Derby
> > - Hadoop
> > - Hive
> > - HTTPComponents
> > - Jakarta ORO
> > - Maven
> > - Tomcat
> > - Thrift
> > - Velocity
> > - Wicket
> > - Xerces
> >
> > and other open source projects (organized by license):
> >
> > -- ASL 2.0:
> >  - Jackson
> >  - Google Guava
> >  - Google Guice
> >  - JSON-simple
> >  - BoneCP
> >  - Microsoft Azure SDK
> >  - Netty
> >  - Rome
> >  - JetS3t
> >  - Groovy
> >  - Jettison
> >  - Plexus
> >  - Datanucleus (JDO)
> >  - Jetty
> >  - Twitter4J
> >  - Snappy-java
> >
> > -- BSD:
> >  - Antlr
> >  - ObjectWeb ASM
> >  - Protobuf
> >  - JSCH
> >  - JavaCC
> >  - Paranamer
> >  - JLine
> >  - Stax
> >  - StringTemplate
> >  - xmlEnc
> >
> > -- MIT
> >  - AppAssembler
> >  - SimpleLog4J
> >
> > -- CDDL 1.0
> >  - Java Activation Framework
> >  - Java Transactions
> >  - Java Servlet API
> >  - Grizzly
> >  - gmbal
> >  - Glassfish
> >
> > -- CDDL 1.1
> >  - Jersey
> >  - JAXB Reference Implementation
> >
> > -- JSON License
> >  - JSON
> >
> > -- EPL 1.0
> >  - JUnit
> >
> > -- JDOM License
> >  - JDOM
> >
> > -- Public Domain
> >  - xz
> >  - AOPAlliance
> >
> > As all dependencies are managed using Apache Maven, none of the
> > external libraries need to be packaged in a source distribution.
> >
> >
> > Required Resources
> >
> > Developer and user mailing lists
> >
> > private@asterixdb.incubator.apache.org (with moderated subscriptions)
> > commits@asterixdb.incubator.apache.org
> > dev@asterixdb.incubator.apache.org
> > users@asterixdb.incubator.apache.org
> >
> >
> > A git repository
> >
> > https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
> >
> >
> > A JIRA issue tracker
> >
> > https://issues.apache.org/jira/browse/ASTERIXDB
> >
> >
> > Initial Committers
> >
> > The following is a list of the planned initial Apache committers (the
> > active subset of the committers for the current repository at Google
> > code).
> >
> > Abdullah Alamoudi (bamousaa@gmail.com)
> > Cameron Samak (eufery@gmail.com)
> > Chen Li (chenli@gmail.com)
> > Ian Maxon (imaxon@uci.edu)
> > Ildar Absalyamov (ildar.absalyamov@gmail.com)
> > Jianfeng Jia (jianfeng.jia@gmail.com)
> > Karen Ouaknine (kereno@gmail.com)
> > Markus Dreseler (apache@dreseler.de)
> > Mike Carey (dtabass@apache.org)
> > Murtadha Hubail (hubailmor@gmail.com)
> > Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
> > Preston Carman (prestonc@apache.org)
> > Raman Grover (RamanGrover29@gmail.com)
> > Sattam Alsubaiee (salsubaiee@gmail.com)
> > Steven Jacobs (sjaco002@apache.org)
> > Taewoo Kim (wangsaeu@gmail.com)
> > Till Westmann (tillw@apache.org)
> > Vinayak Borkar (vinayakb@apache.org)
> > Yingyi Bu (buyingyi@gmail.com)
> > Young-Seok Kim (kisskys@gmail.com)
> > Zach Heilbron (zheilbron@gmail.com)
> >
> >
> > Affiliations
> >
> > UC Irvine
> > - Mike Carey
> > - Chen Li
> > - Ian Maxon
> > - Yingyi Bu
> > - Raman Grover
> > - Pouria Pirzadeh
> > - Young-Seok Kim
> > - Cameron Samak
> > - Taewoo Kim
> > - Jianfeng Jia
> > - Murtadha Hubail
> > - Markus Dreseler
> >
> > UC Riverside
> > - Ildar Absalyamov
> > - Preston Carman
> > - Steven Jacobs
> >
> > Hebrew University
> > - Keren Ouaknine
> >
> > Oracle
> > - Till Westmann
> >
> > X15 Software
> > - Vinayak Borkar
> > - Zach Heilbron
> >
> > KACST Saudi Arabia
> > - Sattam Alsubaiee
> >
> > Saudi Aramco
> > - Abdullah Alamoudi
> >
> > Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> > (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> > non-UC committers are a mix of alumni who continue to contribute to
> > the effort and individuals working with permission part-time (or in
> > spare time) on this project.
> >
> >
> > Sponsors
> >
> > Champion
> >
> > Chris Mattmann (NASA/JPL)
> >
> > Nominated Mentors
> >
> > TBD
> >
> > Sponsoring Entity
> >
> > The Apache Incubator
> >
> >
> >
> >
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattmann@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Till Westmann <ti...@westmann.org>.

Thanks!

Till

> On Jan 20, 2015, at 11:06, Ted Dunning <te...@gmail.com> wrote:
> 
> 
> Added my name to the mentor list.
> 
> 
> 
>> On Tue, Jan 20, 2015 at 8:37 AM, Mike Carey <dt...@gmail.com> wrote:
>> Wonderful; thanks, Ted!!
>> Cheers,
>> Mike
>> 
>>> On 1/19/15 11:29 PM, Ted Dunning wrote:
>>> 
>>> Chris just asked me under separate cover. 
>>> 
>>> I am happy to help out as mentor.
>>> 
>>> 
>>> 
>>> On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra <he...@gmail.com> wrote:
>>>> Thanks Till,
>>>> 
>>>> Will try to solicit more mentors to help.
>>>> Especially with initial committers mostly have not been exposed to
>>>> contributing the Apache way.
>>>> 
>>>> - Henry
>>>> 
>>>> On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <ti...@westmann.org> wrote:
>>>> > Hi Henry,
>>>> >
>>>> > thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>>>> >
>>>> > Even if your time is very limited we would be very happy to have you on board as a mentor.
>>>> > I’ll add you to the proposal.
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> >> On Jan 19, 2015, at 10:26 AM, Henry Saputra <he...@gmail.com> wrote:
>>>> >>
>>>> >> +1 This is GREAT News!
>>>> >>
>>>> >> Was watching and trying AsterixDB last year and looked in awesome shape.
>>>> >>
>>>> >> I have my plate full but would love to help mentor this project to get
>>>> >> it going to ASF if needed!
>>>> >>
>>>> >> - Henry
>>>> >>
>>>> >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>>>> >> <ch...@jpl.nasa.gov> wrote:
>>>> >>> Hi Folks,
>>>> >>>
>>>> >>> I am pleased to bring forth the Apache AsterixDB proposal to the
>>>> >>> Apache Incubator as Champion, working in collaboration with the
>>>> >>> team. Please find the wiki proposal here:
>>>> >>>
>>>> >>> https://wiki.apache.org/incubator/AsterixDBProposal
>>>> >>>
>>>> >>>
>>>> >>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>>>> >>> leave the discussion open for a week, and then look to call a VOTE
>>>> >>> hopefully end of next week if all is well.
>>>> >>>
>>>> >>> Cheers!
>>>> >>> Chris Mattmann
>>>> >>>
>>>> >>> =============================================================
>>>> >>> Apache AsterixDB Proposal
>>>> >>>
>>>> >>> Abstract
>>>> >>>
>>>> >>> Apache AsterixDB is a scalable big data management system (BDMS) that
>>>> >>> provides storage, management, and query capabilities for large
>>>> >>> collections of semi-structured data.
>>>> >>>
>>>> >>> Proposal
>>>> >>>
>>>> >>> AsterixDB is a big data management system (BDMS) that makes it
>>>> >>> well-suited to needs such as web data warehousing and social data
>>>> >>> storage and analysis. Feature-wise, AsterixDB has:
>>>> >>>
>>>> >>> * A NoSQL style data model (ADM) based on extending JSON with object
>>>> >>>  database concepts.
>>>> >>> * An expressive and declarative query language (AQL) for querying
>>>> >>>  semi-structured data.
>>>> >>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>>> >>>  execution of query plans.
>>>> >>> * Partitioned LSM-based data storage and indexing for efficient
>>>> >>>  ingestion of newly arriving data.
>>>> >>> * Support for querying and indexing external data (e.g., in HDFS) as
>>>> >>>  well as data stored within AsterixDB.
>>>> >>> * A rich set of primitive data types, including support for spatial,
>>>> >>>  temporal, and textual data.
>>>> >>> * Indexing options that include B+ trees, R trees, and inverted
>>>> >>>  keyword index support.
>>>> >>> * Basic transactional (concurrency and recovery) capabilities akin to
>>>> >>>  those of a NoSQL store.
>>>> >>>
>>>> >>>
>>>> >>> Background and Rationale
>>>> >>>
>>>> >>> In the world of relational databases, the need to tackle data volumes
>>>> >>> that exceed the capabilities of a single server led to the
>>>> >>> development of “shared-nothing” parallel database systems several
>>>> >>> decades ago. These systems spread data over a cluster based on a
>>>> >>> partitioning strategy, such as hash partitioning, and queries are
>>>> >>> processed by employing partitioned-parallel divide-and-conquer
>>>> >>> techniques. Since these systems are fronted by a high-level,
>>>> >>> declarative language (SQL), their users are shielded from the
>>>> >>> complexities of parallel programming. Parallel database systems have
>>>> >>> been an extremely successful application of parallel computing, and
>>>> >>> quite a number of commercial products exist today.
>>>> >>>
>>>> >>> In the distributed systems world, the Web brought a need to index and
>>>> >>> query its huge content. SQL and relational databases were not the
>>>> >>> answer, though shared-nothing clusters again emerged as the hardware
>>>> >>> platform of choice. Google developed the Google File System (GFS) and
>>>> >>> MapReduce programming model to allow programmers to store and process
>>>> >>> Big Data by writing a few user-defined functions. The MapReduce
>>>> >>> framework applies these functions in parallel to data instances in
>>>> >>> distributed files (map) and to sorted groups of instances sharing a
>>>> >>> common key (reduce) -- not unlike the partitioned parallelism in
>>>> >>> parallel database systems. Apache's Hadoop MapReduce platform is the
>>>> >>> most prominent implementation of this paradigm for the rest of the
>>>> >>> Big Data community. On top of Hadoop and HDFS sit declarative
>>>> >>> languages like Pig and Hive that each compile down to Hadoop
>>>> >>> MapReduce jobs.
>>>> >>>
>>>> >>> The big Web companies were also challenged by extreme user bases
>>>> >>> (100s of millions of users) and needed fast simple lookups and
>>>> >>> updates to very large keyed data sets like user profiles. SQL
>>>> >>> databases were deemed either too expensive or not scalable, so the
>>>> >>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>>>> >>> popular key-value stores, in this space. MongoDB and Couchbase are
>>>> >>> other open source alternatives (document stores).
>>>> >>>
>>>> >>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>>>> >>> as well as the strong demand for Big Data analytics engines today,
>>>> >>> that there is a strong (and growing!) need to store, process, *and*
>>>> >>> query large volumes of semi-structured data in many application
>>>> >>> areas. Until very recently, developers have had to ``choose'' between
>>>> >>> using big data analytics engines like Apache Hive or Apache Spark,
>>>> >>> which can do complex query processing and analysis over HDFS-resident
>>>> >>> files, and flexible but low-function data stores like MongoDB or
>>>> >>> Apache HBase. (The Apache Phoenix project,
>>>> >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>>>> >>> aims to bridge between these choices.)
>>>> >>>
>>>> >>> AsterixDB is a highly scalable data management system that can store,
>>>> >>> index, and manage semi-structured data, e.g., much like MongoDB, but
>>>> >>> it also supports a full-power query language with the expressiveness
>>>> >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>>>> >>> stores and manages data, so AsterixDB can exploit its knowledge of
>>>> >>> data partitioning and the availability of indexes to avoid always
>>>> >>> scanning data set(s) to process queries. Somewhat surprisingly, there
>>>> >>> is no open source parallel database system (relational or otherwise)
>>>> >>> available to developers today -- AsterixDB aims to fill this need.
>>>> >>> Since Apache is where the majority of the today's most important Big
>>>> >>> Data technologies live, the ASF seems like the obvious home for a
>>>> >>> system like AsterixDB.
>>>> >>>
>>>> >>> Current Status
>>>> >>>
>>>> >>> The current version of AsterixDB was co-developed by a team of
>>>> >>> faculty, staff, and students at UC Irvine and UC Riverside. The
>>>> >>> project was initiated as a large NSF-sponsored project in 2009, the
>>>> >>> goal of which was to combine the best ideas from the parallel
>>>> >>> database world, the then new Hadoop world, and the semi-structured
>>>> >>> (e.g., XML/JSON) data world in order to create a next-generation
>>>> >>> BDMS. A first informal open source release was made four years later,
>>>> >>> in June of 2013, under the Apache Software License 2.0.
>>>> >>>
>>>> >>>
>>>> >>> Meritocracy
>>>> >>>
>>>> >>> The current developers are familiar with meritocratic open source
>>>> >>> development at Apache. Apache was chosen specifically because we want
>>>> >>> to encourage this style of development for the project.
>>>> >>>
>>>> >>>
>>>> >>> Community
>>>> >>>
>>>> >>> While AsterixDB started as a university project it has developed into
>>>> >>> a community. A number of the initial committers started contributing
>>>> >>> in academia and continue to actively participate and contribute after
>>>> >>> graduation. And we seek to further develop developer and user
>>>> >>> communities. One way to broaden the community that is ongoing is
>>>> >>> through academic collaborations (currently with IIT Mumbai in India
>>>> >>> and TU Berlin in Germany). During incubation we will also explicitly
>>>> >>> seek increased industrial participation.
>>>> >>>
>>>> >>> Some indicators of the effort's development community and history can
>>>> >>> be
>>>> >>> found at:
>>>> >>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>>>> >>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>>> >>>
>>>> >>>
>>>> >>> Core Developers
>>>> >>>
>>>> >>> The core developers of the project are diverse, although initially UC
>>>> >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>>>> >>> other 50 are from other academic institutions (UC Riverside and the
>>>> >>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>>>> >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>>> >>>
>>>> >>>
>>>> >>> Alignment
>>>> >>>
>>>> >>> Apache is, by far, the most natural home for taking the AsterixDB
>>>> >>> project forward. A large fraction of today's top Big Data
>>>> >>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>>>> >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>>>> >>> significant gap -- the parallel data management system gap -- that
>>>> >>> exists in the Big Data open source world. It is well-aligned with a
>>>> >>> number of the Apache projects, e.g., it has strong support for
>>>> >>> accessing and indexing external data in HDFS, and it uses YARN as an
>>>> >>> answer to basic cluster resource management. AsterixDB also seeks to
>>>> >>> achieve an Apache-style development model; it is seeking a broader
>>>> >>> community of contributors and users in order to achieve its full
>>>> >>> potential and value to the Big Data community.
>>>> >>>
>>>> >>> There are also a number of related Apache projects and dependencies
>>>> >>> that will be mentioned below in the Relationships with Other Apache
>>>> >>> products section.
>>>> >>>
>>>> >>>
>>>> >>> Known Risks
>>>> >>>
>>>> >>> Orphaned products
>>>> >>>
>>>> >>> Given the current level of intellectual investment in AsterixDB, the
>>>> >>> risk of the project being abandoned is very small. The UCI/UCR
>>>> >>> faculty team leads are highly incentivized to continue development
>>>> >>> since the database groups at UC Irvine and UC Riverside are both
>>>> >>> reliant on AsterixDB as a platform for long-term graduate research
>>>> >>> projects. UC San Diego is also beginning to contribute to the code
>>>> >>> base, and a collaboration involving public health applications is
>>>> >>> forming with UCLA. The work on AsterixDB is managed via a mix of
>>>> >>> mailing list discussions supplemented by weekly project status
>>>> >>> meetings which are summarized on the mailing list. Typical (local
>>>> >>> plus Skype-in) attendance to the weekly status meetings runs at about
>>>> >>> 20 active contributors.
>>>> >>>
>>>> >>>
>>>> >>> Inexperience with Open Source
>>>> >>>
>>>> >>> AsterixDB and Hyracks were completely developed in Open Source under
>>>> >>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>>>> >>> lists are available on Google Code and discussions and decisions
>>>> >>> happen on the mailing lists (which is necessary due to the geographic
>>>> >>> distribution of the current developers).
>>>> >>>
>>>> >>> Also a few of the initial committers have contributed to Apache
>>>> >>> projects. Vinayak Borkar is a committer on the Apache Helix and
>>>> >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>>>> >>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>>>> >>> on the Apache VXQuery project.
>>>> >>>
>>>> >>>
>>>> >>> Relationships with Other Apache Products
>>>> >>>
>>>> >>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>>>> >>> is also included in the AsterixDB code base.
>>>> >>>
>>>> >>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>>>> >>> is support for accessing external data in HDFS (and Hive formats),
>>>> >>> and resource management and system administration features are in the
>>>> >>> process of being migrated to YARN.
>>>> >>>
>>>> >>> AsterixDB's AQL query facilities offer comparable query power to
>>>> >>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>>>> >>> differs in storing and indexing data and thus being able to quickly
>>>> >>> answer small and medium queries without large HDFS data scans -
>>>> >>> thereby targeting a different class of use cases.
>>>> >>>
>>>> >>> AsterixDB's data storage and indexing facilities are similar to those
>>>> >>> of HBase, but AsterixDB differs in being a much more complete and
>>>> >>> queryable BDMS (not just a key-value style store).
>>>> >>>
>>>> >>> AsterixDB's target use cases are not in-memory processing or
>>>> >>> iterative algorithm support, making AsterixDB complementary to the
>>>> >>> Apache Spark platform. (Spark interoperability is on our longer-term
>>>> >>> to-do wishlist.)
>>>> >>>
>>>> >>>
>>>> >>> Homogeneous Developers
>>>> >>>
>>>> >>> As mentioned before the current community is already organizationally
>>>> >>> and geographically distributed - and we would like to increase the
>>>> >>> heterogeneity.
>>>> >>>
>>>> >>>
>>>> >>> Reliance on Salaried Developers
>>>> >>>
>>>> >>> Of the initial committers only 3 are full-time UCI staff. The other
>>>> >>> committers are a mix of students, alumni who continue to contribute
>>>> >>> to the effort, and individuals working with permission part-time (or
>>>> >>> in spare time) on this project.
>>>> >>>
>>>> >>>
>>>> >>> A Excessive Fascination with the Apache Brand
>>>> >>>
>>>> >>> We believe in the processes, systems, and framework Apache has put in
>>>> >>> place. Apache is also known to foster a great community around their
>>>> >>> projects and provide exposure. While brand is important, our
>>>> >>> fascination with it is not excessive. We believe that the ASF is the
>>>> >>> right home for AsterixDB and that having AsterixDB inside of the ASF
>>>> >>> will lead to a better long-term outcome for the Big Data community.
>>>> >>>
>>>> >>>
>>>> >>> Documentation
>>>> >>>
>>>> >>> Documentation and publications related to AsterixDB can be found at
>>>> >>> http://asterixdb.ics.uci.edu/.
>>>> >>>
>>>> >>>
>>>> >>> Initial Source
>>>> >>>
>>>> >>> Current source resides in Google code:
>>>> >>> https://code.google.com/p/asterixdb/ (query language and upper system
>>>> >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>>>> >>> system and storage management libraries).
>>>> >>>
>>>> >>>
>>>> >>> External Dependencies
>>>> >>>
>>>> >>> AsterixDB depends on a number of Apache projects:
>>>> >>>
>>>> >>> - Ant
>>>> >>> - Avro
>>>> >>> - ApacheDB JDO
>>>> >>> - Commons
>>>> >>> - Derby
>>>> >>> - Hadoop
>>>> >>> - Hive
>>>> >>> - HTTPComponents
>>>> >>> - Jakarta ORO
>>>> >>> - Maven
>>>> >>> - Tomcat
>>>> >>> - Thrift
>>>> >>> - Velocity
>>>> >>> - Wicket
>>>> >>> - Xerces
>>>> >>>
>>>> >>> and other open source projects (organized by license):
>>>> >>>
>>>> >>> -- ASL 2.0:
>>>> >>> - Jackson
>>>> >>> - Google Guava
>>>> >>> - Google Guice
>>>> >>> - JSON-simple
>>>> >>> - BoneCP
>>>> >>> - Microsoft Azure SDK
>>>> >>> - Netty
>>>> >>> - Rome
>>>> >>> - JetS3t
>>>> >>> - Groovy
>>>> >>> - Jettison
>>>> >>> - Plexus
>>>> >>> - Datanucleus (JDO)
>>>> >>> - Jetty
>>>> >>> - Twitter4J
>>>> >>> - Snappy-java
>>>> >>>
>>>> >>> -- BSD:
>>>> >>> - Antlr
>>>> >>> - ObjectWeb ASM
>>>> >>> - Protobuf
>>>> >>> - JSCH
>>>> >>> - JavaCC
>>>> >>> - Paranamer
>>>> >>> - JLine
>>>> >>> - Stax
>>>> >>> - StringTemplate
>>>> >>> - xmlEnc
>>>> >>>
>>>> >>> -- MIT
>>>> >>> - AppAssembler
>>>> >>> - SimpleLog4J
>>>> >>>
>>>> >>> -- CDDL 1.0
>>>> >>> - Java Activation Framework
>>>> >>> - Java Transactions
>>>> >>> - Java Servlet API
>>>> >>> - Grizzly
>>>> >>> - gmbal
>>>> >>> - Glassfish
>>>> >>>
>>>> >>> -- CDDL 1.1
>>>> >>> - Jersey
>>>> >>> - JAXB Reference Implementation
>>>> >>>
>>>> >>> -- JSON License
>>>> >>> - JSON
>>>> >>>
>>>> >>> -- EPL 1.0
>>>> >>> - JUnit
>>>> >>>
>>>> >>> -- JDOM License
>>>> >>> - JDOM
>>>> >>>
>>>> >>> -- Public Domain
>>>> >>> - xz
>>>> >>> - AOPAlliance
>>>> >>>
>>>> >>> As all dependencies are managed using Apache Maven, none of the
>>>> >>> external libraries need to be packaged in a source distribution.
>>>> >>>
>>>> >>>
>>>> >>> Required Resources
>>>> >>>
>>>> >>> Developer and user mailing lists
>>>> >>>
>>>> >>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>>>> >>> commits@asterixdb.incubator.apache.org
>>>> >>> dev@asterixdb.incubator.apache.org
>>>> >>> users@asterixdb.incubator.apache.org
>>>> >>>
>>>> >>>
>>>> >>> A git repository
>>>> >>>
>>>> >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>>> >>>
>>>> >>>
>>>> >>> A JIRA issue tracker
>>>> >>>
>>>> >>> https://issues.apache.org/jira/browse/ASTERIXDB
>>>> >>>
>>>> >>>
>>>> >>> Initial Committers
>>>> >>>
>>>> >>> The following is a list of the planned initial Apache committers (the
>>>> >>> active subset of the committers for the current repository at Google
>>>> >>> code).
>>>> >>>
>>>> >>> Abdullah Alamoudi (bamousaa@gmail.com)
>>>> >>> Cameron Samak (eufery@gmail.com)
>>>> >>> Chen Li (chenli@gmail.com)
>>>> >>> Ian Maxon (imaxon@uci.edu)
>>>> >>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>>>> >>> Jianfeng Jia (jianfeng.jia@gmail.com)
>>>> >>> Karen Ouaknine (kereno@gmail.com)
>>>> >>> Markus Dreseler (apache@dreseler.de)
>>>> >>> Mike Carey (dtabass@apache.org)
>>>> >>> Murtadha Hubail (hubailmor@gmail.com)
>>>> >>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>>>> >>> Preston Carman (prestonc@apache.org)
>>>> >>> Raman Grover (RamanGrover29@gmail.com)
>>>> >>> Sattam Alsubaiee (salsubaiee@gmail.com)
>>>> >>> Steven Jacobs (sjaco002@apache.org)
>>>> >>> Taewoo Kim (wangsaeu@gmail.com)
>>>> >>> Till Westmann (tillw@apache.org)
>>>> >>> Vinayak Borkar (vinayakb@apache.org)
>>>> >>> Yingyi Bu (buyingyi@gmail.com)
>>>> >>> Young-Seok Kim (kisskys@gmail.com)
>>>> >>> Zach Heilbron (zheilbron@gmail.com)
>>>> >>>
>>>> >>>
>>>> >>> Affiliations
>>>> >>>
>>>> >>> UC Irvine
>>>> >>> - Mike Carey
>>>> >>> - Chen Li
>>>> >>> - Ian Maxon
>>>> >>> - Yingyi Bu
>>>> >>> - Raman Grover
>>>> >>> - Pouria Pirzadeh
>>>> >>> - Young-Seok Kim
>>>> >>> - Cameron Samak
>>>> >>> - Taewoo Kim
>>>> >>> - Jianfeng Jia
>>>> >>> - Murtadha Hubail
>>>> >>> - Markus Dreseler
>>>> >>>
>>>> >>> UC Riverside
>>>> >>> - Ildar Absalyamov
>>>> >>> - Preston Carman
>>>> >>> - Steven Jacobs
>>>> >>>
>>>> >>> Hebrew University
>>>> >>> - Keren Ouaknine
>>>> >>>
>>>> >>> Oracle
>>>> >>> - Till Westmann
>>>> >>>
>>>> >>> X15 Software
>>>> >>> - Vinayak Borkar
>>>> >>> - Zach Heilbron
>>>> >>>
>>>> >>> KACST Saudi Arabia
>>>> >>> - Sattam Alsubaiee
>>>> >>>
>>>> >>> Saudi Aramco
>>>> >>> - Abdullah Alamoudi
>>>> >>>
>>>> >>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>>>> >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>>>> >>> non-UC committers are a mix of alumni who continue to contribute to
>>>> >>> the effort and individuals working with permission part-time (or in
>>>> >>> spare time) on this project.
>>>> >>>
>>>> >>>
>>>> >>> Sponsors
>>>> >>>
>>>> >>> Champion
>>>> >>>
>>>> >>> Chris Mattmann (NASA/JPL)
>>>> >>>
>>>> >>> Nominated Mentors
>>>> >>>
>>>> >>> TBD
>>>> >>>
>>>> >>> Sponsoring Entity
>>>> >>>
>>>> >>> The Apache Incubator
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>> Chris Mattmann, Ph.D.
>>>> >>> Chief Architect
>>>> >>> Instrument Software and Science Data Systems Section (398)
>>>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> >>> Office: 168-519, Mailstop: 168-527
>>>> >>> Email: chris.a.mattmann@nasa.gov
>>>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>> Adjunct Associate Professor, Computer Science Department
>>>> >>> University of Southern California, Los Angeles, CA 90089 USA
>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>> For additional commands, e-mail: general-help@incubator.apache.org
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Ted Dunning <te...@gmail.com>.

Added my name to the mentor list.



On Tue, Jan 20, 2015 at 8:37 AM, Mike Carey <dt...@gmail.com> wrote:

>  Wonderful; thanks, Ted!!
> Cheers,
> Mike
>
>  On 1/19/15 11:29 PM, Ted Dunning wrote:
>
>
> Chris just asked me under separate cover.
>
>  I am happy to help out as mentor.
>
>
>
> On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra <he...@gmail.com>
> wrote:
>
>> Thanks Till,
>>
>> Will try to solicit more mentors to help.
>> Especially with initial committers mostly have not been exposed to
>> contributing the Apache way.
>>
>> - Henry
>>
>> On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <ti...@westmann.org> wrote:
>> > Hi Henry,
>> >
>> > thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>> >
>> > Even if your time is very limited we would be very happy to have you on
>> board as a mentor.
>> > I’ll add you to the proposal.
>> >
>> > Cheers,
>> > Till
>> >
>> >> On Jan 19, 2015, at 10:26 AM, Henry Saputra <he...@gmail.com>
>> wrote:
>> >>
>> >> +1 This is GREAT News!
>> >>
>> >> Was watching and trying AsterixDB last year and looked in awesome
>> shape.
>> >>
>> >> I have my plate full but would love to help mentor this project to get
>> >> it going to ASF if needed!
>> >>
>> >> - Henry
>> >>
>> >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>> >> <ch...@jpl.nasa.gov> wrote:
>> >>> Hi Folks,
>> >>>
>> >>> I am pleased to bring forth the Apache AsterixDB proposal to the
>> >>> Apache Incubator as Champion, working in collaboration with the
>> >>> team. Please find the wiki proposal here:
>> >>>
>> >>> https://wiki.apache.org/incubator/AsterixDBProposal
>> >>>
>> >>>
>> >>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>> >>> leave the discussion open for a week, and then look to call a VOTE
>> >>> hopefully end of next week if all is well.
>> >>>
>> >>> Cheers!
>> >>> Chris Mattmann
>> >>>
>> >>> =============================================================
>> >>> Apache AsterixDB Proposal
>> >>>
>> >>> Abstract
>> >>>
>> >>> Apache AsterixDB is a scalable big data management system (BDMS) that
>> >>> provides storage, management, and query capabilities for large
>> >>> collections of semi-structured data.
>> >>>
>> >>> Proposal
>> >>>
>> >>> AsterixDB is a big data management system (BDMS) that makes it
>> >>> well-suited to needs such as web data warehousing and social data
>> >>> storage and analysis. Feature-wise, AsterixDB has:
>> >>>
>> >>> * A NoSQL style data model (ADM) based on extending JSON with object
>> >>>  database concepts.
>> >>> * An expressive and declarative query language (AQL) for querying
>> >>>  semi-structured data.
>> >>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>> >>>  execution of query plans.
>> >>> * Partitioned LSM-based data storage and indexing for efficient
>> >>>  ingestion of newly arriving data.
>> >>> * Support for querying and indexing external data (e.g., in HDFS) as
>> >>>  well as data stored within AsterixDB.
>> >>> * A rich set of primitive data types, including support for spatial,
>> >>>  temporal, and textual data.
>> >>> * Indexing options that include B+ trees, R trees, and inverted
>> >>>  keyword index support.
>> >>> * Basic transactional (concurrency and recovery) capabilities akin to
>> >>>  those of a NoSQL store.
>> >>>
>> >>>
>> >>> Background and Rationale
>> >>>
>> >>> In the world of relational databases, the need to tackle data volumes
>> >>> that exceed the capabilities of a single server led to the
>> >>> development of “shared-nothing” parallel database systems several
>> >>> decades ago. These systems spread data over a cluster based on a
>> >>> partitioning strategy, such as hash partitioning, and queries are
>> >>> processed by employing partitioned-parallel divide-and-conquer
>> >>> techniques. Since these systems are fronted by a high-level,
>> >>> declarative language (SQL), their users are shielded from the
>> >>> complexities of parallel programming. Parallel database systems have
>> >>> been an extremely successful application of parallel computing, and
>> >>> quite a number of commercial products exist today.
>> >>>
>> >>> In the distributed systems world, the Web brought a need to index and
>> >>> query its huge content. SQL and relational databases were not the
>> >>> answer, though shared-nothing clusters again emerged as the hardware
>> >>> platform of choice. Google developed the Google File System (GFS) and
>> >>> MapReduce programming model to allow programmers to store and process
>> >>> Big Data by writing a few user-defined functions. The MapReduce
>> >>> framework applies these functions in parallel to data instances in
>> >>> distributed files (map) and to sorted groups of instances sharing a
>> >>> common key (reduce) -- not unlike the partitioned parallelism in
>> >>> parallel database systems. Apache's Hadoop MapReduce platform is the
>> >>> most prominent implementation of this paradigm for the rest of the
>> >>> Big Data community. On top of Hadoop and HDFS sit declarative
>> >>> languages like Pig and Hive that each compile down to Hadoop
>> >>> MapReduce jobs.
>> >>>
>> >>> The big Web companies were also challenged by extreme user bases
>> >>> (100s of millions of users) and needed fast simple lookups and
>> >>> updates to very large keyed data sets like user profiles. SQL
>> >>> databases were deemed either too expensive or not scalable, so the
>> >>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>> >>> popular key-value stores, in this space. MongoDB and Couchbase are
>> >>> other open source alternatives (document stores).
>> >>>
>> >>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>> >>> as well as the strong demand for Big Data analytics engines today,
>> >>> that there is a strong (and growing!) need to store, process, *and*
>> >>> query large volumes of semi-structured data in many application
>> >>> areas. Until very recently, developers have had to ``choose'' between
>> >>> using big data analytics engines like Apache Hive or Apache Spark,
>> >>> which can do complex query processing and analysis over HDFS-resident
>> >>> files, and flexible but low-function data stores like MongoDB or
>> >>> Apache HBase. (The Apache Phoenix project,
>> >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>> >>> aims to bridge between these choices.)
>> >>>
>> >>> AsterixDB is a highly scalable data management system that can store,
>> >>> index, and manage semi-structured data, e.g., much like MongoDB, but
>> >>> it also supports a full-power query language with the expressiveness
>> >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>> >>> stores and manages data, so AsterixDB can exploit its knowledge of
>> >>> data partitioning and the availability of indexes to avoid always
>> >>> scanning data set(s) to process queries. Somewhat surprisingly, there
>> >>> is no open source parallel database system (relational or otherwise)
>> >>> available to developers today -- AsterixDB aims to fill this need.
>> >>> Since Apache is where the majority of the today's most important Big
>> >>> Data technologies live, the ASF seems like the obvious home for a
>> >>> system like AsterixDB.
>> >>>
>> >>> Current Status
>> >>>
>> >>> The current version of AsterixDB was co-developed by a team of
>> >>> faculty, staff, and students at UC Irvine and UC Riverside. The
>> >>> project was initiated as a large NSF-sponsored project in 2009, the
>> >>> goal of which was to combine the best ideas from the parallel
>> >>> database world, the then new Hadoop world, and the semi-structured
>> >>> (e.g., XML/JSON) data world in order to create a next-generation
>> >>> BDMS. A first informal open source release was made four years later,
>> >>> in June of 2013, under the Apache Software License 2.0.
>> >>>
>> >>>
>> >>> Meritocracy
>> >>>
>> >>> The current developers are familiar with meritocratic open source
>> >>> development at Apache. Apache was chosen specifically because we want
>> >>> to encourage this style of development for the project.
>> >>>
>> >>>
>> >>> Community
>> >>>
>> >>> While AsterixDB started as a university project it has developed into
>> >>> a community. A number of the initial committers started contributing
>> >>> in academia and continue to actively participate and contribute after
>> >>> graduation. And we seek to further develop developer and user
>> >>> communities. One way to broaden the community that is ongoing is
>> >>> through academic collaborations (currently with IIT Mumbai in India
>> >>> and TU Berlin in Germany). During incubation we will also explicitly
>> >>> seek increased industrial participation.
>> >>>
>> >>> Some indicators of the effort's development community and history can
>> >>> be
>> >>> found at:
>> >>>
>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo
>> ,
>> >>>
>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>> >>>
>> >>>
>> >>> Core Developers
>> >>>
>> >>> The core developers of the project are diverse, although initially UC
>> >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>> >>> other 50 are from other academic institutions (UC Riverside and the
>> >>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>> >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>> >>>
>> >>>
>> >>> Alignment
>> >>>
>> >>> Apache is, by far, the most natural home for taking the AsterixDB
>> >>> project forward. A large fraction of today's top Big Data
>> >>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>> >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>> >>> significant gap -- the parallel data management system gap -- that
>> >>> exists in the Big Data open source world. It is well-aligned with a
>> >>> number of the Apache projects, e.g., it has strong support for
>> >>> accessing and indexing external data in HDFS, and it uses YARN as an
>> >>> answer to basic cluster resource management. AsterixDB also seeks to
>> >>> achieve an Apache-style development model; it is seeking a broader
>> >>> community of contributors and users in order to achieve its full
>> >>> potential and value to the Big Data community.
>> >>>
>> >>> There are also a number of related Apache projects and dependencies
>> >>> that will be mentioned below in the Relationships with Other Apache
>> >>> products section.
>> >>>
>> >>>
>> >>> Known Risks
>> >>>
>> >>> Orphaned products
>> >>>
>> >>> Given the current level of intellectual investment in AsterixDB, the
>> >>> risk of the project being abandoned is very small. The UCI/UCR
>> >>> faculty team leads are highly incentivized to continue development
>> >>> since the database groups at UC Irvine and UC Riverside are both
>> >>> reliant on AsterixDB as a platform for long-term graduate research
>> >>> projects. UC San Diego is also beginning to contribute to the code
>> >>> base, and a collaboration involving public health applications is
>> >>> forming with UCLA. The work on AsterixDB is managed via a mix of
>> >>> mailing list discussions supplemented by weekly project status
>> >>> meetings which are summarized on the mailing list. Typical (local
>> >>> plus Skype-in) attendance to the weekly status meetings runs at about
>> >>> 20 active contributors.
>> >>>
>> >>>
>> >>> Inexperience with Open Source
>> >>>
>> >>> AsterixDB and Hyracks were completely developed in Open Source under
>> >>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>> >>> lists are available on Google Code and discussions and decisions
>> >>> happen on the mailing lists (which is necessary due to the geographic
>> >>> distribution of the current developers).
>> >>>
>> >>> Also a few of the initial committers have contributed to Apache
>> >>> projects. Vinayak Borkar is a committer on the Apache Helix and
>> >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>> >>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>> >>> on the Apache VXQuery project.
>> >>>
>> >>>
>> >>> Relationships with Other Apache Products
>> >>>
>> >>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>> >>> is also included in the AsterixDB code base.
>> >>>
>> >>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>> >>> is support for accessing external data in HDFS (and Hive formats),
>> >>> and resource management and system administration features are in the
>> >>> process of being migrated to YARN.
>> >>>
>> >>> AsterixDB's AQL query facilities offer comparable query power to
>> >>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>> >>> differs in storing and indexing data and thus being able to quickly
>> >>> answer small and medium queries without large HDFS data scans -
>> >>> thereby targeting a different class of use cases.
>> >>>
>> >>> AsterixDB's data storage and indexing facilities are similar to those
>> >>> of HBase, but AsterixDB differs in being a much more complete and
>> >>> queryable BDMS (not just a key-value style store).
>> >>>
>> >>> AsterixDB's target use cases are not in-memory processing or
>> >>> iterative algorithm support, making AsterixDB complementary to the
>> >>> Apache Spark platform. (Spark interoperability is on our longer-term
>> >>> to-do wishlist.)
>> >>>
>> >>>
>> >>> Homogeneous Developers
>> >>>
>> >>> As mentioned before the current community is already organizationally
>> >>> and geographically distributed - and we would like to increase the
>> >>> heterogeneity.
>> >>>
>> >>>
>> >>> Reliance on Salaried Developers
>> >>>
>> >>> Of the initial committers only 3 are full-time UCI staff. The other
>> >>> committers are a mix of students, alumni who continue to contribute
>> >>> to the effort, and individuals working with permission part-time (or
>> >>> in spare time) on this project.
>> >>>
>> >>>
>> >>> A Excessive Fascination with the Apache Brand
>> >>>
>> >>> We believe in the processes, systems, and framework Apache has put in
>> >>> place. Apache is also known to foster a great community around their
>> >>> projects and provide exposure. While brand is important, our
>> >>> fascination with it is not excessive. We believe that the ASF is the
>> >>> right home for AsterixDB and that having AsterixDB inside of the ASF
>> >>> will lead to a better long-term outcome for the Big Data community.
>> >>>
>> >>>
>> >>> Documentation
>> >>>
>> >>> Documentation and publications related to AsterixDB can be found at
>> >>> http://asterixdb.ics.uci.edu/.
>> >>>
>> >>>
>> >>> Initial Source
>> >>>
>> >>> Current source resides in Google code:
>> >>> https://code.google.com/p/asterixdb/ (query language and upper system
>> >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>> >>> system and storage management libraries).
>> >>>
>> >>>
>> >>> External Dependencies
>> >>>
>> >>> AsterixDB depends on a number of Apache projects:
>> >>>
>> >>> - Ant
>> >>> - Avro
>> >>> - ApacheDB JDO
>> >>> - Commons
>> >>> - Derby
>> >>> - Hadoop
>> >>> - Hive
>> >>> - HTTPComponents
>> >>> - Jakarta ORO
>> >>> - Maven
>> >>> - Tomcat
>> >>> - Thrift
>> >>> - Velocity
>> >>> - Wicket
>> >>> - Xerces
>> >>>
>> >>> and other open source projects (organized by license):
>> >>>
>> >>> -- ASL 2.0:
>> >>> - Jackson
>> >>> - Google Guava
>> >>> - Google Guice
>> >>> - JSON-simple
>> >>> - BoneCP
>> >>> - Microsoft Azure SDK
>> >>> - Netty
>> >>> - Rome
>> >>> - JetS3t
>> >>> - Groovy
>> >>> - Jettison
>> >>> - Plexus
>> >>> - Datanucleus (JDO)
>> >>> - Jetty
>> >>> - Twitter4J
>> >>> - Snappy-java
>> >>>
>> >>> -- BSD:
>> >>> - Antlr
>> >>> - ObjectWeb ASM
>> >>> - Protobuf
>> >>> - JSCH
>> >>> - JavaCC
>> >>> - Paranamer
>> >>> - JLine
>> >>> - Stax
>> >>> - StringTemplate
>> >>> - xmlEnc
>> >>>
>> >>> -- MIT
>> >>> - AppAssembler
>> >>> - SimpleLog4J
>> >>>
>> >>> -- CDDL 1.0
>> >>> - Java Activation Framework
>> >>> - Java Transactions
>> >>> - Java Servlet API
>> >>> - Grizzly
>> >>> - gmbal
>> >>> - Glassfish
>> >>>
>> >>> -- CDDL 1.1
>> >>> - Jersey
>> >>> - JAXB Reference Implementation
>> >>>
>> >>> -- JSON License
>> >>> - JSON
>> >>>
>> >>> -- EPL 1.0
>> >>> - JUnit
>> >>>
>> >>> -- JDOM License
>> >>> - JDOM
>> >>>
>> >>> -- Public Domain
>> >>> - xz
>> >>> - AOPAlliance
>> >>>
>> >>> As all dependencies are managed using Apache Maven, none of the
>> >>> external libraries need to be packaged in a source distribution.
>> >>>
>> >>>
>> >>> Required Resources
>> >>>
>> >>> Developer and user mailing lists
>> >>>
>> >>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>> >>> commits@asterixdb.incubator.apache.org
>> >>> dev@asterixdb.incubator.apache.org
>> >>> users@asterixdb.incubator.apache.org
>> >>>
>> >>>
>> >>> A git repository
>> >>>
>> >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>> >>>
>> >>>
>> >>> A JIRA issue tracker
>> >>>
>> >>> https://issues.apache.org/jira/browse/ASTERIXDB
>> >>>
>> >>>
>> >>> Initial Committers
>> >>>
>> >>> The following is a list of the planned initial Apache committers (the
>> >>> active subset of the committers for the current repository at Google
>> >>> code).
>> >>>
>> >>> Abdullah Alamoudi (bamousaa@gmail.com)
>> >>> Cameron Samak (eufery@gmail.com)
>> >>> Chen Li (chenli@gmail.com)
>> >>> Ian Maxon (imaxon@uci.edu)
>> >>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>> >>> Jianfeng Jia (jianfeng.jia@gmail.com)
>> >>> Karen Ouaknine (kereno@gmail.com)
>> >>> Markus Dreseler (apache@dreseler.de)
>> >>> Mike Carey (dtabass@apache.org)
>> >>> Murtadha Hubail (hubailmor@gmail.com)
>> >>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>> >>> Preston Carman (prestonc@apache.org)
>> >>> Raman Grover (RamanGrover29@gmail.com)
>> >>> Sattam Alsubaiee (salsubaiee@gmail.com)
>> >>> Steven Jacobs (sjaco002@apache.org)
>> >>> Taewoo Kim (wangsaeu@gmail.com)
>> >>> Till Westmann (tillw@apache.org)
>> >>> Vinayak Borkar (vinayakb@apache.org)
>> >>> Yingyi Bu (buyingyi@gmail.com)
>> >>> Young-Seok Kim (kisskys@gmail.com)
>> >>> Zach Heilbron (zheilbron@gmail.com)
>> >>>
>> >>>
>> >>> Affiliations
>> >>>
>> >>> UC Irvine
>> >>> - Mike Carey
>> >>> - Chen Li
>> >>> - Ian Maxon
>> >>> - Yingyi Bu
>> >>> - Raman Grover
>> >>> - Pouria Pirzadeh
>> >>> - Young-Seok Kim
>> >>> - Cameron Samak
>> >>> - Taewoo Kim
>> >>> - Jianfeng Jia
>> >>> - Murtadha Hubail
>> >>> - Markus Dreseler
>> >>>
>> >>> UC Riverside
>> >>> - Ildar Absalyamov
>> >>> - Preston Carman
>> >>> - Steven Jacobs
>> >>>
>> >>> Hebrew University
>> >>> - Keren Ouaknine
>> >>>
>> >>> Oracle
>> >>> - Till Westmann
>> >>>
>> >>> X15 Software
>> >>> - Vinayak Borkar
>> >>> - Zach Heilbron
>> >>>
>> >>> KACST Saudi Arabia
>> >>> - Sattam Alsubaiee
>> >>>
>> >>> Saudi Aramco
>> >>> - Abdullah Alamoudi
>> >>>
>> >>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>> >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>> >>> non-UC committers are a mix of alumni who continue to contribute to
>> >>> the effort and individuals working with permission part-time (or in
>> >>> spare time) on this project.
>> >>>
>> >>>
>> >>> Sponsors
>> >>>
>> >>> Champion
>> >>>
>> >>> Chris Mattmann (NASA/JPL)
>> >>>
>> >>> Nominated Mentors
>> >>>
>> >>> TBD
>> >>>
>> >>> Sponsoring Entity
>> >>>
>> >>> The Apache Incubator
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> Chris Mattmann, Ph.D.
>> >>> Chief Architect
>> >>> Instrument Software and Science Data Systems Section (398)
>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>> Office: 168-519, Mailstop: 168-527
>> >>> Email: chris.a.mattmann@nasa.gov
>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> Adjunct Associate Professor, Computer Science Department
>> >>> University of Southern California, Los Angeles, CA 90089 USA
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>
>> >>>
>> >>>
>> >>>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
>
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Mike Carey <dt...@gmail.com>.

Wonderful; thanks, Ted!!
Cheers,
Mike

On 1/19/15 11:29 PM, Ted Dunning wrote:
>
> Chris just asked me under separate cover.
>
> I am happy to help out as mentor.
>
>
>
> On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra 
> <henry.saputra@gmail.com <ma...@gmail.com>> wrote:
>
>     Thanks Till,
>
>     Will try to solicit more mentors to help.
>     Especially with initial committers mostly have not been exposed to
>     contributing the Apache way.
>
>     - Henry
>
>     On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <till@westmann.org
>     <ma...@westmann.org>> wrote:
>     > Hi Henry,
>     >
>     > thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>     >
>     > Even if your time is very limited we would be very happy to have
>     you on board as a mentor.
>     > I’ll add you to the proposal.
>     >
>     > Cheers,
>     > Till
>     >
>     >> On Jan 19, 2015, at 10:26 AM, Henry Saputra
>     <henry.saputra@gmail.com <ma...@gmail.com>> wrote:
>     >>
>     >> +1 This is GREAT News!
>     >>
>     >> Was watching and trying AsterixDB last year and looked in
>     awesome shape.
>     >>
>     >> I have my plate full but would love to help mentor this project
>     to get
>     >> it going to ASF if needed!
>     >>
>     >> - Henry
>     >>
>     >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>     >> <chris.a.mattmann@jpl.nasa.gov
>     <ma...@jpl.nasa.gov>> wrote:
>     >>> Hi Folks,
>     >>>
>     >>> I am pleased to bring forth the Apache AsterixDB proposal to the
>     >>> Apache Incubator as Champion, working in collaboration with the
>     >>> team. Please find the wiki proposal here:
>     >>>
>     >>> https://wiki.apache.org/incubator/AsterixDBProposal
>     >>>
>     >>>
>     >>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>     >>> leave the discussion open for a week, and then look to call a VOTE
>     >>> hopefully end of next week if all is well.
>     >>>
>     >>> Cheers!
>     >>> Chris Mattmann
>     >>>
>     >>> =============================================================
>     >>> Apache AsterixDB Proposal
>     >>>
>     >>> Abstract
>     >>>
>     >>> Apache AsterixDB is a scalable big data management system
>     (BDMS) that
>     >>> provides storage, management, and query capabilities for large
>     >>> collections of semi-structured data.
>     >>>
>     >>> Proposal
>     >>>
>     >>> AsterixDB is a big data management system (BDMS) that makes it
>     >>> well-suited to needs such as web data warehousing and social data
>     >>> storage and analysis. Feature-wise, AsterixDB has:
>     >>>
>     >>> * A NoSQL style data model (ADM) based on extending JSON with
>     object
>     >>>  database concepts.
>     >>> * An expressive and declarative query language (AQL) for querying
>     >>>  semi-structured data.
>     >>> * A runtime query execution engine, Hyracks, for
>     partitioned-parallel
>     >>>  execution of query plans.
>     >>> * Partitioned LSM-based data storage and indexing for efficient
>     >>>  ingestion of newly arriving data.
>     >>> * Support for querying and indexing external data (e.g., in
>     HDFS) as
>     >>>  well as data stored within AsterixDB.
>     >>> * A rich set of primitive data types, including support for
>     spatial,
>     >>>  temporal, and textual data.
>     >>> * Indexing options that include B+ trees, R trees, and inverted
>     >>>  keyword index support.
>     >>> * Basic transactional (concurrency and recovery) capabilities
>     akin to
>     >>>  those of a NoSQL store.
>     >>>
>     >>>
>     >>> Background and Rationale
>     >>>
>     >>> In the world of relational databases, the need to tackle data
>     volumes
>     >>> that exceed the capabilities of a single server led to the
>     >>> development of “shared-nothing” parallel database systems several
>     >>> decades ago. These systems spread data over a cluster based on a
>     >>> partitioning strategy, such as hash partitioning, and queries are
>     >>> processed by employing partitioned-parallel divide-and-conquer
>     >>> techniques. Since these systems are fronted by a high-level,
>     >>> declarative language (SQL), their users are shielded from the
>     >>> complexities of parallel programming. Parallel database
>     systems have
>     >>> been an extremely successful application of parallel
>     computing, and
>     >>> quite a number of commercial products exist today.
>     >>>
>     >>> In the distributed systems world, the Web brought a need to
>     index and
>     >>> query its huge content. SQL and relational databases were not the
>     >>> answer, though shared-nothing clusters again emerged as the
>     hardware
>     >>> platform of choice. Google developed the Google File System
>     (GFS) and
>     >>> MapReduce programming model to allow programmers to store and
>     process
>     >>> Big Data by writing a few user-defined functions. The MapReduce
>     >>> framework applies these functions in parallel to data instances in
>     >>> distributed files (map) and to sorted groups of instances
>     sharing a
>     >>> common key (reduce) -- not unlike the partitioned parallelism in
>     >>> parallel database systems. Apache's Hadoop MapReduce platform
>     is the
>     >>> most prominent implementation of this paradigm for the rest of the
>     >>> Big Data community. On top of Hadoop and HDFS sit declarative
>     >>> languages like Pig and Hive that each compile down to Hadoop
>     >>> MapReduce jobs.
>     >>>
>     >>> The big Web companies were also challenged by extreme user bases
>     >>> (100s of millions of users) and needed fast simple lookups and
>     >>> updates to very large keyed data sets like user profiles. SQL
>     >>> databases were deemed either too expensive or not scalable, so the
>     >>> “NoSQL movement” was born. The ASF now has HBase and
>     Cassandra, two
>     >>> popular key-value stores, in this space. MongoDB and Couchbase are
>     >>> other open source alternatives (document stores).
>     >>>
>     >>> It is evident from the rapidly growing popularity of "NoSQL"
>     stores,
>     >>> as well as the strong demand for Big Data analytics engines today,
>     >>> that there is a strong (and growing!) need to store, process,
>     *and*
>     >>> query large volumes of semi-structured data in many application
>     >>> areas. Until very recently, developers have had to ``choose''
>     between
>     >>> using big data analytics engines like Apache Hive or Apache Spark,
>     >>> which can do complex query processing and analysis over
>     HDFS-resident
>     >>> files, and flexible but low-function data stores like MongoDB or
>     >>> Apache HBase. (The Apache Phoenix project,
>     >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>     >>> aims to bridge between these choices.)
>     >>>
>     >>> AsterixDB is a highly scalable data management system that can
>     store,
>     >>> index, and manage semi-structured data, e.g., much like
>     MongoDB, but
>     >>> it also supports a full-power query language with the
>     expressiveness
>     >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>     >>> stores and manages data, so AsterixDB can exploit its knowledge of
>     >>> data partitioning and the availability of indexes to avoid always
>     >>> scanning data set(s) to process queries. Somewhat
>     surprisingly, there
>     >>> is no open source parallel database system (relational or
>     otherwise)
>     >>> available to developers today -- AsterixDB aims to fill this need.
>     >>> Since Apache is where the majority of the today's most
>     important Big
>     >>> Data technologies live, the ASF seems like the obvious home for a
>     >>> system like AsterixDB.
>     >>>
>     >>> Current Status
>     >>>
>     >>> The current version of AsterixDB was co-developed by a team of
>     >>> faculty, staff, and students at UC Irvine and UC Riverside. The
>     >>> project was initiated as a large NSF-sponsored project in
>     2009, the
>     >>> goal of which was to combine the best ideas from the parallel
>     >>> database world, the then new Hadoop world, and the semi-structured
>     >>> (e.g., XML/JSON) data world in order to create a next-generation
>     >>> BDMS. A first informal open source release was made four years
>     later,
>     >>> in June of 2013, under the Apache Software License 2.0.
>     >>>
>     >>>
>     >>> Meritocracy
>     >>>
>     >>> The current developers are familiar with meritocratic open source
>     >>> development at Apache. Apache was chosen specifically because
>     we want
>     >>> to encourage this style of development for the project.
>     >>>
>     >>>
>     >>> Community
>     >>>
>     >>> While AsterixDB started as a university project it has
>     developed into
>     >>> a community. A number of the initial committers started
>     contributing
>     >>> in academia and continue to actively participate and
>     contribute after
>     >>> graduation. And we seek to further develop developer and user
>     >>> communities. One way to broaden the community that is ongoing is
>     >>> through academic collaborations (currently with IIT Mumbai in
>     India
>     >>> and TU Berlin in Germany). During incubation we will also
>     explicitly
>     >>> seek increased industrial participation.
>     >>>
>     >>> Some indicators of the effort's development community and
>     history can
>     >>> be
>     >>> found at:
>     >>>
>     https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>     >>>
>     https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>     >>>
>     >>>
>     >>> Core Developers
>     >>>
>     >>> The core developers of the project are diverse, although
>     initially UC
>     >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>     >>> other 50 are from other academic institutions (UC Riverside
>     and the
>     >>> Hebrew University in Jerusalem) and companies (Couchbase,
>     Facebook,
>     >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>     >>>
>     >>>
>     >>> Alignment
>     >>>
>     >>> Apache is, by far, the most natural home for taking the AsterixDB
>     >>> project forward. A large fraction of today's top Big Data
>     >>> technologies have their homes in Apache, including Hadoop,
>     YARN, Pig,
>     >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>     >>> significant gap -- the parallel data management system gap -- that
>     >>> exists in the Big Data open source world. It is well-aligned
>     with a
>     >>> number of the Apache projects, e.g., it has strong support for
>     >>> accessing and indexing external data in HDFS, and it uses YARN
>     as an
>     >>> answer to basic cluster resource management. AsterixDB also
>     seeks to
>     >>> achieve an Apache-style development model; it is seeking a broader
>     >>> community of contributors and users in order to achieve its full
>     >>> potential and value to the Big Data community.
>     >>>
>     >>> There are also a number of related Apache projects and
>     dependencies
>     >>> that will be mentioned below in the Relationships with Other
>     Apache
>     >>> products section.
>     >>>
>     >>>
>     >>> Known Risks
>     >>>
>     >>> Orphaned products
>     >>>
>     >>> Given the current level of intellectual investment in
>     AsterixDB, the
>     >>> risk of the project being abandoned is very small. The UCI/UCR
>     >>> faculty team leads are highly incentivized to continue development
>     >>> since the database groups at UC Irvine and UC Riverside are both
>     >>> reliant on AsterixDB as a platform for long-term graduate research
>     >>> projects. UC San Diego is also beginning to contribute to the code
>     >>> base, and a collaboration involving public health applications is
>     >>> forming with UCLA. The work on AsterixDB is managed via a mix of
>     >>> mailing list discussions supplemented by weekly project status
>     >>> meetings which are summarized on the mailing list. Typical (local
>     >>> plus Skype-in) attendance to the weekly status meetings runs
>     at about
>     >>> 20 active contributors.
>     >>>
>     >>>
>     >>> Inexperience with Open Source
>     >>>
>     >>> AsterixDB and Hyracks were completely developed in Open Source
>     under
>     >>> the ASL 2.0. The source code repositories, issue tracker, and
>     mailing
>     >>> lists are available on Google Code and discussions and decisions
>     >>> happen on the mailing lists (which is necessary due to the
>     geographic
>     >>> distribution of the current developers).
>     >>>
>     >>> Also a few of the initial committers have contributed to Apache
>     >>> projects. Vinayak Borkar is a committer on the Apache Helix and
>     >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at
>     the ASF
>     >>> and an IPMC member. Preston Carman and Steven Jacobs are
>     committers
>     >>> on the Apache VXQuery project.
>     >>>
>     >>>
>     >>> Relationships with Other Apache Products
>     >>>
>     >>> Apache VXQuery is based on the Hyracks data-parallel runtime,
>     which
>     >>> is also included in the AsterixDB code base.
>     >>>
>     >>> AsterixDB is closely related to Apache Hadoop. Included in
>     AsterixDB
>     >>> is support for accessing external data in HDFS (and Hive formats),
>     >>> and resource management and system administration features are
>     in the
>     >>> process of being migrated to YARN.
>     >>>
>     >>> AsterixDB's AQL query facilities offer comparable query power to
>     >>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>     >>> differs in storing and indexing data and thus being able to
>     quickly
>     >>> answer small and medium queries without large HDFS data scans -
>     >>> thereby targeting a different class of use cases.
>     >>>
>     >>> AsterixDB's data storage and indexing facilities are similar
>     to those
>     >>> of HBase, but AsterixDB differs in being a much more complete and
>     >>> queryable BDMS (not just a key-value style store).
>     >>>
>     >>> AsterixDB's target use cases are not in-memory processing or
>     >>> iterative algorithm support, making AsterixDB complementary to the
>     >>> Apache Spark platform. (Spark interoperability is on our
>     longer-term
>     >>> to-do wishlist.)
>     >>>
>     >>>
>     >>> Homogeneous Developers
>     >>>
>     >>> As mentioned before the current community is already
>     organizationally
>     >>> and geographically distributed - and we would like to increase the
>     >>> heterogeneity.
>     >>>
>     >>>
>     >>> Reliance on Salaried Developers
>     >>>
>     >>> Of the initial committers only 3 are full-time UCI staff. The
>     other
>     >>> committers are a mix of students, alumni who continue to
>     contribute
>     >>> to the effort, and individuals working with permission
>     part-time (or
>     >>> in spare time) on this project.
>     >>>
>     >>>
>     >>> A Excessive Fascination with the Apache Brand
>     >>>
>     >>> We believe in the processes, systems, and framework Apache has
>     put in
>     >>> place. Apache is also known to foster a great community around
>     their
>     >>> projects and provide exposure. While brand is important, our
>     >>> fascination with it is not excessive. We believe that the ASF
>     is the
>     >>> right home for AsterixDB and that having AsterixDB inside of
>     the ASF
>     >>> will lead to a better long-term outcome for the Big Data
>     community.
>     >>>
>     >>>
>     >>> Documentation
>     >>>
>     >>> Documentation and publications related to AsterixDB can be
>     found at
>     >>> http://asterixdb.ics.uci.edu/.
>     >>>
>     >>>
>     >>> Initial Source
>     >>>
>     >>> Current source resides in Google code:
>     >>> https://code.google.com/p/asterixdb/ (query language and upper
>     system
>     >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>     >>> system and storage management libraries).
>     >>>
>     >>>
>     >>> External Dependencies
>     >>>
>     >>> AsterixDB depends on a number of Apache projects:
>     >>>
>     >>> - Ant
>     >>> - Avro
>     >>> - ApacheDB JDO
>     >>> - Commons
>     >>> - Derby
>     >>> - Hadoop
>     >>> - Hive
>     >>> - HTTPComponents
>     >>> - Jakarta ORO
>     >>> - Maven
>     >>> - Tomcat
>     >>> - Thrift
>     >>> - Velocity
>     >>> - Wicket
>     >>> - Xerces
>     >>>
>     >>> and other open source projects (organized by license):
>     >>>
>     >>> -- ASL 2.0:
>     >>> - Jackson
>     >>> - Google Guava
>     >>> - Google Guice
>     >>> - JSON-simple
>     >>> - BoneCP
>     >>> - Microsoft Azure SDK
>     >>> - Netty
>     >>> - Rome
>     >>> - JetS3t
>     >>> - Groovy
>     >>> - Jettison
>     >>> - Plexus
>     >>> - Datanucleus (JDO)
>     >>> - Jetty
>     >>> - Twitter4J
>     >>> - Snappy-java
>     >>>
>     >>> -- BSD:
>     >>> - Antlr
>     >>> - ObjectWeb ASM
>     >>> - Protobuf
>     >>> - JSCH
>     >>> - JavaCC
>     >>> - Paranamer
>     >>> - JLine
>     >>> - Stax
>     >>> - StringTemplate
>     >>> - xmlEnc
>     >>>
>     >>> -- MIT
>     >>> - AppAssembler
>     >>> - SimpleLog4J
>     >>>
>     >>> -- CDDL 1.0
>     >>> - Java Activation Framework
>     >>> - Java Transactions
>     >>> - Java Servlet API
>     >>> - Grizzly
>     >>> - gmbal
>     >>> - Glassfish
>     >>>
>     >>> -- CDDL 1.1
>     >>> - Jersey
>     >>> - JAXB Reference Implementation
>     >>>
>     >>> -- JSON License
>     >>> - JSON
>     >>>
>     >>> -- EPL 1.0
>     >>> - JUnit
>     >>>
>     >>> -- JDOM License
>     >>> - JDOM
>     >>>
>     >>> -- Public Domain
>     >>> - xz
>     >>> - AOPAlliance
>     >>>
>     >>> As all dependencies are managed using Apache Maven, none of the
>     >>> external libraries need to be packaged in a source distribution.
>     >>>
>     >>>
>     >>> Required Resources
>     >>>
>     >>> Developer and user mailing lists
>     >>>
>     >>> private@asterixdb.incubator.apache.org
>     <ma...@asterixdb.incubator.apache.org> (with moderated
>     subscriptions)
>     >>> commits@asterixdb.incubator.apache.org
>     <ma...@asterixdb.incubator.apache.org>
>     >>> dev@asterixdb.incubator.apache.org
>     <ma...@asterixdb.incubator.apache.org>
>     >>> users@asterixdb.incubator.apache.org
>     <ma...@asterixdb.incubator.apache.org>
>     >>>
>     >>>
>     >>> A git repository
>     >>>
>     >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>     >>>
>     >>>
>     >>> A JIRA issue tracker
>     >>>
>     >>> https://issues.apache.org/jira/browse/ASTERIXDB
>     >>>
>     >>>
>     >>> Initial Committers
>     >>>
>     >>> The following is a list of the planned initial Apache
>     committers (the
>     >>> active subset of the committers for the current repository at
>     Google
>     >>> code).
>     >>>
>     >>> Abdullah Alamoudi (bamousaa@gmail.com <ma...@gmail.com>)
>     >>> Cameron Samak (eufery@gmail.com <ma...@gmail.com>)
>     >>> Chen Li (chenli@gmail.com <ma...@gmail.com>)
>     >>> Ian Maxon (imaxon@uci.edu <ma...@uci.edu>)
>     >>> Ildar Absalyamov (ildar.absalyamov@gmail.com
>     <ma...@gmail.com>)
>     >>> Jianfeng Jia (jianfeng.jia@gmail.com
>     <ma...@gmail.com>)
>     >>> Karen Ouaknine (kereno@gmail.com <ma...@gmail.com>)
>     >>> Markus Dreseler (apache@dreseler.de <ma...@dreseler.de>)
>     >>> Mike Carey (dtabass@apache.org <ma...@apache.org>)
>     >>> Murtadha Hubail (hubailmor@gmail.com <ma...@gmail.com>)
>     >>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com
>     <ma...@gmail.com>)
>     >>> Preston Carman (prestonc@apache.org <ma...@apache.org>)
>     >>> Raman Grover (RamanGrover29@gmail.com
>     <ma...@gmail.com>)
>     >>> Sattam Alsubaiee (salsubaiee@gmail.com
>     <ma...@gmail.com>)
>     >>> Steven Jacobs (sjaco002@apache.org <ma...@apache.org>)
>     >>> Taewoo Kim (wangsaeu@gmail.com <ma...@gmail.com>)
>     >>> Till Westmann (tillw@apache.org <ma...@apache.org>)
>     >>> Vinayak Borkar (vinayakb@apache.org <ma...@apache.org>)
>     >>> Yingyi Bu (buyingyi@gmail.com <ma...@gmail.com>)
>     >>> Young-Seok Kim (kisskys@gmail.com <ma...@gmail.com>)
>     >>> Zach Heilbron (zheilbron@gmail.com <ma...@gmail.com>)
>     >>>
>     >>>
>     >>> Affiliations
>     >>>
>     >>> UC Irvine
>     >>> - Mike Carey
>     >>> - Chen Li
>     >>> - Ian Maxon
>     >>> - Yingyi Bu
>     >>> - Raman Grover
>     >>> - Pouria Pirzadeh
>     >>> - Young-Seok Kim
>     >>> - Cameron Samak
>     >>> - Taewoo Kim
>     >>> - Jianfeng Jia
>     >>> - Murtadha Hubail
>     >>> - Markus Dreseler
>     >>>
>     >>> UC Riverside
>     >>> - Ildar Absalyamov
>     >>> - Preston Carman
>     >>> - Steven Jacobs
>     >>>
>     >>> Hebrew University
>     >>> - Keren Ouaknine
>     >>>
>     >>> Oracle
>     >>> - Till Westmann
>     >>>
>     >>> X15 Software
>     >>> - Vinayak Borkar
>     >>> - Zach Heilbron
>     >>>
>     >>> KACST Saudi Arabia
>     >>> - Sattam Alsubaiee
>     >>>
>     >>> Saudi Aramco
>     >>> - Abdullah Alamoudi
>     >>>
>     >>> Carey, Li, and Maxon are full-time UCI staff, with the
>     remaining UCI
>     >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>     >>> non-UC committers are a mix of alumni who continue to
>     contribute to
>     >>> the effort and individuals working with permission part-time
>     (or in
>     >>> spare time) on this project.
>     >>>
>     >>>
>     >>> Sponsors
>     >>>
>     >>> Champion
>     >>>
>     >>> Chris Mattmann (NASA/JPL)
>     >>>
>     >>> Nominated Mentors
>     >>>
>     >>> TBD
>     >>>
>     >>> Sponsoring Entity
>     >>>
>     >>> The Apache Incubator
>     >>>
>     >>>
>     >>>
>     >>>
>     >>>
>     >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     >>> Chris Mattmann, Ph.D.
>     >>> Chief Architect
>     >>> Instrument Software and Science Data Systems Section (398)
>     >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>     >>> Office: 168-519, Mailstop: 168-527
>     >>> Email: chris.a.mattmann@nasa.gov
>     <ma...@nasa.gov>
>     >>> WWW: http://sunset.usc.edu/~mattmann/
>     <http://sunset.usc.edu/%7Emattmann/>
>     >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     >>> Adjunct Associate Professor, Computer Science Department
>     >>> University of Southern California, Los Angeles, CA 90089 USA
>     >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>     >>>
>     >>>
>     >>>
>     >>>
>     >
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>     <ma...@incubator.apache.org>
>     For additional commands, e-mail: general-help@incubator.apache.org
>     <ma...@incubator.apache.org>
>
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Ted Dunning <te...@gmail.com>.

Chris just asked me under separate cover.

I am happy to help out as mentor.



On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra <he...@gmail.com>
wrote:

> Thanks Till,
>
> Will try to solicit more mentors to help.
> Especially with initial committers mostly have not been exposed to
> contributing the Apache way.
>
> - Henry
>
> On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <ti...@westmann.org> wrote:
> > Hi Henry,
> >
> > thanks! It’s great that you’ve seen (and liked) AsterixDB before.
> >
> > Even if your time is very limited we would be very happy to have you on
> board as a mentor.
> > I’ll add you to the proposal.
> >
> > Cheers,
> > Till
> >
> >> On Jan 19, 2015, at 10:26 AM, Henry Saputra <he...@gmail.com>
> wrote:
> >>
> >> +1 This is GREAT News!
> >>
> >> Was watching and trying AsterixDB last year and looked in awesome shape.
> >>
> >> I have my plate full but would love to help mentor this project to get
> >> it going to ASF if needed!
> >>
> >> - Henry
> >>
> >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
> >> <ch...@jpl.nasa.gov> wrote:
> >>> Hi Folks,
> >>>
> >>> I am pleased to bring forth the Apache AsterixDB proposal to the
> >>> Apache Incubator as Champion, working in collaboration with the
> >>> team. Please find the wiki proposal here:
> >>>
> >>> https://wiki.apache.org/incubator/AsterixDBProposal
> >>>
> >>>
> >>> Full text of the proposal is below. Please discuss and enjoy. I’ll
> >>> leave the discussion open for a week, and then look to call a VOTE
> >>> hopefully end of next week if all is well.
> >>>
> >>> Cheers!
> >>> Chris Mattmann
> >>>
> >>> =============================================================
> >>> Apache AsterixDB Proposal
> >>>
> >>> Abstract
> >>>
> >>> Apache AsterixDB is a scalable big data management system (BDMS) that
> >>> provides storage, management, and query capabilities for large
> >>> collections of semi-structured data.
> >>>
> >>> Proposal
> >>>
> >>> AsterixDB is a big data management system (BDMS) that makes it
> >>> well-suited to needs such as web data warehousing and social data
> >>> storage and analysis. Feature-wise, AsterixDB has:
> >>>
> >>> * A NoSQL style data model (ADM) based on extending JSON with object
> >>>  database concepts.
> >>> * An expressive and declarative query language (AQL) for querying
> >>>  semi-structured data.
> >>> * A runtime query execution engine, Hyracks, for partitioned-parallel
> >>>  execution of query plans.
> >>> * Partitioned LSM-based data storage and indexing for efficient
> >>>  ingestion of newly arriving data.
> >>> * Support for querying and indexing external data (e.g., in HDFS) as
> >>>  well as data stored within AsterixDB.
> >>> * A rich set of primitive data types, including support for spatial,
> >>>  temporal, and textual data.
> >>> * Indexing options that include B+ trees, R trees, and inverted
> >>>  keyword index support.
> >>> * Basic transactional (concurrency and recovery) capabilities akin to
> >>>  those of a NoSQL store.
> >>>
> >>>
> >>> Background and Rationale
> >>>
> >>> In the world of relational databases, the need to tackle data volumes
> >>> that exceed the capabilities of a single server led to the
> >>> development of “shared-nothing” parallel database systems several
> >>> decades ago. These systems spread data over a cluster based on a
> >>> partitioning strategy, such as hash partitioning, and queries are
> >>> processed by employing partitioned-parallel divide-and-conquer
> >>> techniques. Since these systems are fronted by a high-level,
> >>> declarative language (SQL), their users are shielded from the
> >>> complexities of parallel programming. Parallel database systems have
> >>> been an extremely successful application of parallel computing, and
> >>> quite a number of commercial products exist today.
> >>>
> >>> In the distributed systems world, the Web brought a need to index and
> >>> query its huge content. SQL and relational databases were not the
> >>> answer, though shared-nothing clusters again emerged as the hardware
> >>> platform of choice. Google developed the Google File System (GFS) and
> >>> MapReduce programming model to allow programmers to store and process
> >>> Big Data by writing a few user-defined functions. The MapReduce
> >>> framework applies these functions in parallel to data instances in
> >>> distributed files (map) and to sorted groups of instances sharing a
> >>> common key (reduce) -- not unlike the partitioned parallelism in
> >>> parallel database systems. Apache's Hadoop MapReduce platform is the
> >>> most prominent implementation of this paradigm for the rest of the
> >>> Big Data community. On top of Hadoop and HDFS sit declarative
> >>> languages like Pig and Hive that each compile down to Hadoop
> >>> MapReduce jobs.
> >>>
> >>> The big Web companies were also challenged by extreme user bases
> >>> (100s of millions of users) and needed fast simple lookups and
> >>> updates to very large keyed data sets like user profiles. SQL
> >>> databases were deemed either too expensive or not scalable, so the
> >>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> >>> popular key-value stores, in this space. MongoDB and Couchbase are
> >>> other open source alternatives (document stores).
> >>>
> >>> It is evident from the rapidly growing popularity of "NoSQL" stores,
> >>> as well as the strong demand for Big Data analytics engines today,
> >>> that there is a strong (and growing!) need to store, process, *and*
> >>> query large volumes of semi-structured data in many application
> >>> areas. Until very recently, developers have had to ``choose'' between
> >>> using big data analytics engines like Apache Hive or Apache Spark,
> >>> which can do complex query processing and analysis over HDFS-resident
> >>> files, and flexible but low-function data stores like MongoDB or
> >>> Apache HBase. (The Apache Phoenix project,
> >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
> >>> aims to bridge between these choices.)
> >>>
> >>> AsterixDB is a highly scalable data management system that can store,
> >>> index, and manage semi-structured data, e.g., much like MongoDB, but
> >>> it also supports a full-power query language with the expressiveness
> >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
> >>> stores and manages data, so AsterixDB can exploit its knowledge of
> >>> data partitioning and the availability of indexes to avoid always
> >>> scanning data set(s) to process queries. Somewhat surprisingly, there
> >>> is no open source parallel database system (relational or otherwise)
> >>> available to developers today -- AsterixDB aims to fill this need.
> >>> Since Apache is where the majority of the today's most important Big
> >>> Data technologies live, the ASF seems like the obvious home for a
> >>> system like AsterixDB.
> >>>
> >>> Current Status
> >>>
> >>> The current version of AsterixDB was co-developed by a team of
> >>> faculty, staff, and students at UC Irvine and UC Riverside. The
> >>> project was initiated as a large NSF-sponsored project in 2009, the
> >>> goal of which was to combine the best ideas from the parallel
> >>> database world, the then new Hadoop world, and the semi-structured
> >>> (e.g., XML/JSON) data world in order to create a next-generation
> >>> BDMS. A first informal open source release was made four years later,
> >>> in June of 2013, under the Apache Software License 2.0.
> >>>
> >>>
> >>> Meritocracy
> >>>
> >>> The current developers are familiar with meritocratic open source
> >>> development at Apache. Apache was chosen specifically because we want
> >>> to encourage this style of development for the project.
> >>>
> >>>
> >>> Community
> >>>
> >>> While AsterixDB started as a university project it has developed into
> >>> a community. A number of the initial committers started contributing
> >>> in academia and continue to actively participate and contribute after
> >>> graduation. And we seek to further develop developer and user
> >>> communities. One way to broaden the community that is ongoing is
> >>> through academic collaborations (currently with IIT Mumbai in India
> >>> and TU Berlin in Germany). During incubation we will also explicitly
> >>> seek increased industrial participation.
> >>>
> >>> Some indicators of the effort's development community and history can
> >>> be
> >>> found at:
> >>>
> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo
> ,
> >>>
> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
> >>>
> >>>
> >>> Core Developers
> >>>
> >>> The core developers of the project are diverse, although initially UC
> >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
> >>> other 50 are from other academic institutions (UC Riverside and the
> >>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
> >>>
> >>>
> >>> Alignment
> >>>
> >>> Apache is, by far, the most natural home for taking the AsterixDB
> >>> project forward. A large fraction of today's top Big Data
> >>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
> >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> >>> significant gap -- the parallel data management system gap -- that
> >>> exists in the Big Data open source world. It is well-aligned with a
> >>> number of the Apache projects, e.g., it has strong support for
> >>> accessing and indexing external data in HDFS, and it uses YARN as an
> >>> answer to basic cluster resource management. AsterixDB also seeks to
> >>> achieve an Apache-style development model; it is seeking a broader
> >>> community of contributors and users in order to achieve its full
> >>> potential and value to the Big Data community.
> >>>
> >>> There are also a number of related Apache projects and dependencies
> >>> that will be mentioned below in the Relationships with Other Apache
> >>> products section.
> >>>
> >>>
> >>> Known Risks
> >>>
> >>> Orphaned products
> >>>
> >>> Given the current level of intellectual investment in AsterixDB, the
> >>> risk of the project being abandoned is very small. The UCI/UCR
> >>> faculty team leads are highly incentivized to continue development
> >>> since the database groups at UC Irvine and UC Riverside are both
> >>> reliant on AsterixDB as a platform for long-term graduate research
> >>> projects. UC San Diego is also beginning to contribute to the code
> >>> base, and a collaboration involving public health applications is
> >>> forming with UCLA. The work on AsterixDB is managed via a mix of
> >>> mailing list discussions supplemented by weekly project status
> >>> meetings which are summarized on the mailing list. Typical (local
> >>> plus Skype-in) attendance to the weekly status meetings runs at about
> >>> 20 active contributors.
> >>>
> >>>
> >>> Inexperience with Open Source
> >>>
> >>> AsterixDB and Hyracks were completely developed in Open Source under
> >>> the ASL 2.0. The source code repositories, issue tracker, and mailing
> >>> lists are available on Google Code and discussions and decisions
> >>> happen on the mailing lists (which is necessary due to the geographic
> >>> distribution of the current developers).
> >>>
> >>> Also a few of the initial committers have contributed to Apache
> >>> projects. Vinayak Borkar is a committer on the Apache Helix and
> >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> >>> and an IPMC member. Preston Carman and Steven Jacobs are committers
> >>> on the Apache VXQuery project.
> >>>
> >>>
> >>> Relationships with Other Apache Products
> >>>
> >>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
> >>> is also included in the AsterixDB code base.
> >>>
> >>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> >>> is support for accessing external data in HDFS (and Hive formats),
> >>> and resource management and system administration features are in the
> >>> process of being migrated to YARN.
> >>>
> >>> AsterixDB's AQL query facilities offer comparable query power to
> >>> Apache's Pig and Hive systems for big data analytics. AsterixDB
> >>> differs in storing and indexing data and thus being able to quickly
> >>> answer small and medium queries without large HDFS data scans -
> >>> thereby targeting a different class of use cases.
> >>>
> >>> AsterixDB's data storage and indexing facilities are similar to those
> >>> of HBase, but AsterixDB differs in being a much more complete and
> >>> queryable BDMS (not just a key-value style store).
> >>>
> >>> AsterixDB's target use cases are not in-memory processing or
> >>> iterative algorithm support, making AsterixDB complementary to the
> >>> Apache Spark platform. (Spark interoperability is on our longer-term
> >>> to-do wishlist.)
> >>>
> >>>
> >>> Homogeneous Developers
> >>>
> >>> As mentioned before the current community is already organizationally
> >>> and geographically distributed - and we would like to increase the
> >>> heterogeneity.
> >>>
> >>>
> >>> Reliance on Salaried Developers
> >>>
> >>> Of the initial committers only 3 are full-time UCI staff. The other
> >>> committers are a mix of students, alumni who continue to contribute
> >>> to the effort, and individuals working with permission part-time (or
> >>> in spare time) on this project.
> >>>
> >>>
> >>> A Excessive Fascination with the Apache Brand
> >>>
> >>> We believe in the processes, systems, and framework Apache has put in
> >>> place. Apache is also known to foster a great community around their
> >>> projects and provide exposure. While brand is important, our
> >>> fascination with it is not excessive. We believe that the ASF is the
> >>> right home for AsterixDB and that having AsterixDB inside of the ASF
> >>> will lead to a better long-term outcome for the Big Data community.
> >>>
> >>>
> >>> Documentation
> >>>
> >>> Documentation and publications related to AsterixDB can be found at
> >>> http://asterixdb.ics.uci.edu/.
> >>>
> >>>
> >>> Initial Source
> >>>
> >>> Current source resides in Google code:
> >>> https://code.google.com/p/asterixdb/ (query language and upper system
> >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
> >>> system and storage management libraries).
> >>>
> >>>
> >>> External Dependencies
> >>>
> >>> AsterixDB depends on a number of Apache projects:
> >>>
> >>> - Ant
> >>> - Avro
> >>> - ApacheDB JDO
> >>> - Commons
> >>> - Derby
> >>> - Hadoop
> >>> - Hive
> >>> - HTTPComponents
> >>> - Jakarta ORO
> >>> - Maven
> >>> - Tomcat
> >>> - Thrift
> >>> - Velocity
> >>> - Wicket
> >>> - Xerces
> >>>
> >>> and other open source projects (organized by license):
> >>>
> >>> -- ASL 2.0:
> >>> - Jackson
> >>> - Google Guava
> >>> - Google Guice
> >>> - JSON-simple
> >>> - BoneCP
> >>> - Microsoft Azure SDK
> >>> - Netty
> >>> - Rome
> >>> - JetS3t
> >>> - Groovy
> >>> - Jettison
> >>> - Plexus
> >>> - Datanucleus (JDO)
> >>> - Jetty
> >>> - Twitter4J
> >>> - Snappy-java
> >>>
> >>> -- BSD:
> >>> - Antlr
> >>> - ObjectWeb ASM
> >>> - Protobuf
> >>> - JSCH
> >>> - JavaCC
> >>> - Paranamer
> >>> - JLine
> >>> - Stax
> >>> - StringTemplate
> >>> - xmlEnc
> >>>
> >>> -- MIT
> >>> - AppAssembler
> >>> - SimpleLog4J
> >>>
> >>> -- CDDL 1.0
> >>> - Java Activation Framework
> >>> - Java Transactions
> >>> - Java Servlet API
> >>> - Grizzly
> >>> - gmbal
> >>> - Glassfish
> >>>
> >>> -- CDDL 1.1
> >>> - Jersey
> >>> - JAXB Reference Implementation
> >>>
> >>> -- JSON License
> >>> - JSON
> >>>
> >>> -- EPL 1.0
> >>> - JUnit
> >>>
> >>> -- JDOM License
> >>> - JDOM
> >>>
> >>> -- Public Domain
> >>> - xz
> >>> - AOPAlliance
> >>>
> >>> As all dependencies are managed using Apache Maven, none of the
> >>> external libraries need to be packaged in a source distribution.
> >>>
> >>>
> >>> Required Resources
> >>>
> >>> Developer and user mailing lists
> >>>
> >>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
> >>> commits@asterixdb.incubator.apache.org
> >>> dev@asterixdb.incubator.apache.org
> >>> users@asterixdb.incubator.apache.org
> >>>
> >>>
> >>> A git repository
> >>>
> >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
> >>>
> >>>
> >>> A JIRA issue tracker
> >>>
> >>> https://issues.apache.org/jira/browse/ASTERIXDB
> >>>
> >>>
> >>> Initial Committers
> >>>
> >>> The following is a list of the planned initial Apache committers (the
> >>> active subset of the committers for the current repository at Google
> >>> code).
> >>>
> >>> Abdullah Alamoudi (bamousaa@gmail.com)
> >>> Cameron Samak (eufery@gmail.com)
> >>> Chen Li (chenli@gmail.com)
> >>> Ian Maxon (imaxon@uci.edu)
> >>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
> >>> Jianfeng Jia (jianfeng.jia@gmail.com)
> >>> Karen Ouaknine (kereno@gmail.com)
> >>> Markus Dreseler (apache@dreseler.de)
> >>> Mike Carey (dtabass@apache.org)
> >>> Murtadha Hubail (hubailmor@gmail.com)
> >>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
> >>> Preston Carman (prestonc@apache.org)
> >>> Raman Grover (RamanGrover29@gmail.com)
> >>> Sattam Alsubaiee (salsubaiee@gmail.com)
> >>> Steven Jacobs (sjaco002@apache.org)
> >>> Taewoo Kim (wangsaeu@gmail.com)
> >>> Till Westmann (tillw@apache.org)
> >>> Vinayak Borkar (vinayakb@apache.org)
> >>> Yingyi Bu (buyingyi@gmail.com)
> >>> Young-Seok Kim (kisskys@gmail.com)
> >>> Zach Heilbron (zheilbron@gmail.com)
> >>>
> >>>
> >>> Affiliations
> >>>
> >>> UC Irvine
> >>> - Mike Carey
> >>> - Chen Li
> >>> - Ian Maxon
> >>> - Yingyi Bu
> >>> - Raman Grover
> >>> - Pouria Pirzadeh
> >>> - Young-Seok Kim
> >>> - Cameron Samak
> >>> - Taewoo Kim
> >>> - Jianfeng Jia
> >>> - Murtadha Hubail
> >>> - Markus Dreseler
> >>>
> >>> UC Riverside
> >>> - Ildar Absalyamov
> >>> - Preston Carman
> >>> - Steven Jacobs
> >>>
> >>> Hebrew University
> >>> - Keren Ouaknine
> >>>
> >>> Oracle
> >>> - Till Westmann
> >>>
> >>> X15 Software
> >>> - Vinayak Borkar
> >>> - Zach Heilbron
> >>>
> >>> KACST Saudi Arabia
> >>> - Sattam Alsubaiee
> >>>
> >>> Saudi Aramco
> >>> - Abdullah Alamoudi
> >>>
> >>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> >>> non-UC committers are a mix of alumni who continue to contribute to
> >>> the effort and individuals working with permission part-time (or in
> >>> spare time) on this project.
> >>>
> >>>
> >>> Sponsors
> >>>
> >>> Champion
> >>>
> >>> Chris Mattmann (NASA/JPL)
> >>>
> >>> Nominated Mentors
> >>>
> >>> TBD
> >>>
> >>> Sponsoring Entity
> >>>
> >>> The Apache Incubator
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Chris Mattmann, Ph.D.
> >>> Chief Architect
> >>> Instrument Software and Science Data Systems Section (398)
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 168-519, Mailstop: 168-527
> >>> Email: chris.a.mattmann@nasa.gov
> >>> WWW:  http://sunset.usc.edu/~mattmann/
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Adjunct Associate Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by "Alan D. Cabrera" <li...@toolazydogs.com>.

Should be fine.


Regards,
Alan

> On Jan 19, 2015, at 8:27 PM, Till Westmann <ti...@westmann.org> wrote:
> 
> Thank you.
> So for we’ve added 3 slots for mentors on the proposal - I hope that’ll be sufficient even for the relatively large number of new committers.
> 
> Till
> 
>> On Jan 19, 2015, at 8:17 PM, Henry Saputra <he...@gmail.com> wrote:
>> 
>> Thanks Till,
>> 
>> Will try to solicit more mentors to help.
>> Especially with initial committers mostly have not been exposed to
>> contributing the Apache way.
>> 
>> - Henry
>> 
>> On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <ti...@westmann.org> wrote:
>>> Hi Henry,
>>> 
>>> thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>>> 
>>> Even if your time is very limited we would be very happy to have you on board as a mentor.
>>> I’ll add you to the proposal.
>>> 
>>> Cheers,
>>> Till
>>> 
>>>> On Jan 19, 2015, at 10:26 AM, Henry Saputra <he...@gmail.com> wrote:
>>>> 
>>>> +1 This is GREAT News!
>>>> 
>>>> Was watching and trying AsterixDB last year and looked in awesome shape.
>>>> 
>>>> I have my plate full but would love to help mentor this project to get
>>>> it going to ASF if needed!
>>>> 
>>>> - Henry
>>>> 
>>>> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>>>> <ch...@jpl.nasa.gov> wrote:
>>>>> Hi Folks,
>>>>> 
>>>>> I am pleased to bring forth the Apache AsterixDB proposal to the
>>>>> Apache Incubator as Champion, working in collaboration with the
>>>>> team. Please find the wiki proposal here:
>>>>> 
>>>>> https://wiki.apache.org/incubator/AsterixDBProposal
>>>>> 
>>>>> 
>>>>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>>>>> leave the discussion open for a week, and then look to call a VOTE
>>>>> hopefully end of next week if all is well.
>>>>> 
>>>>> Cheers!
>>>>> Chris Mattmann
>>>>> 
>>>>> =============================================================
>>>>> Apache AsterixDB Proposal
>>>>> 
>>>>> Abstract
>>>>> 
>>>>> Apache AsterixDB is a scalable big data management system (BDMS) that
>>>>> provides storage, management, and query capabilities for large
>>>>> collections of semi-structured data.
>>>>> 
>>>>> Proposal
>>>>> 
>>>>> AsterixDB is a big data management system (BDMS) that makes it
>>>>> well-suited to needs such as web data warehousing and social data
>>>>> storage and analysis. Feature-wise, AsterixDB has:
>>>>> 
>>>>> * A NoSQL style data model (ADM) based on extending JSON with object
>>>>> database concepts.
>>>>> * An expressive and declarative query language (AQL) for querying
>>>>> semi-structured data.
>>>>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>>>> execution of query plans.
>>>>> * Partitioned LSM-based data storage and indexing for efficient
>>>>> ingestion of newly arriving data.
>>>>> * Support for querying and indexing external data (e.g., in HDFS) as
>>>>> well as data stored within AsterixDB.
>>>>> * A rich set of primitive data types, including support for spatial,
>>>>> temporal, and textual data.
>>>>> * Indexing options that include B+ trees, R trees, and inverted
>>>>> keyword index support.
>>>>> * Basic transactional (concurrency and recovery) capabilities akin to
>>>>> those of a NoSQL store.
>>>>> 
>>>>> 
>>>>> Background and Rationale
>>>>> 
>>>>> In the world of relational databases, the need to tackle data volumes
>>>>> that exceed the capabilities of a single server led to the
>>>>> development of “shared-nothing” parallel database systems several
>>>>> decades ago. These systems spread data over a cluster based on a
>>>>> partitioning strategy, such as hash partitioning, and queries are
>>>>> processed by employing partitioned-parallel divide-and-conquer
>>>>> techniques. Since these systems are fronted by a high-level,
>>>>> declarative language (SQL), their users are shielded from the
>>>>> complexities of parallel programming. Parallel database systems have
>>>>> been an extremely successful application of parallel computing, and
>>>>> quite a number of commercial products exist today.
>>>>> 
>>>>> In the distributed systems world, the Web brought a need to index and
>>>>> query its huge content. SQL and relational databases were not the
>>>>> answer, though shared-nothing clusters again emerged as the hardware
>>>>> platform of choice. Google developed the Google File System (GFS) and
>>>>> MapReduce programming model to allow programmers to store and process
>>>>> Big Data by writing a few user-defined functions. The MapReduce
>>>>> framework applies these functions in parallel to data instances in
>>>>> distributed files (map) and to sorted groups of instances sharing a
>>>>> common key (reduce) -- not unlike the partitioned parallelism in
>>>>> parallel database systems. Apache's Hadoop MapReduce platform is the
>>>>> most prominent implementation of this paradigm for the rest of the
>>>>> Big Data community. On top of Hadoop and HDFS sit declarative
>>>>> languages like Pig and Hive that each compile down to Hadoop
>>>>> MapReduce jobs.
>>>>> 
>>>>> The big Web companies were also challenged by extreme user bases
>>>>> (100s of millions of users) and needed fast simple lookups and
>>>>> updates to very large keyed data sets like user profiles. SQL
>>>>> databases were deemed either too expensive or not scalable, so the
>>>>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>>>>> popular key-value stores, in this space. MongoDB and Couchbase are
>>>>> other open source alternatives (document stores).
>>>>> 
>>>>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>>>>> as well as the strong demand for Big Data analytics engines today,
>>>>> that there is a strong (and growing!) need to store, process, *and*
>>>>> query large volumes of semi-structured data in many application
>>>>> areas. Until very recently, developers have had to ``choose'' between
>>>>> using big data analytics engines like Apache Hive or Apache Spark,
>>>>> which can do complex query processing and analysis over HDFS-resident
>>>>> files, and flexible but low-function data stores like MongoDB or
>>>>> Apache HBase. (The Apache Phoenix project,
>>>>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>>>>> aims to bridge between these choices.)
>>>>> 
>>>>> AsterixDB is a highly scalable data management system that can store,
>>>>> index, and manage semi-structured data, e.g., much like MongoDB, but
>>>>> it also supports a full-power query language with the expressiveness
>>>>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>>>>> stores and manages data, so AsterixDB can exploit its knowledge of
>>>>> data partitioning and the availability of indexes to avoid always
>>>>> scanning data set(s) to process queries. Somewhat surprisingly, there
>>>>> is no open source parallel database system (relational or otherwise)
>>>>> available to developers today -- AsterixDB aims to fill this need.
>>>>> Since Apache is where the majority of the today's most important Big
>>>>> Data technologies live, the ASF seems like the obvious home for a
>>>>> system like AsterixDB.
>>>>> 
>>>>> Current Status
>>>>> 
>>>>> The current version of AsterixDB was co-developed by a team of
>>>>> faculty, staff, and students at UC Irvine and UC Riverside. The
>>>>> project was initiated as a large NSF-sponsored project in 2009, the
>>>>> goal of which was to combine the best ideas from the parallel
>>>>> database world, the then new Hadoop world, and the semi-structured
>>>>> (e.g., XML/JSON) data world in order to create a next-generation
>>>>> BDMS. A first informal open source release was made four years later,
>>>>> in June of 2013, under the Apache Software License 2.0.
>>>>> 
>>>>> 
>>>>> Meritocracy
>>>>> 
>>>>> The current developers are familiar with meritocratic open source
>>>>> development at Apache. Apache was chosen specifically because we want
>>>>> to encourage this style of development for the project.
>>>>> 
>>>>> 
>>>>> Community
>>>>> 
>>>>> While AsterixDB started as a university project it has developed into
>>>>> a community. A number of the initial committers started contributing
>>>>> in academia and continue to actively participate and contribute after
>>>>> graduation. And we seek to further develop developer and user
>>>>> communities. One way to broaden the community that is ongoing is
>>>>> through academic collaborations (currently with IIT Mumbai in India
>>>>> and TU Berlin in Germany). During incubation we will also explicitly
>>>>> seek increased industrial participation.
>>>>> 
>>>>> Some indicators of the effort's development community and history can
>>>>> be
>>>>> found at:
>>>>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>>>>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>>>> 
>>>>> 
>>>>> Core Developers
>>>>> 
>>>>> The core developers of the project are diverse, although initially UC
>>>>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>>>>> other 50 are from other academic institutions (UC Riverside and the
>>>>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>>>>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>>>> 
>>>>> 
>>>>> Alignment
>>>>> 
>>>>> Apache is, by far, the most natural home for taking the AsterixDB
>>>>> project forward. A large fraction of today's top Big Data
>>>>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>>>>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>>>>> significant gap -- the parallel data management system gap -- that
>>>>> exists in the Big Data open source world. It is well-aligned with a
>>>>> number of the Apache projects, e.g., it has strong support for
>>>>> accessing and indexing external data in HDFS, and it uses YARN as an
>>>>> answer to basic cluster resource management. AsterixDB also seeks to
>>>>> achieve an Apache-style development model; it is seeking a broader
>>>>> community of contributors and users in order to achieve its full
>>>>> potential and value to the Big Data community.
>>>>> 
>>>>> There are also a number of related Apache projects and dependencies
>>>>> that will be mentioned below in the Relationships with Other Apache
>>>>> products section.
>>>>> 
>>>>> 
>>>>> Known Risks
>>>>> 
>>>>> Orphaned products
>>>>> 
>>>>> Given the current level of intellectual investment in AsterixDB, the
>>>>> risk of the project being abandoned is very small. The UCI/UCR
>>>>> faculty team leads are highly incentivized to continue development
>>>>> since the database groups at UC Irvine and UC Riverside are both
>>>>> reliant on AsterixDB as a platform for long-term graduate research
>>>>> projects. UC San Diego is also beginning to contribute to the code
>>>>> base, and a collaboration involving public health applications is
>>>>> forming with UCLA. The work on AsterixDB is managed via a mix of
>>>>> mailing list discussions supplemented by weekly project status
>>>>> meetings which are summarized on the mailing list. Typical (local
>>>>> plus Skype-in) attendance to the weekly status meetings runs at about
>>>>> 20 active contributors.
>>>>> 
>>>>> 
>>>>> Inexperience with Open Source
>>>>> 
>>>>> AsterixDB and Hyracks were completely developed in Open Source under
>>>>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>>>>> lists are available on Google Code and discussions and decisions
>>>>> happen on the mailing lists (which is necessary due to the geographic
>>>>> distribution of the current developers).
>>>>> 
>>>>> Also a few of the initial committers have contributed to Apache
>>>>> projects. Vinayak Borkar is a committer on the Apache Helix and
>>>>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>>>>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>>>>> on the Apache VXQuery project.
>>>>> 
>>>>> 
>>>>> Relationships with Other Apache Products
>>>>> 
>>>>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>>>>> is also included in the AsterixDB code base.
>>>>> 
>>>>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>>>>> is support for accessing external data in HDFS (and Hive formats),
>>>>> and resource management and system administration features are in the
>>>>> process of being migrated to YARN.
>>>>> 
>>>>> AsterixDB's AQL query facilities offer comparable query power to
>>>>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>>>>> differs in storing and indexing data and thus being able to quickly
>>>>> answer small and medium queries without large HDFS data scans -
>>>>> thereby targeting a different class of use cases.
>>>>> 
>>>>> AsterixDB's data storage and indexing facilities are similar to those
>>>>> of HBase, but AsterixDB differs in being a much more complete and
>>>>> queryable BDMS (not just a key-value style store).
>>>>> 
>>>>> AsterixDB's target use cases are not in-memory processing or
>>>>> iterative algorithm support, making AsterixDB complementary to the
>>>>> Apache Spark platform. (Spark interoperability is on our longer-term
>>>>> to-do wishlist.)
>>>>> 
>>>>> 
>>>>> Homogeneous Developers
>>>>> 
>>>>> As mentioned before the current community is already organizationally
>>>>> and geographically distributed - and we would like to increase the
>>>>> heterogeneity.
>>>>> 
>>>>> 
>>>>> Reliance on Salaried Developers
>>>>> 
>>>>> Of the initial committers only 3 are full-time UCI staff. The other
>>>>> committers are a mix of students, alumni who continue to contribute
>>>>> to the effort, and individuals working with permission part-time (or
>>>>> in spare time) on this project.
>>>>> 
>>>>> 
>>>>> A Excessive Fascination with the Apache Brand
>>>>> 
>>>>> We believe in the processes, systems, and framework Apache has put in
>>>>> place. Apache is also known to foster a great community around their
>>>>> projects and provide exposure. While brand is important, our
>>>>> fascination with it is not excessive. We believe that the ASF is the
>>>>> right home for AsterixDB and that having AsterixDB inside of the ASF
>>>>> will lead to a better long-term outcome for the Big Data community.
>>>>> 
>>>>> 
>>>>> Documentation
>>>>> 
>>>>> Documentation and publications related to AsterixDB can be found at
>>>>> http://asterixdb.ics.uci.edu/.
>>>>> 
>>>>> 
>>>>> Initial Source
>>>>> 
>>>>> Current source resides in Google code:
>>>>> https://code.google.com/p/asterixdb/ (query language and upper system
>>>>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>>>>> system and storage management libraries).
>>>>> 
>>>>> 
>>>>> External Dependencies
>>>>> 
>>>>> AsterixDB depends on a number of Apache projects:
>>>>> 
>>>>> - Ant
>>>>> - Avro
>>>>> - ApacheDB JDO
>>>>> - Commons
>>>>> - Derby
>>>>> - Hadoop
>>>>> - Hive
>>>>> - HTTPComponents
>>>>> - Jakarta ORO
>>>>> - Maven
>>>>> - Tomcat
>>>>> - Thrift
>>>>> - Velocity
>>>>> - Wicket
>>>>> - Xerces
>>>>> 
>>>>> and other open source projects (organized by license):
>>>>> 
>>>>> -- ASL 2.0:
>>>>> - Jackson
>>>>> - Google Guava
>>>>> - Google Guice
>>>>> - JSON-simple
>>>>> - BoneCP
>>>>> - Microsoft Azure SDK
>>>>> - Netty
>>>>> - Rome
>>>>> - JetS3t
>>>>> - Groovy
>>>>> - Jettison
>>>>> - Plexus
>>>>> - Datanucleus (JDO)
>>>>> - Jetty
>>>>> - Twitter4J
>>>>> - Snappy-java
>>>>> 
>>>>> -- BSD:
>>>>> - Antlr
>>>>> - ObjectWeb ASM
>>>>> - Protobuf
>>>>> - JSCH
>>>>> - JavaCC
>>>>> - Paranamer
>>>>> - JLine
>>>>> - Stax
>>>>> - StringTemplate
>>>>> - xmlEnc
>>>>> 
>>>>> -- MIT
>>>>> - AppAssembler
>>>>> - SimpleLog4J
>>>>> 
>>>>> -- CDDL 1.0
>>>>> - Java Activation Framework
>>>>> - Java Transactions
>>>>> - Java Servlet API
>>>>> - Grizzly
>>>>> - gmbal
>>>>> - Glassfish
>>>>> 
>>>>> -- CDDL 1.1
>>>>> - Jersey
>>>>> - JAXB Reference Implementation
>>>>> 
>>>>> -- JSON License
>>>>> - JSON
>>>>> 
>>>>> -- EPL 1.0
>>>>> - JUnit
>>>>> 
>>>>> -- JDOM License
>>>>> - JDOM
>>>>> 
>>>>> -- Public Domain
>>>>> - xz
>>>>> - AOPAlliance
>>>>> 
>>>>> As all dependencies are managed using Apache Maven, none of the
>>>>> external libraries need to be packaged in a source distribution.
>>>>> 
>>>>> 
>>>>> Required Resources
>>>>> 
>>>>> Developer and user mailing lists
>>>>> 
>>>>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>>>>> commits@asterixdb.incubator.apache.org
>>>>> dev@asterixdb.incubator.apache.org
>>>>> users@asterixdb.incubator.apache.org
>>>>> 
>>>>> 
>>>>> A git repository
>>>>> 
>>>>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>>>> 
>>>>> 
>>>>> A JIRA issue tracker
>>>>> 
>>>>> https://issues.apache.org/jira/browse/ASTERIXDB
>>>>> 
>>>>> 
>>>>> Initial Committers
>>>>> 
>>>>> The following is a list of the planned initial Apache committers (the
>>>>> active subset of the committers for the current repository at Google
>>>>> code).
>>>>> 
>>>>> Abdullah Alamoudi (bamousaa@gmail.com)
>>>>> Cameron Samak (eufery@gmail.com)
>>>>> Chen Li (chenli@gmail.com)
>>>>> Ian Maxon (imaxon@uci.edu)
>>>>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>>>>> Jianfeng Jia (jianfeng.jia@gmail.com)
>>>>> Karen Ouaknine (kereno@gmail.com)
>>>>> Markus Dreseler (apache@dreseler.de)
>>>>> Mike Carey (dtabass@apache.org)
>>>>> Murtadha Hubail (hubailmor@gmail.com)
>>>>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>>>>> Preston Carman (prestonc@apache.org)
>>>>> Raman Grover (RamanGrover29@gmail.com)
>>>>> Sattam Alsubaiee (salsubaiee@gmail.com)
>>>>> Steven Jacobs (sjaco002@apache.org)
>>>>> Taewoo Kim (wangsaeu@gmail.com)
>>>>> Till Westmann (tillw@apache.org)
>>>>> Vinayak Borkar (vinayakb@apache.org)
>>>>> Yingyi Bu (buyingyi@gmail.com)
>>>>> Young-Seok Kim (kisskys@gmail.com)
>>>>> Zach Heilbron (zheilbron@gmail.com)
>>>>> 
>>>>> 
>>>>> Affiliations
>>>>> 
>>>>> UC Irvine
>>>>> - Mike Carey
>>>>> - Chen Li
>>>>> - Ian Maxon
>>>>> - Yingyi Bu
>>>>> - Raman Grover
>>>>> - Pouria Pirzadeh
>>>>> - Young-Seok Kim
>>>>> - Cameron Samak
>>>>> - Taewoo Kim
>>>>> - Jianfeng Jia
>>>>> - Murtadha Hubail
>>>>> - Markus Dreseler
>>>>> 
>>>>> UC Riverside
>>>>> - Ildar Absalyamov
>>>>> - Preston Carman
>>>>> - Steven Jacobs
>>>>> 
>>>>> Hebrew University
>>>>> - Keren Ouaknine
>>>>> 
>>>>> Oracle
>>>>> - Till Westmann
>>>>> 
>>>>> X15 Software
>>>>> - Vinayak Borkar
>>>>> - Zach Heilbron
>>>>> 
>>>>> KACST Saudi Arabia
>>>>> - Sattam Alsubaiee
>>>>> 
>>>>> Saudi Aramco
>>>>> - Abdullah Alamoudi
>>>>> 
>>>>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>>>>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>>>>> non-UC committers are a mix of alumni who continue to contribute to
>>>>> the effort and individuals working with permission part-time (or in
>>>>> spare time) on this project.
>>>>> 
>>>>> 
>>>>> Sponsors
>>>>> 
>>>>> Champion
>>>>> 
>>>>> Chris Mattmann (NASA/JPL)
>>>>> 
>>>>> Nominated Mentors
>>>>> 
>>>>> TBD
>>>>> 
>>>>> Sponsoring Entity
>>>>> 
>>>>> The Apache Incubator
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email: chris.a.mattmann@nasa.gov
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Till Westmann <ti...@westmann.org>.

Thank you.
So for we’ve added 3 slots for mentors on the proposal - I hope that’ll be sufficient even for the relatively large number of new committers.

Till

> On Jan 19, 2015, at 8:17 PM, Henry Saputra <he...@gmail.com> wrote:
> 
> Thanks Till,
> 
> Will try to solicit more mentors to help.
> Especially with initial committers mostly have not been exposed to
> contributing the Apache way.
> 
> - Henry
> 
> On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <ti...@westmann.org> wrote:
>> Hi Henry,
>> 
>> thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>> 
>> Even if your time is very limited we would be very happy to have you on board as a mentor.
>> I’ll add you to the proposal.
>> 
>> Cheers,
>> Till
>> 
>>> On Jan 19, 2015, at 10:26 AM, Henry Saputra <he...@gmail.com> wrote:
>>> 
>>> +1 This is GREAT News!
>>> 
>>> Was watching and trying AsterixDB last year and looked in awesome shape.
>>> 
>>> I have my plate full but would love to help mentor this project to get
>>> it going to ASF if needed!
>>> 
>>> - Henry
>>> 
>>> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>>> <ch...@jpl.nasa.gov> wrote:
>>>> Hi Folks,
>>>> 
>>>> I am pleased to bring forth the Apache AsterixDB proposal to the
>>>> Apache Incubator as Champion, working in collaboration with the
>>>> team. Please find the wiki proposal here:
>>>> 
>>>> https://wiki.apache.org/incubator/AsterixDBProposal
>>>> 
>>>> 
>>>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>>>> leave the discussion open for a week, and then look to call a VOTE
>>>> hopefully end of next week if all is well.
>>>> 
>>>> Cheers!
>>>> Chris Mattmann
>>>> 
>>>> =============================================================
>>>> Apache AsterixDB Proposal
>>>> 
>>>> Abstract
>>>> 
>>>> Apache AsterixDB is a scalable big data management system (BDMS) that
>>>> provides storage, management, and query capabilities for large
>>>> collections of semi-structured data.
>>>> 
>>>> Proposal
>>>> 
>>>> AsterixDB is a big data management system (BDMS) that makes it
>>>> well-suited to needs such as web data warehousing and social data
>>>> storage and analysis. Feature-wise, AsterixDB has:
>>>> 
>>>> * A NoSQL style data model (ADM) based on extending JSON with object
>>>> database concepts.
>>>> * An expressive and declarative query language (AQL) for querying
>>>> semi-structured data.
>>>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>>> execution of query plans.
>>>> * Partitioned LSM-based data storage and indexing for efficient
>>>> ingestion of newly arriving data.
>>>> * Support for querying and indexing external data (e.g., in HDFS) as
>>>> well as data stored within AsterixDB.
>>>> * A rich set of primitive data types, including support for spatial,
>>>> temporal, and textual data.
>>>> * Indexing options that include B+ trees, R trees, and inverted
>>>> keyword index support.
>>>> * Basic transactional (concurrency and recovery) capabilities akin to
>>>> those of a NoSQL store.
>>>> 
>>>> 
>>>> Background and Rationale
>>>> 
>>>> In the world of relational databases, the need to tackle data volumes
>>>> that exceed the capabilities of a single server led to the
>>>> development of “shared-nothing” parallel database systems several
>>>> decades ago. These systems spread data over a cluster based on a
>>>> partitioning strategy, such as hash partitioning, and queries are
>>>> processed by employing partitioned-parallel divide-and-conquer
>>>> techniques. Since these systems are fronted by a high-level,
>>>> declarative language (SQL), their users are shielded from the
>>>> complexities of parallel programming. Parallel database systems have
>>>> been an extremely successful application of parallel computing, and
>>>> quite a number of commercial products exist today.
>>>> 
>>>> In the distributed systems world, the Web brought a need to index and
>>>> query its huge content. SQL and relational databases were not the
>>>> answer, though shared-nothing clusters again emerged as the hardware
>>>> platform of choice. Google developed the Google File System (GFS) and
>>>> MapReduce programming model to allow programmers to store and process
>>>> Big Data by writing a few user-defined functions. The MapReduce
>>>> framework applies these functions in parallel to data instances in
>>>> distributed files (map) and to sorted groups of instances sharing a
>>>> common key (reduce) -- not unlike the partitioned parallelism in
>>>> parallel database systems. Apache's Hadoop MapReduce platform is the
>>>> most prominent implementation of this paradigm for the rest of the
>>>> Big Data community. On top of Hadoop and HDFS sit declarative
>>>> languages like Pig and Hive that each compile down to Hadoop
>>>> MapReduce jobs.
>>>> 
>>>> The big Web companies were also challenged by extreme user bases
>>>> (100s of millions of users) and needed fast simple lookups and
>>>> updates to very large keyed data sets like user profiles. SQL
>>>> databases were deemed either too expensive or not scalable, so the
>>>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>>>> popular key-value stores, in this space. MongoDB and Couchbase are
>>>> other open source alternatives (document stores).
>>>> 
>>>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>>>> as well as the strong demand for Big Data analytics engines today,
>>>> that there is a strong (and growing!) need to store, process, *and*
>>>> query large volumes of semi-structured data in many application
>>>> areas. Until very recently, developers have had to ``choose'' between
>>>> using big data analytics engines like Apache Hive or Apache Spark,
>>>> which can do complex query processing and analysis over HDFS-resident
>>>> files, and flexible but low-function data stores like MongoDB or
>>>> Apache HBase. (The Apache Phoenix project,
>>>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>>>> aims to bridge between these choices.)
>>>> 
>>>> AsterixDB is a highly scalable data management system that can store,
>>>> index, and manage semi-structured data, e.g., much like MongoDB, but
>>>> it also supports a full-power query language with the expressiveness
>>>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>>>> stores and manages data, so AsterixDB can exploit its knowledge of
>>>> data partitioning and the availability of indexes to avoid always
>>>> scanning data set(s) to process queries. Somewhat surprisingly, there
>>>> is no open source parallel database system (relational or otherwise)
>>>> available to developers today -- AsterixDB aims to fill this need.
>>>> Since Apache is where the majority of the today's most important Big
>>>> Data technologies live, the ASF seems like the obvious home for a
>>>> system like AsterixDB.
>>>> 
>>>> Current Status
>>>> 
>>>> The current version of AsterixDB was co-developed by a team of
>>>> faculty, staff, and students at UC Irvine and UC Riverside. The
>>>> project was initiated as a large NSF-sponsored project in 2009, the
>>>> goal of which was to combine the best ideas from the parallel
>>>> database world, the then new Hadoop world, and the semi-structured
>>>> (e.g., XML/JSON) data world in order to create a next-generation
>>>> BDMS. A first informal open source release was made four years later,
>>>> in June of 2013, under the Apache Software License 2.0.
>>>> 
>>>> 
>>>> Meritocracy
>>>> 
>>>> The current developers are familiar with meritocratic open source
>>>> development at Apache. Apache was chosen specifically because we want
>>>> to encourage this style of development for the project.
>>>> 
>>>> 
>>>> Community
>>>> 
>>>> While AsterixDB started as a university project it has developed into
>>>> a community. A number of the initial committers started contributing
>>>> in academia and continue to actively participate and contribute after
>>>> graduation. And we seek to further develop developer and user
>>>> communities. One way to broaden the community that is ongoing is
>>>> through academic collaborations (currently with IIT Mumbai in India
>>>> and TU Berlin in Germany). During incubation we will also explicitly
>>>> seek increased industrial participation.
>>>> 
>>>> Some indicators of the effort's development community and history can
>>>> be
>>>> found at:
>>>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>>>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>>> 
>>>> 
>>>> Core Developers
>>>> 
>>>> The core developers of the project are diverse, although initially UC
>>>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>>>> other 50 are from other academic institutions (UC Riverside and the
>>>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>>>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>>> 
>>>> 
>>>> Alignment
>>>> 
>>>> Apache is, by far, the most natural home for taking the AsterixDB
>>>> project forward. A large fraction of today's top Big Data
>>>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>>>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>>>> significant gap -- the parallel data management system gap -- that
>>>> exists in the Big Data open source world. It is well-aligned with a
>>>> number of the Apache projects, e.g., it has strong support for
>>>> accessing and indexing external data in HDFS, and it uses YARN as an
>>>> answer to basic cluster resource management. AsterixDB also seeks to
>>>> achieve an Apache-style development model; it is seeking a broader
>>>> community of contributors and users in order to achieve its full
>>>> potential and value to the Big Data community.
>>>> 
>>>> There are also a number of related Apache projects and dependencies
>>>> that will be mentioned below in the Relationships with Other Apache
>>>> products section.
>>>> 
>>>> 
>>>> Known Risks
>>>> 
>>>> Orphaned products
>>>> 
>>>> Given the current level of intellectual investment in AsterixDB, the
>>>> risk of the project being abandoned is very small. The UCI/UCR
>>>> faculty team leads are highly incentivized to continue development
>>>> since the database groups at UC Irvine and UC Riverside are both
>>>> reliant on AsterixDB as a platform for long-term graduate research
>>>> projects. UC San Diego is also beginning to contribute to the code
>>>> base, and a collaboration involving public health applications is
>>>> forming with UCLA. The work on AsterixDB is managed via a mix of
>>>> mailing list discussions supplemented by weekly project status
>>>> meetings which are summarized on the mailing list. Typical (local
>>>> plus Skype-in) attendance to the weekly status meetings runs at about
>>>> 20 active contributors.
>>>> 
>>>> 
>>>> Inexperience with Open Source
>>>> 
>>>> AsterixDB and Hyracks were completely developed in Open Source under
>>>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>>>> lists are available on Google Code and discussions and decisions
>>>> happen on the mailing lists (which is necessary due to the geographic
>>>> distribution of the current developers).
>>>> 
>>>> Also a few of the initial committers have contributed to Apache
>>>> projects. Vinayak Borkar is a committer on the Apache Helix and
>>>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>>>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>>>> on the Apache VXQuery project.
>>>> 
>>>> 
>>>> Relationships with Other Apache Products
>>>> 
>>>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>>>> is also included in the AsterixDB code base.
>>>> 
>>>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>>>> is support for accessing external data in HDFS (and Hive formats),
>>>> and resource management and system administration features are in the
>>>> process of being migrated to YARN.
>>>> 
>>>> AsterixDB's AQL query facilities offer comparable query power to
>>>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>>>> differs in storing and indexing data and thus being able to quickly
>>>> answer small and medium queries without large HDFS data scans -
>>>> thereby targeting a different class of use cases.
>>>> 
>>>> AsterixDB's data storage and indexing facilities are similar to those
>>>> of HBase, but AsterixDB differs in being a much more complete and
>>>> queryable BDMS (not just a key-value style store).
>>>> 
>>>> AsterixDB's target use cases are not in-memory processing or
>>>> iterative algorithm support, making AsterixDB complementary to the
>>>> Apache Spark platform. (Spark interoperability is on our longer-term
>>>> to-do wishlist.)
>>>> 
>>>> 
>>>> Homogeneous Developers
>>>> 
>>>> As mentioned before the current community is already organizationally
>>>> and geographically distributed - and we would like to increase the
>>>> heterogeneity.
>>>> 
>>>> 
>>>> Reliance on Salaried Developers
>>>> 
>>>> Of the initial committers only 3 are full-time UCI staff. The other
>>>> committers are a mix of students, alumni who continue to contribute
>>>> to the effort, and individuals working with permission part-time (or
>>>> in spare time) on this project.
>>>> 
>>>> 
>>>> A Excessive Fascination with the Apache Brand
>>>> 
>>>> We believe in the processes, systems, and framework Apache has put in
>>>> place. Apache is also known to foster a great community around their
>>>> projects and provide exposure. While brand is important, our
>>>> fascination with it is not excessive. We believe that the ASF is the
>>>> right home for AsterixDB and that having AsterixDB inside of the ASF
>>>> will lead to a better long-term outcome for the Big Data community.
>>>> 
>>>> 
>>>> Documentation
>>>> 
>>>> Documentation and publications related to AsterixDB can be found at
>>>> http://asterixdb.ics.uci.edu/.
>>>> 
>>>> 
>>>> Initial Source
>>>> 
>>>> Current source resides in Google code:
>>>> https://code.google.com/p/asterixdb/ (query language and upper system
>>>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>>>> system and storage management libraries).
>>>> 
>>>> 
>>>> External Dependencies
>>>> 
>>>> AsterixDB depends on a number of Apache projects:
>>>> 
>>>> - Ant
>>>> - Avro
>>>> - ApacheDB JDO
>>>> - Commons
>>>> - Derby
>>>> - Hadoop
>>>> - Hive
>>>> - HTTPComponents
>>>> - Jakarta ORO
>>>> - Maven
>>>> - Tomcat
>>>> - Thrift
>>>> - Velocity
>>>> - Wicket
>>>> - Xerces
>>>> 
>>>> and other open source projects (organized by license):
>>>> 
>>>> -- ASL 2.0:
>>>> - Jackson
>>>> - Google Guava
>>>> - Google Guice
>>>> - JSON-simple
>>>> - BoneCP
>>>> - Microsoft Azure SDK
>>>> - Netty
>>>> - Rome
>>>> - JetS3t
>>>> - Groovy
>>>> - Jettison
>>>> - Plexus
>>>> - Datanucleus (JDO)
>>>> - Jetty
>>>> - Twitter4J
>>>> - Snappy-java
>>>> 
>>>> -- BSD:
>>>> - Antlr
>>>> - ObjectWeb ASM
>>>> - Protobuf
>>>> - JSCH
>>>> - JavaCC
>>>> - Paranamer
>>>> - JLine
>>>> - Stax
>>>> - StringTemplate
>>>> - xmlEnc
>>>> 
>>>> -- MIT
>>>> - AppAssembler
>>>> - SimpleLog4J
>>>> 
>>>> -- CDDL 1.0
>>>> - Java Activation Framework
>>>> - Java Transactions
>>>> - Java Servlet API
>>>> - Grizzly
>>>> - gmbal
>>>> - Glassfish
>>>> 
>>>> -- CDDL 1.1
>>>> - Jersey
>>>> - JAXB Reference Implementation
>>>> 
>>>> -- JSON License
>>>> - JSON
>>>> 
>>>> -- EPL 1.0
>>>> - JUnit
>>>> 
>>>> -- JDOM License
>>>> - JDOM
>>>> 
>>>> -- Public Domain
>>>> - xz
>>>> - AOPAlliance
>>>> 
>>>> As all dependencies are managed using Apache Maven, none of the
>>>> external libraries need to be packaged in a source distribution.
>>>> 
>>>> 
>>>> Required Resources
>>>> 
>>>> Developer and user mailing lists
>>>> 
>>>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>>>> commits@asterixdb.incubator.apache.org
>>>> dev@asterixdb.incubator.apache.org
>>>> users@asterixdb.incubator.apache.org
>>>> 
>>>> 
>>>> A git repository
>>>> 
>>>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>>> 
>>>> 
>>>> A JIRA issue tracker
>>>> 
>>>> https://issues.apache.org/jira/browse/ASTERIXDB
>>>> 
>>>> 
>>>> Initial Committers
>>>> 
>>>> The following is a list of the planned initial Apache committers (the
>>>> active subset of the committers for the current repository at Google
>>>> code).
>>>> 
>>>> Abdullah Alamoudi (bamousaa@gmail.com)
>>>> Cameron Samak (eufery@gmail.com)
>>>> Chen Li (chenli@gmail.com)
>>>> Ian Maxon (imaxon@uci.edu)
>>>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>>>> Jianfeng Jia (jianfeng.jia@gmail.com)
>>>> Karen Ouaknine (kereno@gmail.com)
>>>> Markus Dreseler (apache@dreseler.de)
>>>> Mike Carey (dtabass@apache.org)
>>>> Murtadha Hubail (hubailmor@gmail.com)
>>>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>>>> Preston Carman (prestonc@apache.org)
>>>> Raman Grover (RamanGrover29@gmail.com)
>>>> Sattam Alsubaiee (salsubaiee@gmail.com)
>>>> Steven Jacobs (sjaco002@apache.org)
>>>> Taewoo Kim (wangsaeu@gmail.com)
>>>> Till Westmann (tillw@apache.org)
>>>> Vinayak Borkar (vinayakb@apache.org)
>>>> Yingyi Bu (buyingyi@gmail.com)
>>>> Young-Seok Kim (kisskys@gmail.com)
>>>> Zach Heilbron (zheilbron@gmail.com)
>>>> 
>>>> 
>>>> Affiliations
>>>> 
>>>> UC Irvine
>>>> - Mike Carey
>>>> - Chen Li
>>>> - Ian Maxon
>>>> - Yingyi Bu
>>>> - Raman Grover
>>>> - Pouria Pirzadeh
>>>> - Young-Seok Kim
>>>> - Cameron Samak
>>>> - Taewoo Kim
>>>> - Jianfeng Jia
>>>> - Murtadha Hubail
>>>> - Markus Dreseler
>>>> 
>>>> UC Riverside
>>>> - Ildar Absalyamov
>>>> - Preston Carman
>>>> - Steven Jacobs
>>>> 
>>>> Hebrew University
>>>> - Keren Ouaknine
>>>> 
>>>> Oracle
>>>> - Till Westmann
>>>> 
>>>> X15 Software
>>>> - Vinayak Borkar
>>>> - Zach Heilbron
>>>> 
>>>> KACST Saudi Arabia
>>>> - Sattam Alsubaiee
>>>> 
>>>> Saudi Aramco
>>>> - Abdullah Alamoudi
>>>> 
>>>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>>>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>>>> non-UC committers are a mix of alumni who continue to contribute to
>>>> the effort and individuals working with permission part-time (or in
>>>> spare time) on this project.
>>>> 
>>>> 
>>>> Sponsors
>>>> 
>>>> Champion
>>>> 
>>>> Chris Mattmann (NASA/JPL)
>>>> 
>>>> Nominated Mentors
>>>> 
>>>> TBD
>>>> 
>>>> Sponsoring Entity
>>>> 
>>>> The Apache Incubator
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Henry Saputra <he...@gmail.com>.

Thanks Till,

Will try to solicit more mentors to help.
Especially with initial committers mostly have not been exposed to
contributing the Apache way.

- Henry

On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <ti...@westmann.org> wrote:
> Hi Henry,
>
> thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>
> Even if your time is very limited we would be very happy to have you on board as a mentor.
> I’ll add you to the proposal.
>
> Cheers,
> Till
>
>> On Jan 19, 2015, at 10:26 AM, Henry Saputra <he...@gmail.com> wrote:
>>
>> +1 This is GREAT News!
>>
>> Was watching and trying AsterixDB last year and looked in awesome shape.
>>
>> I have my plate full but would love to help mentor this project to get
>> it going to ASF if needed!
>>
>> - Henry
>>
>> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>> <ch...@jpl.nasa.gov> wrote:
>>> Hi Folks,
>>>
>>> I am pleased to bring forth the Apache AsterixDB proposal to the
>>> Apache Incubator as Champion, working in collaboration with the
>>> team. Please find the wiki proposal here:
>>>
>>> https://wiki.apache.org/incubator/AsterixDBProposal
>>>
>>>
>>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>>> leave the discussion open for a week, and then look to call a VOTE
>>> hopefully end of next week if all is well.
>>>
>>> Cheers!
>>> Chris Mattmann
>>>
>>> =============================================================
>>> Apache AsterixDB Proposal
>>>
>>> Abstract
>>>
>>> Apache AsterixDB is a scalable big data management system (BDMS) that
>>> provides storage, management, and query capabilities for large
>>> collections of semi-structured data.
>>>
>>> Proposal
>>>
>>> AsterixDB is a big data management system (BDMS) that makes it
>>> well-suited to needs such as web data warehousing and social data
>>> storage and analysis. Feature-wise, AsterixDB has:
>>>
>>> * A NoSQL style data model (ADM) based on extending JSON with object
>>>  database concepts.
>>> * An expressive and declarative query language (AQL) for querying
>>>  semi-structured data.
>>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>>  execution of query plans.
>>> * Partitioned LSM-based data storage and indexing for efficient
>>>  ingestion of newly arriving data.
>>> * Support for querying and indexing external data (e.g., in HDFS) as
>>>  well as data stored within AsterixDB.
>>> * A rich set of primitive data types, including support for spatial,
>>>  temporal, and textual data.
>>> * Indexing options that include B+ trees, R trees, and inverted
>>>  keyword index support.
>>> * Basic transactional (concurrency and recovery) capabilities akin to
>>>  those of a NoSQL store.
>>>
>>>
>>> Background and Rationale
>>>
>>> In the world of relational databases, the need to tackle data volumes
>>> that exceed the capabilities of a single server led to the
>>> development of “shared-nothing” parallel database systems several
>>> decades ago. These systems spread data over a cluster based on a
>>> partitioning strategy, such as hash partitioning, and queries are
>>> processed by employing partitioned-parallel divide-and-conquer
>>> techniques. Since these systems are fronted by a high-level,
>>> declarative language (SQL), their users are shielded from the
>>> complexities of parallel programming. Parallel database systems have
>>> been an extremely successful application of parallel computing, and
>>> quite a number of commercial products exist today.
>>>
>>> In the distributed systems world, the Web brought a need to index and
>>> query its huge content. SQL and relational databases were not the
>>> answer, though shared-nothing clusters again emerged as the hardware
>>> platform of choice. Google developed the Google File System (GFS) and
>>> MapReduce programming model to allow programmers to store and process
>>> Big Data by writing a few user-defined functions. The MapReduce
>>> framework applies these functions in parallel to data instances in
>>> distributed files (map) and to sorted groups of instances sharing a
>>> common key (reduce) -- not unlike the partitioned parallelism in
>>> parallel database systems. Apache's Hadoop MapReduce platform is the
>>> most prominent implementation of this paradigm for the rest of the
>>> Big Data community. On top of Hadoop and HDFS sit declarative
>>> languages like Pig and Hive that each compile down to Hadoop
>>> MapReduce jobs.
>>>
>>> The big Web companies were also challenged by extreme user bases
>>> (100s of millions of users) and needed fast simple lookups and
>>> updates to very large keyed data sets like user profiles. SQL
>>> databases were deemed either too expensive or not scalable, so the
>>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>>> popular key-value stores, in this space. MongoDB and Couchbase are
>>> other open source alternatives (document stores).
>>>
>>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>>> as well as the strong demand for Big Data analytics engines today,
>>> that there is a strong (and growing!) need to store, process, *and*
>>> query large volumes of semi-structured data in many application
>>> areas. Until very recently, developers have had to ``choose'' between
>>> using big data analytics engines like Apache Hive or Apache Spark,
>>> which can do complex query processing and analysis over HDFS-resident
>>> files, and flexible but low-function data stores like MongoDB or
>>> Apache HBase. (The Apache Phoenix project,
>>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>>> aims to bridge between these choices.)
>>>
>>> AsterixDB is a highly scalable data management system that can store,
>>> index, and manage semi-structured data, e.g., much like MongoDB, but
>>> it also supports a full-power query language with the expressiveness
>>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>>> stores and manages data, so AsterixDB can exploit its knowledge of
>>> data partitioning and the availability of indexes to avoid always
>>> scanning data set(s) to process queries. Somewhat surprisingly, there
>>> is no open source parallel database system (relational or otherwise)
>>> available to developers today -- AsterixDB aims to fill this need.
>>> Since Apache is where the majority of the today's most important Big
>>> Data technologies live, the ASF seems like the obvious home for a
>>> system like AsterixDB.
>>>
>>> Current Status
>>>
>>> The current version of AsterixDB was co-developed by a team of
>>> faculty, staff, and students at UC Irvine and UC Riverside. The
>>> project was initiated as a large NSF-sponsored project in 2009, the
>>> goal of which was to combine the best ideas from the parallel
>>> database world, the then new Hadoop world, and the semi-structured
>>> (e.g., XML/JSON) data world in order to create a next-generation
>>> BDMS. A first informal open source release was made four years later,
>>> in June of 2013, under the Apache Software License 2.0.
>>>
>>>
>>> Meritocracy
>>>
>>> The current developers are familiar with meritocratic open source
>>> development at Apache. Apache was chosen specifically because we want
>>> to encourage this style of development for the project.
>>>
>>>
>>> Community
>>>
>>> While AsterixDB started as a university project it has developed into
>>> a community. A number of the initial committers started contributing
>>> in academia and continue to actively participate and contribute after
>>> graduation. And we seek to further develop developer and user
>>> communities. One way to broaden the community that is ongoing is
>>> through academic collaborations (currently with IIT Mumbai in India
>>> and TU Berlin in Germany). During incubation we will also explicitly
>>> seek increased industrial participation.
>>>
>>> Some indicators of the effort's development community and history can
>>> be
>>> found at:
>>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>>
>>>
>>> Core Developers
>>>
>>> The core developers of the project are diverse, although initially UC
>>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>>> other 50 are from other academic institutions (UC Riverside and the
>>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>>
>>>
>>> Alignment
>>>
>>> Apache is, by far, the most natural home for taking the AsterixDB
>>> project forward. A large fraction of today's top Big Data
>>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>>> significant gap -- the parallel data management system gap -- that
>>> exists in the Big Data open source world. It is well-aligned with a
>>> number of the Apache projects, e.g., it has strong support for
>>> accessing and indexing external data in HDFS, and it uses YARN as an
>>> answer to basic cluster resource management. AsterixDB also seeks to
>>> achieve an Apache-style development model; it is seeking a broader
>>> community of contributors and users in order to achieve its full
>>> potential and value to the Big Data community.
>>>
>>> There are also a number of related Apache projects and dependencies
>>> that will be mentioned below in the Relationships with Other Apache
>>> products section.
>>>
>>>
>>> Known Risks
>>>
>>> Orphaned products
>>>
>>> Given the current level of intellectual investment in AsterixDB, the
>>> risk of the project being abandoned is very small. The UCI/UCR
>>> faculty team leads are highly incentivized to continue development
>>> since the database groups at UC Irvine and UC Riverside are both
>>> reliant on AsterixDB as a platform for long-term graduate research
>>> projects. UC San Diego is also beginning to contribute to the code
>>> base, and a collaboration involving public health applications is
>>> forming with UCLA. The work on AsterixDB is managed via a mix of
>>> mailing list discussions supplemented by weekly project status
>>> meetings which are summarized on the mailing list. Typical (local
>>> plus Skype-in) attendance to the weekly status meetings runs at about
>>> 20 active contributors.
>>>
>>>
>>> Inexperience with Open Source
>>>
>>> AsterixDB and Hyracks were completely developed in Open Source under
>>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>>> lists are available on Google Code and discussions and decisions
>>> happen on the mailing lists (which is necessary due to the geographic
>>> distribution of the current developers).
>>>
>>> Also a few of the initial committers have contributed to Apache
>>> projects. Vinayak Borkar is a committer on the Apache Helix and
>>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>>> on the Apache VXQuery project.
>>>
>>>
>>> Relationships with Other Apache Products
>>>
>>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>>> is also included in the AsterixDB code base.
>>>
>>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>>> is support for accessing external data in HDFS (and Hive formats),
>>> and resource management and system administration features are in the
>>> process of being migrated to YARN.
>>>
>>> AsterixDB's AQL query facilities offer comparable query power to
>>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>>> differs in storing and indexing data and thus being able to quickly
>>> answer small and medium queries without large HDFS data scans -
>>> thereby targeting a different class of use cases.
>>>
>>> AsterixDB's data storage and indexing facilities are similar to those
>>> of HBase, but AsterixDB differs in being a much more complete and
>>> queryable BDMS (not just a key-value style store).
>>>
>>> AsterixDB's target use cases are not in-memory processing or
>>> iterative algorithm support, making AsterixDB complementary to the
>>> Apache Spark platform. (Spark interoperability is on our longer-term
>>> to-do wishlist.)
>>>
>>>
>>> Homogeneous Developers
>>>
>>> As mentioned before the current community is already organizationally
>>> and geographically distributed - and we would like to increase the
>>> heterogeneity.
>>>
>>>
>>> Reliance on Salaried Developers
>>>
>>> Of the initial committers only 3 are full-time UCI staff. The other
>>> committers are a mix of students, alumni who continue to contribute
>>> to the effort, and individuals working with permission part-time (or
>>> in spare time) on this project.
>>>
>>>
>>> A Excessive Fascination with the Apache Brand
>>>
>>> We believe in the processes, systems, and framework Apache has put in
>>> place. Apache is also known to foster a great community around their
>>> projects and provide exposure. While brand is important, our
>>> fascination with it is not excessive. We believe that the ASF is the
>>> right home for AsterixDB and that having AsterixDB inside of the ASF
>>> will lead to a better long-term outcome for the Big Data community.
>>>
>>>
>>> Documentation
>>>
>>> Documentation and publications related to AsterixDB can be found at
>>> http://asterixdb.ics.uci.edu/.
>>>
>>>
>>> Initial Source
>>>
>>> Current source resides in Google code:
>>> https://code.google.com/p/asterixdb/ (query language and upper system
>>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>>> system and storage management libraries).
>>>
>>>
>>> External Dependencies
>>>
>>> AsterixDB depends on a number of Apache projects:
>>>
>>> - Ant
>>> - Avro
>>> - ApacheDB JDO
>>> - Commons
>>> - Derby
>>> - Hadoop
>>> - Hive
>>> - HTTPComponents
>>> - Jakarta ORO
>>> - Maven
>>> - Tomcat
>>> - Thrift
>>> - Velocity
>>> - Wicket
>>> - Xerces
>>>
>>> and other open source projects (organized by license):
>>>
>>> -- ASL 2.0:
>>> - Jackson
>>> - Google Guava
>>> - Google Guice
>>> - JSON-simple
>>> - BoneCP
>>> - Microsoft Azure SDK
>>> - Netty
>>> - Rome
>>> - JetS3t
>>> - Groovy
>>> - Jettison
>>> - Plexus
>>> - Datanucleus (JDO)
>>> - Jetty
>>> - Twitter4J
>>> - Snappy-java
>>>
>>> -- BSD:
>>> - Antlr
>>> - ObjectWeb ASM
>>> - Protobuf
>>> - JSCH
>>> - JavaCC
>>> - Paranamer
>>> - JLine
>>> - Stax
>>> - StringTemplate
>>> - xmlEnc
>>>
>>> -- MIT
>>> - AppAssembler
>>> - SimpleLog4J
>>>
>>> -- CDDL 1.0
>>> - Java Activation Framework
>>> - Java Transactions
>>> - Java Servlet API
>>> - Grizzly
>>> - gmbal
>>> - Glassfish
>>>
>>> -- CDDL 1.1
>>> - Jersey
>>> - JAXB Reference Implementation
>>>
>>> -- JSON License
>>> - JSON
>>>
>>> -- EPL 1.0
>>> - JUnit
>>>
>>> -- JDOM License
>>> - JDOM
>>>
>>> -- Public Domain
>>> - xz
>>> - AOPAlliance
>>>
>>> As all dependencies are managed using Apache Maven, none of the
>>> external libraries need to be packaged in a source distribution.
>>>
>>>
>>> Required Resources
>>>
>>> Developer and user mailing lists
>>>
>>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>>> commits@asterixdb.incubator.apache.org
>>> dev@asterixdb.incubator.apache.org
>>> users@asterixdb.incubator.apache.org
>>>
>>>
>>> A git repository
>>>
>>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>>
>>>
>>> A JIRA issue tracker
>>>
>>> https://issues.apache.org/jira/browse/ASTERIXDB
>>>
>>>
>>> Initial Committers
>>>
>>> The following is a list of the planned initial Apache committers (the
>>> active subset of the committers for the current repository at Google
>>> code).
>>>
>>> Abdullah Alamoudi (bamousaa@gmail.com)
>>> Cameron Samak (eufery@gmail.com)
>>> Chen Li (chenli@gmail.com)
>>> Ian Maxon (imaxon@uci.edu)
>>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>>> Jianfeng Jia (jianfeng.jia@gmail.com)
>>> Karen Ouaknine (kereno@gmail.com)
>>> Markus Dreseler (apache@dreseler.de)
>>> Mike Carey (dtabass@apache.org)
>>> Murtadha Hubail (hubailmor@gmail.com)
>>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>>> Preston Carman (prestonc@apache.org)
>>> Raman Grover (RamanGrover29@gmail.com)
>>> Sattam Alsubaiee (salsubaiee@gmail.com)
>>> Steven Jacobs (sjaco002@apache.org)
>>> Taewoo Kim (wangsaeu@gmail.com)
>>> Till Westmann (tillw@apache.org)
>>> Vinayak Borkar (vinayakb@apache.org)
>>> Yingyi Bu (buyingyi@gmail.com)
>>> Young-Seok Kim (kisskys@gmail.com)
>>> Zach Heilbron (zheilbron@gmail.com)
>>>
>>>
>>> Affiliations
>>>
>>> UC Irvine
>>> - Mike Carey
>>> - Chen Li
>>> - Ian Maxon
>>> - Yingyi Bu
>>> - Raman Grover
>>> - Pouria Pirzadeh
>>> - Young-Seok Kim
>>> - Cameron Samak
>>> - Taewoo Kim
>>> - Jianfeng Jia
>>> - Murtadha Hubail
>>> - Markus Dreseler
>>>
>>> UC Riverside
>>> - Ildar Absalyamov
>>> - Preston Carman
>>> - Steven Jacobs
>>>
>>> Hebrew University
>>> - Keren Ouaknine
>>>
>>> Oracle
>>> - Till Westmann
>>>
>>> X15 Software
>>> - Vinayak Borkar
>>> - Zach Heilbron
>>>
>>> KACST Saudi Arabia
>>> - Sattam Alsubaiee
>>>
>>> Saudi Aramco
>>> - Abdullah Alamoudi
>>>
>>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>>> non-UC committers are a mix of alumni who continue to contribute to
>>> the effort and individuals working with permission part-time (or in
>>> spare time) on this project.
>>>
>>>
>>> Sponsors
>>>
>>> Champion
>>>
>>> Chris Mattmann (NASA/JPL)
>>>
>>> Nominated Mentors
>>>
>>> TBD
>>>
>>> Sponsoring Entity
>>>
>>> The Apache Incubator
>>>
>>>
>>>
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Mike Carey <dt...@gmail.com>.

Indeed - thanks!!
Cheers,
Mike

On 1/19/15 5:28 PM, Till Westmann wrote:
> Hi Henry,
>
> thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>
> Even if your time is very limited we would be very happy to have you on board as a mentor.
> I’ll add you to the proposal.
>
> Cheers,
> Till
>
>> On Jan 19, 2015, at 10:26 AM, Henry Saputra <he...@gmail.com> wrote:
>>
>> +1 This is GREAT News!
>>
>> Was watching and trying AsterixDB last year and looked in awesome shape.
>>
>> I have my plate full but would love to help mentor this project to get
>> it going to ASF if needed!
>>
>> - Henry
>>
>> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>> <ch...@jpl.nasa.gov> wrote:
>>> Hi Folks,
>>>
>>> I am pleased to bring forth the Apache AsterixDB proposal to the
>>> Apache Incubator as Champion, working in collaboration with the
>>> team. Please find the wiki proposal here:
>>>
>>> https://wiki.apache.org/incubator/AsterixDBProposal
>>>
>>>
>>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>>> leave the discussion open for a week, and then look to call a VOTE
>>> hopefully end of next week if all is well.
>>>
>>> Cheers!
>>> Chris Mattmann
>>>
>>> =============================================================
>>> Apache AsterixDB Proposal
>>>
>>> Abstract
>>>
>>> Apache AsterixDB is a scalable big data management system (BDMS) that
>>> provides storage, management, and query capabilities for large
>>> collections of semi-structured data.
>>>
>>> Proposal
>>>
>>> AsterixDB is a big data management system (BDMS) that makes it
>>> well-suited to needs such as web data warehousing and social data
>>> storage and analysis. Feature-wise, AsterixDB has:
>>>
>>> * A NoSQL style data model (ADM) based on extending JSON with object
>>>   database concepts.
>>> * An expressive and declarative query language (AQL) for querying
>>>   semi-structured data.
>>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>>   execution of query plans.
>>> * Partitioned LSM-based data storage and indexing for efficient
>>>   ingestion of newly arriving data.
>>> * Support for querying and indexing external data (e.g., in HDFS) as
>>>   well as data stored within AsterixDB.
>>> * A rich set of primitive data types, including support for spatial,
>>>   temporal, and textual data.
>>> * Indexing options that include B+ trees, R trees, and inverted
>>>   keyword index support.
>>> * Basic transactional (concurrency and recovery) capabilities akin to
>>>   those of a NoSQL store.
>>>
>>>
>>> Background and Rationale
>>>
>>> In the world of relational databases, the need to tackle data volumes
>>> that exceed the capabilities of a single server led to the
>>> development of “shared-nothing” parallel database systems several
>>> decades ago. These systems spread data over a cluster based on a
>>> partitioning strategy, such as hash partitioning, and queries are
>>> processed by employing partitioned-parallel divide-and-conquer
>>> techniques. Since these systems are fronted by a high-level,
>>> declarative language (SQL), their users are shielded from the
>>> complexities of parallel programming. Parallel database systems have
>>> been an extremely successful application of parallel computing, and
>>> quite a number of commercial products exist today.
>>>
>>> In the distributed systems world, the Web brought a need to index and
>>> query its huge content. SQL and relational databases were not the
>>> answer, though shared-nothing clusters again emerged as the hardware
>>> platform of choice. Google developed the Google File System (GFS) and
>>> MapReduce programming model to allow programmers to store and process
>>> Big Data by writing a few user-defined functions. The MapReduce
>>> framework applies these functions in parallel to data instances in
>>> distributed files (map) and to sorted groups of instances sharing a
>>> common key (reduce) -- not unlike the partitioned parallelism in
>>> parallel database systems. Apache's Hadoop MapReduce platform is the
>>> most prominent implementation of this paradigm for the rest of the
>>> Big Data community. On top of Hadoop and HDFS sit declarative
>>> languages like Pig and Hive that each compile down to Hadoop
>>> MapReduce jobs.
>>>
>>> The big Web companies were also challenged by extreme user bases
>>> (100s of millions of users) and needed fast simple lookups and
>>> updates to very large keyed data sets like user profiles. SQL
>>> databases were deemed either too expensive or not scalable, so the
>>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>>> popular key-value stores, in this space. MongoDB and Couchbase are
>>> other open source alternatives (document stores).
>>>
>>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>>> as well as the strong demand for Big Data analytics engines today,
>>> that there is a strong (and growing!) need to store, process, *and*
>>> query large volumes of semi-structured data in many application
>>> areas. Until very recently, developers have had to ``choose'' between
>>> using big data analytics engines like Apache Hive or Apache Spark,
>>> which can do complex query processing and analysis over HDFS-resident
>>> files, and flexible but low-function data stores like MongoDB or
>>> Apache HBase. (The Apache Phoenix project,
>>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>>> aims to bridge between these choices.)
>>>
>>> AsterixDB is a highly scalable data management system that can store,
>>> index, and manage semi-structured data, e.g., much like MongoDB, but
>>> it also supports a full-power query language with the expressiveness
>>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>>> stores and manages data, so AsterixDB can exploit its knowledge of
>>> data partitioning and the availability of indexes to avoid always
>>> scanning data set(s) to process queries. Somewhat surprisingly, there
>>> is no open source parallel database system (relational or otherwise)
>>> available to developers today -- AsterixDB aims to fill this need.
>>> Since Apache is where the majority of the today's most important Big
>>> Data technologies live, the ASF seems like the obvious home for a
>>> system like AsterixDB.
>>>
>>> Current Status
>>>
>>> The current version of AsterixDB was co-developed by a team of
>>> faculty, staff, and students at UC Irvine and UC Riverside. The
>>> project was initiated as a large NSF-sponsored project in 2009, the
>>> goal of which was to combine the best ideas from the parallel
>>> database world, the then new Hadoop world, and the semi-structured
>>> (e.g., XML/JSON) data world in order to create a next-generation
>>> BDMS. A first informal open source release was made four years later,
>>> in June of 2013, under the Apache Software License 2.0.
>>>
>>>
>>> Meritocracy
>>>
>>> The current developers are familiar with meritocratic open source
>>> development at Apache. Apache was chosen specifically because we want
>>> to encourage this style of development for the project.
>>>
>>>
>>> Community
>>>
>>> While AsterixDB started as a university project it has developed into
>>> a community. A number of the initial committers started contributing
>>> in academia and continue to actively participate and contribute after
>>> graduation. And we seek to further develop developer and user
>>> communities. One way to broaden the community that is ongoing is
>>> through academic collaborations (currently with IIT Mumbai in India
>>> and TU Berlin in Germany). During incubation we will also explicitly
>>> seek increased industrial participation.
>>>
>>> Some indicators of the effort's development community and history can
>>> be
>>> found at:
>>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>>
>>>
>>> Core Developers
>>>
>>> The core developers of the project are diverse, although initially UC
>>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>>> other 50 are from other academic institutions (UC Riverside and the
>>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>>
>>>
>>> Alignment
>>>
>>> Apache is, by far, the most natural home for taking the AsterixDB
>>> project forward. A large fraction of today's top Big Data
>>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>>> significant gap -- the parallel data management system gap -- that
>>> exists in the Big Data open source world. It is well-aligned with a
>>> number of the Apache projects, e.g., it has strong support for
>>> accessing and indexing external data in HDFS, and it uses YARN as an
>>> answer to basic cluster resource management. AsterixDB also seeks to
>>> achieve an Apache-style development model; it is seeking a broader
>>> community of contributors and users in order to achieve its full
>>> potential and value to the Big Data community.
>>>
>>> There are also a number of related Apache projects and dependencies
>>> that will be mentioned below in the Relationships with Other Apache
>>> products section.
>>>
>>>
>>> Known Risks
>>>
>>> Orphaned products
>>>
>>> Given the current level of intellectual investment in AsterixDB, the
>>> risk of the project being abandoned is very small. The UCI/UCR
>>> faculty team leads are highly incentivized to continue development
>>> since the database groups at UC Irvine and UC Riverside are both
>>> reliant on AsterixDB as a platform for long-term graduate research
>>> projects. UC San Diego is also beginning to contribute to the code
>>> base, and a collaboration involving public health applications is
>>> forming with UCLA. The work on AsterixDB is managed via a mix of
>>> mailing list discussions supplemented by weekly project status
>>> meetings which are summarized on the mailing list. Typical (local
>>> plus Skype-in) attendance to the weekly status meetings runs at about
>>> 20 active contributors.
>>>
>>>
>>> Inexperience with Open Source
>>>
>>> AsterixDB and Hyracks were completely developed in Open Source under
>>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>>> lists are available on Google Code and discussions and decisions
>>> happen on the mailing lists (which is necessary due to the geographic
>>> distribution of the current developers).
>>>
>>> Also a few of the initial committers have contributed to Apache
>>> projects. Vinayak Borkar is a committer on the Apache Helix and
>>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>>> on the Apache VXQuery project.
>>>
>>>
>>> Relationships with Other Apache Products
>>>
>>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>>> is also included in the AsterixDB code base.
>>>
>>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>>> is support for accessing external data in HDFS (and Hive formats),
>>> and resource management and system administration features are in the
>>> process of being migrated to YARN.
>>>
>>> AsterixDB's AQL query facilities offer comparable query power to
>>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>>> differs in storing and indexing data and thus being able to quickly
>>> answer small and medium queries without large HDFS data scans -
>>> thereby targeting a different class of use cases.
>>>
>>> AsterixDB's data storage and indexing facilities are similar to those
>>> of HBase, but AsterixDB differs in being a much more complete and
>>> queryable BDMS (not just a key-value style store).
>>>
>>> AsterixDB's target use cases are not in-memory processing or
>>> iterative algorithm support, making AsterixDB complementary to the
>>> Apache Spark platform. (Spark interoperability is on our longer-term
>>> to-do wishlist.)
>>>
>>>
>>> Homogeneous Developers
>>>
>>> As mentioned before the current community is already organizationally
>>> and geographically distributed - and we would like to increase the
>>> heterogeneity.
>>>
>>>
>>> Reliance on Salaried Developers
>>>
>>> Of the initial committers only 3 are full-time UCI staff. The other
>>> committers are a mix of students, alumni who continue to contribute
>>> to the effort, and individuals working with permission part-time (or
>>> in spare time) on this project.
>>>
>>>
>>> A Excessive Fascination with the Apache Brand
>>>
>>> We believe in the processes, systems, and framework Apache has put in
>>> place. Apache is also known to foster a great community around their
>>> projects and provide exposure. While brand is important, our
>>> fascination with it is not excessive. We believe that the ASF is the
>>> right home for AsterixDB and that having AsterixDB inside of the ASF
>>> will lead to a better long-term outcome for the Big Data community.
>>>
>>>
>>> Documentation
>>>
>>> Documentation and publications related to AsterixDB can be found at
>>> http://asterixdb.ics.uci.edu/.
>>>
>>>
>>> Initial Source
>>>
>>> Current source resides in Google code:
>>> https://code.google.com/p/asterixdb/ (query language and upper system
>>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>>> system and storage management libraries).
>>>
>>>
>>> External Dependencies
>>>
>>> AsterixDB depends on a number of Apache projects:
>>>
>>> - Ant
>>> - Avro
>>> - ApacheDB JDO
>>> - Commons
>>> - Derby
>>> - Hadoop
>>> - Hive
>>> - HTTPComponents
>>> - Jakarta ORO
>>> - Maven
>>> - Tomcat
>>> - Thrift
>>> - Velocity
>>> - Wicket
>>> - Xerces
>>>
>>> and other open source projects (organized by license):
>>>
>>> -- ASL 2.0:
>>> - Jackson
>>> - Google Guava
>>> - Google Guice
>>> - JSON-simple
>>> - BoneCP
>>> - Microsoft Azure SDK
>>> - Netty
>>> - Rome
>>> - JetS3t
>>> - Groovy
>>> - Jettison
>>> - Plexus
>>> - Datanucleus (JDO)
>>> - Jetty
>>> - Twitter4J
>>> - Snappy-java
>>>
>>> -- BSD:
>>> - Antlr
>>> - ObjectWeb ASM
>>> - Protobuf
>>> - JSCH
>>> - JavaCC
>>> - Paranamer
>>> - JLine
>>> - Stax
>>> - StringTemplate
>>> - xmlEnc
>>>
>>> -- MIT
>>> - AppAssembler
>>> - SimpleLog4J
>>>
>>> -- CDDL 1.0
>>> - Java Activation Framework
>>> - Java Transactions
>>> - Java Servlet API
>>> - Grizzly
>>> - gmbal
>>> - Glassfish
>>>
>>> -- CDDL 1.1
>>> - Jersey
>>> - JAXB Reference Implementation
>>>
>>> -- JSON License
>>> - JSON
>>>
>>> -- EPL 1.0
>>> - JUnit
>>>
>>> -- JDOM License
>>> - JDOM
>>>
>>> -- Public Domain
>>> - xz
>>> - AOPAlliance
>>>
>>> As all dependencies are managed using Apache Maven, none of the
>>> external libraries need to be packaged in a source distribution.
>>>
>>>
>>> Required Resources
>>>
>>> Developer and user mailing lists
>>>
>>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>>> commits@asterixdb.incubator.apache.org
>>> dev@asterixdb.incubator.apache.org
>>> users@asterixdb.incubator.apache.org
>>>
>>>
>>> A git repository
>>>
>>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>>
>>>
>>> A JIRA issue tracker
>>>
>>> https://issues.apache.org/jira/browse/ASTERIXDB
>>>
>>>
>>> Initial Committers
>>>
>>> The following is a list of the planned initial Apache committers (the
>>> active subset of the committers for the current repository at Google
>>> code).
>>>
>>> Abdullah Alamoudi (bamousaa@gmail.com)
>>> Cameron Samak (eufery@gmail.com)
>>> Chen Li (chenli@gmail.com)
>>> Ian Maxon (imaxon@uci.edu)
>>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>>> Jianfeng Jia (jianfeng.jia@gmail.com)
>>> Karen Ouaknine (kereno@gmail.com)
>>> Markus Dreseler (apache@dreseler.de)
>>> Mike Carey (dtabass@apache.org)
>>> Murtadha Hubail (hubailmor@gmail.com)
>>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>>> Preston Carman (prestonc@apache.org)
>>> Raman Grover (RamanGrover29@gmail.com)
>>> Sattam Alsubaiee (salsubaiee@gmail.com)
>>> Steven Jacobs (sjaco002@apache.org)
>>> Taewoo Kim (wangsaeu@gmail.com)
>>> Till Westmann (tillw@apache.org)
>>> Vinayak Borkar (vinayakb@apache.org)
>>> Yingyi Bu (buyingyi@gmail.com)
>>> Young-Seok Kim (kisskys@gmail.com)
>>> Zach Heilbron (zheilbron@gmail.com)
>>>
>>>
>>> Affiliations
>>>
>>> UC Irvine
>>> - Mike Carey
>>> - Chen Li
>>> - Ian Maxon
>>> - Yingyi Bu
>>> - Raman Grover
>>> - Pouria Pirzadeh
>>> - Young-Seok Kim
>>> - Cameron Samak
>>> - Taewoo Kim
>>> - Jianfeng Jia
>>> - Murtadha Hubail
>>> - Markus Dreseler
>>>
>>> UC Riverside
>>> - Ildar Absalyamov
>>> - Preston Carman
>>> - Steven Jacobs
>>>
>>> Hebrew University
>>> - Keren Ouaknine
>>>
>>> Oracle
>>> - Till Westmann
>>>
>>> X15 Software
>>> - Vinayak Borkar
>>> - Zach Heilbron
>>>
>>> KACST Saudi Arabia
>>> - Sattam Alsubaiee
>>>
>>> Saudi Aramco
>>> - Abdullah Alamoudi
>>>
>>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>>> non-UC committers are a mix of alumni who continue to contribute to
>>> the effort and individuals working with permission part-time (or in
>>> spare time) on this project.
>>>
>>>
>>> Sponsors
>>>
>>> Champion
>>>
>>> Chris Mattmann (NASA/JPL)
>>>
>>> Nominated Mentors
>>>
>>> TBD
>>>
>>> Sponsoring Entity
>>>
>>> The Apache Incubator
>>>
>>>
>>>
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Till Westmann <ti...@westmann.org>.

Hi Henry,

thanks! It’s great that you’ve seen (and liked) AsterixDB before.

Even if your time is very limited we would be very happy to have you on board as a mentor.
I’ll add you to the proposal.

Cheers,
Till

> On Jan 19, 2015, at 10:26 AM, Henry Saputra <he...@gmail.com> wrote:
> 
> +1 This is GREAT News!
> 
> Was watching and trying AsterixDB last year and looked in awesome shape.
> 
> I have my plate full but would love to help mentor this project to get
> it going to ASF if needed!
> 
> - Henry
> 
> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
> <ch...@jpl.nasa.gov> wrote:
>> Hi Folks,
>> 
>> I am pleased to bring forth the Apache AsterixDB proposal to the
>> Apache Incubator as Champion, working in collaboration with the
>> team. Please find the wiki proposal here:
>> 
>> https://wiki.apache.org/incubator/AsterixDBProposal
>> 
>> 
>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>> leave the discussion open for a week, and then look to call a VOTE
>> hopefully end of next week if all is well.
>> 
>> Cheers!
>> Chris Mattmann
>> 
>> =============================================================
>> Apache AsterixDB Proposal
>> 
>> Abstract
>> 
>> Apache AsterixDB is a scalable big data management system (BDMS) that
>> provides storage, management, and query capabilities for large
>> collections of semi-structured data.
>> 
>> Proposal
>> 
>> AsterixDB is a big data management system (BDMS) that makes it
>> well-suited to needs such as web data warehousing and social data
>> storage and analysis. Feature-wise, AsterixDB has:
>> 
>> * A NoSQL style data model (ADM) based on extending JSON with object
>>  database concepts.
>> * An expressive and declarative query language (AQL) for querying
>>  semi-structured data.
>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>  execution of query plans.
>> * Partitioned LSM-based data storage and indexing for efficient
>>  ingestion of newly arriving data.
>> * Support for querying and indexing external data (e.g., in HDFS) as
>>  well as data stored within AsterixDB.
>> * A rich set of primitive data types, including support for spatial,
>>  temporal, and textual data.
>> * Indexing options that include B+ trees, R trees, and inverted
>>  keyword index support.
>> * Basic transactional (concurrency and recovery) capabilities akin to
>>  those of a NoSQL store.
>> 
>> 
>> Background and Rationale
>> 
>> In the world of relational databases, the need to tackle data volumes
>> that exceed the capabilities of a single server led to the
>> development of “shared-nothing” parallel database systems several
>> decades ago. These systems spread data over a cluster based on a
>> partitioning strategy, such as hash partitioning, and queries are
>> processed by employing partitioned-parallel divide-and-conquer
>> techniques. Since these systems are fronted by a high-level,
>> declarative language (SQL), their users are shielded from the
>> complexities of parallel programming. Parallel database systems have
>> been an extremely successful application of parallel computing, and
>> quite a number of commercial products exist today.
>> 
>> In the distributed systems world, the Web brought a need to index and
>> query its huge content. SQL and relational databases were not the
>> answer, though shared-nothing clusters again emerged as the hardware
>> platform of choice. Google developed the Google File System (GFS) and
>> MapReduce programming model to allow programmers to store and process
>> Big Data by writing a few user-defined functions. The MapReduce
>> framework applies these functions in parallel to data instances in
>> distributed files (map) and to sorted groups of instances sharing a
>> common key (reduce) -- not unlike the partitioned parallelism in
>> parallel database systems. Apache's Hadoop MapReduce platform is the
>> most prominent implementation of this paradigm for the rest of the
>> Big Data community. On top of Hadoop and HDFS sit declarative
>> languages like Pig and Hive that each compile down to Hadoop
>> MapReduce jobs.
>> 
>> The big Web companies were also challenged by extreme user bases
>> (100s of millions of users) and needed fast simple lookups and
>> updates to very large keyed data sets like user profiles. SQL
>> databases were deemed either too expensive or not scalable, so the
>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>> popular key-value stores, in this space. MongoDB and Couchbase are
>> other open source alternatives (document stores).
>> 
>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>> as well as the strong demand for Big Data analytics engines today,
>> that there is a strong (and growing!) need to store, process, *and*
>> query large volumes of semi-structured data in many application
>> areas. Until very recently, developers have had to ``choose'' between
>> using big data analytics engines like Apache Hive or Apache Spark,
>> which can do complex query processing and analysis over HDFS-resident
>> files, and flexible but low-function data stores like MongoDB or
>> Apache HBase. (The Apache Phoenix project,
>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>> aims to bridge between these choices.)
>> 
>> AsterixDB is a highly scalable data management system that can store,
>> index, and manage semi-structured data, e.g., much like MongoDB, but
>> it also supports a full-power query language with the expressiveness
>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>> stores and manages data, so AsterixDB can exploit its knowledge of
>> data partitioning and the availability of indexes to avoid always
>> scanning data set(s) to process queries. Somewhat surprisingly, there
>> is no open source parallel database system (relational or otherwise)
>> available to developers today -- AsterixDB aims to fill this need.
>> Since Apache is where the majority of the today's most important Big
>> Data technologies live, the ASF seems like the obvious home for a
>> system like AsterixDB.
>> 
>> Current Status
>> 
>> The current version of AsterixDB was co-developed by a team of
>> faculty, staff, and students at UC Irvine and UC Riverside. The
>> project was initiated as a large NSF-sponsored project in 2009, the
>> goal of which was to combine the best ideas from the parallel
>> database world, the then new Hadoop world, and the semi-structured
>> (e.g., XML/JSON) data world in order to create a next-generation
>> BDMS. A first informal open source release was made four years later,
>> in June of 2013, under the Apache Software License 2.0.
>> 
>> 
>> Meritocracy
>> 
>> The current developers are familiar with meritocratic open source
>> development at Apache. Apache was chosen specifically because we want
>> to encourage this style of development for the project.
>> 
>> 
>> Community
>> 
>> While AsterixDB started as a university project it has developed into
>> a community. A number of the initial committers started contributing
>> in academia and continue to actively participate and contribute after
>> graduation. And we seek to further develop developer and user
>> communities. One way to broaden the community that is ongoing is
>> through academic collaborations (currently with IIT Mumbai in India
>> and TU Berlin in Germany). During incubation we will also explicitly
>> seek increased industrial participation.
>> 
>> Some indicators of the effort's development community and history can
>> be
>> found at:
>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>> 
>> 
>> Core Developers
>> 
>> The core developers of the project are diverse, although initially UC
>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>> other 50 are from other academic institutions (UC Riverside and the
>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>> 
>> 
>> Alignment
>> 
>> Apache is, by far, the most natural home for taking the AsterixDB
>> project forward. A large fraction of today's top Big Data
>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>> significant gap -- the parallel data management system gap -- that
>> exists in the Big Data open source world. It is well-aligned with a
>> number of the Apache projects, e.g., it has strong support for
>> accessing and indexing external data in HDFS, and it uses YARN as an
>> answer to basic cluster resource management. AsterixDB also seeks to
>> achieve an Apache-style development model; it is seeking a broader
>> community of contributors and users in order to achieve its full
>> potential and value to the Big Data community.
>> 
>> There are also a number of related Apache projects and dependencies
>> that will be mentioned below in the Relationships with Other Apache
>> products section.
>> 
>> 
>> Known Risks
>> 
>> Orphaned products
>> 
>> Given the current level of intellectual investment in AsterixDB, the
>> risk of the project being abandoned is very small. The UCI/UCR
>> faculty team leads are highly incentivized to continue development
>> since the database groups at UC Irvine and UC Riverside are both
>> reliant on AsterixDB as a platform for long-term graduate research
>> projects. UC San Diego is also beginning to contribute to the code
>> base, and a collaboration involving public health applications is
>> forming with UCLA. The work on AsterixDB is managed via a mix of
>> mailing list discussions supplemented by weekly project status
>> meetings which are summarized on the mailing list. Typical (local
>> plus Skype-in) attendance to the weekly status meetings runs at about
>> 20 active contributors.
>> 
>> 
>> Inexperience with Open Source
>> 
>> AsterixDB and Hyracks were completely developed in Open Source under
>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>> lists are available on Google Code and discussions and decisions
>> happen on the mailing lists (which is necessary due to the geographic
>> distribution of the current developers).
>> 
>> Also a few of the initial committers have contributed to Apache
>> projects. Vinayak Borkar is a committer on the Apache Helix and
>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>> on the Apache VXQuery project.
>> 
>> 
>> Relationships with Other Apache Products
>> 
>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>> is also included in the AsterixDB code base.
>> 
>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>> is support for accessing external data in HDFS (and Hive formats),
>> and resource management and system administration features are in the
>> process of being migrated to YARN.
>> 
>> AsterixDB's AQL query facilities offer comparable query power to
>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>> differs in storing and indexing data and thus being able to quickly
>> answer small and medium queries without large HDFS data scans -
>> thereby targeting a different class of use cases.
>> 
>> AsterixDB's data storage and indexing facilities are similar to those
>> of HBase, but AsterixDB differs in being a much more complete and
>> queryable BDMS (not just a key-value style store).
>> 
>> AsterixDB's target use cases are not in-memory processing or
>> iterative algorithm support, making AsterixDB complementary to the
>> Apache Spark platform. (Spark interoperability is on our longer-term
>> to-do wishlist.)
>> 
>> 
>> Homogeneous Developers
>> 
>> As mentioned before the current community is already organizationally
>> and geographically distributed - and we would like to increase the
>> heterogeneity.
>> 
>> 
>> Reliance on Salaried Developers
>> 
>> Of the initial committers only 3 are full-time UCI staff. The other
>> committers are a mix of students, alumni who continue to contribute
>> to the effort, and individuals working with permission part-time (or
>> in spare time) on this project.
>> 
>> 
>> A Excessive Fascination with the Apache Brand
>> 
>> We believe in the processes, systems, and framework Apache has put in
>> place. Apache is also known to foster a great community around their
>> projects and provide exposure. While brand is important, our
>> fascination with it is not excessive. We believe that the ASF is the
>> right home for AsterixDB and that having AsterixDB inside of the ASF
>> will lead to a better long-term outcome for the Big Data community.
>> 
>> 
>> Documentation
>> 
>> Documentation and publications related to AsterixDB can be found at
>> http://asterixdb.ics.uci.edu/.
>> 
>> 
>> Initial Source
>> 
>> Current source resides in Google code:
>> https://code.google.com/p/asterixdb/ (query language and upper system
>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>> system and storage management libraries).
>> 
>> 
>> External Dependencies
>> 
>> AsterixDB depends on a number of Apache projects:
>> 
>> - Ant
>> - Avro
>> - ApacheDB JDO
>> - Commons
>> - Derby
>> - Hadoop
>> - Hive
>> - HTTPComponents
>> - Jakarta ORO
>> - Maven
>> - Tomcat
>> - Thrift
>> - Velocity
>> - Wicket
>> - Xerces
>> 
>> and other open source projects (organized by license):
>> 
>> -- ASL 2.0:
>> - Jackson
>> - Google Guava
>> - Google Guice
>> - JSON-simple
>> - BoneCP
>> - Microsoft Azure SDK
>> - Netty
>> - Rome
>> - JetS3t
>> - Groovy
>> - Jettison
>> - Plexus
>> - Datanucleus (JDO)
>> - Jetty
>> - Twitter4J
>> - Snappy-java
>> 
>> -- BSD:
>> - Antlr
>> - ObjectWeb ASM
>> - Protobuf
>> - JSCH
>> - JavaCC
>> - Paranamer
>> - JLine
>> - Stax
>> - StringTemplate
>> - xmlEnc
>> 
>> -- MIT
>> - AppAssembler
>> - SimpleLog4J
>> 
>> -- CDDL 1.0
>> - Java Activation Framework
>> - Java Transactions
>> - Java Servlet API
>> - Grizzly
>> - gmbal
>> - Glassfish
>> 
>> -- CDDL 1.1
>> - Jersey
>> - JAXB Reference Implementation
>> 
>> -- JSON License
>> - JSON
>> 
>> -- EPL 1.0
>> - JUnit
>> 
>> -- JDOM License
>> - JDOM
>> 
>> -- Public Domain
>> - xz
>> - AOPAlliance
>> 
>> As all dependencies are managed using Apache Maven, none of the
>> external libraries need to be packaged in a source distribution.
>> 
>> 
>> Required Resources
>> 
>> Developer and user mailing lists
>> 
>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>> commits@asterixdb.incubator.apache.org
>> dev@asterixdb.incubator.apache.org
>> users@asterixdb.incubator.apache.org
>> 
>> 
>> A git repository
>> 
>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>> 
>> 
>> A JIRA issue tracker
>> 
>> https://issues.apache.org/jira/browse/ASTERIXDB
>> 
>> 
>> Initial Committers
>> 
>> The following is a list of the planned initial Apache committers (the
>> active subset of the committers for the current repository at Google
>> code).
>> 
>> Abdullah Alamoudi (bamousaa@gmail.com)
>> Cameron Samak (eufery@gmail.com)
>> Chen Li (chenli@gmail.com)
>> Ian Maxon (imaxon@uci.edu)
>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>> Jianfeng Jia (jianfeng.jia@gmail.com)
>> Karen Ouaknine (kereno@gmail.com)
>> Markus Dreseler (apache@dreseler.de)
>> Mike Carey (dtabass@apache.org)
>> Murtadha Hubail (hubailmor@gmail.com)
>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>> Preston Carman (prestonc@apache.org)
>> Raman Grover (RamanGrover29@gmail.com)
>> Sattam Alsubaiee (salsubaiee@gmail.com)
>> Steven Jacobs (sjaco002@apache.org)
>> Taewoo Kim (wangsaeu@gmail.com)
>> Till Westmann (tillw@apache.org)
>> Vinayak Borkar (vinayakb@apache.org)
>> Yingyi Bu (buyingyi@gmail.com)
>> Young-Seok Kim (kisskys@gmail.com)
>> Zach Heilbron (zheilbron@gmail.com)
>> 
>> 
>> Affiliations
>> 
>> UC Irvine
>> - Mike Carey
>> - Chen Li
>> - Ian Maxon
>> - Yingyi Bu
>> - Raman Grover
>> - Pouria Pirzadeh
>> - Young-Seok Kim
>> - Cameron Samak
>> - Taewoo Kim
>> - Jianfeng Jia
>> - Murtadha Hubail
>> - Markus Dreseler
>> 
>> UC Riverside
>> - Ildar Absalyamov
>> - Preston Carman
>> - Steven Jacobs
>> 
>> Hebrew University
>> - Keren Ouaknine
>> 
>> Oracle
>> - Till Westmann
>> 
>> X15 Software
>> - Vinayak Borkar
>> - Zach Heilbron
>> 
>> KACST Saudi Arabia
>> - Sattam Alsubaiee
>> 
>> Saudi Aramco
>> - Abdullah Alamoudi
>> 
>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>> non-UC committers are a mix of alumni who continue to contribute to
>> the effort and individuals working with permission part-time (or in
>> spare time) on this project.
>> 
>> 
>> Sponsors
>> 
>> Champion
>> 
>> Chris Mattmann (NASA/JPL)
>> 
>> Nominated Mentors
>> 
>> TBD
>> 
>> Sponsoring Entity
>> 
>> The Apache Incubator
>> 
>> 
>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Henry Saputra <he...@gmail.com>.

+1 This is GREAT News!

Was watching and trying AsterixDB last year and looked in awesome shape.

I have my plate full but would love to help mentor this project to get
it going to ASF if needed!

- Henry

On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
<ch...@jpl.nasa.gov> wrote:
> Hi Folks,
>
> I am pleased to bring forth the Apache AsterixDB proposal to the
> Apache Incubator as Champion, working in collaboration with the
> team. Please find the wiki proposal here:
>
> https://wiki.apache.org/incubator/AsterixDBProposal
>
>
> Full text of the proposal is below. Please discuss and enjoy. I’ll
> leave the discussion open for a week, and then look to call a VOTE
> hopefully end of next week if all is well.
>
> Cheers!
> Chris Mattmann
>
> =============================================================
> Apache AsterixDB Proposal
>
> Abstract
>
> Apache AsterixDB is a scalable big data management system (BDMS) that
> provides storage, management, and query capabilities for large
> collections of semi-structured data.
>
> Proposal
>
> AsterixDB is a big data management system (BDMS) that makes it
> well-suited to needs such as web data warehousing and social data
> storage and analysis. Feature-wise, AsterixDB has:
>
> * A NoSQL style data model (ADM) based on extending JSON with object
>   database concepts.
> * An expressive and declarative query language (AQL) for querying
>   semi-structured data.
> * A runtime query execution engine, Hyracks, for partitioned-parallel
>   execution of query plans.
> * Partitioned LSM-based data storage and indexing for efficient
>   ingestion of newly arriving data.
> * Support for querying and indexing external data (e.g., in HDFS) as
>   well as data stored within AsterixDB.
> * A rich set of primitive data types, including support for spatial,
>   temporal, and textual data.
> * Indexing options that include B+ trees, R trees, and inverted
>   keyword index support.
> * Basic transactional (concurrency and recovery) capabilities akin to
>   those of a NoSQL store.
>
>
> Background and Rationale
>
> In the world of relational databases, the need to tackle data volumes
> that exceed the capabilities of a single server led to the
> development of “shared-nothing” parallel database systems several
> decades ago. These systems spread data over a cluster based on a
> partitioning strategy, such as hash partitioning, and queries are
> processed by employing partitioned-parallel divide-and-conquer
> techniques. Since these systems are fronted by a high-level,
> declarative language (SQL), their users are shielded from the
> complexities of parallel programming. Parallel database systems have
> been an extremely successful application of parallel computing, and
> quite a number of commercial products exist today.
>
> In the distributed systems world, the Web brought a need to index and
> query its huge content. SQL and relational databases were not the
> answer, though shared-nothing clusters again emerged as the hardware
> platform of choice. Google developed the Google File System (GFS) and
> MapReduce programming model to allow programmers to store and process
> Big Data by writing a few user-defined functions. The MapReduce
> framework applies these functions in parallel to data instances in
> distributed files (map) and to sorted groups of instances sharing a
> common key (reduce) -- not unlike the partitioned parallelism in
> parallel database systems. Apache's Hadoop MapReduce platform is the
> most prominent implementation of this paradigm for the rest of the
> Big Data community. On top of Hadoop and HDFS sit declarative
> languages like Pig and Hive that each compile down to Hadoop
> MapReduce jobs.
>
> The big Web companies were also challenged by extreme user bases
> (100s of millions of users) and needed fast simple lookups and
> updates to very large keyed data sets like user profiles. SQL
> databases were deemed either too expensive or not scalable, so the
> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> popular key-value stores, in this space. MongoDB and Couchbase are
> other open source alternatives (document stores).
>
> It is evident from the rapidly growing popularity of "NoSQL" stores,
> as well as the strong demand for Big Data analytics engines today,
> that there is a strong (and growing!) need to store, process, *and*
> query large volumes of semi-structured data in many application
> areas. Until very recently, developers have had to ``choose'' between
> using big data analytics engines like Apache Hive or Apache Spark,
> which can do complex query processing and analysis over HDFS-resident
> files, and flexible but low-function data stores like MongoDB or
> Apache HBase. (The Apache Phoenix project,
> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
> aims to bridge between these choices.)
>
> AsterixDB is a highly scalable data management system that can store,
> index, and manage semi-structured data, e.g., much like MongoDB, but
> it also supports a full-power query language with the expressiveness
> of SQL (and more). Unlike analytics engines like Hive or Spark, it
> stores and manages data, so AsterixDB can exploit its knowledge of
> data partitioning and the availability of indexes to avoid always
> scanning data set(s) to process queries. Somewhat surprisingly, there
> is no open source parallel database system (relational or otherwise)
> available to developers today -- AsterixDB aims to fill this need.
> Since Apache is where the majority of the today's most important Big
> Data technologies live, the ASF seems like the obvious home for a
> system like AsterixDB.
>
> Current Status
>
> The current version of AsterixDB was co-developed by a team of
> faculty, staff, and students at UC Irvine and UC Riverside. The
> project was initiated as a large NSF-sponsored project in 2009, the
> goal of which was to combine the best ideas from the parallel
> database world, the then new Hadoop world, and the semi-structured
> (e.g., XML/JSON) data world in order to create a next-generation
> BDMS. A first informal open source release was made four years later,
> in June of 2013, under the Apache Software License 2.0.
>
>
> Meritocracy
>
> The current developers are familiar with meritocratic open source
> development at Apache. Apache was chosen specifically because we want
> to encourage this style of development for the project.
>
>
> Community
>
> While AsterixDB started as a university project it has developed into
> a community. A number of the initial committers started contributing
> in academia and continue to actively participate and contribute after
> graduation. And we seek to further develop developer and user
> communities. One way to broaden the community that is ongoing is
> through academic collaborations (currently with IIT Mumbai in India
> and TU Berlin in Germany). During incubation we will also explicitly
> seek increased industrial participation.
>
> Some indicators of the effort's development community and history can
> be
> found at:
> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>
>
> Core Developers
>
> The core developers of the project are diverse, although initially UC
> Irvine heavy (roughly 50) due to the project's origins at UCI. The
> other 50 are from other academic institutions (UC Riverside and the
> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>
>
> Alignment
>
> Apache is, by far, the most natural home for taking the AsterixDB
> project forward. A large fraction of today's top Big Data
> technologies have their homes in Apache, including Hadoop, YARN, Pig,
> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> significant gap -- the parallel data management system gap -- that
> exists in the Big Data open source world. It is well-aligned with a
> number of the Apache projects, e.g., it has strong support for
> accessing and indexing external data in HDFS, and it uses YARN as an
> answer to basic cluster resource management. AsterixDB also seeks to
> achieve an Apache-style development model; it is seeking a broader
> community of contributors and users in order to achieve its full
> potential and value to the Big Data community.
>
> There are also a number of related Apache projects and dependencies
> that will be mentioned below in the Relationships with Other Apache
> products section.
>
>
> Known Risks
>
> Orphaned products
>
> Given the current level of intellectual investment in AsterixDB, the
> risk of the project being abandoned is very small. The UCI/UCR
> faculty team leads are highly incentivized to continue development
> since the database groups at UC Irvine and UC Riverside are both
> reliant on AsterixDB as a platform for long-term graduate research
> projects. UC San Diego is also beginning to contribute to the code
> base, and a collaboration involving public health applications is
> forming with UCLA. The work on AsterixDB is managed via a mix of
> mailing list discussions supplemented by weekly project status
> meetings which are summarized on the mailing list. Typical (local
> plus Skype-in) attendance to the weekly status meetings runs at about
> 20 active contributors.
>
>
> Inexperience with Open Source
>
> AsterixDB and Hyracks were completely developed in Open Source under
> the ASL 2.0. The source code repositories, issue tracker, and mailing
> lists are available on Google Code and discussions and decisions
> happen on the mailing lists (which is necessary due to the geographic
> distribution of the current developers).
>
> Also a few of the initial committers have contributed to Apache
> projects. Vinayak Borkar is a committer on the Apache Helix and
> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> and an IPMC member. Preston Carman and Steven Jacobs are committers
> on the Apache VXQuery project.
>
>
> Relationships with Other Apache Products
>
> Apache VXQuery is based on the Hyracks data-parallel runtime, which
> is also included in the AsterixDB code base.
>
> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> is support for accessing external data in HDFS (and Hive formats),
> and resource management and system administration features are in the
> process of being migrated to YARN.
>
> AsterixDB's AQL query facilities offer comparable query power to
> Apache's Pig and Hive systems for big data analytics. AsterixDB
> differs in storing and indexing data and thus being able to quickly
> answer small and medium queries without large HDFS data scans -
> thereby targeting a different class of use cases.
>
> AsterixDB's data storage and indexing facilities are similar to those
> of HBase, but AsterixDB differs in being a much more complete and
> queryable BDMS (not just a key-value style store).
>
> AsterixDB's target use cases are not in-memory processing or
> iterative algorithm support, making AsterixDB complementary to the
> Apache Spark platform. (Spark interoperability is on our longer-term
> to-do wishlist.)
>
>
> Homogeneous Developers
>
> As mentioned before the current community is already organizationally
> and geographically distributed - and we would like to increase the
> heterogeneity.
>
>
> Reliance on Salaried Developers
>
> Of the initial committers only 3 are full-time UCI staff. The other
> committers are a mix of students, alumni who continue to contribute
> to the effort, and individuals working with permission part-time (or
> in spare time) on this project.
>
>
> A Excessive Fascination with the Apache Brand
>
> We believe in the processes, systems, and framework Apache has put in
> place. Apache is also known to foster a great community around their
> projects and provide exposure. While brand is important, our
> fascination with it is not excessive. We believe that the ASF is the
> right home for AsterixDB and that having AsterixDB inside of the ASF
> will lead to a better long-term outcome for the Big Data community.
>
>
> Documentation
>
> Documentation and publications related to AsterixDB can be found at
> http://asterixdb.ics.uci.edu/.
>
>
> Initial Source
>
> Current source resides in Google code:
> https://code.google.com/p/asterixdb/ (query language and upper system
> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
> system and storage management libraries).
>
>
> External Dependencies
>
> AsterixDB depends on a number of Apache projects:
>
> - Ant
> - Avro
> - ApacheDB JDO
> - Commons
> - Derby
> - Hadoop
> - Hive
> - HTTPComponents
> - Jakarta ORO
> - Maven
> - Tomcat
> - Thrift
> - Velocity
> - Wicket
> - Xerces
>
> and other open source projects (organized by license):
>
> -- ASL 2.0:
>  - Jackson
>  - Google Guava
>  - Google Guice
>  - JSON-simple
>  - BoneCP
>  - Microsoft Azure SDK
>  - Netty
>  - Rome
>  - JetS3t
>  - Groovy
>  - Jettison
>  - Plexus
>  - Datanucleus (JDO)
>  - Jetty
>  - Twitter4J
>  - Snappy-java
>
> -- BSD:
>  - Antlr
>  - ObjectWeb ASM
>  - Protobuf
>  - JSCH
>  - JavaCC
>  - Paranamer
>  - JLine
>  - Stax
>  - StringTemplate
>  - xmlEnc
>
> -- MIT
>  - AppAssembler
>  - SimpleLog4J
>
> -- CDDL 1.0
>  - Java Activation Framework
>  - Java Transactions
>  - Java Servlet API
>  - Grizzly
>  - gmbal
>  - Glassfish
>
> -- CDDL 1.1
>  - Jersey
>  - JAXB Reference Implementation
>
> -- JSON License
>  - JSON
>
> -- EPL 1.0
>  - JUnit
>
> -- JDOM License
>  - JDOM
>
> -- Public Domain
>  - xz
>  - AOPAlliance
>
> As all dependencies are managed using Apache Maven, none of the
> external libraries need to be packaged in a source distribution.
>
>
> Required Resources
>
> Developer and user mailing lists
>
> private@asterixdb.incubator.apache.org (with moderated subscriptions)
> commits@asterixdb.incubator.apache.org
> dev@asterixdb.incubator.apache.org
> users@asterixdb.incubator.apache.org
>
>
> A git repository
>
> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>
>
> A JIRA issue tracker
>
> https://issues.apache.org/jira/browse/ASTERIXDB
>
>
> Initial Committers
>
> The following is a list of the planned initial Apache committers (the
> active subset of the committers for the current repository at Google
> code).
>
> Abdullah Alamoudi (bamousaa@gmail.com)
> Cameron Samak (eufery@gmail.com)
> Chen Li (chenli@gmail.com)
> Ian Maxon (imaxon@uci.edu)
> Ildar Absalyamov (ildar.absalyamov@gmail.com)
> Jianfeng Jia (jianfeng.jia@gmail.com)
> Karen Ouaknine (kereno@gmail.com)
> Markus Dreseler (apache@dreseler.de)
> Mike Carey (dtabass@apache.org)
> Murtadha Hubail (hubailmor@gmail.com)
> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
> Preston Carman (prestonc@apache.org)
> Raman Grover (RamanGrover29@gmail.com)
> Sattam Alsubaiee (salsubaiee@gmail.com)
> Steven Jacobs (sjaco002@apache.org)
> Taewoo Kim (wangsaeu@gmail.com)
> Till Westmann (tillw@apache.org)
> Vinayak Borkar (vinayakb@apache.org)
> Yingyi Bu (buyingyi@gmail.com)
> Young-Seok Kim (kisskys@gmail.com)
> Zach Heilbron (zheilbron@gmail.com)
>
>
> Affiliations
>
> UC Irvine
> - Mike Carey
> - Chen Li
> - Ian Maxon
> - Yingyi Bu
> - Raman Grover
> - Pouria Pirzadeh
> - Young-Seok Kim
> - Cameron Samak
> - Taewoo Kim
> - Jianfeng Jia
> - Murtadha Hubail
> - Markus Dreseler
>
> UC Riverside
> - Ildar Absalyamov
> - Preston Carman
> - Steven Jacobs
>
> Hebrew University
> - Keren Ouaknine
>
> Oracle
> - Till Westmann
>
> X15 Software
> - Vinayak Borkar
> - Zach Heilbron
>
> KACST Saudi Arabia
> - Sattam Alsubaiee
>
> Saudi Aramco
> - Abdullah Alamoudi
>
> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> non-UC committers are a mix of alumni who continue to contribute to
> the effort and individuals working with permission part-time (or in
> spare time) on this project.
>
>
> Sponsors
>
> Champion
>
> Chris Mattmann (NASA/JPL)
>
> Nominated Mentors
>
> TBD
>
> Sponsoring Entity
>
> The Apache Incubator
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Mike Carey <dt...@gmail.com>.

Excellent; thanks, Jochen!!
Cheers,
Mike

On 1/19/15 11:44 PM, Jochen Wiedmann wrote:
> Hi, Chris,
>
> I am interested in the proposal and (following up to my involvement
> with VXQuery in the past) would like to offer myself as a mentor.
>
> Jochen
>
>
> On Thu, Jan 15, 2015 at 3:21 AM, Mattmann, Chris A (3980)
> <ch...@jpl.nasa.gov> wrote:
>> Hi Folks,
>>
>> I am pleased to bring forth the Apache AsterixDB proposal to the
>> Apache Incubator as Champion, working in collaboration with the
>> team. Please find the wiki proposal here:
>>
>> https://wiki.apache.org/incubator/AsterixDBProposal
>>
>>
>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>> leave the discussion open for a week, and then look to call a VOTE
>> hopefully end of next week if all is well.
>>
>> Cheers!
>> Chris Mattmann
>>
>> =============================================================
>> Apache AsterixDB Proposal
>>
>> Abstract
>>
>> Apache AsterixDB is a scalable big data management system (BDMS) that
>> provides storage, management, and query capabilities for large
>> collections of semi-structured data.
>>
>> Proposal
>>
>> AsterixDB is a big data management system (BDMS) that makes it
>> well-suited to needs such as web data warehousing and social data
>> storage and analysis. Feature-wise, AsterixDB has:
>>
>> * A NoSQL style data model (ADM) based on extending JSON with object
>>    database concepts.
>> * An expressive and declarative query language (AQL) for querying
>>    semi-structured data.
>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>    execution of query plans.
>> * Partitioned LSM-based data storage and indexing for efficient
>>    ingestion of newly arriving data.
>> * Support for querying and indexing external data (e.g., in HDFS) as
>>    well as data stored within AsterixDB.
>> * A rich set of primitive data types, including support for spatial,
>>    temporal, and textual data.
>> * Indexing options that include B+ trees, R trees, and inverted
>>    keyword index support.
>> * Basic transactional (concurrency and recovery) capabilities akin to
>>    those of a NoSQL store.
>>
>>
>> Background and Rationale
>>
>> In the world of relational databases, the need to tackle data volumes
>> that exceed the capabilities of a single server led to the
>> development of “shared-nothing” parallel database systems several
>> decades ago. These systems spread data over a cluster based on a
>> partitioning strategy, such as hash partitioning, and queries are
>> processed by employing partitioned-parallel divide-and-conquer
>> techniques. Since these systems are fronted by a high-level,
>> declarative language (SQL), their users are shielded from the
>> complexities of parallel programming. Parallel database systems have
>> been an extremely successful application of parallel computing, and
>> quite a number of commercial products exist today.
>>
>> In the distributed systems world, the Web brought a need to index and
>> query its huge content. SQL and relational databases were not the
>> answer, though shared-nothing clusters again emerged as the hardware
>> platform of choice. Google developed the Google File System (GFS) and
>> MapReduce programming model to allow programmers to store and process
>> Big Data by writing a few user-defined functions. The MapReduce
>> framework applies these functions in parallel to data instances in
>> distributed files (map) and to sorted groups of instances sharing a
>> common key (reduce) -- not unlike the partitioned parallelism in
>> parallel database systems. Apache's Hadoop MapReduce platform is the
>> most prominent implementation of this paradigm for the rest of the
>> Big Data community. On top of Hadoop and HDFS sit declarative
>> languages like Pig and Hive that each compile down to Hadoop
>> MapReduce jobs.
>>
>> The big Web companies were also challenged by extreme user bases
>> (100s of millions of users) and needed fast simple lookups and
>> updates to very large keyed data sets like user profiles. SQL
>> databases were deemed either too expensive or not scalable, so the
>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>> popular key-value stores, in this space. MongoDB and Couchbase are
>> other open source alternatives (document stores).
>>
>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>> as well as the strong demand for Big Data analytics engines today,
>> that there is a strong (and growing!) need to store, process, *and*
>> query large volumes of semi-structured data in many application
>> areas. Until very recently, developers have had to ``choose'' between
>> using big data analytics engines like Apache Hive or Apache Spark,
>> which can do complex query processing and analysis over HDFS-resident
>> files, and flexible but low-function data stores like MongoDB or
>> Apache HBase. (The Apache Phoenix project,
>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>> aims to bridge between these choices.)
>>
>> AsterixDB is a highly scalable data management system that can store,
>> index, and manage semi-structured data, e.g., much like MongoDB, but
>> it also supports a full-power query language with the expressiveness
>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>> stores and manages data, so AsterixDB can exploit its knowledge of
>> data partitioning and the availability of indexes to avoid always
>> scanning data set(s) to process queries. Somewhat surprisingly, there
>> is no open source parallel database system (relational or otherwise)
>> available to developers today -- AsterixDB aims to fill this need.
>> Since Apache is where the majority of the today's most important Big
>> Data technologies live, the ASF seems like the obvious home for a
>> system like AsterixDB.
>>
>> Current Status
>>
>> The current version of AsterixDB was co-developed by a team of
>> faculty, staff, and students at UC Irvine and UC Riverside. The
>> project was initiated as a large NSF-sponsored project in 2009, the
>> goal of which was to combine the best ideas from the parallel
>> database world, the then new Hadoop world, and the semi-structured
>> (e.g., XML/JSON) data world in order to create a next-generation
>> BDMS. A first informal open source release was made four years later,
>> in June of 2013, under the Apache Software License 2.0.
>>
>>
>> Meritocracy
>>
>> The current developers are familiar with meritocratic open source
>> development at Apache. Apache was chosen specifically because we want
>> to encourage this style of development for the project.
>>
>>
>> Community
>>
>> While AsterixDB started as a university project it has developed into
>> a community. A number of the initial committers started contributing
>> in academia and continue to actively participate and contribute after
>> graduation. And we seek to further develop developer and user
>> communities. One way to broaden the community that is ongoing is
>> through academic collaborations (currently with IIT Mumbai in India
>> and TU Berlin in Germany). During incubation we will also explicitly
>> seek increased industrial participation.
>>
>> Some indicators of the effort's development community and history can
>> be
>> found at:
>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>
>>
>> Core Developers
>>
>> The core developers of the project are diverse, although initially UC
>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>> other 50 are from other academic institutions (UC Riverside and the
>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>
>>
>> Alignment
>>
>> Apache is, by far, the most natural home for taking the AsterixDB
>> project forward. A large fraction of today's top Big Data
>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>> significant gap -- the parallel data management system gap -- that
>> exists in the Big Data open source world. It is well-aligned with a
>> number of the Apache projects, e.g., it has strong support for
>> accessing and indexing external data in HDFS, and it uses YARN as an
>> answer to basic cluster resource management. AsterixDB also seeks to
>> achieve an Apache-style development model; it is seeking a broader
>> community of contributors and users in order to achieve its full
>> potential and value to the Big Data community.
>>
>> There are also a number of related Apache projects and dependencies
>> that will be mentioned below in the Relationships with Other Apache
>> products section.
>>
>>
>> Known Risks
>>
>> Orphaned products
>>
>> Given the current level of intellectual investment in AsterixDB, the
>> risk of the project being abandoned is very small. The UCI/UCR
>> faculty team leads are highly incentivized to continue development
>> since the database groups at UC Irvine and UC Riverside are both
>> reliant on AsterixDB as a platform for long-term graduate research
>> projects. UC San Diego is also beginning to contribute to the code
>> base, and a collaboration involving public health applications is
>> forming with UCLA. The work on AsterixDB is managed via a mix of
>> mailing list discussions supplemented by weekly project status
>> meetings which are summarized on the mailing list. Typical (local
>> plus Skype-in) attendance to the weekly status meetings runs at about
>> 20 active contributors.
>>
>>
>> Inexperience with Open Source
>>
>> AsterixDB and Hyracks were completely developed in Open Source under
>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>> lists are available on Google Code and discussions and decisions
>> happen on the mailing lists (which is necessary due to the geographic
>> distribution of the current developers).
>>
>> Also a few of the initial committers have contributed to Apache
>> projects. Vinayak Borkar is a committer on the Apache Helix and
>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>> on the Apache VXQuery project.
>>
>>
>> Relationships with Other Apache Products
>>
>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>> is also included in the AsterixDB code base.
>>
>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>> is support for accessing external data in HDFS (and Hive formats),
>> and resource management and system administration features are in the
>> process of being migrated to YARN.
>>
>> AsterixDB's AQL query facilities offer comparable query power to
>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>> differs in storing and indexing data and thus being able to quickly
>> answer small and medium queries without large HDFS data scans -
>> thereby targeting a different class of use cases.
>>
>> AsterixDB's data storage and indexing facilities are similar to those
>> of HBase, but AsterixDB differs in being a much more complete and
>> queryable BDMS (not just a key-value style store).
>>
>> AsterixDB's target use cases are not in-memory processing or
>> iterative algorithm support, making AsterixDB complementary to the
>> Apache Spark platform. (Spark interoperability is on our longer-term
>> to-do wishlist.)
>>
>>
>> Homogeneous Developers
>>
>> As mentioned before the current community is already organizationally
>> and geographically distributed - and we would like to increase the
>> heterogeneity.
>>
>>
>> Reliance on Salaried Developers
>>
>> Of the initial committers only 3 are full-time UCI staff. The other
>> committers are a mix of students, alumni who continue to contribute
>> to the effort, and individuals working with permission part-time (or
>> in spare time) on this project.
>>
>>
>> A Excessive Fascination with the Apache Brand
>>
>> We believe in the processes, systems, and framework Apache has put in
>> place. Apache is also known to foster a great community around their
>> projects and provide exposure. While brand is important, our
>> fascination with it is not excessive. We believe that the ASF is the
>> right home for AsterixDB and that having AsterixDB inside of the ASF
>> will lead to a better long-term outcome for the Big Data community.
>>
>>
>> Documentation
>>
>> Documentation and publications related to AsterixDB can be found at
>> http://asterixdb.ics.uci.edu/.
>>
>>
>> Initial Source
>>
>> Current source resides in Google code:
>> https://code.google.com/p/asterixdb/ (query language and upper system
>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>> system and storage management libraries).
>>
>>
>> External Dependencies
>>
>> AsterixDB depends on a number of Apache projects:
>>
>> - Ant
>> - Avro
>> - ApacheDB JDO
>> - Commons
>> - Derby
>> - Hadoop
>> - Hive
>> - HTTPComponents
>> - Jakarta ORO
>> - Maven
>> - Tomcat
>> - Thrift
>> - Velocity
>> - Wicket
>> - Xerces
>>
>> and other open source projects (organized by license):
>>
>> -- ASL 2.0:
>>   - Jackson
>>   - Google Guava
>>   - Google Guice
>>   - JSON-simple
>>   - BoneCP
>>   - Microsoft Azure SDK
>>   - Netty
>>   - Rome
>>   - JetS3t
>>   - Groovy
>>   - Jettison
>>   - Plexus
>>   - Datanucleus (JDO)
>>   - Jetty
>>   - Twitter4J
>>   - Snappy-java
>>
>> -- BSD:
>>   - Antlr
>>   - ObjectWeb ASM
>>   - Protobuf
>>   - JSCH
>>   - JavaCC
>>   - Paranamer
>>   - JLine
>>   - Stax
>>   - StringTemplate
>>   - xmlEnc
>>
>> -- MIT
>>   - AppAssembler
>>   - SimpleLog4J
>>
>> -- CDDL 1.0
>>   - Java Activation Framework
>>   - Java Transactions
>>   - Java Servlet API
>>   - Grizzly
>>   - gmbal
>>   - Glassfish
>>
>> -- CDDL 1.1
>>   - Jersey
>>   - JAXB Reference Implementation
>>
>> -- JSON License
>>   - JSON
>>
>> -- EPL 1.0
>>   - JUnit
>>
>> -- JDOM License
>>   - JDOM
>>
>> -- Public Domain
>>   - xz
>>   - AOPAlliance
>>
>> As all dependencies are managed using Apache Maven, none of the
>> external libraries need to be packaged in a source distribution.
>>
>>
>> Required Resources
>>
>> Developer and user mailing lists
>>
>> private@asterixdb.incubator.apache.org (with moderated subscriptions)
>> commits@asterixdb.incubator.apache.org
>> dev@asterixdb.incubator.apache.org
>> users@asterixdb.incubator.apache.org
>>
>>
>> A git repository
>>
>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>
>>
>> A JIRA issue tracker
>>
>> https://issues.apache.org/jira/browse/ASTERIXDB
>>
>>
>> Initial Committers
>>
>> The following is a list of the planned initial Apache committers (the
>> active subset of the committers for the current repository at Google
>> code).
>>
>> Abdullah Alamoudi (bamousaa@gmail.com)
>> Cameron Samak (eufery@gmail.com)
>> Chen Li (chenli@gmail.com)
>> Ian Maxon (imaxon@uci.edu)
>> Ildar Absalyamov (ildar.absalyamov@gmail.com)
>> Jianfeng Jia (jianfeng.jia@gmail.com)
>> Karen Ouaknine (kereno@gmail.com)
>> Markus Dreseler (apache@dreseler.de)
>> Mike Carey (dtabass@apache.org)
>> Murtadha Hubail (hubailmor@gmail.com)
>> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
>> Preston Carman (prestonc@apache.org)
>> Raman Grover (RamanGrover29@gmail.com)
>> Sattam Alsubaiee (salsubaiee@gmail.com)
>> Steven Jacobs (sjaco002@apache.org)
>> Taewoo Kim (wangsaeu@gmail.com)
>> Till Westmann (tillw@apache.org)
>> Vinayak Borkar (vinayakb@apache.org)
>> Yingyi Bu (buyingyi@gmail.com)
>> Young-Seok Kim (kisskys@gmail.com)
>> Zach Heilbron (zheilbron@gmail.com)
>>
>>
>> Affiliations
>>
>> UC Irvine
>> - Mike Carey
>> - Chen Li
>> - Ian Maxon
>> - Yingyi Bu
>> - Raman Grover
>> - Pouria Pirzadeh
>> - Young-Seok Kim
>> - Cameron Samak
>> - Taewoo Kim
>> - Jianfeng Jia
>> - Murtadha Hubail
>> - Markus Dreseler
>>
>> UC Riverside
>> - Ildar Absalyamov
>> - Preston Carman
>> - Steven Jacobs
>>
>> Hebrew University
>> - Keren Ouaknine
>>
>> Oracle
>> - Till Westmann
>>
>> X15 Software
>> - Vinayak Borkar
>> - Zach Heilbron
>>
>> KACST Saudi Arabia
>> - Sattam Alsubaiee
>>
>> Saudi Aramco
>> - Abdullah Alamoudi
>>
>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>> non-UC committers are a mix of alumni who continue to contribute to
>> the effort and individuals working with permission part-time (or in
>> spare time) on this project.
>>
>>
>> Sponsors
>>
>> Champion
>>
>> Chris Mattmann (NASA/JPL)
>>
>> Nominated Mentors
>>
>> TBD
>>
>> Sponsoring Entity
>>
>> The Apache Incubator
>>
>>
>>
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Posted by Jochen Wiedmann <jo...@gmail.com>.

Hi, Chris,

I am interested in the proposal and (following up to my involvement
with VXQuery in the past) would like to offer myself as a mentor.

Jochen


On Thu, Jan 15, 2015 at 3:21 AM, Mattmann, Chris A (3980)
<ch...@jpl.nasa.gov> wrote:
> Hi Folks,
>
> I am pleased to bring forth the Apache AsterixDB proposal to the
> Apache Incubator as Champion, working in collaboration with the
> team. Please find the wiki proposal here:
>
> https://wiki.apache.org/incubator/AsterixDBProposal
>
>
> Full text of the proposal is below. Please discuss and enjoy. I’ll
> leave the discussion open for a week, and then look to call a VOTE
> hopefully end of next week if all is well.
>
> Cheers!
> Chris Mattmann
>
> =============================================================
> Apache AsterixDB Proposal
>
> Abstract
>
> Apache AsterixDB is a scalable big data management system (BDMS) that
> provides storage, management, and query capabilities for large
> collections of semi-structured data.
>
> Proposal
>
> AsterixDB is a big data management system (BDMS) that makes it
> well-suited to needs such as web data warehousing and social data
> storage and analysis. Feature-wise, AsterixDB has:
>
> * A NoSQL style data model (ADM) based on extending JSON with object
>   database concepts.
> * An expressive and declarative query language (AQL) for querying
>   semi-structured data.
> * A runtime query execution engine, Hyracks, for partitioned-parallel
>   execution of query plans.
> * Partitioned LSM-based data storage and indexing for efficient
>   ingestion of newly arriving data.
> * Support for querying and indexing external data (e.g., in HDFS) as
>   well as data stored within AsterixDB.
> * A rich set of primitive data types, including support for spatial,
>   temporal, and textual data.
> * Indexing options that include B+ trees, R trees, and inverted
>   keyword index support.
> * Basic transactional (concurrency and recovery) capabilities akin to
>   those of a NoSQL store.
>
>
> Background and Rationale
>
> In the world of relational databases, the need to tackle data volumes
> that exceed the capabilities of a single server led to the
> development of “shared-nothing” parallel database systems several
> decades ago. These systems spread data over a cluster based on a
> partitioning strategy, such as hash partitioning, and queries are
> processed by employing partitioned-parallel divide-and-conquer
> techniques. Since these systems are fronted by a high-level,
> declarative language (SQL), their users are shielded from the
> complexities of parallel programming. Parallel database systems have
> been an extremely successful application of parallel computing, and
> quite a number of commercial products exist today.
>
> In the distributed systems world, the Web brought a need to index and
> query its huge content. SQL and relational databases were not the
> answer, though shared-nothing clusters again emerged as the hardware
> platform of choice. Google developed the Google File System (GFS) and
> MapReduce programming model to allow programmers to store and process
> Big Data by writing a few user-defined functions. The MapReduce
> framework applies these functions in parallel to data instances in
> distributed files (map) and to sorted groups of instances sharing a
> common key (reduce) -- not unlike the partitioned parallelism in
> parallel database systems. Apache's Hadoop MapReduce platform is the
> most prominent implementation of this paradigm for the rest of the
> Big Data community. On top of Hadoop and HDFS sit declarative
> languages like Pig and Hive that each compile down to Hadoop
> MapReduce jobs.
>
> The big Web companies were also challenged by extreme user bases
> (100s of millions of users) and needed fast simple lookups and
> updates to very large keyed data sets like user profiles. SQL
> databases were deemed either too expensive or not scalable, so the
> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> popular key-value stores, in this space. MongoDB and Couchbase are
> other open source alternatives (document stores).
>
> It is evident from the rapidly growing popularity of "NoSQL" stores,
> as well as the strong demand for Big Data analytics engines today,
> that there is a strong (and growing!) need to store, process, *and*
> query large volumes of semi-structured data in many application
> areas. Until very recently, developers have had to ``choose'' between
> using big data analytics engines like Apache Hive or Apache Spark,
> which can do complex query processing and analysis over HDFS-resident
> files, and flexible but low-function data stores like MongoDB or
> Apache HBase. (The Apache Phoenix project,
> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
> aims to bridge between these choices.)
>
> AsterixDB is a highly scalable data management system that can store,
> index, and manage semi-structured data, e.g., much like MongoDB, but
> it also supports a full-power query language with the expressiveness
> of SQL (and more). Unlike analytics engines like Hive or Spark, it
> stores and manages data, so AsterixDB can exploit its knowledge of
> data partitioning and the availability of indexes to avoid always
> scanning data set(s) to process queries. Somewhat surprisingly, there
> is no open source parallel database system (relational or otherwise)
> available to developers today -- AsterixDB aims to fill this need.
> Since Apache is where the majority of the today's most important Big
> Data technologies live, the ASF seems like the obvious home for a
> system like AsterixDB.
>
> Current Status
>
> The current version of AsterixDB was co-developed by a team of
> faculty, staff, and students at UC Irvine and UC Riverside. The
> project was initiated as a large NSF-sponsored project in 2009, the
> goal of which was to combine the best ideas from the parallel
> database world, the then new Hadoop world, and the semi-structured
> (e.g., XML/JSON) data world in order to create a next-generation
> BDMS. A first informal open source release was made four years later,
> in June of 2013, under the Apache Software License 2.0.
>
>
> Meritocracy
>
> The current developers are familiar with meritocratic open source
> development at Apache. Apache was chosen specifically because we want
> to encourage this style of development for the project.
>
>
> Community
>
> While AsterixDB started as a university project it has developed into
> a community. A number of the initial committers started contributing
> in academia and continue to actively participate and contribute after
> graduation. And we seek to further develop developer and user
> communities. One way to broaden the community that is ongoing is
> through academic collaborations (currently with IIT Mumbai in India
> and TU Berlin in Germany). During incubation we will also explicitly
> seek increased industrial participation.
>
> Some indicators of the effort's development community and history can
> be
> found at:
> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>
>
> Core Developers
>
> The core developers of the project are diverse, although initially UC
> Irvine heavy (roughly 50) due to the project's origins at UCI. The
> other 50 are from other academic institutions (UC Riverside and the
> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>
>
> Alignment
>
> Apache is, by far, the most natural home for taking the AsterixDB
> project forward. A large fraction of today's top Big Data
> technologies have their homes in Apache, including Hadoop, YARN, Pig,
> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> significant gap -- the parallel data management system gap -- that
> exists in the Big Data open source world. It is well-aligned with a
> number of the Apache projects, e.g., it has strong support for
> accessing and indexing external data in HDFS, and it uses YARN as an
> answer to basic cluster resource management. AsterixDB also seeks to
> achieve an Apache-style development model; it is seeking a broader
> community of contributors and users in order to achieve its full
> potential and value to the Big Data community.
>
> There are also a number of related Apache projects and dependencies
> that will be mentioned below in the Relationships with Other Apache
> products section.
>
>
> Known Risks
>
> Orphaned products
>
> Given the current level of intellectual investment in AsterixDB, the
> risk of the project being abandoned is very small. The UCI/UCR
> faculty team leads are highly incentivized to continue development
> since the database groups at UC Irvine and UC Riverside are both
> reliant on AsterixDB as a platform for long-term graduate research
> projects. UC San Diego is also beginning to contribute to the code
> base, and a collaboration involving public health applications is
> forming with UCLA. The work on AsterixDB is managed via a mix of
> mailing list discussions supplemented by weekly project status
> meetings which are summarized on the mailing list. Typical (local
> plus Skype-in) attendance to the weekly status meetings runs at about
> 20 active contributors.
>
>
> Inexperience with Open Source
>
> AsterixDB and Hyracks were completely developed in Open Source under
> the ASL 2.0. The source code repositories, issue tracker, and mailing
> lists are available on Google Code and discussions and decisions
> happen on the mailing lists (which is necessary due to the geographic
> distribution of the current developers).
>
> Also a few of the initial committers have contributed to Apache
> projects. Vinayak Borkar is a committer on the Apache Helix and
> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> and an IPMC member. Preston Carman and Steven Jacobs are committers
> on the Apache VXQuery project.
>
>
> Relationships with Other Apache Products
>
> Apache VXQuery is based on the Hyracks data-parallel runtime, which
> is also included in the AsterixDB code base.
>
> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> is support for accessing external data in HDFS (and Hive formats),
> and resource management and system administration features are in the
> process of being migrated to YARN.
>
> AsterixDB's AQL query facilities offer comparable query power to
> Apache's Pig and Hive systems for big data analytics. AsterixDB
> differs in storing and indexing data and thus being able to quickly
> answer small and medium queries without large HDFS data scans -
> thereby targeting a different class of use cases.
>
> AsterixDB's data storage and indexing facilities are similar to those
> of HBase, but AsterixDB differs in being a much more complete and
> queryable BDMS (not just a key-value style store).
>
> AsterixDB's target use cases are not in-memory processing or
> iterative algorithm support, making AsterixDB complementary to the
> Apache Spark platform. (Spark interoperability is on our longer-term
> to-do wishlist.)
>
>
> Homogeneous Developers
>
> As mentioned before the current community is already organizationally
> and geographically distributed - and we would like to increase the
> heterogeneity.
>
>
> Reliance on Salaried Developers
>
> Of the initial committers only 3 are full-time UCI staff. The other
> committers are a mix of students, alumni who continue to contribute
> to the effort, and individuals working with permission part-time (or
> in spare time) on this project.
>
>
> A Excessive Fascination with the Apache Brand
>
> We believe in the processes, systems, and framework Apache has put in
> place. Apache is also known to foster a great community around their
> projects and provide exposure. While brand is important, our
> fascination with it is not excessive. We believe that the ASF is the
> right home for AsterixDB and that having AsterixDB inside of the ASF
> will lead to a better long-term outcome for the Big Data community.
>
>
> Documentation
>
> Documentation and publications related to AsterixDB can be found at
> http://asterixdb.ics.uci.edu/.
>
>
> Initial Source
>
> Current source resides in Google code:
> https://code.google.com/p/asterixdb/ (query language and upper system
> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
> system and storage management libraries).
>
>
> External Dependencies
>
> AsterixDB depends on a number of Apache projects:
>
> - Ant
> - Avro
> - ApacheDB JDO
> - Commons
> - Derby
> - Hadoop
> - Hive
> - HTTPComponents
> - Jakarta ORO
> - Maven
> - Tomcat
> - Thrift
> - Velocity
> - Wicket
> - Xerces
>
> and other open source projects (organized by license):
>
> -- ASL 2.0:
>  - Jackson
>  - Google Guava
>  - Google Guice
>  - JSON-simple
>  - BoneCP
>  - Microsoft Azure SDK
>  - Netty
>  - Rome
>  - JetS3t
>  - Groovy
>  - Jettison
>  - Plexus
>  - Datanucleus (JDO)
>  - Jetty
>  - Twitter4J
>  - Snappy-java
>
> -- BSD:
>  - Antlr
>  - ObjectWeb ASM
>  - Protobuf
>  - JSCH
>  - JavaCC
>  - Paranamer
>  - JLine
>  - Stax
>  - StringTemplate
>  - xmlEnc
>
> -- MIT
>  - AppAssembler
>  - SimpleLog4J
>
> -- CDDL 1.0
>  - Java Activation Framework
>  - Java Transactions
>  - Java Servlet API
>  - Grizzly
>  - gmbal
>  - Glassfish
>
> -- CDDL 1.1
>  - Jersey
>  - JAXB Reference Implementation
>
> -- JSON License
>  - JSON
>
> -- EPL 1.0
>  - JUnit
>
> -- JDOM License
>  - JDOM
>
> -- Public Domain
>  - xz
>  - AOPAlliance
>
> As all dependencies are managed using Apache Maven, none of the
> external libraries need to be packaged in a source distribution.
>
>
> Required Resources
>
> Developer and user mailing lists
>
> private@asterixdb.incubator.apache.org (with moderated subscriptions)
> commits@asterixdb.incubator.apache.org
> dev@asterixdb.incubator.apache.org
> users@asterixdb.incubator.apache.org
>
>
> A git repository
>
> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>
>
> A JIRA issue tracker
>
> https://issues.apache.org/jira/browse/ASTERIXDB
>
>
> Initial Committers
>
> The following is a list of the planned initial Apache committers (the
> active subset of the committers for the current repository at Google
> code).
>
> Abdullah Alamoudi (bamousaa@gmail.com)
> Cameron Samak (eufery@gmail.com)
> Chen Li (chenli@gmail.com)
> Ian Maxon (imaxon@uci.edu)
> Ildar Absalyamov (ildar.absalyamov@gmail.com)
> Jianfeng Jia (jianfeng.jia@gmail.com)
> Karen Ouaknine (kereno@gmail.com)
> Markus Dreseler (apache@dreseler.de)
> Mike Carey (dtabass@apache.org)
> Murtadha Hubail (hubailmor@gmail.com)
> Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
> Preston Carman (prestonc@apache.org)
> Raman Grover (RamanGrover29@gmail.com)
> Sattam Alsubaiee (salsubaiee@gmail.com)
> Steven Jacobs (sjaco002@apache.org)
> Taewoo Kim (wangsaeu@gmail.com)
> Till Westmann (tillw@apache.org)
> Vinayak Borkar (vinayakb@apache.org)
> Yingyi Bu (buyingyi@gmail.com)
> Young-Seok Kim (kisskys@gmail.com)
> Zach Heilbron (zheilbron@gmail.com)
>
>
> Affiliations
>
> UC Irvine
> - Mike Carey
> - Chen Li
> - Ian Maxon
> - Yingyi Bu
> - Raman Grover
> - Pouria Pirzadeh
> - Young-Seok Kim
> - Cameron Samak
> - Taewoo Kim
> - Jianfeng Jia
> - Murtadha Hubail
> - Markus Dreseler
>
> UC Riverside
> - Ildar Absalyamov
> - Preston Carman
> - Steven Jacobs
>
> Hebrew University
> - Keren Ouaknine
>
> Oracle
> - Till Westmann
>
> X15 Software
> - Vinayak Borkar
> - Zach Heilbron
>
> KACST Saudi Arabia
> - Sattam Alsubaiee
>
> Saudi Aramco
> - Abdullah Alamoudi
>
> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> non-UC committers are a mix of alumni who continue to contribute to
> the effort and individuals working with permission part-time (or in
> spare time) on this project.
>
>
> Sponsors
>
> Champion
>
> Chris Mattmann (NASA/JPL)
>
> Nominated Mentors
>
> TBD
>
> Sponsoring Entity
>
> The Apache Incubator
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>



-- 
Our time is just a point along a line that runs forever with no end.
(Al Stewart, Lord Grenville)

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org