You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Ted Dunning <td...@apache.org> on 2012/08/02 23:40:34 UTC

[PROPOSAL] Drill for the Apache Incubator

Abstract
========
Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by Google’s Dremel (
http://research.google.com/pubs/pub36632.html).

Proposal
========
Drill is a distributed system for interactive analysis of large-scale
datasets. Drill is similar to Google’s Dremel, with the additional
flexibility needed to support a broader range of query languages, data
formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be able
to process petabyes of data and trillions of records in seconds.

Background
==========
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive analysis. In
recent years open source systems have emerged to address the need for
scalable batch processing (Apache Hadoop) and stream processing (Storm,
Apache S4). In 2010 Google published a paper called “Dremel: Interactive
Analysis of Web-Scale Datasets,” describing a scalable system used
internally for interactive analysis of nested data. No open source project
has successfully replicated the capabilities of Dremel.

Rationale
=========
There is a strong need in the market for low-latency interactive analysis
of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
Buffers). This need was identified by Google and addressed internally with
a system called Dremel.

In recent years open source systems have emerged to address the need for
scalable batch processing (Apache Hadoop) and stream processing (Storm,
Apache S4). Apache Hadoop, originally inspired by Google’s internal
MapReduce system, is used by thousands of organizations processing
large-scale datasets. Apache Hadoop is designed to achieve very high
throughput, but is not designed to achieve the sub-second latency needed
for interactive data analysis and exploration. Drill, inspired by Google’s
internal Dremel system, is intended to address this need.

It is worth noting that, as explained by Google in the original paper,
Dremel complements MapReduce-based computing. Dremel is not intended as a
replacement for MapReduce and is often used in conjunction with it to
analyze outputs of MapReduce pipelines or rapidly prototype larger
computations. Indeed, Dremel and MapReduce are both used by thousands of
Google employees.

Like Dremel, Drill supports a nested data model with data encoded in a
number of formats such as JSON, Avro or Protocol Buffers. In many
organizations nested data is the standard, so supporting a nested data
model eliminates the need to normalize the data. With that said, flat data
formats, such as CSV files, are naturally supported as a special case of
nested data.

The Drill architecture consists of four key components/layers:
* Query languages: This layer is responsible for parsing the user’s query
and constructing an execution plan.  The initial goal is to support the
SQL-like language used by Dremel and Google BigQuery (
https://developers.google.com/bigquery/docs/query-reference), which we call
DrQL. However, Drill is designed to support other languages and programming
models, such as the Mongo Query Language (
http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
* Low-latency distributed execution engine: This layer is responsible for
executing the physical plan. It provides the scalability and fault
tolerance needed to efficiently query petabytes of data on 10,000 servers.
Drill’s execution engine is based on research in distributed execution
engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
storage, and can be extended with additional operators and connectors.
* Nested data formats: This layer is responsible for supporting various
data formats. The initial goal is to support the column-based format used
by Dremel. Drill is designed to support schema-based formats such as
Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
formats such as JSON, BSON or YAML. In addition, it is designed to support
column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
particular distinction with Drill is that the execution engine is flexible
enough to support column-based processing as well as row-based processing.
This is important because column-based processing can be much more
efficient when the data is stored in a column-based format, but many large
data assets are stored in a row-based format that would require conversion
before use.
* Scalable data sources: This layer is responsible for supporting various
data sources. The initial focus is to leverage Hadoop as a data source.

It is worth noting that no open source project has successfully replicated
the capabilities of Dremel, nor have any taken on the broader goals of
flexibility (eg, pluggable query languages, data formats, data sources and
execution engine operators/connectors) that are part of Drill.

Initial Goals
=============
The initial goals for this project are to specify the detailed requirements
and architecture, and then develop the initial implementation including the
execution engine and DrQL.
Like Apache Hadoop, which was built to support multiple storage systems
(through the FileSystem API) and file formats (through the
InputFormat/OutputFormat APIs), Drill will be built to support multiple
query languages, data formats and data sources. The initial implementation
of Drill will support the DrQL and a column-based format similar to Dremel.

Current Status
==============
Significant work has been completed to identify the initial requirements
and define the overall system architecture. The next step is to implement
the four components described in the Rationale section, and we intend to do
that development as an Apache project.

Meritocracy
===========
We plan to invest in supporting a meritocracy. We will discuss the
requirements in an open forum. Several companies have already expressed
interest in this project, and we intend to invite additional developers to
participate. We will encourage and monitor community participation so that
privileges can be extended to those that contribute. Also, Drill has an
extensible/pluggable architecture that encourages developers to contribute
various extensions, such as query languages, data formats, data sources and
execution engine operators and connectors. While some companies will surely
develop commercial extensions, we also anticipate that some companies and
individuals will want to contribute such extensions back to the project,
and we look forward to fostering a rich ecosystem of extensions.

Community
=========
The need for a system for interactive analysis of large datasets in the
open source is tremendous, so there is a potential for a very large
community. We believe that Drill’s extensible architecture will further
encourage community participation. Also, related Apache projects (eg,
Hadoop) have very large and active communities, and we expect that over
time Drill will also attract a large community.

Core Developers
===============
The developers on the initial committers list include experienced
distributed systems engineers:
* Tomer Shiran has experience developing distributed execution engines. He
developed Parallel DataSeries, a data-parallel version of the open source
DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also the
author of Applying Idealized Lower-bound Runtime Models to Understand
Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer worked
as a software developer and researcher at IBM Research, Microsoft and HP
Labs, and is now at MapR Technologies. He has been active in the Hadoop
community since 2009.
* Jason Frantz was at Clustrix, where he designed and developed the first
scale-out SQL database based on MySQL. Jason developed the distributed
query optimizer that powered Clustrix. He is now a software engineer and
architect at MapR Technologies.
* Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, and
has a history of over 30 years of contributions to open source. He is now
at MapR Technologies. Ted has been very active in the Hadoop community
since the project’s early days.
* MC Srivas is the co-founder and CTO of MapR Technologies. While at Google
he worked on Google’s scalable search infrastructure. MC Srivas has been
active in the Hadoop community since 2009.
* Chris Wensel is the founder and CEO of Concurrent. Prior to founding
Concurrent, he developed Cascading, an Apache-licensed open source
application framework enabling Java developers to quickly and easily
develop robust Data Analytics and Data Management applications on Apache
Hadoop. Chris has been involved in the Hadoop community since the project's
early days.
* Keys Botzum was at IBM, where he worked on security and distributed
systems, and is currently at MapR Technologies.
* Gera Shegalov was at Oracle, where he worked on networking, storage and
database kernels, and is currently at MapR Technologies.
* Ryan Rawson is the VP Engineering of Drawn to Scale where he developed
Spire, a real-time operational database for Hadoop. He is also a committer
and PMC member for Apache HBase, and has a long history of contributions to
open source. Ryan has been involved in the Hadoop community since the
project's early days.

We realize that additional employer diversity is needed, and we will work
aggressively to recruit developers from additional companies.

Alignment
=========
The initial committers strongly believe that a system for interactive
analysis of large-scale datasets will gain broader adoption as an open
source, community driven project, where the community can contribute not
only to the core components, but also to a growing collection of query
languages and optimizers, data formats, data formats, and execution engine
operators and connectors. Drill will integrate closely with Apache Hadoop.
First, the data will live in Hadoop. That is, Drill will support Hadoop
FileSystem implementations and HBase. Second, Hadoop-related data formats
will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools
will be provided to produce column-based formats. Fourth, Drill tables can
be registered in HCatalog. Finally, Hive is being considered as the basis
of the DrQL implementation.

Known Risks
===========

Orphaned Products
=================
The contributors are leading vendors in this space, with significant open
source experience, so the risk of being orphaned is relatively low. The
project could be at risk if vendors decided to change their strategies in
the market. In such an event, the current committers plan to continue
working on the project on their own time, though the progress will likely
be slower. We plan to mitigate this risk by recruiting additional
committers.

Inexperience with Open Source
=============================
The initial committers include veteran Apache members (committers and PMC
members) and other developers who have varying degrees of experience with
open source projects. All have been involved with source code that has been
released under an open source license, and several also have experience
developing code with an open source development process.

Homogenous Developers
=====================
The initial committers are employed by a number of companies, including
MapR Technologies, Concurrent and Drawn to Scale. We are committed to
recruiting additional committers from other companies.

Reliance on Salaried Developers
===============================
It is expected that Drill development will occur on both salaried time and
on volunteer time, after hours. The majority of initial committers are paid
by their employer to contribute to this project. However, they are all
passionate about the project, and we are confident that the project will
continue even if no salaried developers contribute to the project. We are
committed to recruiting additional committers including non-salaried
developers.

Relationships with Other Apache Products
========================================
As mentioned in the Alignment section, Drill is closely integrated with
Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data
lives inside a Hadoop environment (Drill operates on in situ data). We look
forward to collaborating with those communities, as well as other Apache
communities.

An Excessive Fascination with the Apache Brand
==============================================
Drill solves a real problem that many organizations struggle with, and has
been proven within Google to be of significant value. The architecture is
based on academic and industry research. Our rationale for developing Drill
as an Apache project is detailed in the Rationale section. We believe that
the Apache brand and community process will help us attract more
contributors to this project, and help establish ubiquitous APIs. In
addition, establishing consensus among users and developers of a
Dremel-like tool is a key requirement for success of the project.

Documentation
=============
Drill is inspired by Google’s Dremel. Google has published a paper
highlighting Dremel’s innovative nested column-based data format and
execution engine: http://research.google.com/pubs/pub36632.html

High-level slides have been published by MapR: TODO

Initial Source
==============
There is no initial source code. All source code will be developed within
the Apache Incubator.

Cryptography
============
Drill will eventually support encryption on the wire. This is not one of
the initial goals, and we do not expect Drill to be a controlled export
item due to the use of encryption.

Required Resources
==================

Mailing List
============
* drill-private
* drill-dev
* drill-user

Subversion Directory
====================
Git is the preferred source control system: git://git.apache.org/drill

Issue Tracking
==============
JIRA Drill (DRILL)

Initial Committers
==================
* Tomer Shiran (tshiran at maprtech dot com)
* Ted Dunning (tdunning at apache dot org)
* Jason Frantz (jfrantz at maprtech dot com)
* MC Srivas (mcsrivas at maprtech dot com)
* Chris Wensel (chris and concurrentinc dot com)
* Keys Botzum (kbotzum at maprtech dot com)
* Gera Shegalov (gshegalov at maprtech dot com)
* Ryan Rawson (ryan at drawntoscale dot com)

Affiliations
============
The initial committers are employees of MapR Technologies, Drawn to Scale
and Concurrent. The nominated mentors are employees of MapR Technologies,
Lucid Imagination and Nokia.

Sponsors
========

Champion
========
Ted Dunning (tdunning at apache dot org)

Nominated Mentors
=================
* Ted Dunning (tdunning at apache dot org) – Chief Application Architect at
MapR Technologies, Committer for Lucene, Mahout and ZooKeeper.
* Grant Ingersoll (grant at lucidimagination dot com) – Chief Scientist at
Lucid Imagination, Committer for Lucene, Mahout and other projects.
* Isabel Drost (Isabel at apache dot org) – Software Developer at Nokia
Gate 5 GmbH, Committer for Lucene, Mahout and other projects.

Sponsoring Entity
=================
Incubator

Re: [PROPOSAL] Drill for the Apache Incubator

Posted by Ted Dunning <te...@gmail.com>.
The mailing list request is in infra's hands.

One of the better sources of information about Dremel is the BigQuery
documentation.  That says that the right side of a join must be < 8MB and
that the only outer join available is a left out join.

What Drill does is somewhat of a different question.

On Thu, Aug 16, 2012 at 12:18 AM, Tomer Shiran <ts...@maprtech.com> wrote:

> Yes, we plan to support joins.
>
> We are in the process of setting up the mailing lists.
>
> On Thu, Aug 16, 2012 at 12:09 AM, karthik tunga <karthik.tunga@gmail.com
> >wrote:
>
> > The proposal looks great. I was wondering what operations will drill
> > support ?
> > For example the dremel paper doesn't talk about joins, will drill support
> > joins ?
> >
> > Sorry if I missed it, is there a dev mailing list I could subscribe to ?
> >
> > Cheers,
> > Karthik
> >
> > On 13 August 2012 23:55, Bernd Fondermann <bernd.fondermann@gmail.com
> > >wrote:
> >
> > > great proposal and a very promising mentor lineup.
> > >
> > > Have fun,
> > >
> > >   Bernd
> > >
> > > On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning <td...@apache.org>
> > wrote:
> > > > Abstract
> > > > ========
> > > > Drill is a distributed system for interactive analysis of large-scale
> > > > datasets, inspired by Google’s Dremel (
> > > > http://research.google.com/pubs/pub36632.html).
> > > >
> > > > Proposal
> > > > ========
> > > > Drill is a distributed system for interactive analysis of large-scale
> > > > datasets. Drill is similar to Google’s Dremel, with the additional
> > > > flexibility needed to support a broader range of query languages,
> data
> > > > formats and data sources. It is designed to efficiently process
> nested
> > > > data. It is a design goal to scale to 10,000 servers or more and to
> be
> > > able
> > > > to process petabyes of data and trillions of records in seconds.
> > > >
> > > > Background
> > > > ==========
> > > > Many organizations have the need to run data-intensive applications,
> > > > including batch processing, stream processing and interactive
> analysis.
> > > In
> > > > recent years open source systems have emerged to address the need for
> > > > scalable batch processing (Apache Hadoop) and stream processing
> (Storm,
> > > > Apache S4). In 2010 Google published a paper called “Dremel:
> > Interactive
> > > > Analysis of Web-Scale Datasets,” describing a scalable system used
> > > > internally for interactive analysis of nested data. No open source
> > > project
> > > > has successfully replicated the capabilities of Dremel.
> > > >
> > > > Rationale
> > > > =========
> > > > There is a strong need in the market for low-latency interactive
> > analysis
> > > > of large-scale datasets, including nested data (eg, JSON, Avro,
> > Protocol
> > > > Buffers). This need was identified by Google and addressed internally
> > > with
> > > > a system called Dremel.
> > > >
> > > > In recent years open source systems have emerged to address the need
> > for
> > > > scalable batch processing (Apache Hadoop) and stream processing
> (Storm,
> > > > Apache S4). Apache Hadoop, originally inspired by Google’s internal
> > > > MapReduce system, is used by thousands of organizations processing
> > > > large-scale datasets. Apache Hadoop is designed to achieve very high
> > > > throughput, but is not designed to achieve the sub-second latency
> > needed
> > > > for interactive data analysis and exploration. Drill, inspired by
> > > Google’s
> > > > internal Dremel system, is intended to address this need.
> > > >
> > > > It is worth noting that, as explained by Google in the original
> paper,
> > > > Dremel complements MapReduce-based computing. Dremel is not intended
> > as a
> > > > replacement for MapReduce and is often used in conjunction with it to
> > > > analyze outputs of MapReduce pipelines or rapidly prototype larger
> > > > computations. Indeed, Dremel and MapReduce are both used by thousands
> > of
> > > > Google employees.
> > > >
> > > > Like Dremel, Drill supports a nested data model with data encoded in
> a
> > > > number of formats such as JSON, Avro or Protocol Buffers. In many
> > > > organizations nested data is the standard, so supporting a nested
> data
> > > > model eliminates the need to normalize the data. With that said, flat
> > > data
> > > > formats, such as CSV files, are naturally supported as a special case
> > of
> > > > nested data.
> > > >
> > > > The Drill architecture consists of four key components/layers:
> > > > * Query languages: This layer is responsible for parsing the user’s
> > query
> > > > and constructing an execution plan.  The initial goal is to support
> the
> > > > SQL-like language used by Dremel and Google BigQuery (
> > > > https://developers.google.com/bigquery/docs/query-reference), which
> we
> > > call
> > > > DrQL. However, Drill is designed to support other languages and
> > > programming
> > > > models, such as the Mongo Query Language (
> > > > http://www.mongodb.org/display/DOCS/Mongo+Query+Language),
> Cascading (
> > > > http://www.cascading.org/) or Plume (
> https://github.com/tdunning/Plume
> > ).
> > > > * Low-latency distributed execution engine: This layer is responsible
> > for
> > > > executing the physical plan. It provides the scalability and fault
> > > > tolerance needed to efficiently query petabytes of data on 10,000
> > > servers.
> > > > Drill’s execution engine is based on research in distributed
> execution
> > > > engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
> > > > storage, and can be extended with additional operators and
> connectors.
> > > > * Nested data formats: This layer is responsible for supporting
> various
> > > > data formats. The initial goal is to support the column-based format
> > used
> > > > by Dremel. Drill is designed to support schema-based formats such as
> > > > Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and
> schema-less
> > > > formats such as JSON, BSON or YAML. In addition, it is designed to
> > > support
> > > > column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
> > > > row-based formats such as Protocol Buffers, Avro, JSON, BSON and
> CSV. A
> > > > particular distinction with Drill is that the execution engine is
> > > flexible
> > > > enough to support column-based processing as well as row-based
> > > processing.
> > > > This is important because column-based processing can be much more
> > > > efficient when the data is stored in a column-based format, but many
> > > large
> > > > data assets are stored in a row-based format that would require
> > > conversion
> > > > before use.
> > > > * Scalable data sources: This layer is responsible for supporting
> > various
> > > > data sources. The initial focus is to leverage Hadoop as a data
> source.
> > > >
> > > > It is worth noting that no open source project has successfully
> > > replicated
> > > > the capabilities of Dremel, nor have any taken on the broader goals
> of
> > > > flexibility (eg, pluggable query languages, data formats, data
> sources
> > > and
> > > > execution engine operators/connectors) that are part of Drill.
> > > >
> > > > Initial Goals
> > > > =============
> > > > The initial goals for this project are to specify the detailed
> > > requirements
> > > > and architecture, and then develop the initial implementation
> including
> > > the
> > > > execution engine and DrQL.
> > > > Like Apache Hadoop, which was built to support multiple storage
> systems
> > > > (through the FileSystem API) and file formats (through the
> > > > InputFormat/OutputFormat APIs), Drill will be built to support
> multiple
> > > > query languages, data formats and data sources. The initial
> > > implementation
> > > > of Drill will support the DrQL and a column-based format similar to
> > > Dremel.
> > > >
> > > > Current Status
> > > > ==============
> > > > Significant work has been completed to identify the initial
> > requirements
> > > > and define the overall system architecture. The next step is to
> > implement
> > > > the four components described in the Rationale section, and we intend
> > to
> > > do
> > > > that development as an Apache project.
> > > >
> > > > Meritocracy
> > > > ===========
> > > > We plan to invest in supporting a meritocracy. We will discuss the
> > > > requirements in an open forum. Several companies have already
> expressed
> > > > interest in this project, and we intend to invite additional
> developers
> > > to
> > > > participate. We will encourage and monitor community participation so
> > > that
> > > > privileges can be extended to those that contribute. Also, Drill has
> an
> > > > extensible/pluggable architecture that encourages developers to
> > > contribute
> > > > various extensions, such as query languages, data formats, data
> sources
> > > and
> > > > execution engine operators and connectors. While some companies will
> > > surely
> > > > develop commercial extensions, we also anticipate that some companies
> > and
> > > > individuals will want to contribute such extensions back to the
> > project,
> > > > and we look forward to fostering a rich ecosystem of extensions.
> > > >
> > > > Community
> > > > =========
> > > > The need for a system for interactive analysis of large datasets in
> the
> > > > open source is tremendous, so there is a potential for a very large
> > > > community. We believe that Drill’s extensible architecture will
> further
> > > > encourage community participation. Also, related Apache projects (eg,
> > > > Hadoop) have very large and active communities, and we expect that
> over
> > > > time Drill will also attract a large community.
> > > >
> > > > Core Developers
> > > > ===============
> > > > The developers on the initial committers list include experienced
> > > > distributed systems engineers:
> > > > * Tomer Shiran has experience developing distributed execution
> engines.
> > > He
> > > > developed Parallel DataSeries, a data-parallel version of the open
> > source
> > > > DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also
> > the
> > > > author of Applying Idealized Lower-bound Runtime Models to Understand
> > > > Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer
> > > worked
> > > > as a software developer and researcher at IBM Research, Microsoft and
> > HP
> > > > Labs, and is now at MapR Technologies. He has been active in the
> Hadoop
> > > > community since 2009.
> > > > * Jason Frantz was at Clustrix, where he designed and developed the
> > first
> > > > scale-out SQL database based on MySQL. Jason developed the
> distributed
> > > > query optimizer that powered Clustrix. He is now a software engineer
> > and
> > > > architect at MapR Technologies.
> > > > * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout,
> > and
> > > > has a history of over 30 years of contributions to open source. He is
> > now
> > > > at MapR Technologies. Ted has been very active in the Hadoop
> community
> > > > since the project’s early days.
> > > > * MC Srivas is the co-founder and CTO of MapR Technologies. While at
> > > Google
> > > > he worked on Google’s scalable search infrastructure. MC Srivas has
> > been
> > > > active in the Hadoop community since 2009.
> > > > * Chris Wensel is the founder and CEO of Concurrent. Prior to
> founding
> > > > Concurrent, he developed Cascading, an Apache-licensed open source
> > > > application framework enabling Java developers to quickly and easily
> > > > develop robust Data Analytics and Data Management applications on
> > Apache
> > > > Hadoop. Chris has been involved in the Hadoop community since the
> > > project's
> > > > early days.
> > > > * Keys Botzum was at IBM, where he worked on security and distributed
> > > > systems, and is currently at MapR Technologies.
> > > > * Gera Shegalov was at Oracle, where he worked on networking, storage
> > and
> > > > database kernels, and is currently at MapR Technologies.
> > > > * Ryan Rawson is the VP Engineering of Drawn to Scale where he
> > developed
> > > > Spire, a real-time operational database for Hadoop. He is also a
> > > committer
> > > > and PMC member for Apache HBase, and has a long history of
> > contributions
> > > to
> > > > open source. Ryan has been involved in the Hadoop community since the
> > > > project's early days.
> > > >
> > > > We realize that additional employer diversity is needed, and we will
> > work
> > > > aggressively to recruit developers from additional companies.
> > > >
> > > > Alignment
> > > > =========
> > > > The initial committers strongly believe that a system for interactive
> > > > analysis of large-scale datasets will gain broader adoption as an
> open
> > > > source, community driven project, where the community can contribute
> > not
> > > > only to the core components, but also to a growing collection of
> query
> > > > languages and optimizers, data formats, data formats, and execution
> > > engine
> > > > operators and connectors. Drill will integrate closely with Apache
> > > Hadoop.
> > > > First, the data will live in Hadoop. That is, Drill will support
> Hadoop
> > > > FileSystem implementations and HBase. Second, Hadoop-related data
> > formats
> > > > will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based
> > tools
> > > > will be provided to produce column-based formats. Fourth, Drill
> tables
> > > can
> > > > be registered in HCatalog. Finally, Hive is being considered as the
> > basis
> > > > of the DrQL implementation.
> > > >
> > > > Known Risks
> > > > ===========
> > > >
> > > > Orphaned Products
> > > > =================
> > > > The contributors are leading vendors in this space, with significant
> > open
> > > > source experience, so the risk of being orphaned is relatively low.
> The
> > > > project could be at risk if vendors decided to change their
> strategies
> > in
> > > > the market. In such an event, the current committers plan to continue
> > > > working on the project on their own time, though the progress will
> > likely
> > > > be slower. We plan to mitigate this risk by recruiting additional
> > > > committers.
> > > >
> > > > Inexperience with Open Source
> > > > =============================
> > > > The initial committers include veteran Apache members (committers and
> > PMC
> > > > members) and other developers who have varying degrees of experience
> > with
> > > > open source projects. All have been involved with source code that
> has
> > > been
> > > > released under an open source license, and several also have
> experience
> > > > developing code with an open source development process.
> > > >
> > > > Homogenous Developers
> > > > =====================
> > > > The initial committers are employed by a number of companies,
> including
> > > > MapR Technologies, Concurrent and Drawn to Scale. We are committed to
> > > > recruiting additional committers from other companies.
> > > >
> > > > Reliance on Salaried Developers
> > > > ===============================
> > > > It is expected that Drill development will occur on both salaried
> time
> > > and
> > > > on volunteer time, after hours. The majority of initial committers
> are
> > > paid
> > > > by their employer to contribute to this project. However, they are
> all
> > > > passionate about the project, and we are confident that the project
> > will
> > > > continue even if no salaried developers contribute to the project. We
> > are
> > > > committed to recruiting additional committers including non-salaried
> > > > developers.
> > > >
> > > > Relationships with Other Apache Products
> > > > ========================================
> > > > As mentioned in the Alignment section, Drill is closely integrated
> with
> > > > Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill
> > data
> > > > lives inside a Hadoop environment (Drill operates on in situ data).
> We
> > > look
> > > > forward to collaborating with those communities, as well as other
> > Apache
> > > > communities.
> > > >
> > > > An Excessive Fascination with the Apache Brand
> > > > ==============================================
> > > > Drill solves a real problem that many organizations struggle with,
> and
> > > has
> > > > been proven within Google to be of significant value. The
> architecture
> > is
> > > > based on academic and industry research. Our rationale for developing
> > > Drill
> > > > as an Apache project is detailed in the Rationale section. We believe
> > > that
> > > > the Apache brand and community process will help us attract more
> > > > contributors to this project, and help establish ubiquitous APIs. In
> > > > addition, establishing consensus among users and developers of a
> > > > Dremel-like tool is a key requirement for success of the project.
> > > >
> > > > Documentation
> > > > =============
> > > > Drill is inspired by Google’s Dremel. Google has published a paper
> > > > highlighting Dremel’s innovative nested column-based data format and
> > > > execution engine: http://research.google.com/pubs/pub36632.html
> > > >
> > > > High-level slides have been published by MapR: TODO
> > > >
> > > > Initial Source
> > > > ==============
> > > > There is no initial source code. All source code will be developed
> > within
> > > > the Apache Incubator.
> > > >
> > > > Cryptography
> > > > ============
> > > > Drill will eventually support encryption on the wire. This is not one
> > of
> > > > the initial goals, and we do not expect Drill to be a controlled
> export
> > > > item due to the use of encryption.
> > > >
> > > > Required Resources
> > > > ==================
> > > >
> > > > Mailing List
> > > > ============
> > > > * drill-private
> > > > * drill-dev
> > > > * drill-user
> > > >
> > > > Subversion Directory
> > > > ====================
> > > > Git is the preferred source control system: git://
> git.apache.org/drill
> > > >
> > > > Issue Tracking
> > > > ==============
> > > > JIRA Drill (DRILL)
> > > >
> > > > Initial Committers
> > > > ==================
> > > > * Tomer Shiran (tshiran at maprtech dot com)
> > > > * Ted Dunning (tdunning at apache dot org)
> > > > * Jason Frantz (jfrantz at maprtech dot com)
> > > > * MC Srivas (mcsrivas at maprtech dot com)
> > > > * Chris Wensel (chris and concurrentinc dot com)
> > > > * Keys Botzum (kbotzum at maprtech dot com)
> > > > * Gera Shegalov (gshegalov at maprtech dot com)
> > > > * Ryan Rawson (ryan at drawntoscale dot com)
> > > >
> > > > Affiliations
> > > > ============
> > > > The initial committers are employees of MapR Technologies, Drawn to
> > Scale
> > > > and Concurrent. The nominated mentors are employees of MapR
> > Technologies,
> > > > Lucid Imagination and Nokia.
> > > >
> > > > Sponsors
> > > > ========
> > > >
> > > > Champion
> > > > ========
> > > > Ted Dunning (tdunning at apache dot org)
> > > >
> > > > Nominated Mentors
> > > > =================
> > > > * Ted Dunning (tdunning at apache dot org) – Chief Application
> > Architect
> > > at
> > > > MapR Technologies, Committer for Lucene, Mahout and ZooKeeper.
> > > > * Grant Ingersoll (grant at lucidimagination dot com) – Chief
> Scientist
> > > at
> > > > Lucid Imagination, Committer for Lucene, Mahout and other projects.
> > > > * Isabel Drost (Isabel at apache dot org) – Software Developer at
> Nokia
> > > > Gate 5 GmbH, Committer for Lucene, Mahout and other projects.
> > > >
> > > > Sponsoring Entity
> > > > =================
> > > > Incubator
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
>
>
>
> --
> Tomer Shiran
> Director of Product Management | MapR Technologies | 650-804-8657
>

Re: [PROPOSAL] Drill for the Apache Incubator

Posted by Tomer Shiran <ts...@maprtech.com>.
Yes, we plan to support joins.

We are in the process of setting up the mailing lists.

On Thu, Aug 16, 2012 at 12:09 AM, karthik tunga <ka...@gmail.com>wrote:

> The proposal looks great. I was wondering what operations will drill
> support ?
> For example the dremel paper doesn't talk about joins, will drill support
> joins ?
>
> Sorry if I missed it, is there a dev mailing list I could subscribe to ?
>
> Cheers,
> Karthik
>
> On 13 August 2012 23:55, Bernd Fondermann <bernd.fondermann@gmail.com
> >wrote:
>
> > great proposal and a very promising mentor lineup.
> >
> > Have fun,
> >
> >   Bernd
> >
> > On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning <td...@apache.org>
> wrote:
> > > Abstract
> > > ========
> > > Drill is a distributed system for interactive analysis of large-scale
> > > datasets, inspired by Google’s Dremel (
> > > http://research.google.com/pubs/pub36632.html).
> > >
> > > Proposal
> > > ========
> > > Drill is a distributed system for interactive analysis of large-scale
> > > datasets. Drill is similar to Google’s Dremel, with the additional
> > > flexibility needed to support a broader range of query languages, data
> > > formats and data sources. It is designed to efficiently process nested
> > > data. It is a design goal to scale to 10,000 servers or more and to be
> > able
> > > to process petabyes of data and trillions of records in seconds.
> > >
> > > Background
> > > ==========
> > > Many organizations have the need to run data-intensive applications,
> > > including batch processing, stream processing and interactive analysis.
> > In
> > > recent years open source systems have emerged to address the need for
> > > scalable batch processing (Apache Hadoop) and stream processing (Storm,
> > > Apache S4). In 2010 Google published a paper called “Dremel:
> Interactive
> > > Analysis of Web-Scale Datasets,” describing a scalable system used
> > > internally for interactive analysis of nested data. No open source
> > project
> > > has successfully replicated the capabilities of Dremel.
> > >
> > > Rationale
> > > =========
> > > There is a strong need in the market for low-latency interactive
> analysis
> > > of large-scale datasets, including nested data (eg, JSON, Avro,
> Protocol
> > > Buffers). This need was identified by Google and addressed internally
> > with
> > > a system called Dremel.
> > >
> > > In recent years open source systems have emerged to address the need
> for
> > > scalable batch processing (Apache Hadoop) and stream processing (Storm,
> > > Apache S4). Apache Hadoop, originally inspired by Google’s internal
> > > MapReduce system, is used by thousands of organizations processing
> > > large-scale datasets. Apache Hadoop is designed to achieve very high
> > > throughput, but is not designed to achieve the sub-second latency
> needed
> > > for interactive data analysis and exploration. Drill, inspired by
> > Google’s
> > > internal Dremel system, is intended to address this need.
> > >
> > > It is worth noting that, as explained by Google in the original paper,
> > > Dremel complements MapReduce-based computing. Dremel is not intended
> as a
> > > replacement for MapReduce and is often used in conjunction with it to
> > > analyze outputs of MapReduce pipelines or rapidly prototype larger
> > > computations. Indeed, Dremel and MapReduce are both used by thousands
> of
> > > Google employees.
> > >
> > > Like Dremel, Drill supports a nested data model with data encoded in a
> > > number of formats such as JSON, Avro or Protocol Buffers. In many
> > > organizations nested data is the standard, so supporting a nested data
> > > model eliminates the need to normalize the data. With that said, flat
> > data
> > > formats, such as CSV files, are naturally supported as a special case
> of
> > > nested data.
> > >
> > > The Drill architecture consists of four key components/layers:
> > > * Query languages: This layer is responsible for parsing the user’s
> query
> > > and constructing an execution plan.  The initial goal is to support the
> > > SQL-like language used by Dremel and Google BigQuery (
> > > https://developers.google.com/bigquery/docs/query-reference), which we
> > call
> > > DrQL. However, Drill is designed to support other languages and
> > programming
> > > models, such as the Mongo Query Language (
> > > http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
> > > http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume
> ).
> > > * Low-latency distributed execution engine: This layer is responsible
> for
> > > executing the physical plan. It provides the scalability and fault
> > > tolerance needed to efficiently query petabytes of data on 10,000
> > servers.
> > > Drill’s execution engine is based on research in distributed execution
> > > engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
> > > storage, and can be extended with additional operators and connectors.
> > > * Nested data formats: This layer is responsible for supporting various
> > > data formats. The initial goal is to support the column-based format
> used
> > > by Dremel. Drill is designed to support schema-based formats such as
> > > Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
> > > formats such as JSON, BSON or YAML. In addition, it is designed to
> > support
> > > column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
> > > row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
> > > particular distinction with Drill is that the execution engine is
> > flexible
> > > enough to support column-based processing as well as row-based
> > processing.
> > > This is important because column-based processing can be much more
> > > efficient when the data is stored in a column-based format, but many
> > large
> > > data assets are stored in a row-based format that would require
> > conversion
> > > before use.
> > > * Scalable data sources: This layer is responsible for supporting
> various
> > > data sources. The initial focus is to leverage Hadoop as a data source.
> > >
> > > It is worth noting that no open source project has successfully
> > replicated
> > > the capabilities of Dremel, nor have any taken on the broader goals of
> > > flexibility (eg, pluggable query languages, data formats, data sources
> > and
> > > execution engine operators/connectors) that are part of Drill.
> > >
> > > Initial Goals
> > > =============
> > > The initial goals for this project are to specify the detailed
> > requirements
> > > and architecture, and then develop the initial implementation including
> > the
> > > execution engine and DrQL.
> > > Like Apache Hadoop, which was built to support multiple storage systems
> > > (through the FileSystem API) and file formats (through the
> > > InputFormat/OutputFormat APIs), Drill will be built to support multiple
> > > query languages, data formats and data sources. The initial
> > implementation
> > > of Drill will support the DrQL and a column-based format similar to
> > Dremel.
> > >
> > > Current Status
> > > ==============
> > > Significant work has been completed to identify the initial
> requirements
> > > and define the overall system architecture. The next step is to
> implement
> > > the four components described in the Rationale section, and we intend
> to
> > do
> > > that development as an Apache project.
> > >
> > > Meritocracy
> > > ===========
> > > We plan to invest in supporting a meritocracy. We will discuss the
> > > requirements in an open forum. Several companies have already expressed
> > > interest in this project, and we intend to invite additional developers
> > to
> > > participate. We will encourage and monitor community participation so
> > that
> > > privileges can be extended to those that contribute. Also, Drill has an
> > > extensible/pluggable architecture that encourages developers to
> > contribute
> > > various extensions, such as query languages, data formats, data sources
> > and
> > > execution engine operators and connectors. While some companies will
> > surely
> > > develop commercial extensions, we also anticipate that some companies
> and
> > > individuals will want to contribute such extensions back to the
> project,
> > > and we look forward to fostering a rich ecosystem of extensions.
> > >
> > > Community
> > > =========
> > > The need for a system for interactive analysis of large datasets in the
> > > open source is tremendous, so there is a potential for a very large
> > > community. We believe that Drill’s extensible architecture will further
> > > encourage community participation. Also, related Apache projects (eg,
> > > Hadoop) have very large and active communities, and we expect that over
> > > time Drill will also attract a large community.
> > >
> > > Core Developers
> > > ===============
> > > The developers on the initial committers list include experienced
> > > distributed systems engineers:
> > > * Tomer Shiran has experience developing distributed execution engines.
> > He
> > > developed Parallel DataSeries, a data-parallel version of the open
> source
> > > DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also
> the
> > > author of Applying Idealized Lower-bound Runtime Models to Understand
> > > Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer
> > worked
> > > as a software developer and researcher at IBM Research, Microsoft and
> HP
> > > Labs, and is now at MapR Technologies. He has been active in the Hadoop
> > > community since 2009.
> > > * Jason Frantz was at Clustrix, where he designed and developed the
> first
> > > scale-out SQL database based on MySQL. Jason developed the distributed
> > > query optimizer that powered Clustrix. He is now a software engineer
> and
> > > architect at MapR Technologies.
> > > * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout,
> and
> > > has a history of over 30 years of contributions to open source. He is
> now
> > > at MapR Technologies. Ted has been very active in the Hadoop community
> > > since the project’s early days.
> > > * MC Srivas is the co-founder and CTO of MapR Technologies. While at
> > Google
> > > he worked on Google’s scalable search infrastructure. MC Srivas has
> been
> > > active in the Hadoop community since 2009.
> > > * Chris Wensel is the founder and CEO of Concurrent. Prior to founding
> > > Concurrent, he developed Cascading, an Apache-licensed open source
> > > application framework enabling Java developers to quickly and easily
> > > develop robust Data Analytics and Data Management applications on
> Apache
> > > Hadoop. Chris has been involved in the Hadoop community since the
> > project's
> > > early days.
> > > * Keys Botzum was at IBM, where he worked on security and distributed
> > > systems, and is currently at MapR Technologies.
> > > * Gera Shegalov was at Oracle, where he worked on networking, storage
> and
> > > database kernels, and is currently at MapR Technologies.
> > > * Ryan Rawson is the VP Engineering of Drawn to Scale where he
> developed
> > > Spire, a real-time operational database for Hadoop. He is also a
> > committer
> > > and PMC member for Apache HBase, and has a long history of
> contributions
> > to
> > > open source. Ryan has been involved in the Hadoop community since the
> > > project's early days.
> > >
> > > We realize that additional employer diversity is needed, and we will
> work
> > > aggressively to recruit developers from additional companies.
> > >
> > > Alignment
> > > =========
> > > The initial committers strongly believe that a system for interactive
> > > analysis of large-scale datasets will gain broader adoption as an open
> > > source, community driven project, where the community can contribute
> not
> > > only to the core components, but also to a growing collection of query
> > > languages and optimizers, data formats, data formats, and execution
> > engine
> > > operators and connectors. Drill will integrate closely with Apache
> > Hadoop.
> > > First, the data will live in Hadoop. That is, Drill will support Hadoop
> > > FileSystem implementations and HBase. Second, Hadoop-related data
> formats
> > > will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based
> tools
> > > will be provided to produce column-based formats. Fourth, Drill tables
> > can
> > > be registered in HCatalog. Finally, Hive is being considered as the
> basis
> > > of the DrQL implementation.
> > >
> > > Known Risks
> > > ===========
> > >
> > > Orphaned Products
> > > =================
> > > The contributors are leading vendors in this space, with significant
> open
> > > source experience, so the risk of being orphaned is relatively low. The
> > > project could be at risk if vendors decided to change their strategies
> in
> > > the market. In such an event, the current committers plan to continue
> > > working on the project on their own time, though the progress will
> likely
> > > be slower. We plan to mitigate this risk by recruiting additional
> > > committers.
> > >
> > > Inexperience with Open Source
> > > =============================
> > > The initial committers include veteran Apache members (committers and
> PMC
> > > members) and other developers who have varying degrees of experience
> with
> > > open source projects. All have been involved with source code that has
> > been
> > > released under an open source license, and several also have experience
> > > developing code with an open source development process.
> > >
> > > Homogenous Developers
> > > =====================
> > > The initial committers are employed by a number of companies, including
> > > MapR Technologies, Concurrent and Drawn to Scale. We are committed to
> > > recruiting additional committers from other companies.
> > >
> > > Reliance on Salaried Developers
> > > ===============================
> > > It is expected that Drill development will occur on both salaried time
> > and
> > > on volunteer time, after hours. The majority of initial committers are
> > paid
> > > by their employer to contribute to this project. However, they are all
> > > passionate about the project, and we are confident that the project
> will
> > > continue even if no salaried developers contribute to the project. We
> are
> > > committed to recruiting additional committers including non-salaried
> > > developers.
> > >
> > > Relationships with Other Apache Products
> > > ========================================
> > > As mentioned in the Alignment section, Drill is closely integrated with
> > > Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill
> data
> > > lives inside a Hadoop environment (Drill operates on in situ data). We
> > look
> > > forward to collaborating with those communities, as well as other
> Apache
> > > communities.
> > >
> > > An Excessive Fascination with the Apache Brand
> > > ==============================================
> > > Drill solves a real problem that many organizations struggle with, and
> > has
> > > been proven within Google to be of significant value. The architecture
> is
> > > based on academic and industry research. Our rationale for developing
> > Drill
> > > as an Apache project is detailed in the Rationale section. We believe
> > that
> > > the Apache brand and community process will help us attract more
> > > contributors to this project, and help establish ubiquitous APIs. In
> > > addition, establishing consensus among users and developers of a
> > > Dremel-like tool is a key requirement for success of the project.
> > >
> > > Documentation
> > > =============
> > > Drill is inspired by Google’s Dremel. Google has published a paper
> > > highlighting Dremel’s innovative nested column-based data format and
> > > execution engine: http://research.google.com/pubs/pub36632.html
> > >
> > > High-level slides have been published by MapR: TODO
> > >
> > > Initial Source
> > > ==============
> > > There is no initial source code. All source code will be developed
> within
> > > the Apache Incubator.
> > >
> > > Cryptography
> > > ============
> > > Drill will eventually support encryption on the wire. This is not one
> of
> > > the initial goals, and we do not expect Drill to be a controlled export
> > > item due to the use of encryption.
> > >
> > > Required Resources
> > > ==================
> > >
> > > Mailing List
> > > ============
> > > * drill-private
> > > * drill-dev
> > > * drill-user
> > >
> > > Subversion Directory
> > > ====================
> > > Git is the preferred source control system: git://git.apache.org/drill
> > >
> > > Issue Tracking
> > > ==============
> > > JIRA Drill (DRILL)
> > >
> > > Initial Committers
> > > ==================
> > > * Tomer Shiran (tshiran at maprtech dot com)
> > > * Ted Dunning (tdunning at apache dot org)
> > > * Jason Frantz (jfrantz at maprtech dot com)
> > > * MC Srivas (mcsrivas at maprtech dot com)
> > > * Chris Wensel (chris and concurrentinc dot com)
> > > * Keys Botzum (kbotzum at maprtech dot com)
> > > * Gera Shegalov (gshegalov at maprtech dot com)
> > > * Ryan Rawson (ryan at drawntoscale dot com)
> > >
> > > Affiliations
> > > ============
> > > The initial committers are employees of MapR Technologies, Drawn to
> Scale
> > > and Concurrent. The nominated mentors are employees of MapR
> Technologies,
> > > Lucid Imagination and Nokia.
> > >
> > > Sponsors
> > > ========
> > >
> > > Champion
> > > ========
> > > Ted Dunning (tdunning at apache dot org)
> > >
> > > Nominated Mentors
> > > =================
> > > * Ted Dunning (tdunning at apache dot org) – Chief Application
> Architect
> > at
> > > MapR Technologies, Committer for Lucene, Mahout and ZooKeeper.
> > > * Grant Ingersoll (grant at lucidimagination dot com) – Chief Scientist
> > at
> > > Lucid Imagination, Committer for Lucene, Mahout and other projects.
> > > * Isabel Drost (Isabel at apache dot org) – Software Developer at Nokia
> > > Gate 5 GmbH, Committer for Lucene, Mahout and other projects.
> > >
> > > Sponsoring Entity
> > > =================
> > > Incubator
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>



-- 
Tomer Shiran
Director of Product Management | MapR Technologies | 650-804-8657

Re: [PROPOSAL] Drill for the Apache Incubator

Posted by karthik tunga <ka...@gmail.com>.
The proposal looks great. I was wondering what operations will drill
support ?
For example the dremel paper doesn't talk about joins, will drill support
joins ?

Sorry if I missed it, is there a dev mailing list I could subscribe to ?

Cheers,
Karthik

On 13 August 2012 23:55, Bernd Fondermann <be...@gmail.com>wrote:

> great proposal and a very promising mentor lineup.
>
> Have fun,
>
>   Bernd
>
> On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning <td...@apache.org> wrote:
> > Abstract
> > ========
> > Drill is a distributed system for interactive analysis of large-scale
> > datasets, inspired by Google’s Dremel (
> > http://research.google.com/pubs/pub36632.html).
> >
> > Proposal
> > ========
> > Drill is a distributed system for interactive analysis of large-scale
> > datasets. Drill is similar to Google’s Dremel, with the additional
> > flexibility needed to support a broader range of query languages, data
> > formats and data sources. It is designed to efficiently process nested
> > data. It is a design goal to scale to 10,000 servers or more and to be
> able
> > to process petabyes of data and trillions of records in seconds.
> >
> > Background
> > ==========
> > Many organizations have the need to run data-intensive applications,
> > including batch processing, stream processing and interactive analysis.
> In
> > recent years open source systems have emerged to address the need for
> > scalable batch processing (Apache Hadoop) and stream processing (Storm,
> > Apache S4). In 2010 Google published a paper called “Dremel: Interactive
> > Analysis of Web-Scale Datasets,” describing a scalable system used
> > internally for interactive analysis of nested data. No open source
> project
> > has successfully replicated the capabilities of Dremel.
> >
> > Rationale
> > =========
> > There is a strong need in the market for low-latency interactive analysis
> > of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
> > Buffers). This need was identified by Google and addressed internally
> with
> > a system called Dremel.
> >
> > In recent years open source systems have emerged to address the need for
> > scalable batch processing (Apache Hadoop) and stream processing (Storm,
> > Apache S4). Apache Hadoop, originally inspired by Google’s internal
> > MapReduce system, is used by thousands of organizations processing
> > large-scale datasets. Apache Hadoop is designed to achieve very high
> > throughput, but is not designed to achieve the sub-second latency needed
> > for interactive data analysis and exploration. Drill, inspired by
> Google’s
> > internal Dremel system, is intended to address this need.
> >
> > It is worth noting that, as explained by Google in the original paper,
> > Dremel complements MapReduce-based computing. Dremel is not intended as a
> > replacement for MapReduce and is often used in conjunction with it to
> > analyze outputs of MapReduce pipelines or rapidly prototype larger
> > computations. Indeed, Dremel and MapReduce are both used by thousands of
> > Google employees.
> >
> > Like Dremel, Drill supports a nested data model with data encoded in a
> > number of formats such as JSON, Avro or Protocol Buffers. In many
> > organizations nested data is the standard, so supporting a nested data
> > model eliminates the need to normalize the data. With that said, flat
> data
> > formats, such as CSV files, are naturally supported as a special case of
> > nested data.
> >
> > The Drill architecture consists of four key components/layers:
> > * Query languages: This layer is responsible for parsing the user’s query
> > and constructing an execution plan.  The initial goal is to support the
> > SQL-like language used by Dremel and Google BigQuery (
> > https://developers.google.com/bigquery/docs/query-reference), which we
> call
> > DrQL. However, Drill is designed to support other languages and
> programming
> > models, such as the Mongo Query Language (
> > http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
> > http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
> > * Low-latency distributed execution engine: This layer is responsible for
> > executing the physical plan. It provides the scalability and fault
> > tolerance needed to efficiently query petabytes of data on 10,000
> servers.
> > Drill’s execution engine is based on research in distributed execution
> > engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
> > storage, and can be extended with additional operators and connectors.
> > * Nested data formats: This layer is responsible for supporting various
> > data formats. The initial goal is to support the column-based format used
> > by Dremel. Drill is designed to support schema-based formats such as
> > Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
> > formats such as JSON, BSON or YAML. In addition, it is designed to
> support
> > column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
> > row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
> > particular distinction with Drill is that the execution engine is
> flexible
> > enough to support column-based processing as well as row-based
> processing.
> > This is important because column-based processing can be much more
> > efficient when the data is stored in a column-based format, but many
> large
> > data assets are stored in a row-based format that would require
> conversion
> > before use.
> > * Scalable data sources: This layer is responsible for supporting various
> > data sources. The initial focus is to leverage Hadoop as a data source.
> >
> > It is worth noting that no open source project has successfully
> replicated
> > the capabilities of Dremel, nor have any taken on the broader goals of
> > flexibility (eg, pluggable query languages, data formats, data sources
> and
> > execution engine operators/connectors) that are part of Drill.
> >
> > Initial Goals
> > =============
> > The initial goals for this project are to specify the detailed
> requirements
> > and architecture, and then develop the initial implementation including
> the
> > execution engine and DrQL.
> > Like Apache Hadoop, which was built to support multiple storage systems
> > (through the FileSystem API) and file formats (through the
> > InputFormat/OutputFormat APIs), Drill will be built to support multiple
> > query languages, data formats and data sources. The initial
> implementation
> > of Drill will support the DrQL and a column-based format similar to
> Dremel.
> >
> > Current Status
> > ==============
> > Significant work has been completed to identify the initial requirements
> > and define the overall system architecture. The next step is to implement
> > the four components described in the Rationale section, and we intend to
> do
> > that development as an Apache project.
> >
> > Meritocracy
> > ===========
> > We plan to invest in supporting a meritocracy. We will discuss the
> > requirements in an open forum. Several companies have already expressed
> > interest in this project, and we intend to invite additional developers
> to
> > participate. We will encourage and monitor community participation so
> that
> > privileges can be extended to those that contribute. Also, Drill has an
> > extensible/pluggable architecture that encourages developers to
> contribute
> > various extensions, such as query languages, data formats, data sources
> and
> > execution engine operators and connectors. While some companies will
> surely
> > develop commercial extensions, we also anticipate that some companies and
> > individuals will want to contribute such extensions back to the project,
> > and we look forward to fostering a rich ecosystem of extensions.
> >
> > Community
> > =========
> > The need for a system for interactive analysis of large datasets in the
> > open source is tremendous, so there is a potential for a very large
> > community. We believe that Drill’s extensible architecture will further
> > encourage community participation. Also, related Apache projects (eg,
> > Hadoop) have very large and active communities, and we expect that over
> > time Drill will also attract a large community.
> >
> > Core Developers
> > ===============
> > The developers on the initial committers list include experienced
> > distributed systems engineers:
> > * Tomer Shiran has experience developing distributed execution engines.
> He
> > developed Parallel DataSeries, a data-parallel version of the open source
> > DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also the
> > author of Applying Idealized Lower-bound Runtime Models to Understand
> > Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer
> worked
> > as a software developer and researcher at IBM Research, Microsoft and HP
> > Labs, and is now at MapR Technologies. He has been active in the Hadoop
> > community since 2009.
> > * Jason Frantz was at Clustrix, where he designed and developed the first
> > scale-out SQL database based on MySQL. Jason developed the distributed
> > query optimizer that powered Clustrix. He is now a software engineer and
> > architect at MapR Technologies.
> > * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, and
> > has a history of over 30 years of contributions to open source. He is now
> > at MapR Technologies. Ted has been very active in the Hadoop community
> > since the project’s early days.
> > * MC Srivas is the co-founder and CTO of MapR Technologies. While at
> Google
> > he worked on Google’s scalable search infrastructure. MC Srivas has been
> > active in the Hadoop community since 2009.
> > * Chris Wensel is the founder and CEO of Concurrent. Prior to founding
> > Concurrent, he developed Cascading, an Apache-licensed open source
> > application framework enabling Java developers to quickly and easily
> > develop robust Data Analytics and Data Management applications on Apache
> > Hadoop. Chris has been involved in the Hadoop community since the
> project's
> > early days.
> > * Keys Botzum was at IBM, where he worked on security and distributed
> > systems, and is currently at MapR Technologies.
> > * Gera Shegalov was at Oracle, where he worked on networking, storage and
> > database kernels, and is currently at MapR Technologies.
> > * Ryan Rawson is the VP Engineering of Drawn to Scale where he developed
> > Spire, a real-time operational database for Hadoop. He is also a
> committer
> > and PMC member for Apache HBase, and has a long history of contributions
> to
> > open source. Ryan has been involved in the Hadoop community since the
> > project's early days.
> >
> > We realize that additional employer diversity is needed, and we will work
> > aggressively to recruit developers from additional companies.
> >
> > Alignment
> > =========
> > The initial committers strongly believe that a system for interactive
> > analysis of large-scale datasets will gain broader adoption as an open
> > source, community driven project, where the community can contribute not
> > only to the core components, but also to a growing collection of query
> > languages and optimizers, data formats, data formats, and execution
> engine
> > operators and connectors. Drill will integrate closely with Apache
> Hadoop.
> > First, the data will live in Hadoop. That is, Drill will support Hadoop
> > FileSystem implementations and HBase. Second, Hadoop-related data formats
> > will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools
> > will be provided to produce column-based formats. Fourth, Drill tables
> can
> > be registered in HCatalog. Finally, Hive is being considered as the basis
> > of the DrQL implementation.
> >
> > Known Risks
> > ===========
> >
> > Orphaned Products
> > =================
> > The contributors are leading vendors in this space, with significant open
> > source experience, so the risk of being orphaned is relatively low. The
> > project could be at risk if vendors decided to change their strategies in
> > the market. In such an event, the current committers plan to continue
> > working on the project on their own time, though the progress will likely
> > be slower. We plan to mitigate this risk by recruiting additional
> > committers.
> >
> > Inexperience with Open Source
> > =============================
> > The initial committers include veteran Apache members (committers and PMC
> > members) and other developers who have varying degrees of experience with
> > open source projects. All have been involved with source code that has
> been
> > released under an open source license, and several also have experience
> > developing code with an open source development process.
> >
> > Homogenous Developers
> > =====================
> > The initial committers are employed by a number of companies, including
> > MapR Technologies, Concurrent and Drawn to Scale. We are committed to
> > recruiting additional committers from other companies.
> >
> > Reliance on Salaried Developers
> > ===============================
> > It is expected that Drill development will occur on both salaried time
> and
> > on volunteer time, after hours. The majority of initial committers are
> paid
> > by their employer to contribute to this project. However, they are all
> > passionate about the project, and we are confident that the project will
> > continue even if no salaried developers contribute to the project. We are
> > committed to recruiting additional committers including non-salaried
> > developers.
> >
> > Relationships with Other Apache Products
> > ========================================
> > As mentioned in the Alignment section, Drill is closely integrated with
> > Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data
> > lives inside a Hadoop environment (Drill operates on in situ data). We
> look
> > forward to collaborating with those communities, as well as other Apache
> > communities.
> >
> > An Excessive Fascination with the Apache Brand
> > ==============================================
> > Drill solves a real problem that many organizations struggle with, and
> has
> > been proven within Google to be of significant value. The architecture is
> > based on academic and industry research. Our rationale for developing
> Drill
> > as an Apache project is detailed in the Rationale section. We believe
> that
> > the Apache brand and community process will help us attract more
> > contributors to this project, and help establish ubiquitous APIs. In
> > addition, establishing consensus among users and developers of a
> > Dremel-like tool is a key requirement for success of the project.
> >
> > Documentation
> > =============
> > Drill is inspired by Google’s Dremel. Google has published a paper
> > highlighting Dremel’s innovative nested column-based data format and
> > execution engine: http://research.google.com/pubs/pub36632.html
> >
> > High-level slides have been published by MapR: TODO
> >
> > Initial Source
> > ==============
> > There is no initial source code. All source code will be developed within
> > the Apache Incubator.
> >
> > Cryptography
> > ============
> > Drill will eventually support encryption on the wire. This is not one of
> > the initial goals, and we do not expect Drill to be a controlled export
> > item due to the use of encryption.
> >
> > Required Resources
> > ==================
> >
> > Mailing List
> > ============
> > * drill-private
> > * drill-dev
> > * drill-user
> >
> > Subversion Directory
> > ====================
> > Git is the preferred source control system: git://git.apache.org/drill
> >
> > Issue Tracking
> > ==============
> > JIRA Drill (DRILL)
> >
> > Initial Committers
> > ==================
> > * Tomer Shiran (tshiran at maprtech dot com)
> > * Ted Dunning (tdunning at apache dot org)
> > * Jason Frantz (jfrantz at maprtech dot com)
> > * MC Srivas (mcsrivas at maprtech dot com)
> > * Chris Wensel (chris and concurrentinc dot com)
> > * Keys Botzum (kbotzum at maprtech dot com)
> > * Gera Shegalov (gshegalov at maprtech dot com)
> > * Ryan Rawson (ryan at drawntoscale dot com)
> >
> > Affiliations
> > ============
> > The initial committers are employees of MapR Technologies, Drawn to Scale
> > and Concurrent. The nominated mentors are employees of MapR Technologies,
> > Lucid Imagination and Nokia.
> >
> > Sponsors
> > ========
> >
> > Champion
> > ========
> > Ted Dunning (tdunning at apache dot org)
> >
> > Nominated Mentors
> > =================
> > * Ted Dunning (tdunning at apache dot org) – Chief Application Architect
> at
> > MapR Technologies, Committer for Lucene, Mahout and ZooKeeper.
> > * Grant Ingersoll (grant at lucidimagination dot com) – Chief Scientist
> at
> > Lucid Imagination, Committer for Lucene, Mahout and other projects.
> > * Isabel Drost (Isabel at apache dot org) – Software Developer at Nokia
> > Gate 5 GmbH, Committer for Lucene, Mahout and other projects.
> >
> > Sponsoring Entity
> > =================
> > Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [PROPOSAL] Drill for the Apache Incubator

Posted by Bernd Fondermann <be...@gmail.com>.
great proposal and a very promising mentor lineup.

Have fun,

  Bernd

On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning <td...@apache.org> wrote:
> Abstract
> ========
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by Google’s Dremel (
> http://research.google.com/pubs/pub36632.html).
>
> Proposal
> ========
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google’s Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be able
> to process petabyes of data and trillions of records in seconds.
>
> Background
> ==========
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive analysis. In
> recent years open source systems have emerged to address the need for
> scalable batch processing (Apache Hadoop) and stream processing (Storm,
> Apache S4). In 2010 Google published a paper called “Dremel: Interactive
> Analysis of Web-Scale Datasets,” describing a scalable system used
> internally for interactive analysis of nested data. No open source project
> has successfully replicated the capabilities of Dremel.
>
> Rationale
> =========
> There is a strong need in the market for low-latency interactive analysis
> of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
> Buffers). This need was identified by Google and addressed internally with
> a system called Dremel.
>
> In recent years open source systems have emerged to address the need for
> scalable batch processing (Apache Hadoop) and stream processing (Storm,
> Apache S4). Apache Hadoop, originally inspired by Google’s internal
> MapReduce system, is used by thousands of organizations processing
> large-scale datasets. Apache Hadoop is designed to achieve very high
> throughput, but is not designed to achieve the sub-second latency needed
> for interactive data analysis and exploration. Drill, inspired by Google’s
> internal Dremel system, is intended to address this need.
>
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended as a
> replacement for MapReduce and is often used in conjunction with it to
> analyze outputs of MapReduce pipelines or rapidly prototype larger
> computations. Indeed, Dremel and MapReduce are both used by thousands of
> Google employees.
>
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat data
> formats, such as CSV files, are naturally supported as a special case of
> nested data.
>
> The Drill architecture consists of four key components/layers:
> * Query languages: This layer is responsible for parsing the user’s query
> and constructing an execution plan.  The initial goal is to support the
> SQL-like language used by Dremel and Google BigQuery (
> https://developers.google.com/bigquery/docs/query-reference), which we call
> DrQL. However, Drill is designed to support other languages and programming
> models, such as the Mongo Query Language (
> http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
> http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
> * Low-latency distributed execution engine: This layer is responsible for
> executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000 servers.
> Drill’s execution engine is based on research in distributed execution
> engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
> storage, and can be extended with additional operators and connectors.
> * Nested data formats: This layer is responsible for supporting various
> data formats. The initial goal is to support the column-based format used
> by Dremel. Drill is designed to support schema-based formats such as
> Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
> formats such as JSON, BSON or YAML. In addition, it is designed to support
> column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
> row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
> particular distinction with Drill is that the execution engine is flexible
> enough to support column-based processing as well as row-based processing.
> This is important because column-based processing can be much more
> efficient when the data is stored in a column-based format, but many large
> data assets are stored in a row-based format that would require conversion
> before use.
> * Scalable data sources: This layer is responsible for supporting various
> data sources. The initial focus is to leverage Hadoop as a data source.
>
> It is worth noting that no open source project has successfully replicated
> the capabilities of Dremel, nor have any taken on the broader goals of
> flexibility (eg, pluggable query languages, data formats, data sources and
> execution engine operators/connectors) that are part of Drill.
>
> Initial Goals
> =============
> The initial goals for this project are to specify the detailed requirements
> and architecture, and then develop the initial implementation including the
> execution engine and DrQL.
> Like Apache Hadoop, which was built to support multiple storage systems
> (through the FileSystem API) and file formats (through the
> InputFormat/OutputFormat APIs), Drill will be built to support multiple
> query languages, data formats and data sources. The initial implementation
> of Drill will support the DrQL and a column-based format similar to Dremel.
>
> Current Status
> ==============
> Significant work has been completed to identify the initial requirements
> and define the overall system architecture. The next step is to implement
> the four components described in the Rationale section, and we intend to do
> that development as an Apache project.
>
> Meritocracy
> ===========
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements in an open forum. Several companies have already expressed
> interest in this project, and we intend to invite additional developers to
> participate. We will encourage and monitor community participation so that
> privileges can be extended to those that contribute. Also, Drill has an
> extensible/pluggable architecture that encourages developers to contribute
> various extensions, such as query languages, data formats, data sources and
> execution engine operators and connectors. While some companies will surely
> develop commercial extensions, we also anticipate that some companies and
> individuals will want to contribute such extensions back to the project,
> and we look forward to fostering a rich ecosystem of extensions.
>
> Community
> =========
> The need for a system for interactive analysis of large datasets in the
> open source is tremendous, so there is a potential for a very large
> community. We believe that Drill’s extensible architecture will further
> encourage community participation. Also, related Apache projects (eg,
> Hadoop) have very large and active communities, and we expect that over
> time Drill will also attract a large community.
>
> Core Developers
> ===============
> The developers on the initial committers list include experienced
> distributed systems engineers:
> * Tomer Shiran has experience developing distributed execution engines. He
> developed Parallel DataSeries, a data-parallel version of the open source
> DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also the
> author of Applying Idealized Lower-bound Runtime Models to Understand
> Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer worked
> as a software developer and researcher at IBM Research, Microsoft and HP
> Labs, and is now at MapR Technologies. He has been active in the Hadoop
> community since 2009.
> * Jason Frantz was at Clustrix, where he designed and developed the first
> scale-out SQL database based on MySQL. Jason developed the distributed
> query optimizer that powered Clustrix. He is now a software engineer and
> architect at MapR Technologies.
> * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, and
> has a history of over 30 years of contributions to open source. He is now
> at MapR Technologies. Ted has been very active in the Hadoop community
> since the project’s early days.
> * MC Srivas is the co-founder and CTO of MapR Technologies. While at Google
> he worked on Google’s scalable search infrastructure. MC Srivas has been
> active in the Hadoop community since 2009.
> * Chris Wensel is the founder and CEO of Concurrent. Prior to founding
> Concurrent, he developed Cascading, an Apache-licensed open source
> application framework enabling Java developers to quickly and easily
> develop robust Data Analytics and Data Management applications on Apache
> Hadoop. Chris has been involved in the Hadoop community since the project's
> early days.
> * Keys Botzum was at IBM, where he worked on security and distributed
> systems, and is currently at MapR Technologies.
> * Gera Shegalov was at Oracle, where he worked on networking, storage and
> database kernels, and is currently at MapR Technologies.
> * Ryan Rawson is the VP Engineering of Drawn to Scale where he developed
> Spire, a real-time operational database for Hadoop. He is also a committer
> and PMC member for Apache HBase, and has a long history of contributions to
> open source. Ryan has been involved in the Hadoop community since the
> project's early days.
>
> We realize that additional employer diversity is needed, and we will work
> aggressively to recruit developers from additional companies.
>
> Alignment
> =========
> The initial committers strongly believe that a system for interactive
> analysis of large-scale datasets will gain broader adoption as an open
> source, community driven project, where the community can contribute not
> only to the core components, but also to a growing collection of query
> languages and optimizers, data formats, data formats, and execution engine
> operators and connectors. Drill will integrate closely with Apache Hadoop.
> First, the data will live in Hadoop. That is, Drill will support Hadoop
> FileSystem implementations and HBase. Second, Hadoop-related data formats
> will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools
> will be provided to produce column-based formats. Fourth, Drill tables can
> be registered in HCatalog. Finally, Hive is being considered as the basis
> of the DrQL implementation.
>
> Known Risks
> ===========
>
> Orphaned Products
> =================
> The contributors are leading vendors in this space, with significant open
> source experience, so the risk of being orphaned is relatively low. The
> project could be at risk if vendors decided to change their strategies in
> the market. In such an event, the current committers plan to continue
> working on the project on their own time, though the progress will likely
> be slower. We plan to mitigate this risk by recruiting additional
> committers.
>
> Inexperience with Open Source
> =============================
> The initial committers include veteran Apache members (committers and PMC
> members) and other developers who have varying degrees of experience with
> open source projects. All have been involved with source code that has been
> released under an open source license, and several also have experience
> developing code with an open source development process.
>
> Homogenous Developers
> =====================
> The initial committers are employed by a number of companies, including
> MapR Technologies, Concurrent and Drawn to Scale. We are committed to
> recruiting additional committers from other companies.
>
> Reliance on Salaried Developers
> ===============================
> It is expected that Drill development will occur on both salaried time and
> on volunteer time, after hours. The majority of initial committers are paid
> by their employer to contribute to this project. However, they are all
> passionate about the project, and we are confident that the project will
> continue even if no salaried developers contribute to the project. We are
> committed to recruiting additional committers including non-salaried
> developers.
>
> Relationships with Other Apache Products
> ========================================
> As mentioned in the Alignment section, Drill is closely integrated with
> Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data
> lives inside a Hadoop environment (Drill operates on in situ data). We look
> forward to collaborating with those communities, as well as other Apache
> communities.
>
> An Excessive Fascination with the Apache Brand
> ==============================================
> Drill solves a real problem that many organizations struggle with, and has
> been proven within Google to be of significant value. The architecture is
> based on academic and industry research. Our rationale for developing Drill
> as an Apache project is detailed in the Rationale section. We believe that
> the Apache brand and community process will help us attract more
> contributors to this project, and help establish ubiquitous APIs. In
> addition, establishing consensus among users and developers of a
> Dremel-like tool is a key requirement for success of the project.
>
> Documentation
> =============
> Drill is inspired by Google’s Dremel. Google has published a paper
> highlighting Dremel’s innovative nested column-based data format and
> execution engine: http://research.google.com/pubs/pub36632.html
>
> High-level slides have been published by MapR: TODO
>
> Initial Source
> ==============
> There is no initial source code. All source code will be developed within
> the Apache Incubator.
>
> Cryptography
> ============
> Drill will eventually support encryption on the wire. This is not one of
> the initial goals, and we do not expect Drill to be a controlled export
> item due to the use of encryption.
>
> Required Resources
> ==================
>
> Mailing List
> ============
> * drill-private
> * drill-dev
> * drill-user
>
> Subversion Directory
> ====================
> Git is the preferred source control system: git://git.apache.org/drill
>
> Issue Tracking
> ==============
> JIRA Drill (DRILL)
>
> Initial Committers
> ==================
> * Tomer Shiran (tshiran at maprtech dot com)
> * Ted Dunning (tdunning at apache dot org)
> * Jason Frantz (jfrantz at maprtech dot com)
> * MC Srivas (mcsrivas at maprtech dot com)
> * Chris Wensel (chris and concurrentinc dot com)
> * Keys Botzum (kbotzum at maprtech dot com)
> * Gera Shegalov (gshegalov at maprtech dot com)
> * Ryan Rawson (ryan at drawntoscale dot com)
>
> Affiliations
> ============
> The initial committers are employees of MapR Technologies, Drawn to Scale
> and Concurrent. The nominated mentors are employees of MapR Technologies,
> Lucid Imagination and Nokia.
>
> Sponsors
> ========
>
> Champion
> ========
> Ted Dunning (tdunning at apache dot org)
>
> Nominated Mentors
> =================
> * Ted Dunning (tdunning at apache dot org) – Chief Application Architect at
> MapR Technologies, Committer for Lucene, Mahout and ZooKeeper.
> * Grant Ingersoll (grant at lucidimagination dot com) – Chief Scientist at
> Lucid Imagination, Committer for Lucene, Mahout and other projects.
> * Isabel Drost (Isabel at apache dot org) – Software Developer at Nokia
> Gate 5 GmbH, Committer for Lucene, Mahout and other projects.
>
> Sponsoring Entity
> =================
> Incubator

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org