You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Jean-Baptiste Onofré <jb...@nanthrax.net> on 2016/05/25 20:24:06 UTC

[VOTE] Accept CarbonData into the Apache Incubator

Hi all,

following the discussion thread, I'm now calling a vote to accept 
CarbonData into the Incubator.

\u200b[ ] +1 Accept CarbonData into the Apache Incubator
[ ] +0 Abstain
[ ] -1 Do not accept CarbonData into the Apache Incubator, because ...

This vote is open for 72 hours.

The proposal follows, you can also access the wiki page:
https://wiki.apache.org/incubator/CarbonDataProposal

Thanks !
Regards
JB

= Apache CarbonData =

== Abstract ==

Apache CarbonData is a new Apache Hadoop native file format for faster 
interactive
query using advanced columnar storage, index, compression and encoding 
techniques
to improve computing efficiency, in turn it will help speedup queries an 
order of
magnitude faster over PetaBytes of data.

CarbonData github address: https://github.com/HuaweiBigData/carbondata

== Background ==

Huawei is an ICT solution provider, we are committed to enhancing 
customer experiences for telecom carriers, enterprises, and consumers on 
big data, In order to satisfy the following customer requirements, we 
created a new Hadoop native file format:

  * Support interactive OLAP-style query over big data in seconds.
  * Support fast query on individual record which require touching all 
fields.
  * Fast data loading speed and support incremental load in period of 
minutes.
  * Support HDFS so that customer can leverage existing Hadoop cluster.
  * Support time based data retention.

Based on these requirements, we investigated existing file formats in 
the Hadoop eco-system, but we could not find a suitable solution that 
satisfying requirements all at the same time, so we start designing 
CarbonData.

== Rationale ==

CarbonData contains multiple modules, which are classified into two 
categories:

  1. CarbonData File Format: which contains core implementation for file 
format such as columnar,index,dictionary,encoding+compression,API for 
reading/writing etc.
  2. CarbonData integration with big data processing framework such as 
Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract 
the execution runtime.

=== CarbonData File Format ===

CarbonData file format is a columnar store in HDFS, it has many features 
that a modern columnar format has, such as splittable, compression 
schema ,complex data type etc. And CarbonData has following unique features:

==== Indexing ====

In order to support fast interactive query, CarbonData leverage indexing 
technology to reduce I/O scans. CarbonData files stores data along with 
index, the index is not stored separately but the CarbonData file itself 
contains the index. In current implementation, CarbonData supports 3 
types of indexing:

1. Multi-dimensional Key (B+ Tree index)
  The Data block are written in sequence to the disk and within each 
data blocks each column block is written in sequence. Finally, the 
metadata block for the file is written with information about byte 
positions of each block in the file, Min-Max statistics index and the 
start and end MDK of each data block. Since, the entire data in the file 
is in sorted order, the start and end MDK of each data block can be used 
to construct a B+Tree and the file can be logically  represented as a 
B+Tree with the data blocks as leaf nodes (on disk) and the remaining 
non-leaf nodes in memory.
2. Inverted index
  Inverted index is widely used in search engine. By using this index, 
it helps processing/query engine to do filtering inside one HDFS block. 
Furthermore, query acceleration for count distinct like operation is 
made possible when combining bitmap and inverted index in query time.
3. MinMax index
  For all columns, minmax index is created so that processing/query 
engine can skip scan that is not required.

==== Global Dictionary ====

Besides I/O reduction, CarbonData accelerates computation by using 
global dictionary, which enables processing/query engines to perform all 
processing on encoded data without having to convert the data (Late 
Materialization). We have observed dramatic performance improvement for 
OLAP analytic scenario where table contains many columns in string data 
type. The data is converted back to the user readable form just before 
processing/query engine returning results to user.

==== Column Group ====

Sometimes users want to perform processing/query on multi-columns in one 
table, for example, performing scan for individual record in 
troubleshooting scenario. In this case, row format is more efficient 
than columnar format since all columns will be touched by the workload. 
To accelerate this, CarbonData supports storing a group of column in row 
format, so data in column group is stored together and enable fast 
retrieval.

==== Optimized for multiple use cases ====

CarbonData indices and dictionary is highly configurable. To make 
storage optimized for different use cases, user can configure what to 
index, so user can decide and tune the format before loading data into 
CarbonData.

For example

|| Use Case || Supporting Features ||
|| Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ 
Tree index), Minmax index, Inverted index ||
|| High throughput scan || Global dictionary, Minmax index ||
|| Low latency point query || Multi-dimensional Key (B+ Tree index), 
Partitioning ||
|| Individual record query || Column group, Global dictionary ||

=== BigData Processing Framework Integration ===

  * CarbonData provides InputFormat/OutputFormat interfaces for 
Reading/Writing data from the CarbonData files and at the same time 
provides abstract API for processing data stored as Carbondata format 
with data processing framework.
  * CarbonData provides deep integration with Apache Spark including 
predicate push down, column pruning, aggregation push down etc. So users 
can use Spark SQL to connect and query from CarbonData.
  * CarbonData can integrate with various big data Query/Processing 
framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.

Example: 
https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala

== Initial Goals ==

Our initial goals are to bring CarbonData into the ASF, transition 
internal engineering processes into the open, and foster a collaborative 
development model according to the "Apache Way".

== Current Status ==

CarbonData is production ready and already provide a large set of features.
The current license is already Apache 2.0.

== Meritocracy ==

We intend to radically expand the initial developer and user community 
by running the project in accordance with the "Apache Way". Users and 
new contributors will be treated with respect and welcomed. By 
participating in the community and providing quality patches/support 
that move the project forward, they will earn merit. They also will be 
encouraged to provide non-code contributions (documentation, events, 
community management, etc.) and will gain merit for doing so. Those with 
a proven support and quality track record will be encouraged to become 
committers.

== Community ==

If CarbonData is accepted for incubation, the primary initial goal is to 
build a large community. We really trust that CarbonData will become a 
key project for big data column-like platforms, and so, we bet on a 
large community of users and developers.

== Known Risks ==

Development has been sponsored mostly by a one company.For the project 
to fully transition to the Apache Way governance model, development must 
shift towards the meritocracy-centric model of growing a community of 
contributors balanced with the needs for extreme stability and core 
implementation coherency.

== Orphaned products ==

Huawei is fully committed CarbonData. Moreover, Huawei has a vested 
interest in making CarbonData succeed by driving its close integration 
with sister ASF projects. We expect this to further reduces the risk of 
orphaning the product.

== Inexperience with Open Source ==

Huawei has been developing and using open source software since a long 
time. Additionally, several ASF veterans agreed to mentor the project 
and are listed in this proposal. The project will rely on their guidance 
and collective wisdom to quickly transition the entire team of initial 
committers towards practicing the Apache Way.

== Reliance on Salaried Developers ==

Most of the contributors are paid to work in big data space. While they 
might wander from their current employers, they are unlikely to venture 
far from their core expertises and thus will continue to be engaged with 
the project regardless of their current employers.

== An Excessive Fascination with the Apache Brand ==

While we intend to leverage the Apache \u2018branding\u2019 when talking to other 
projects as testament of our project\u2019s \u2018neutrality\u2019, we have no plans 
for making use of Apache brand in press releases nor posting billboards 
advertising acceptance of CarbonData into Apache Incubator.

== Initial Source ==

https://github.com/HuaweiBigData/carbondata.git

== External Dependencies ==

All external dependencies are licensed under an Apache 2.0 license or
Apache-compatible license. As we grow the Carbondata community we will
configure our build process to require and validate all contributions
and dependencies are licensed under the Apache 2.0 license or are under
an Apache-compatible license.

  * Apache Spark
  * Apache Hadoop
  * Apache Maven
  * Apache Commons
  * Apache Log4j
  * Apache Thrift
  * Apache Zookeeper
  * Scala
  * Snappy
  * Kettle (Pentaho)
  * Eigenbase
  * Fastutil
  * GSON
  * Jmockit
  * Junit

== Required Resources ==

=== Mailing lists ===

  * private@carbondata.incubator.apache.org (moderated subscriptions)
  * commits@carbondata.incubator.apache.org
  * dev@carbondata.incubator.apache.org
  * issues@carbondata.incubator.apache.org

=== Git Repository ===

  * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git

=== Issue Tracking ===

  * JIRA Project CarbonData (CarbonData)

=== Initial Committers ===

  * Liang Chenliang
  * Jean-Baptiste Onofr�
  * Henry Saputra
  * Uma Maheswara Rao G
  * Jenny MA
  * Jacky Likun
  * Vimal Das Kammath
  * Jarray Qiuheng

=== Affiliations ===

  * Huawei: Liang Chenliang
  * Talend: Jean-Baptiste Onofr�
  * Ebay: Henry Saputra
  * Intel: Uma Maheswara Rao G

=== Sponsors ===

=== Champion ===

  * Jean-Baptiste Onofr� - Apache Member

=== Mentors ===

  * Henry Saputra (eBay)
  * Jean-Baptiste Onofr� (Talend)
  * Uma Maheswara Rao G (Intel)

=== Sponsoring Entity ===

The Apache Incubator

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Julian Hyde <jh...@apache.org>.
+1 

Julian

> On May 25, 2016, at 1:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> 
> Hi all,
> 
> following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator.
> 
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> 
> This vote is open for 72 hours.
> 
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
> 
> Thanks !
> Regards
> JB
> 
> = Apache CarbonData =
> 
> == Abstract ==
> 
> Apache CarbonData is a new Apache Hadoop native file format for faster interactive
> query using advanced columnar storage, index, compression and encoding techniques
> to improve computing efficiency, in turn it will help speedup queries an order of
> magnitude faster over PetaBytes of data.
> 
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> 
> == Background ==
> 
> Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format:
> 
> * Support interactive OLAP-style query over big data in seconds.
> * Support fast query on individual record which require touching all fields.
> * Fast data loading speed and support incremental load in period of minutes.
> * Support HDFS so that customer can leverage existing Hadoop cluster.
> * Support time based data retention.
> 
> Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData.
> 
> == Rationale ==
> 
> CarbonData contains multiple modules, which are classified into two categories:
> 
> 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime.
> 
> === CarbonData File Format ===
> 
> CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features:
> 
> ==== Indexing ====
> 
> In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing:
> 
> 1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically  represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
> Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time.
> 3. MinMax index
> For all columns, minmax index is created so that processing/query engine can skip scan that is not required.
> 
> ==== Global Dictionary ====
> 
> Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
> 
> ==== Column Group ====
> 
> Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.
> 
> ==== Optimized for multiple use cases ====
> 
> CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data into CarbonData.
> 
> For example
> 
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index), Partitioning ||
> || Individual record query || Column group, Global dictionary ||
> 
> === BigData Processing Framework Integration ===
> 
> * CarbonData provides InputFormat/OutputFormat interfaces for Reading/Writing data from the CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.
> * CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use Spark SQL to connect and query from CarbonData.
> * CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> 
> Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
> 
> == Initial Goals ==
> 
> Our initial goals are to bring CarbonData into the ASF, transition internal engineering processes into the open, and foster a collaborative development model according to the "Apache Way".
> 
> == Current Status ==
> 
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
> 
> == Meritocracy ==
> 
> We intend to radically expand the initial developer and user community by running the project in accordance with the "Apache Way". Users and new contributors will be treated with respect and welcomed. By participating in the community and providing quality patches/support that move the project forward, they will earn merit. They also will be encouraged to provide non-code contributions (documentation, events, community management, etc.) and will gain merit for doing so. Those with a proven support and quality track record will be encouraged to become committers.
> 
> == Community ==
> 
> If CarbonData is accepted for incubation, the primary initial goal is to build a large community. We really trust that CarbonData will become a key project for big data column-like platforms, and so, we bet on a large community of users and developers.
> 
> == Known Risks ==
> 
> Development has been sponsored mostly by a one company.For the project to fully transition to the Apache Way governance model, development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.
> 
> == Orphaned products ==
> 
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest in making CarbonData succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.
> 
> == Inexperience with Open Source ==
> 
> Huawei has been developing and using open source software since a long time. Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
> 
> == Reliance on Salaried Developers ==
> 
> Most of the contributors are paid to work in big data space. While they might wander from their current employers, they are unlikely to venture far from their core expertises and thus will continue to be engaged with the project regardless of their current employers.
> 
> == An Excessive Fascination with the Apache Brand ==
> 
> While we intend to leverage the Apache ‘branding’ when talking to other projects as testament of our project’s ‘neutrality’, we have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of CarbonData into Apache Incubator.
> 
> == Initial Source ==
> 
> https://github.com/HuaweiBigData/carbondata.git
> 
> == External Dependencies ==
> 
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
> 
> * Apache Spark
> * Apache Hadoop
> * Apache Maven
> * Apache Commons
> * Apache Log4j
> * Apache Thrift
> * Apache Zookeeper
> * Scala
> * Snappy
> * Kettle (Pentaho)
> * Eigenbase
> * Fastutil
> * GSON
> * Jmockit
> * Junit
> 
> == Required Resources ==
> 
> === Mailing lists ===
> 
> * private@carbondata.incubator.apache.org (moderated subscriptions)
> * commits@carbondata.incubator.apache.org
> * dev@carbondata.incubator.apache.org
> * issues@carbondata.incubator.apache.org
> 
> === Git Repository ===
> 
> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> 
> === Issue Tracking ===
> 
> * JIRA Project CarbonData (CarbonData)
> 
> === Initial Committers ===
> 
> * Liang Chenliang
> * Jean-Baptiste Onofré
> * Henry Saputra
> * Uma Maheswara Rao G
> * Jenny MA
> * Jacky Likun
> * Vimal Das Kammath
> * Jarray Qiuheng
> 
> === Affiliations ===
> 
> * Huawei: Liang Chenliang
> * Talend: Jean-Baptiste Onofré
> * Ebay: Henry Saputra
> * Intel: Uma Maheswara Rao G
> 
> === Sponsors ===
> 
> === Champion ===
> 
> * Jean-Baptiste Onofré - Apache Member
> 
> === Mentors ===
> 
> * Henry Saputra (eBay)
> * Jean-Baptiste Onofré (Talend)
> * Uma Maheswara Rao G (Intel)
> 
> === Sponsoring Entity ===
> 
> The Apache Incubator
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Sergio Fernández <wi...@apache.org>.
+1 (binding)

On Wed, May 25, 2016 at 10:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing customer
> experiences for telecom carriers, enterprises, and consumers on big data,
> In order to satisfy the following customer requirements, we created a new
> Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all
> fields.
>  * Fast data loading speed and support incremental load in period of
> minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in the
> Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>  1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>  2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression schema
> ,complex data type etc. And CarbonData has following unique features:
>
> ==== Indexing ====
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3 types
> of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each data
> blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
>  Inverted index is widely used in search engine. By using this index, it
> helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>  For all columns, minmax index is created so that processing/query engine
> can skip scan that is not required.
>
> ==== Global Dictionary ====
>
> Besides I/O reduction, CarbonData accelerates computation by using global
> dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
> ==== Column Group ====
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
> ==== Optimized for multiple use cases ====
>
> CarbonData indices and dictionary is highly configurable. To make storage
> optimized for different use cases, user can configure what to index, so
> user can decide and tune the format before loading data into CarbonData.
>
> For example
>
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> || Individual record query || Column group, Global dictionary ||
>
> === BigData Processing Framework Integration ===
>
>  * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format with
> data processing framework.
>  * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
>  * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
> Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>
> == Initial Goals ==
>
> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
>
> == Current Status ==
>
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
>
> == Meritocracy ==
>
> We intend to radically expand the initial developer and user community by
> running the project in accordance with the "Apache Way". Users and new
> contributors will be treated with respect and welcomed. By participating in
> the community and providing quality patches/support that move the project
> forward, they will earn merit. They also will be encouraged to provide
> non-code contributions (documentation, events, community management, etc.)
> and will gain merit for doing so. Those with a proven support and quality
> track record will be encouraged to become committers.
>
> == Community ==
>
> If CarbonData is accepted for incubation, the primary initial goal is to
> build a large community. We really trust that CarbonData will become a key
> project for big data column-like platforms, and so, we bet on a large
> community of users and developers.
>
> == Known Risks ==
>
> Development has been sponsored mostly by a one company.For the project to
> fully transition to the Apache Way governance model, development must shift
> towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
>
> == Orphaned products ==
>
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration with
> sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
>
> == Inexperience with Open Source ==
>
> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project and
> are listed in this proposal. The project will rely on their guidance and
> collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
>
> == Reliance on Salaried Developers ==
>
> Most of the contributors are paid to work in big data space. While they
> might wander from their current employers, they are unlikely to venture far
> from their core expertises and thus will continue to be engaged with the
> project regardless of their current employers.
>
> == An Excessive Fascination with the Apache Brand ==
>
> While we intend to leverage the Apache ‘branding’ when talking to other
> projects as testament of our project’s ‘neutrality’, we have no plans for
> making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
>
> == Initial Source ==
>
> https://github.com/HuaweiBigData/carbondata.git
>
> == External Dependencies ==
>
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
>
>  * Apache Spark
>  * Apache Hadoop
>  * Apache Maven
>  * Apache Commons
>  * Apache Log4j
>  * Apache Thrift
>  * Apache Zookeeper
>  * Scala
>  * Snappy
>  * Kettle (Pentaho)
>  * Eigenbase
>  * Fastutil
>  * GSON
>  * Jmockit
>  * Junit
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@carbondata.incubator.apache.org (moderated subscriptions)
>  * commits@carbondata.incubator.apache.org
>  * dev@carbondata.incubator.apache.org
>  * issues@carbondata.incubator.apache.org
>
> === Git Repository ===
>
>  * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
> === Issue Tracking ===
>
>  * JIRA Project CarbonData (CarbonData)
>
> === Initial Committers ===
>
>  * Liang Chenliang
>  * Jean-Baptiste Onofré
>  * Henry Saputra
>  * Uma Maheswara Rao G
>  * Jenny MA
>  * Jacky Likun
>  * Vimal Das Kammath
>  * Jarray Qiuheng
>
> === Affiliations ===
>
>  * Huawei: Liang Chenliang
>  * Talend: Jean-Baptiste Onofré
>  * Ebay: Henry Saputra
>  * Intel: Uma Maheswara Rao G
>
> === Sponsors ===
>
> === Champion ===
>
>  * Jean-Baptiste Onofré - Apache Member
>
> === Mentors ===
>
>  * Henry Saputra (eBay)
>  * Jean-Baptiste Onofré (Talend)
>  * Uma Maheswara Rao G (Intel)
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>


-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Madhawa Kasun Gunasekara <ma...@gmail.com>.
+1

Thanks,
Madhawa

Madhawa

On Fri, May 27, 2016 at 11:16 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Jim,
>
> good point. Let me try to explain this "gap" regarding my discussion with
> the team:
>
> 1. Some people have been involved mostly in architecture and design more
> directly in code. That's why they are part of the initial committer list,
> whereas they didn't really provide "visible" code on github.
>
> 2. Some people are no more involved in the project. That's why they don't
> appear on the initial committer list.
>
> Regards
> JB
>
>
> On 05/26/2016 05:45 PM, Jim Jagielski wrote:
>
>> I am trying to align the list of initial committers with
>> the list of current/active contributors, according to
>> Github, and I am seeing people proposed who have not
>> contributed anything and people NOT proposed who seem
>> to be kinda active...
>>
>> Sooo..... -0
>>
>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> following the discussion thread, I'm now calling a vote to accept
>>> CarbonData into the Incubator.
>>>
>>> ​[ ] +1 Accept CarbonData into the Apache Incubator
>>> [ ] +0 Abstain
>>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>>>
>>> This vote is open for 72 hours.
>>>
>>> The proposal follows, you can also access the wiki page:
>>> https://wiki.apache.org/incubator/CarbonDataProposal
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> = Apache CarbonData =
>>>
>>> == Abstract ==
>>>
>>> Apache CarbonData is a new Apache Hadoop native file format for faster
>>> interactive
>>> query using advanced columnar storage, index, compression and encoding
>>> techniques
>>> to improve computing efficiency, in turn it will help speedup queries an
>>> order of
>>> magnitude faster over PetaBytes of data.
>>>
>>> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>>>
>>> == Background ==
>>>
>>> Huawei is an ICT solution provider, we are committed to enhancing
>>> customer experiences for telecom carriers, enterprises, and consumers on
>>> big data, In order to satisfy the following customer requirements, we
>>> created a new Hadoop native file format:
>>>
>>> * Support interactive OLAP-style query over big data in seconds.
>>> * Support fast query on individual record which require touching all
>>> fields.
>>> * Fast data loading speed and support incremental load in period of
>>> minutes.
>>> * Support HDFS so that customer can leverage existing Hadoop cluster.
>>> * Support time based data retention.
>>>
>>> Based on these requirements, we investigated existing file formats in
>>> the Hadoop eco-system, but we could not find a suitable solution that
>>> satisfying requirements all at the same time, so we start designing
>>> CarbonData.
>>>
>>> == Rationale ==
>>>
>>> CarbonData contains multiple modules, which are classified into two
>>> categories:
>>>
>>> 1. CarbonData File Format: which contains core implementation for file
>>> format such as columnar,index,dictionary,encoding+compression,API for
>>> reading/writing etc.
>>> 2. CarbonData integration with big data processing framework such as
>>> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
>>> execution runtime.
>>>
>>> === CarbonData File Format ===
>>>
>>> CarbonData file format is a columnar store in HDFS, it has many features
>>> that a modern columnar format has, such as splittable, compression schema
>>> ,complex data type etc. And CarbonData has following unique features:
>>>
>>> ==== Indexing ====
>>>
>>> In order to support fast interactive query, CarbonData leverage indexing
>>> technology to reduce I/O scans. CarbonData files stores data along with
>>> index, the index is not stored separately but the CarbonData file itself
>>> contains the index. In current implementation, CarbonData supports 3 types
>>> of indexing:
>>>
>>> 1. Multi-dimensional Key (B+ Tree index)
>>> The Data block are written in sequence to the disk and within each data
>>> blocks each column block is written in sequence. Finally, the metadata
>>> block for the file is written with information about byte positions of each
>>> block in the file, Min-Max statistics index and the start and end MDK of
>>> each data block. Since, the entire data in the file is in sorted order, the
>>> start and end MDK of each data block can be used to construct a B+Tree and
>>> the file can be logically  represented as a B+Tree with the data blocks as
>>> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
>>> 2. Inverted index
>>> Inverted index is widely used in search engine. By using this index, it
>>> helps processing/query engine to do filtering inside one HDFS block.
>>> Furthermore, query acceleration for count distinct like operation is made
>>> possible when combining bitmap and inverted index in query time.
>>> 3. MinMax index
>>> For all columns, minmax index is created so that processing/query engine
>>> can skip scan that is not required.
>>>
>>> ==== Global Dictionary ====
>>>
>>> Besides I/O reduction, CarbonData accelerates computation by using
>>> global dictionary, which enables processing/query engines to perform all
>>> processing on encoded data without having to convert the data (Late
>>> Materialization). We have observed dramatic performance improvement for
>>> OLAP analytic scenario where table contains many columns in string data
>>> type. The data is converted back to the user readable form just before
>>> processing/query engine returning results to user.
>>>
>>> ==== Column Group ====
>>>
>>> Sometimes users want to perform processing/query on multi-columns in one
>>> table, for example, performing scan for individual record in
>>> troubleshooting scenario. In this case, row format is more efficient than
>>> columnar format since all columns will be touched by the workload. To
>>> accelerate this, CarbonData supports storing a group of column in row
>>> format, so data in column group is stored together and enable fast
>>> retrieval.
>>>
>>> ==== Optimized for multiple use cases ====
>>>
>>> CarbonData indices and dictionary is highly configurable. To make
>>> storage optimized for different use cases, user can configure what to
>>> index, so user can decide and tune the format before loading data into
>>> CarbonData.
>>>
>>> For example
>>>
>>> || Use Case || Supporting Features ||
>>> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
>>> Tree index), Minmax index, Inverted index ||
>>> || High throughput scan || Global dictionary, Minmax index ||
>>> || Low latency point query || Multi-dimensional Key (B+ Tree index),
>>> Partitioning ||
>>> || Individual record query || Column group, Global dictionary ||
>>>
>>> === BigData Processing Framework Integration ===
>>>
>>> * CarbonData provides InputFormat/OutputFormat interfaces for
>>> Reading/Writing data from the CarbonData files and at the same time
>>> provides abstract API for processing data stored as Carbondata format with
>>> data processing framework.
>>> * CarbonData provides deep integration with Apache Spark including
>>> predicate push down, column pruning, aggregation push down etc. So users
>>> can use Spark SQL to connect and query from CarbonData.
>>> * CarbonData can integrate with various big data Query/Processing
>>> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>>>
>>> Example:
>>> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>>>
>>> == Initial Goals ==
>>>
>>> Our initial goals are to bring CarbonData into the ASF, transition
>>> internal engineering processes into the open, and foster a collaborative
>>> development model according to the "Apache Way".
>>>
>>> == Current Status ==
>>>
>>> CarbonData is production ready and already provide a large set of
>>> features.
>>> The current license is already Apache 2.0.
>>>
>>> == Meritocracy ==
>>>
>>> We intend to radically expand the initial developer and user community
>>> by running the project in accordance with the "Apache Way". Users and new
>>> contributors will be treated with respect and welcomed. By participating in
>>> the community and providing quality patches/support that move the project
>>> forward, they will earn merit. They also will be encouraged to provide
>>> non-code contributions (documentation, events, community management, etc.)
>>> and will gain merit for doing so. Those with a proven support and quality
>>> track record will be encouraged to become committers.
>>>
>>> == Community ==
>>>
>>> If CarbonData is accepted for incubation, the primary initial goal is to
>>> build a large community. We really trust that CarbonData will become a key
>>> project for big data column-like platforms, and so, we bet on a large
>>> community of users and developers.
>>>
>>> == Known Risks ==
>>>
>>> Development has been sponsored mostly by a one company.For the project
>>> to fully transition to the Apache Way governance model, development must
>>> shift towards the meritocracy-centric model of growing a community of
>>> contributors balanced with the needs for extreme stability and core
>>> implementation coherency.
>>>
>>> == Orphaned products ==
>>>
>>> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
>>> interest in making CarbonData succeed by driving its close integration with
>>> sister ASF projects. We expect this to further reduces the risk of
>>> orphaning the product.
>>>
>>> == Inexperience with Open Source ==
>>>
>>> Huawei has been developing and using open source software since a long
>>> time. Additionally, several ASF veterans agreed to mentor the project and
>>> are listed in this proposal. The project will rely on their guidance and
>>> collective wisdom to quickly transition the entire team of initial
>>> committers towards practicing the Apache Way.
>>>
>>> == Reliance on Salaried Developers ==
>>>
>>> Most of the contributors are paid to work in big data space. While they
>>> might wander from their current employers, they are unlikely to venture far
>>> from their core expertises and thus will continue to be engaged with the
>>> project regardless of their current employers.
>>>
>>> == An Excessive Fascination with the Apache Brand ==
>>>
>>> While we intend to leverage the Apache ‘branding’ when talking to other
>>> projects as testament of our project’s ‘neutrality’, we have no plans for
>>> making use of Apache brand in press releases nor posting billboards
>>> advertising acceptance of CarbonData into Apache Incubator.
>>>
>>> == Initial Source ==
>>>
>>> https://github.com/HuaweiBigData/carbondata.git
>>>
>>> == External Dependencies ==
>>>
>>> All external dependencies are licensed under an Apache 2.0 license or
>>> Apache-compatible license. As we grow the Carbondata community we will
>>> configure our build process to require and validate all contributions
>>> and dependencies are licensed under the Apache 2.0 license or are under
>>> an Apache-compatible license.
>>>
>>> * Apache Spark
>>> * Apache Hadoop
>>> * Apache Maven
>>> * Apache Commons
>>> * Apache Log4j
>>> * Apache Thrift
>>> * Apache Zookeeper
>>> * Scala
>>> * Snappy
>>> * Kettle (Pentaho)
>>> * Eigenbase
>>> * Fastutil
>>> * GSON
>>> * Jmockit
>>> * Junit
>>>
>>> == Required Resources ==
>>>
>>> === Mailing lists ===
>>>
>>> * private@carbondata.incubator.apache.org (moderated subscriptions)
>>> * commits@carbondata.incubator.apache.org
>>> * dev@carbondata.incubator.apache.org
>>> * issues@carbondata.incubator.apache.org
>>>
>>> === Git Repository ===
>>>
>>> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>>>
>>> === Issue Tracking ===
>>>
>>> * JIRA Project CarbonData (CarbonData)
>>>
>>> === Initial Committers ===
>>>
>>> * Liang Chenliang
>>> * Jean-Baptiste Onofré
>>> * Henry Saputra
>>> * Uma Maheswara Rao G
>>> * Jenny MA
>>> * Jacky Likun
>>> * Vimal Das Kammath
>>> * Jarray Qiuheng
>>>
>>> === Affiliations ===
>>>
>>> * Huawei: Liang Chenliang
>>> * Talend: Jean-Baptiste Onofré
>>> * Ebay: Henry Saputra
>>> * Intel: Uma Maheswara Rao G
>>>
>>> === Sponsors ===
>>>
>>> === Champion ===
>>>
>>> * Jean-Baptiste Onofré - Apache Member
>>>
>>> === Mentors ===
>>>
>>> * Henry Saputra (eBay)
>>> * Jean-Baptiste Onofré (Talend)
>>> * Uma Maheswara Rao G (Intel)
>>>
>>> === Sponsoring Entity ===
>>>
>>> The Apache Incubator
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Amol Kekre <am...@datatorrent.com>.
+1 (non-binding)

Thks
Amol

On Fri, May 27, 2016 at 5:53 AM, Jim Jagielski <ji...@jagunet.com> wrote:

> Thx for the feedback...
>
> I change my vote to +1 (binding)
> > On May 27, 2016, at 1:46 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> >
> > Hi Jim,
> >
> > good point. Let me try to explain this "gap" regarding my discussion
> with the team:
> >
> > 1. Some people have been involved mostly in architecture and design more
> directly in code. That's why they are part of the initial committer list,
> whereas they didn't really provide "visible" code on github.
> >
> > 2. Some people are no more involved in the project. That's why they
> don't appear on the initial committer list.
> >
> > Regards
> > JB
> >
> > On 05/26/2016 05:45 PM, Jim Jagielski wrote:
> >> I am trying to align the list of initial committers with
> >> the list of current/active contributors, according to
> >> Github, and I am seeing people proposed who have not
> >> contributed anything and people NOT proposed who seem
> >> to be kinda active...
> >>
> >> Sooo..... -0
> >>
> >>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
> >>>
> >>> ​[ ] +1 Accept CarbonData into the Apache Incubator
> >>> [ ] +0 Abstain
> >>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >>>
> >>> This vote is open for 72 hours.
> >>>
> >>> The proposal follows, you can also access the wiki page:
> >>> https://wiki.apache.org/incubator/CarbonDataProposal
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> = Apache CarbonData =
> >>>
> >>> == Abstract ==
> >>>
> >>> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> >>> query using advanced columnar storage, index, compression and encoding
> techniques
> >>> to improve computing efficiency, in turn it will help speedup queries
> an order of
> >>> magnitude faster over PetaBytes of data.
> >>>
> >>> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> >>>
> >>> == Background ==
> >>>
> >>> Huawei is an ICT solution provider, we are committed to enhancing
> customer experiences for telecom carriers, enterprises, and consumers on
> big data, In order to satisfy the following customer requirements, we
> created a new Hadoop native file format:
> >>>
> >>> * Support interactive OLAP-style query over big data in seconds.
> >>> * Support fast query on individual record which require touching all
> fields.
> >>> * Fast data loading speed and support incremental load in period of
> minutes.
> >>> * Support HDFS so that customer can leverage existing Hadoop cluster.
> >>> * Support time based data retention.
> >>>
> >>> Based on these requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
> >>>
> >>> == Rationale ==
> >>>
> >>> CarbonData contains multiple modules, which are classified into two
> categories:
> >>>
> >>> 1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
> >>> 2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
> >>>
> >>> === CarbonData File Format ===
> >>>
> >>> CarbonData file format is a columnar store in HDFS, it has many
> features that a modern columnar format has, such as splittable, compression
> schema ,complex data type etc. And CarbonData has following unique features:
> >>>
> >>> ==== Indexing ====
> >>>
> >>> In order to support fast interactive query, CarbonData leverage
> indexing technology to reduce I/O scans. CarbonData files stores data along
> with index, the index is not stored separately but the CarbonData file
> itself contains the index. In current implementation, CarbonData supports 3
> types of indexing:
> >>>
> >>> 1. Multi-dimensional Key (B+ Tree index)
> >>> The Data block are written in sequence to the disk and within each
> data blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> >>> 2. Inverted index
> >>> Inverted index is widely used in search engine. By using this index,
> it helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> >>> 3. MinMax index
> >>> For all columns, minmax index is created so that processing/query
> engine can skip scan that is not required.
> >>>
> >>> ==== Global Dictionary ====
> >>>
> >>> Besides I/O reduction, CarbonData accelerates computation by using
> global dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
> >>>
> >>> ==== Column Group ====
> >>>
> >>> Sometimes users want to perform processing/query on multi-columns in
> one table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
> >>>
> >>> ==== Optimized for multiple use cases ====
> >>>
> >>> CarbonData indices and dictionary is highly configurable. To make
> storage optimized for different use cases, user can configure what to
> index, so user can decide and tune the format before loading data into
> CarbonData.
> >>>
> >>> For example
> >>>
> >>> || Use Case || Supporting Features ||
> >>> || Interactive OLAP query || Columnar format, Multi-dimensional Key
> (B+ Tree index), Minmax index, Inverted index ||
> >>> || High throughput scan || Global dictionary, Minmax index ||
> >>> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> >>> || Individual record query || Column group, Global dictionary ||
> >>>
> >>> === BigData Processing Framework Integration ===
> >>>
> >>> * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format with
> data processing framework.
> >>> * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
> >>> * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> >>>
> >>> Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
> >>>
> >>> == Initial Goals ==
> >>>
> >>> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
> >>>
> >>> == Current Status ==
> >>>
> >>> CarbonData is production ready and already provide a large set of
> features.
> >>> The current license is already Apache 2.0.
> >>>
> >>> == Meritocracy ==
> >>>
> >>> We intend to radically expand the initial developer and user community
> by running the project in accordance with the "Apache Way". Users and new
> contributors will be treated with respect and welcomed. By participating in
> the community and providing quality patches/support that move the project
> forward, they will earn merit. They also will be encouraged to provide
> non-code contributions (documentation, events, community management, etc.)
> and will gain merit for doing so. Those with a proven support and quality
> track record will be encouraged to become committers.
> >>>
> >>> == Community ==
> >>>
> >>> If CarbonData is accepted for incubation, the primary initial goal is
> to build a large community. We really trust that CarbonData will become a
> key project for big data column-like platforms, and so, we bet on a large
> community of users and developers.
> >>>
> >>> == Known Risks ==
> >>>
> >>> Development has been sponsored mostly by a one company.For the project
> to fully transition to the Apache Way governance model, development must
> shift towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
> >>>
> >>> == Orphaned products ==
> >>>
> >>> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration with
> sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
> >>>
> >>> == Inexperience with Open Source ==
> >>>
> >>> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project and
> are listed in this proposal. The project will rely on their guidance and
> collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
> >>>
> >>> == Reliance on Salaried Developers ==
> >>>
> >>> Most of the contributors are paid to work in big data space. While
> they might wander from their current employers, they are unlikely to
> venture far from their core expertises and thus will continue to be engaged
> with the project regardless of their current employers.
> >>>
> >>> == An Excessive Fascination with the Apache Brand ==
> >>>
> >>> While we intend to leverage the Apache ‘branding’ when talking to
> other projects as testament of our project’s ‘neutrality’, we have no plans
> for making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
> >>>
> >>> == Initial Source ==
> >>>
> >>> https://github.com/HuaweiBigData/carbondata.git
> >>>
> >>> == External Dependencies ==
> >>>
> >>> All external dependencies are licensed under an Apache 2.0 license or
> >>> Apache-compatible license. As we grow the Carbondata community we will
> >>> configure our build process to require and validate all contributions
> >>> and dependencies are licensed under the Apache 2.0 license or are under
> >>> an Apache-compatible license.
> >>>
> >>> * Apache Spark
> >>> * Apache Hadoop
> >>> * Apache Maven
> >>> * Apache Commons
> >>> * Apache Log4j
> >>> * Apache Thrift
> >>> * Apache Zookeeper
> >>> * Scala
> >>> * Snappy
> >>> * Kettle (Pentaho)
> >>> * Eigenbase
> >>> * Fastutil
> >>> * GSON
> >>> * Jmockit
> >>> * Junit
> >>>
> >>> == Required Resources ==
> >>>
> >>> === Mailing lists ===
> >>>
> >>> * private@carbondata.incubator.apache.org (moderated subscriptions)
> >>> * commits@carbondata.incubator.apache.org
> >>> * dev@carbondata.incubator.apache.org
> >>> * issues@carbondata.incubator.apache.org
> >>>
> >>> === Git Repository ===
> >>>
> >>> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> >>>
> >>> === Issue Tracking ===
> >>>
> >>> * JIRA Project CarbonData (CarbonData)
> >>>
> >>> === Initial Committers ===
> >>>
> >>> * Liang Chenliang
> >>> * Jean-Baptiste Onofré
> >>> * Henry Saputra
> >>> * Uma Maheswara Rao G
> >>> * Jenny MA
> >>> * Jacky Likun
> >>> * Vimal Das Kammath
> >>> * Jarray Qiuheng
> >>>
> >>> === Affiliations ===
> >>>
> >>> * Huawei: Liang Chenliang
> >>> * Talend: Jean-Baptiste Onofré
> >>> * Ebay: Henry Saputra
> >>> * Intel: Uma Maheswara Rao G
> >>>
> >>> === Sponsors ===
> >>>
> >>> === Champion ===
> >>>
> >>> * Jean-Baptiste Onofré - Apache Member
> >>>
> >>> === Mentors ===
> >>>
> >>> * Henry Saputra (eBay)
> >>> * Jean-Baptiste Onofré (Talend)
> >>> * Uma Maheswara Rao G (Intel)
> >>>
> >>> === Sponsoring Entity ===
> >>>
> >>> The Apache Incubator
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> >>> For additional commands, e-mail: general-help@incubator.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> >> For additional commands, e-mail: general-help@incubator.apache.org
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Jim Jagielski <ji...@jaguNET.com>.
Thx for the feedback...

I change my vote to +1 (binding)
> On May 27, 2016, at 1:46 AM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> 
> Hi Jim,
> 
> good point. Let me try to explain this "gap" regarding my discussion with the team:
> 
> 1. Some people have been involved mostly in architecture and design more directly in code. That's why they are part of the initial committer list, whereas they didn't really provide "visible" code on github.
> 
> 2. Some people are no more involved in the project. That's why they don't appear on the initial committer list.
> 
> Regards
> JB
> 
> On 05/26/2016 05:45 PM, Jim Jagielski wrote:
>> I am trying to align the list of initial committers with
>> the list of current/active contributors, according to
>> Github, and I am seeing people proposed who have not
>> contributed anything and people NOT proposed who seem
>> to be kinda active...
>> 
>> Sooo..... -0
>> 
>>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>>> 
>>> Hi all,
>>> 
>>> following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator.
>>> 
>>> ​[ ] +1 Accept CarbonData into the Apache Incubator
>>> [ ] +0 Abstain
>>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>>> 
>>> This vote is open for 72 hours.
>>> 
>>> The proposal follows, you can also access the wiki page:
>>> https://wiki.apache.org/incubator/CarbonDataProposal
>>> 
>>> Thanks !
>>> Regards
>>> JB
>>> 
>>> = Apache CarbonData =
>>> 
>>> == Abstract ==
>>> 
>>> Apache CarbonData is a new Apache Hadoop native file format for faster interactive
>>> query using advanced columnar storage, index, compression and encoding techniques
>>> to improve computing efficiency, in turn it will help speedup queries an order of
>>> magnitude faster over PetaBytes of data.
>>> 
>>> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>>> 
>>> == Background ==
>>> 
>>> Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format:
>>> 
>>> * Support interactive OLAP-style query over big data in seconds.
>>> * Support fast query on individual record which require touching all fields.
>>> * Fast data loading speed and support incremental load in period of minutes.
>>> * Support HDFS so that customer can leverage existing Hadoop cluster.
>>> * Support time based data retention.
>>> 
>>> Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData.
>>> 
>>> == Rationale ==
>>> 
>>> CarbonData contains multiple modules, which are classified into two categories:
>>> 
>>> 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
>>> 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime.
>>> 
>>> === CarbonData File Format ===
>>> 
>>> CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features:
>>> 
>>> ==== Indexing ====
>>> 
>>> In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing:
>>> 
>>> 1. Multi-dimensional Key (B+ Tree index)
>>> The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically  represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.
>>> 2. Inverted index
>>> Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time.
>>> 3. MinMax index
>>> For all columns, minmax index is created so that processing/query engine can skip scan that is not required.
>>> 
>>> ==== Global Dictionary ====
>>> 
>>> Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
>>> 
>>> ==== Column Group ====
>>> 
>>> Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.
>>> 
>>> ==== Optimized for multiple use cases ====
>>> 
>>> CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data into CarbonData.
>>> 
>>> For example
>>> 
>>> || Use Case || Supporting Features ||
>>> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index ||
>>> || High throughput scan || Global dictionary, Minmax index ||
>>> || Low latency point query || Multi-dimensional Key (B+ Tree index), Partitioning ||
>>> || Individual record query || Column group, Global dictionary ||
>>> 
>>> === BigData Processing Framework Integration ===
>>> 
>>> * CarbonData provides InputFormat/OutputFormat interfaces for Reading/Writing data from the CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.
>>> * CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use Spark SQL to connect and query from CarbonData.
>>> * CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>>> 
>>> Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>>> 
>>> == Initial Goals ==
>>> 
>>> Our initial goals are to bring CarbonData into the ASF, transition internal engineering processes into the open, and foster a collaborative development model according to the "Apache Way".
>>> 
>>> == Current Status ==
>>> 
>>> CarbonData is production ready and already provide a large set of features.
>>> The current license is already Apache 2.0.
>>> 
>>> == Meritocracy ==
>>> 
>>> We intend to radically expand the initial developer and user community by running the project in accordance with the "Apache Way". Users and new contributors will be treated with respect and welcomed. By participating in the community and providing quality patches/support that move the project forward, they will earn merit. They also will be encouraged to provide non-code contributions (documentation, events, community management, etc.) and will gain merit for doing so. Those with a proven support and quality track record will be encouraged to become committers.
>>> 
>>> == Community ==
>>> 
>>> If CarbonData is accepted for incubation, the primary initial goal is to build a large community. We really trust that CarbonData will become a key project for big data column-like platforms, and so, we bet on a large community of users and developers.
>>> 
>>> == Known Risks ==
>>> 
>>> Development has been sponsored mostly by a one company.For the project to fully transition to the Apache Way governance model, development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.
>>> 
>>> == Orphaned products ==
>>> 
>>> Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest in making CarbonData succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.
>>> 
>>> == Inexperience with Open Source ==
>>> 
>>> Huawei has been developing and using open source software since a long time. Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
>>> 
>>> == Reliance on Salaried Developers ==
>>> 
>>> Most of the contributors are paid to work in big data space. While they might wander from their current employers, they are unlikely to venture far from their core expertises and thus will continue to be engaged with the project regardless of their current employers.
>>> 
>>> == An Excessive Fascination with the Apache Brand ==
>>> 
>>> While we intend to leverage the Apache ‘branding’ when talking to other projects as testament of our project’s ‘neutrality’, we have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of CarbonData into Apache Incubator.
>>> 
>>> == Initial Source ==
>>> 
>>> https://github.com/HuaweiBigData/carbondata.git
>>> 
>>> == External Dependencies ==
>>> 
>>> All external dependencies are licensed under an Apache 2.0 license or
>>> Apache-compatible license. As we grow the Carbondata community we will
>>> configure our build process to require and validate all contributions
>>> and dependencies are licensed under the Apache 2.0 license or are under
>>> an Apache-compatible license.
>>> 
>>> * Apache Spark
>>> * Apache Hadoop
>>> * Apache Maven
>>> * Apache Commons
>>> * Apache Log4j
>>> * Apache Thrift
>>> * Apache Zookeeper
>>> * Scala
>>> * Snappy
>>> * Kettle (Pentaho)
>>> * Eigenbase
>>> * Fastutil
>>> * GSON
>>> * Jmockit
>>> * Junit
>>> 
>>> == Required Resources ==
>>> 
>>> === Mailing lists ===
>>> 
>>> * private@carbondata.incubator.apache.org (moderated subscriptions)
>>> * commits@carbondata.incubator.apache.org
>>> * dev@carbondata.incubator.apache.org
>>> * issues@carbondata.incubator.apache.org
>>> 
>>> === Git Repository ===
>>> 
>>> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>>> 
>>> === Issue Tracking ===
>>> 
>>> * JIRA Project CarbonData (CarbonData)
>>> 
>>> === Initial Committers ===
>>> 
>>> * Liang Chenliang
>>> * Jean-Baptiste Onofré
>>> * Henry Saputra
>>> * Uma Maheswara Rao G
>>> * Jenny MA
>>> * Jacky Likun
>>> * Vimal Das Kammath
>>> * Jarray Qiuheng
>>> 
>>> === Affiliations ===
>>> 
>>> * Huawei: Liang Chenliang
>>> * Talend: Jean-Baptiste Onofré
>>> * Ebay: Henry Saputra
>>> * Intel: Uma Maheswara Rao G
>>> 
>>> === Sponsors ===
>>> 
>>> === Champion ===
>>> 
>>> * Jean-Baptiste Onofré - Apache Member
>>> 
>>> === Mentors ===
>>> 
>>> * Henry Saputra (eBay)
>>> * Jean-Baptiste Onofré (Talend)
>>> * Uma Maheswara Rao G (Intel)
>>> 
>>> === Sponsoring Entity ===
>>> 
>>> The Apache Incubator
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Jim,

good point. Let me try to explain this "gap" regarding my discussion 
with the team:

1. Some people have been involved mostly in architecture and design more 
directly in code. That's why they are part of the initial committer 
list, whereas they didn't really provide "visible" code on github.

2. Some people are no more involved in the project. That's why they 
don't appear on the initial committer list.

Regards
JB

On 05/26/2016 05:45 PM, Jim Jagielski wrote:
> I am trying to align the list of initial committers with
> the list of current/active contributors, according to
> Github, and I am seeing people proposed who have not
> contributed anything and people NOT proposed who seem
> to be kinda active...
>
> Sooo..... -0
>
>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofr� <jb...@nanthrax.net> wrote:
>>
>> Hi all,
>>
>> following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator.
>>
>> \u200b[ ] +1 Accept CarbonData into the Apache Incubator
>> [ ] +0 Abstain
>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>>
>> This vote is open for 72 hours.
>>
>> The proposal follows, you can also access the wiki page:
>> https://wiki.apache.org/incubator/CarbonDataProposal
>>
>> Thanks !
>> Regards
>> JB
>>
>> = Apache CarbonData =
>>
>> == Abstract ==
>>
>> Apache CarbonData is a new Apache Hadoop native file format for faster interactive
>> query using advanced columnar storage, index, compression and encoding techniques
>> to improve computing efficiency, in turn it will help speedup queries an order of
>> magnitude faster over PetaBytes of data.
>>
>> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>>
>> == Background ==
>>
>> Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format:
>>
>> * Support interactive OLAP-style query over big data in seconds.
>> * Support fast query on individual record which require touching all fields.
>> * Fast data loading speed and support incremental load in period of minutes.
>> * Support HDFS so that customer can leverage existing Hadoop cluster.
>> * Support time based data retention.
>>
>> Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData.
>>
>> == Rationale ==
>>
>> CarbonData contains multiple modules, which are classified into two categories:
>>
>> 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
>> 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime.
>>
>> === CarbonData File Format ===
>>
>> CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features:
>>
>> ==== Indexing ====
>>
>> In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing:
>>
>> 1. Multi-dimensional Key (B+ Tree index)
>> The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically  represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.
>> 2. Inverted index
>> Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time.
>> 3. MinMax index
>> For all columns, minmax index is created so that processing/query engine can skip scan that is not required.
>>
>> ==== Global Dictionary ====
>>
>> Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
>>
>> ==== Column Group ====
>>
>> Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.
>>
>> ==== Optimized for multiple use cases ====
>>
>> CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data into CarbonData.
>>
>> For example
>>
>> || Use Case || Supporting Features ||
>> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index ||
>> || High throughput scan || Global dictionary, Minmax index ||
>> || Low latency point query || Multi-dimensional Key (B+ Tree index), Partitioning ||
>> || Individual record query || Column group, Global dictionary ||
>>
>> === BigData Processing Framework Integration ===
>>
>> * CarbonData provides InputFormat/OutputFormat interfaces for Reading/Writing data from the CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.
>> * CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use Spark SQL to connect and query from CarbonData.
>> * CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>>
>> Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>>
>> == Initial Goals ==
>>
>> Our initial goals are to bring CarbonData into the ASF, transition internal engineering processes into the open, and foster a collaborative development model according to the "Apache Way".
>>
>> == Current Status ==
>>
>> CarbonData is production ready and already provide a large set of features.
>> The current license is already Apache 2.0.
>>
>> == Meritocracy ==
>>
>> We intend to radically expand the initial developer and user community by running the project in accordance with the "Apache Way". Users and new contributors will be treated with respect and welcomed. By participating in the community and providing quality patches/support that move the project forward, they will earn merit. They also will be encouraged to provide non-code contributions (documentation, events, community management, etc.) and will gain merit for doing so. Those with a proven support and quality track record will be encouraged to become committers.
>>
>> == Community ==
>>
>> If CarbonData is accepted for incubation, the primary initial goal is to build a large community. We really trust that CarbonData will become a key project for big data column-like platforms, and so, we bet on a large community of users and developers.
>>
>> == Known Risks ==
>>
>> Development has been sponsored mostly by a one company.For the project to fully transition to the Apache Way governance model, development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.
>>
>> == Orphaned products ==
>>
>> Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest in making CarbonData succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.
>>
>> == Inexperience with Open Source ==
>>
>> Huawei has been developing and using open source software since a long time. Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
>>
>> == Reliance on Salaried Developers ==
>>
>> Most of the contributors are paid to work in big data space. While they might wander from their current employers, they are unlikely to venture far from their core expertises and thus will continue to be engaged with the project regardless of their current employers.
>>
>> == An Excessive Fascination with the Apache Brand ==
>>
>> While we intend to leverage the Apache \u2018branding\u2019 when talking to other projects as testament of our project\u2019s \u2018neutrality\u2019, we have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of CarbonData into Apache Incubator.
>>
>> == Initial Source ==
>>
>> https://github.com/HuaweiBigData/carbondata.git
>>
>> == External Dependencies ==
>>
>> All external dependencies are licensed under an Apache 2.0 license or
>> Apache-compatible license. As we grow the Carbondata community we will
>> configure our build process to require and validate all contributions
>> and dependencies are licensed under the Apache 2.0 license or are under
>> an Apache-compatible license.
>>
>> * Apache Spark
>> * Apache Hadoop
>> * Apache Maven
>> * Apache Commons
>> * Apache Log4j
>> * Apache Thrift
>> * Apache Zookeeper
>> * Scala
>> * Snappy
>> * Kettle (Pentaho)
>> * Eigenbase
>> * Fastutil
>> * GSON
>> * Jmockit
>> * Junit
>>
>> == Required Resources ==
>>
>> === Mailing lists ===
>>
>> * private@carbondata.incubator.apache.org (moderated subscriptions)
>> * commits@carbondata.incubator.apache.org
>> * dev@carbondata.incubator.apache.org
>> * issues@carbondata.incubator.apache.org
>>
>> === Git Repository ===
>>
>> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>>
>> === Issue Tracking ===
>>
>> * JIRA Project CarbonData (CarbonData)
>>
>> === Initial Committers ===
>>
>> * Liang Chenliang
>> * Jean-Baptiste Onofr�
>> * Henry Saputra
>> * Uma Maheswara Rao G
>> * Jenny MA
>> * Jacky Likun
>> * Vimal Das Kammath
>> * Jarray Qiuheng
>>
>> === Affiliations ===
>>
>> * Huawei: Liang Chenliang
>> * Talend: Jean-Baptiste Onofr�
>> * Ebay: Henry Saputra
>> * Intel: Uma Maheswara Rao G
>>
>> === Sponsors ===
>>
>> === Champion ===
>>
>> * Jean-Baptiste Onofr� - Apache Member
>>
>> === Mentors ===
>>
>> * Henry Saputra (eBay)
>> * Jean-Baptiste Onofr� (Talend)
>> * Uma Maheswara Rao G (Intel)
>>
>> === Sponsoring Entity ===
>>
>> The Apache Incubator
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Jim Jagielski <ji...@jaguNET.com>.
I am trying to align the list of initial committers with
the list of current/active contributors, according to
Github, and I am seeing people proposed who have not
contributed anything and people NOT proposed who seem
to be kinda active...

Sooo..... -0

> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> 
> Hi all,
> 
> following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator.
> 
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> 
> This vote is open for 72 hours.
> 
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
> 
> Thanks !
> Regards
> JB
> 
> = Apache CarbonData =
> 
> == Abstract ==
> 
> Apache CarbonData is a new Apache Hadoop native file format for faster interactive
> query using advanced columnar storage, index, compression and encoding techniques
> to improve computing efficiency, in turn it will help speedup queries an order of
> magnitude faster over PetaBytes of data.
> 
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> 
> == Background ==
> 
> Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format:
> 
> * Support interactive OLAP-style query over big data in seconds.
> * Support fast query on individual record which require touching all fields.
> * Fast data loading speed and support incremental load in period of minutes.
> * Support HDFS so that customer can leverage existing Hadoop cluster.
> * Support time based data retention.
> 
> Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData.
> 
> == Rationale ==
> 
> CarbonData contains multiple modules, which are classified into two categories:
> 
> 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime.
> 
> === CarbonData File Format ===
> 
> CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features:
> 
> ==== Indexing ====
> 
> In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing:
> 
> 1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically  represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
> Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time.
> 3. MinMax index
> For all columns, minmax index is created so that processing/query engine can skip scan that is not required.
> 
> ==== Global Dictionary ====
> 
> Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
> 
> ==== Column Group ====
> 
> Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.
> 
> ==== Optimized for multiple use cases ====
> 
> CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data into CarbonData.
> 
> For example
> 
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index), Partitioning ||
> || Individual record query || Column group, Global dictionary ||
> 
> === BigData Processing Framework Integration ===
> 
> * CarbonData provides InputFormat/OutputFormat interfaces for Reading/Writing data from the CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.
> * CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use Spark SQL to connect and query from CarbonData.
> * CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> 
> Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
> 
> == Initial Goals ==
> 
> Our initial goals are to bring CarbonData into the ASF, transition internal engineering processes into the open, and foster a collaborative development model according to the "Apache Way".
> 
> == Current Status ==
> 
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
> 
> == Meritocracy ==
> 
> We intend to radically expand the initial developer and user community by running the project in accordance with the "Apache Way". Users and new contributors will be treated with respect and welcomed. By participating in the community and providing quality patches/support that move the project forward, they will earn merit. They also will be encouraged to provide non-code contributions (documentation, events, community management, etc.) and will gain merit for doing so. Those with a proven support and quality track record will be encouraged to become committers.
> 
> == Community ==
> 
> If CarbonData is accepted for incubation, the primary initial goal is to build a large community. We really trust that CarbonData will become a key project for big data column-like platforms, and so, we bet on a large community of users and developers.
> 
> == Known Risks ==
> 
> Development has been sponsored mostly by a one company.For the project to fully transition to the Apache Way governance model, development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.
> 
> == Orphaned products ==
> 
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest in making CarbonData succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.
> 
> == Inexperience with Open Source ==
> 
> Huawei has been developing and using open source software since a long time. Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
> 
> == Reliance on Salaried Developers ==
> 
> Most of the contributors are paid to work in big data space. While they might wander from their current employers, they are unlikely to venture far from their core expertises and thus will continue to be engaged with the project regardless of their current employers.
> 
> == An Excessive Fascination with the Apache Brand ==
> 
> While we intend to leverage the Apache ‘branding’ when talking to other projects as testament of our project’s ‘neutrality’, we have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of CarbonData into Apache Incubator.
> 
> == Initial Source ==
> 
> https://github.com/HuaweiBigData/carbondata.git
> 
> == External Dependencies ==
> 
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
> 
> * Apache Spark
> * Apache Hadoop
> * Apache Maven
> * Apache Commons
> * Apache Log4j
> * Apache Thrift
> * Apache Zookeeper
> * Scala
> * Snappy
> * Kettle (Pentaho)
> * Eigenbase
> * Fastutil
> * GSON
> * Jmockit
> * Junit
> 
> == Required Resources ==
> 
> === Mailing lists ===
> 
> * private@carbondata.incubator.apache.org (moderated subscriptions)
> * commits@carbondata.incubator.apache.org
> * dev@carbondata.incubator.apache.org
> * issues@carbondata.incubator.apache.org
> 
> === Git Repository ===
> 
> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> 
> === Issue Tracking ===
> 
> * JIRA Project CarbonData (CarbonData)
> 
> === Initial Committers ===
> 
> * Liang Chenliang
> * Jean-Baptiste Onofré
> * Henry Saputra
> * Uma Maheswara Rao G
> * Jenny MA
> * Jacky Likun
> * Vimal Das Kammath
> * Jarray Qiuheng
> 
> === Affiliations ===
> 
> * Huawei: Liang Chenliang
> * Talend: Jean-Baptiste Onofré
> * Ebay: Henry Saputra
> * Intel: Uma Maheswara Rao G
> 
> === Sponsors ===
> 
> === Champion ===
> 
> * Jean-Baptiste Onofré - Apache Member
> 
> === Mentors ===
> 
> * Henry Saputra (eBay)
> * Jean-Baptiste Onofré (Talend)
> * Uma Maheswara Rao G (Intel)
> 
> === Sponsoring Entity ===
> 
> The Apache Incubator
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by David E Jones <de...@dejc.com>.
+1

-David (jonesde@a.o)


> On 25 May 2016, at 13:24, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> 
> Hi all,
> 
> following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator.
> 
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> 
> This vote is open for 72 hours.
> 
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
> 
> Thanks !
> Regards
> JB
> 
> = Apache CarbonData =
> 
> == Abstract ==
> 
> Apache CarbonData is a new Apache Hadoop native file format for faster interactive
> query using advanced columnar storage, index, compression and encoding techniques
> to improve computing efficiency, in turn it will help speedup queries an order of
> magnitude faster over PetaBytes of data.
> 
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> 
> == Background ==
> 
> Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format:
> 
> * Support interactive OLAP-style query over big data in seconds.
> * Support fast query on individual record which require touching all fields.
> * Fast data loading speed and support incremental load in period of minutes.
> * Support HDFS so that customer can leverage existing Hadoop cluster.
> * Support time based data retention.
> 
> Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData.
> 
> == Rationale ==
> 
> CarbonData contains multiple modules, which are classified into two categories:
> 
> 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime.
> 
> === CarbonData File Format ===
> 
> CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features:
> 
> ==== Indexing ====
> 
> In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing:
> 
> 1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically  represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
> Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time.
> 3. MinMax index
> For all columns, minmax index is created so that processing/query engine can skip scan that is not required.
> 
> ==== Global Dictionary ====
> 
> Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
> 
> ==== Column Group ====
> 
> Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.
> 
> ==== Optimized for multiple use cases ====
> 
> CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data into CarbonData.
> 
> For example
> 
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index), Partitioning ||
> || Individual record query || Column group, Global dictionary ||
> 
> === BigData Processing Framework Integration ===
> 
> * CarbonData provides InputFormat/OutputFormat interfaces for Reading/Writing data from the CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.
> * CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use Spark SQL to connect and query from CarbonData.
> * CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> 
> Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
> 
> == Initial Goals ==
> 
> Our initial goals are to bring CarbonData into the ASF, transition internal engineering processes into the open, and foster a collaborative development model according to the "Apache Way".
> 
> == Current Status ==
> 
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
> 
> == Meritocracy ==
> 
> We intend to radically expand the initial developer and user community by running the project in accordance with the "Apache Way". Users and new contributors will be treated with respect and welcomed. By participating in the community and providing quality patches/support that move the project forward, they will earn merit. They also will be encouraged to provide non-code contributions (documentation, events, community management, etc.) and will gain merit for doing so. Those with a proven support and quality track record will be encouraged to become committers.
> 
> == Community ==
> 
> If CarbonData is accepted for incubation, the primary initial goal is to build a large community. We really trust that CarbonData will become a key project for big data column-like platforms, and so, we bet on a large community of users and developers.
> 
> == Known Risks ==
> 
> Development has been sponsored mostly by a one company.For the project to fully transition to the Apache Way governance model, development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.
> 
> == Orphaned products ==
> 
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest in making CarbonData succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.
> 
> == Inexperience with Open Source ==
> 
> Huawei has been developing and using open source software since a long time. Additionally, several ASF veterans agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
> 
> == Reliance on Salaried Developers ==
> 
> Most of the contributors are paid to work in big data space. While they might wander from their current employers, they are unlikely to venture far from their core expertises and thus will continue to be engaged with the project regardless of their current employers.
> 
> == An Excessive Fascination with the Apache Brand ==
> 
> While we intend to leverage the Apache ‘branding’ when talking to other projects as testament of our project’s ‘neutrality’, we have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of CarbonData into Apache Incubator.
> 
> == Initial Source ==
> 
> https://github.com/HuaweiBigData/carbondata.git
> 
> == External Dependencies ==
> 
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
> 
> * Apache Spark
> * Apache Hadoop
> * Apache Maven
> * Apache Commons
> * Apache Log4j
> * Apache Thrift
> * Apache Zookeeper
> * Scala
> * Snappy
> * Kettle (Pentaho)
> * Eigenbase
> * Fastutil
> * GSON
> * Jmockit
> * Junit
> 
> == Required Resources ==
> 
> === Mailing lists ===
> 
> * private@carbondata.incubator.apache.org (moderated subscriptions)
> * commits@carbondata.incubator.apache.org
> * dev@carbondata.incubator.apache.org
> * issues@carbondata.incubator.apache.org
> 
> === Git Repository ===
> 
> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> 
> === Issue Tracking ===
> 
> * JIRA Project CarbonData (CarbonData)
> 
> === Initial Committers ===
> 
> * Liang Chenliang
> * Jean-Baptiste Onofré
> * Henry Saputra
> * Uma Maheswara Rao G
> * Jenny MA
> * Jacky Likun
> * Vimal Das Kammath
> * Jarray Qiuheng
> 
> === Affiliations ===
> 
> * Huawei: Liang Chenliang
> * Talend: Jean-Baptiste Onofré
> * Ebay: Henry Saputra
> * Intel: Uma Maheswara Rao G
> 
> === Sponsors ===
> 
> === Champion ===
> 
> * Jean-Baptiste Onofré - Apache Member
> 
> === Mentors ===
> 
> * Henry Saputra (eBay)
> * Jean-Baptiste Onofré (Talend)
> * Uma Maheswara Rao G (Intel)
> 
> === Sponsoring Entity ===
> 
> The Apache Incubator
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Jake Farrell <jf...@apache.org>.
+1 (binding)

-Jake

On Wed, May 25, 2016 at 4:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing customer
> experiences for telecom carriers, enterprises, and consumers on big data,
> In order to satisfy the following customer requirements, we created a new
> Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all
> fields.
>  * Fast data loading speed and support incremental load in period of
> minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in the
> Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>  1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>  2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression schema
> ,complex data type etc. And CarbonData has following unique features:
>
> ==== Indexing ====
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3 types
> of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each data
> blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
>  Inverted index is widely used in search engine. By using this index, it
> helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>  For all columns, minmax index is created so that processing/query engine
> can skip scan that is not required.
>
> ==== Global Dictionary ====
>
> Besides I/O reduction, CarbonData accelerates computation by using global
> dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
> ==== Column Group ====
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
> ==== Optimized for multiple use cases ====
>
> CarbonData indices and dictionary is highly configurable. To make storage
> optimized for different use cases, user can configure what to index, so
> user can decide and tune the format before loading data into CarbonData.
>
> For example
>
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> || Individual record query || Column group, Global dictionary ||
>
> === BigData Processing Framework Integration ===
>
>  * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format with
> data processing framework.
>  * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
>  * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
> Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>
> == Initial Goals ==
>
> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
>
> == Current Status ==
>
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
>
> == Meritocracy ==
>
> We intend to radically expand the initial developer and user community by
> running the project in accordance with the "Apache Way". Users and new
> contributors will be treated with respect and welcomed. By participating in
> the community and providing quality patches/support that move the project
> forward, they will earn merit. They also will be encouraged to provide
> non-code contributions (documentation, events, community management, etc.)
> and will gain merit for doing so. Those with a proven support and quality
> track record will be encouraged to become committers.
>
> == Community ==
>
> If CarbonData is accepted for incubation, the primary initial goal is to
> build a large community. We really trust that CarbonData will become a key
> project for big data column-like platforms, and so, we bet on a large
> community of users and developers.
>
> == Known Risks ==
>
> Development has been sponsored mostly by a one company.For the project to
> fully transition to the Apache Way governance model, development must shift
> towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
>
> == Orphaned products ==
>
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration with
> sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
>
> == Inexperience with Open Source ==
>
> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project and
> are listed in this proposal. The project will rely on their guidance and
> collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
>
> == Reliance on Salaried Developers ==
>
> Most of the contributors are paid to work in big data space. While they
> might wander from their current employers, they are unlikely to venture far
> from their core expertises and thus will continue to be engaged with the
> project regardless of their current employers.
>
> == An Excessive Fascination with the Apache Brand ==
>
> While we intend to leverage the Apache ‘branding’ when talking to other
> projects as testament of our project’s ‘neutrality’, we have no plans for
> making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
>
> == Initial Source ==
>
> https://github.com/HuaweiBigData/carbondata.git
>
> == External Dependencies ==
>
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
>
>  * Apache Spark
>  * Apache Hadoop
>  * Apache Maven
>  * Apache Commons
>  * Apache Log4j
>  * Apache Thrift
>  * Apache Zookeeper
>  * Scala
>  * Snappy
>  * Kettle (Pentaho)
>  * Eigenbase
>  * Fastutil
>  * GSON
>  * Jmockit
>  * Junit
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@carbondata.incubator.apache.org (moderated subscriptions)
>  * commits@carbondata.incubator.apache.org
>  * dev@carbondata.incubator.apache.org
>  * issues@carbondata.incubator.apache.org
>
> === Git Repository ===
>
>  * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
> === Issue Tracking ===
>
>  * JIRA Project CarbonData (CarbonData)
>
> === Initial Committers ===
>
>  * Liang Chenliang
>  * Jean-Baptiste Onofré
>  * Henry Saputra
>  * Uma Maheswara Rao G
>  * Jenny MA
>  * Jacky Likun
>  * Vimal Das Kammath
>  * Jarray Qiuheng
>
> === Affiliations ===
>
>  * Huawei: Liang Chenliang
>  * Talend: Jean-Baptiste Onofré
>  * Ebay: Henry Saputra
>  * Intel: Uma Maheswara Rao G
>
> === Sponsors ===
>
> === Champion ===
>
>  * Jean-Baptiste Onofré - Apache Member
>
> === Mentors ===
>
>  * Henry Saputra (eBay)
>  * Jean-Baptiste Onofré (Talend)
>  * Uma Maheswara Rao G (Intel)
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

[RESULT][VOTE] Accept CarbonData into the Apache Incubator

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi,

I close this vote with only +1: welcome to Apache CarbonData in the 
Incubator !

I will request the resources creation.

Thanks all for your vote.

Regards
JB

On 05/25/2016 10:24 PM, Jean-Baptiste Onofr� wrote:
> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> \u200b[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing
> customer experiences for telecom carriers, enterprises, and consumers on
> big data, In order to satisfy the following customer requirements, we
> created a new Hadoop native file format:
>
>   * Support interactive OLAP-style query over big data in seconds.
>   * Support fast query on individual record which require touching all
> fields.
>   * Fast data loading speed and support incremental load in period of
> minutes.
>   * Support HDFS so that customer can leverage existing Hadoop cluster.
>   * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>   1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>   2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
> the execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression
> schema ,complex data type etc. And CarbonData has following unique
> features:
>
> ==== Indexing ====
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3
> types of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>   The Data block are written in sequence to the disk and within each
> data blocks each column block is written in sequence. Finally, the
> metadata block for the file is written with information about byte
> positions of each block in the file, Min-Max statistics index and the
> start and end MDK of each data block. Since, the entire data in the file
> is in sorted order, the start and end MDK of each data block can be used
> to construct a B+Tree and the file can be logically  represented as a
> B+Tree with the data blocks as leaf nodes (on disk) and the remaining
> non-leaf nodes in memory.
> 2. Inverted index
>   Inverted index is widely used in search engine. By using this index,
> it helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is
> made possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>   For all columns, minmax index is created so that processing/query
> engine can skip scan that is not required.
>
> ==== Global Dictionary ====
>
> Besides I/O reduction, CarbonData accelerates computation by using
> global dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
> ==== Column Group ====
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient
> than columnar format since all columns will be touched by the workload.
> To accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
> ==== Optimized for multiple use cases ====
>
> CarbonData indices and dictionary is highly configurable. To make
> storage optimized for different use cases, user can configure what to
> index, so user can decide and tune the format before loading data into
> CarbonData.
>
> For example
>
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> || Individual record query || Column group, Global dictionary ||
>
> === BigData Processing Framework Integration ===
>
>   * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format
> with data processing framework.
>   * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
>   * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
> Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>
>
> == Initial Goals ==
>
> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
>
> == Current Status ==
>
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
>
> == Meritocracy ==
>
> We intend to radically expand the initial developer and user community
> by running the project in accordance with the "Apache Way". Users and
> new contributors will be treated with respect and welcomed. By
> participating in the community and providing quality patches/support
> that move the project forward, they will earn merit. They also will be
> encouraged to provide non-code contributions (documentation, events,
> community management, etc.) and will gain merit for doing so. Those with
> a proven support and quality track record will be encouraged to become
> committers.
>
> == Community ==
>
> If CarbonData is accepted for incubation, the primary initial goal is to
> build a large community. We really trust that CarbonData will become a
> key project for big data column-like platforms, and so, we bet on a
> large community of users and developers.
>
> == Known Risks ==
>
> Development has been sponsored mostly by a one company.For the project
> to fully transition to the Apache Way governance model, development must
> shift towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
>
> == Orphaned products ==
>
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration
> with sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
>
> == Inexperience with Open Source ==
>
> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project
> and are listed in this proposal. The project will rely on their guidance
> and collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
>
> == Reliance on Salaried Developers ==
>
> Most of the contributors are paid to work in big data space. While they
> might wander from their current employers, they are unlikely to venture
> far from their core expertises and thus will continue to be engaged with
> the project regardless of their current employers.
>
> == An Excessive Fascination with the Apache Brand ==
>
> While we intend to leverage the Apache \u2018branding\u2019 when talking to other
> projects as testament of our project\u2019s \u2018neutrality\u2019, we have no plans
> for making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
>
> == Initial Source ==
>
> https://github.com/HuaweiBigData/carbondata.git
>
> == External Dependencies ==
>
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
>
>   * Apache Spark
>   * Apache Hadoop
>   * Apache Maven
>   * Apache Commons
>   * Apache Log4j
>   * Apache Thrift
>   * Apache Zookeeper
>   * Scala
>   * Snappy
>   * Kettle (Pentaho)
>   * Eigenbase
>   * Fastutil
>   * GSON
>   * Jmockit
>   * Junit
>
> == Required Resources ==
>
> === Mailing lists ===
>
>   * private@carbondata.incubator.apache.org (moderated subscriptions)
>   * commits@carbondata.incubator.apache.org
>   * dev@carbondata.incubator.apache.org
>   * issues@carbondata.incubator.apache.org
>
> === Git Repository ===
>
>   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
> === Issue Tracking ===
>
>   * JIRA Project CarbonData (CarbonData)
>
> === Initial Committers ===
>
>   * Liang Chenliang
>   * Jean-Baptiste Onofr�
>   * Henry Saputra
>   * Uma Maheswara Rao G
>   * Jenny MA
>   * Jacky Likun
>   * Vimal Das Kammath
>   * Jarray Qiuheng
>
> === Affiliations ===
>
>   * Huawei: Liang Chenliang
>   * Talend: Jean-Baptiste Onofr�
>   * Ebay: Henry Saputra
>   * Intel: Uma Maheswara Rao G
>
> === Sponsors ===
>
> === Champion ===
>
>   * Jean-Baptiste Onofr� - Apache Member
>
> === Mentors ===
>
>   * Henry Saputra (eBay)
>   * Jean-Baptiste Onofr� (Talend)
>   * Uma Maheswara Rao G (Intel)
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


RE: [VOTE] Accept CarbonData into the Apache Incubator

Posted by "Zheng, Kai" <ka...@intel.com>.
+1 (non-binding)

Regards,
Kai

-----Original Message-----
From: Gangumalla, Uma [mailto:uma.gangumalla@intel.com] 
Sent: Friday, May 27, 2016 1:10 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator

+1 (binding)

Regards,
Uma

On 5/25/16, 1:24 PM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:

>Hi all,
>
>following the discussion thread, I'm now calling a vote to accept 
>CarbonData into the Incubator.
>
>​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ ] 
>-1 Do not accept CarbonData into the Apache Incubator, because ...
>
>This vote is open for 72 hours.
>
>The proposal follows, you can also access the wiki page:
>https://wiki.apache.org/incubator/CarbonDataProposal
>
>Thanks !
>Regards
>JB
>
>= Apache CarbonData =
>
>== Abstract ==
>
>Apache CarbonData is a new Apache Hadoop native file format for faster 
>interactive query using advanced columnar storage, index, compression 
>and encoding techniques to improve computing efficiency, in turn it 
>will help speedup queries an order of magnitude faster over PetaBytes 
>of data.
>
>CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
>== Background ==
>
>Huawei is an ICT solution provider, we are committed to enhancing 
>customer experiences for telecom carriers, enterprises, and consumers 
>on big data, In order to satisfy the following customer requirements, 
>we created a new Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all 
>fields.
>  * Fast data loading speed and support incremental load in period of 
>minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
>Based on these requirements, we investigated existing file formats in 
>the Hadoop eco-system, but we could not find a suitable solution that 
>satisfying requirements all at the same time, so we start designing 
>CarbonData.
>
>== Rationale ==
>
>CarbonData contains multiple modules, which are classified into two
>categories:
>
>  1. CarbonData File Format: which contains core implementation for 
>file format such as columnar,index,dictionary,encoding+compression,API 
>for reading/writing etc.
>  2. CarbonData integration with big data processing framework such as 
>Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract 
>the execution runtime.
>
>=== CarbonData File Format ===
>
>CarbonData file format is a columnar store in HDFS, it has many 
>features that a modern columnar format has, such as splittable, 
>compression schema ,complex data type etc. And CarbonData has following 
>unique
>features:
>
>==== Indexing ====
>
>In order to support fast interactive query, CarbonData leverage 
>indexing technology to reduce I/O scans. CarbonData files stores data 
>along with index, the index is not stored separately but the CarbonData 
>file itself contains the index. In current implementation, CarbonData 
>supports 3 types of indexing:
>
>1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each 
>data blocks each column block is written in sequence. Finally, the 
>metadata block for the file is written with information about byte 
>positions of each block in the file, Min-Max statistics index and the 
>start and end MDK of each data block. Since, the entire data in the 
>file is in sorted order, the start and end MDK of each data block can 
>be used to construct a B+Tree and the file can be logically  
>represented as a
>B+Tree with the data blocks as leaf nodes (on disk) and the remaining
>non-leaf nodes in memory.
>2. Inverted index
>  Inverted index is widely used in search engine. By using this index, 
>it helps processing/query engine to do filtering inside one HDFS block.
>Furthermore, query acceleration for count distinct like operation is 
>made possible when combining bitmap and inverted index in query time.
>3. MinMax index
>  For all columns, minmax index is created so that processing/query 
>engine can skip scan that is not required.
>
>==== Global Dictionary ====
>
>Besides I/O reduction, CarbonData accelerates computation by using 
>global dictionary, which enables processing/query engines to perform 
>all processing on encoded data without having to convert the data (Late 
>Materialization). We have observed dramatic performance improvement for 
>OLAP analytic scenario where table contains many columns in string data 
>type. The data is converted back to the user readable form just before 
>processing/query engine returning results to user.
>
>==== Column Group ====
>
>Sometimes users want to perform processing/query on multi-columns in 
>one table, for example, performing scan for individual record in 
>troubleshooting scenario. In this case, row format is more efficient 
>than columnar format since all columns will be touched by the workload.
>To accelerate this, CarbonData supports storing a group of column in 
>row format, so data in column group is stored together and enable fast 
>retrieval.
>
>==== Optimized for multiple use cases ====
>
>CarbonData indices and dictionary is highly configurable. To make 
>storage optimized for different use cases, user can configure what to 
>index, so user can decide and tune the format before loading data into 
>CarbonData.
>
>For example
>
>|| Use Case || Supporting Features ||
>|| Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
>Tree index), Minmax index, Inverted index ||
>|| High throughput scan || Global dictionary, Minmax index || Low 
>|| latency point query || Multi-dimensional Key (B+ Tree index),
>Partitioning ||
>|| Individual record query || Column group, Global dictionary ||
>
>=== BigData Processing Framework Integration ===
>
>  * CarbonData provides InputFormat/OutputFormat interfaces for 
>Reading/Writing data from the CarbonData files and at the same time 
>provides abstract API for processing data stored as Carbondata format 
>with data processing framework.
>  * CarbonData provides deep integration with Apache Spark including 
>predicate push down, column pruning, aggregation push down etc. So 
>users can use Spark SQL to connect and query from CarbonData.
>  * CarbonData can integrate with various big data Query/Processing 
>framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
>Example: 
>https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/ma
>in/ scala/org/carbondata/examples/CarbonExample.scala
>
>== Initial Goals ==
>
>Our initial goals are to bring CarbonData into the ASF, transition 
>internal engineering processes into the open, and foster a 
>collaborative development model according to the "Apache Way".
>
>== Current Status ==
>
>CarbonData is production ready and already provide a large set of 
>features.
>The current license is already Apache 2.0.
>
>== Meritocracy ==
>
>We intend to radically expand the initial developer and user community 
>by running the project in accordance with the "Apache Way". Users and 
>new contributors will be treated with respect and welcomed. By 
>participating in the community and providing quality patches/support 
>that move the project forward, they will earn merit. They also will be 
>encouraged to provide non-code contributions (documentation, events, 
>community management, etc.) and will gain merit for doing so. Those 
>with a proven support and quality track record will be encouraged to 
>become committers.
>
>== Community ==
>
>If CarbonData is accepted for incubation, the primary initial goal is 
>to build a large community. We really trust that CarbonData will become 
>a key project for big data column-like platforms, and so, we bet on a 
>large community of users and developers.
>
>== Known Risks ==
>
>Development has been sponsored mostly by a one company.For the project 
>to fully transition to the Apache Way governance model, development 
>must shift towards the meritocracy-centric model of growing a community 
>of contributors balanced with the needs for extreme stability and core 
>implementation coherency.
>
>== Orphaned products ==
>
>Huawei is fully committed CarbonData. Moreover, Huawei has a vested 
>interest in making CarbonData succeed by driving its close integration 
>with sister ASF projects. We expect this to further reduces the risk of 
>orphaning the product.
>
>== Inexperience with Open Source ==
>
>Huawei has been developing and using open source software since a long 
>time. Additionally, several ASF veterans agreed to mentor the project 
>and are listed in this proposal. The project will rely on their 
>guidance and collective wisdom to quickly transition the entire team of 
>initial committers towards practicing the Apache Way.
>
>== Reliance on Salaried Developers ==
>
>Most of the contributors are paid to work in big data space. While they 
>might wander from their current employers, they are unlikely to venture 
>far from their core expertises and thus will continue to be engaged 
>with the project regardless of their current employers.
>
>== An Excessive Fascination with the Apache Brand ==
>
>While we intend to leverage the Apache ‘branding’ when talking to other 
>projects as testament of our project’s ‘neutrality’, we have no plans 
>for making use of Apache brand in press releases nor posting billboards 
>advertising acceptance of CarbonData into Apache Incubator.
>
>== Initial Source ==
>
>https://github.com/HuaweiBigData/carbondata.git
>
>== External Dependencies ==
>
>All external dependencies are licensed under an Apache 2.0 license or 
>Apache-compatible license. As we grow the Carbondata community we will 
>configure our build process to require and validate all contributions 
>and dependencies are licensed under the Apache 2.0 license or are under 
>an Apache-compatible license.
>
>  * Apache Spark
>  * Apache Hadoop
>  * Apache Maven
>  * Apache Commons
>  * Apache Log4j
>  * Apache Thrift
>  * Apache Zookeeper
>  * Scala
>  * Snappy
>  * Kettle (Pentaho)
>  * Eigenbase
>  * Fastutil
>  * GSON
>  * Jmockit
>  * Junit
>
>== Required Resources ==
>
>=== Mailing lists ===
>
>  * private@carbondata.incubator.apache.org (moderated subscriptions)
>  * commits@carbondata.incubator.apache.org
>  * dev@carbondata.incubator.apache.org
>  * issues@carbondata.incubator.apache.org
>
>=== Git Repository ===
>
>  * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
>=== Issue Tracking ===
>
>  * JIRA Project CarbonData (CarbonData)
>
>=== Initial Committers ===
>
>  * Liang Chenliang
>  * Jean-Baptiste Onofré
>  * Henry Saputra
>  * Uma Maheswara Rao G
>  * Jenny MA
>  * Jacky Likun
>  * Vimal Das Kammath
>  * Jarray Qiuheng
>
>=== Affiliations ===
>
>  * Huawei: Liang Chenliang
>  * Talend: Jean-Baptiste Onofré
>  * Ebay: Henry Saputra
>  * Intel: Uma Maheswara Rao G
>
>=== Sponsors ===
>
>=== Champion ===
>
>  * Jean-Baptiste Onofré - Apache Member
>
>=== Mentors ===
>
>  * Henry Saputra (eBay)
>  * Jean-Baptiste Onofré (Talend)
>  * Uma Maheswara Rao G (Intel)
>
>=== Sponsoring Entity ===
>
>The Apache Incubator
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>For additional commands, e-mail: general-help@incubator.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by "Gangumalla, Uma" <um...@intel.com>.
+1 (binding)

Regards,
Uma

On 5/25/16, 1:24 PM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:

>Hi all,
>
>following the discussion thread, I'm now calling a vote to accept
>CarbonData into the Incubator.
>
>​[ ] +1 Accept CarbonData into the Apache Incubator
>[ ] +0 Abstain
>[ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
>This vote is open for 72 hours.
>
>The proposal follows, you can also access the wiki page:
>https://wiki.apache.org/incubator/CarbonDataProposal
>
>Thanks !
>Regards
>JB
>
>= Apache CarbonData =
>
>== Abstract ==
>
>Apache CarbonData is a new Apache Hadoop native file format for faster
>interactive
>query using advanced columnar storage, index, compression and encoding
>techniques
>to improve computing efficiency, in turn it will help speedup queries an
>order of
>magnitude faster over PetaBytes of data.
>
>CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
>== Background ==
>
>Huawei is an ICT solution provider, we are committed to enhancing
>customer experiences for telecom carriers, enterprises, and consumers on
>big data, In order to satisfy the following customer requirements, we
>created a new Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all
>fields.
>  * Fast data loading speed and support incremental load in period of
>minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
>Based on these requirements, we investigated existing file formats in
>the Hadoop eco-system, but we could not find a suitable solution that
>satisfying requirements all at the same time, so we start designing
>CarbonData.
>
>== Rationale ==
>
>CarbonData contains multiple modules, which are classified into two
>categories:
>
>  1. CarbonData File Format: which contains core implementation for file
>format such as columnar,index,dictionary,encoding+compression,API for
>reading/writing etc.
>  2. CarbonData integration with big data processing framework such as
>Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
>the execution runtime.
>
>=== CarbonData File Format ===
>
>CarbonData file format is a columnar store in HDFS, it has many features
>that a modern columnar format has, such as splittable, compression
>schema ,complex data type etc. And CarbonData has following unique
>features:
>
>==== Indexing ====
>
>In order to support fast interactive query, CarbonData leverage indexing
>technology to reduce I/O scans. CarbonData files stores data along with
>index, the index is not stored separately but the CarbonData file itself
>contains the index. In current implementation, CarbonData supports 3
>types of indexing:
>
>1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each
>data blocks each column block is written in sequence. Finally, the
>metadata block for the file is written with information about byte
>positions of each block in the file, Min-Max statistics index and the
>start and end MDK of each data block. Since, the entire data in the file
>is in sorted order, the start and end MDK of each data block can be used
>to construct a B+Tree and the file can be logically  represented as a
>B+Tree with the data blocks as leaf nodes (on disk) and the remaining
>non-leaf nodes in memory.
>2. Inverted index
>  Inverted index is widely used in search engine. By using this index,
>it helps processing/query engine to do filtering inside one HDFS block.
>Furthermore, query acceleration for count distinct like operation is
>made possible when combining bitmap and inverted index in query time.
>3. MinMax index
>  For all columns, minmax index is created so that processing/query
>engine can skip scan that is not required.
>
>==== Global Dictionary ====
>
>Besides I/O reduction, CarbonData accelerates computation by using
>global dictionary, which enables processing/query engines to perform all
>processing on encoded data without having to convert the data (Late
>Materialization). We have observed dramatic performance improvement for
>OLAP analytic scenario where table contains many columns in string data
>type. The data is converted back to the user readable form just before
>processing/query engine returning results to user.
>
>==== Column Group ====
>
>Sometimes users want to perform processing/query on multi-columns in one
>table, for example, performing scan for individual record in
>troubleshooting scenario. In this case, row format is more efficient
>than columnar format since all columns will be touched by the workload.
>To accelerate this, CarbonData supports storing a group of column in row
>format, so data in column group is stored together and enable fast
>retrieval.
>
>==== Optimized for multiple use cases ====
>
>CarbonData indices and dictionary is highly configurable. To make
>storage optimized for different use cases, user can configure what to
>index, so user can decide and tune the format before loading data into
>CarbonData.
>
>For example
>
>|| Use Case || Supporting Features ||
>|| Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
>Tree index), Minmax index, Inverted index ||
>|| High throughput scan || Global dictionary, Minmax index ||
>|| Low latency point query || Multi-dimensional Key (B+ Tree index),
>Partitioning ||
>|| Individual record query || Column group, Global dictionary ||
>
>=== BigData Processing Framework Integration ===
>
>  * CarbonData provides InputFormat/OutputFormat interfaces for
>Reading/Writing data from the CarbonData files and at the same time
>provides abstract API for processing data stored as Carbondata format
>with data processing framework.
>  * CarbonData provides deep integration with Apache Spark including
>predicate push down, column pruning, aggregation push down etc. So users
>can use Spark SQL to connect and query from CarbonData.
>  * CarbonData can integrate with various big data Query/Processing
>framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
>Example: 
>https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/
>scala/org/carbondata/examples/CarbonExample.scala
>
>== Initial Goals ==
>
>Our initial goals are to bring CarbonData into the ASF, transition
>internal engineering processes into the open, and foster a collaborative
>development model according to the "Apache Way".
>
>== Current Status ==
>
>CarbonData is production ready and already provide a large set of
>features.
>The current license is already Apache 2.0.
>
>== Meritocracy ==
>
>We intend to radically expand the initial developer and user community
>by running the project in accordance with the "Apache Way". Users and
>new contributors will be treated with respect and welcomed. By
>participating in the community and providing quality patches/support
>that move the project forward, they will earn merit. They also will be
>encouraged to provide non-code contributions (documentation, events,
>community management, etc.) and will gain merit for doing so. Those with
>a proven support and quality track record will be encouraged to become
>committers.
>
>== Community ==
>
>If CarbonData is accepted for incubation, the primary initial goal is to
>build a large community. We really trust that CarbonData will become a
>key project for big data column-like platforms, and so, we bet on a
>large community of users and developers.
>
>== Known Risks ==
>
>Development has been sponsored mostly by a one company.For the project
>to fully transition to the Apache Way governance model, development must
>shift towards the meritocracy-centric model of growing a community of
>contributors balanced with the needs for extreme stability and core
>implementation coherency.
>
>== Orphaned products ==
>
>Huawei is fully committed CarbonData. Moreover, Huawei has a vested
>interest in making CarbonData succeed by driving its close integration
>with sister ASF projects. We expect this to further reduces the risk of
>orphaning the product.
>
>== Inexperience with Open Source ==
>
>Huawei has been developing and using open source software since a long
>time. Additionally, several ASF veterans agreed to mentor the project
>and are listed in this proposal. The project will rely on their guidance
>and collective wisdom to quickly transition the entire team of initial
>committers towards practicing the Apache Way.
>
>== Reliance on Salaried Developers ==
>
>Most of the contributors are paid to work in big data space. While they
>might wander from their current employers, they are unlikely to venture
>far from their core expertises and thus will continue to be engaged with
>the project regardless of their current employers.
>
>== An Excessive Fascination with the Apache Brand ==
>
>While we intend to leverage the Apache ‘branding’ when talking to other
>projects as testament of our project’s ‘neutrality’, we have no plans
>for making use of Apache brand in press releases nor posting billboards
>advertising acceptance of CarbonData into Apache Incubator.
>
>== Initial Source ==
>
>https://github.com/HuaweiBigData/carbondata.git
>
>== External Dependencies ==
>
>All external dependencies are licensed under an Apache 2.0 license or
>Apache-compatible license. As we grow the Carbondata community we will
>configure our build process to require and validate all contributions
>and dependencies are licensed under the Apache 2.0 license or are under
>an Apache-compatible license.
>
>  * Apache Spark
>  * Apache Hadoop
>  * Apache Maven
>  * Apache Commons
>  * Apache Log4j
>  * Apache Thrift
>  * Apache Zookeeper
>  * Scala
>  * Snappy
>  * Kettle (Pentaho)
>  * Eigenbase
>  * Fastutil
>  * GSON
>  * Jmockit
>  * Junit
>
>== Required Resources ==
>
>=== Mailing lists ===
>
>  * private@carbondata.incubator.apache.org (moderated subscriptions)
>  * commits@carbondata.incubator.apache.org
>  * dev@carbondata.incubator.apache.org
>  * issues@carbondata.incubator.apache.org
>
>=== Git Repository ===
>
>  * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
>=== Issue Tracking ===
>
>  * JIRA Project CarbonData (CarbonData)
>
>=== Initial Committers ===
>
>  * Liang Chenliang
>  * Jean-Baptiste Onofré
>  * Henry Saputra
>  * Uma Maheswara Rao G
>  * Jenny MA
>  * Jacky Likun
>  * Vimal Das Kammath
>  * Jarray Qiuheng
>
>=== Affiliations ===
>
>  * Huawei: Liang Chenliang
>  * Talend: Jean-Baptiste Onofré
>  * Ebay: Henry Saputra
>  * Intel: Uma Maheswara Rao G
>
>=== Sponsors ===
>
>=== Champion ===
>
>  * Jean-Baptiste Onofré - Apache Member
>
>=== Mentors ===
>
>  * Henry Saputra (eBay)
>  * Jean-Baptiste Onofré (Talend)
>  * Uma Maheswara Rao G (Intel)
>
>=== Sponsoring Entity ===
>
>The Apache Incubator
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>For additional commands, e-mail: general-help@incubator.apache.org
>


Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Henry Saputra <he...@gmail.com>.
+1 (binding)

On Wednesday, May 25, 2016, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:

> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing customer
> experiences for telecom carriers, enterprises, and consumers on big data,
> In order to satisfy the following customer requirements, we created a new
> Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all
> fields.
>  * Fast data loading speed and support incremental load in period of
> minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in the
> Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>  1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>  2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression schema
> ,complex data type etc. And CarbonData has following unique features:
>
> ==== Indexing ====
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3 types
> of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each data
> blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
>  Inverted index is widely used in search engine. By using this index, it
> helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>  For all columns, minmax index is created so that processing/query engine
> can skip scan that is not required.
>
> ==== Global Dictionary ====
>
> Besides I/O reduction, CarbonData accelerates computation by using global
> dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
> ==== Column Group ====
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
> ==== Optimized for multiple use cases ====
>
> CarbonData indices and dictionary is highly configurable. To make storage
> optimized for different use cases, user can configure what to index, so
> user can decide and tune the format before loading data into CarbonData.
>
> For example
>
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> || Individual record query || Column group, Global dictionary ||
>
> === BigData Processing Framework Integration ===
>
>  * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format with
> data processing framework.
>  * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
>  * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
> Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>
> == Initial Goals ==
>
> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
>
> == Current Status ==
>
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
>
> == Meritocracy ==
>
> We intend to radically expand the initial developer and user community by
> running the project in accordance with the "Apache Way". Users and new
> contributors will be treated with respect and welcomed. By participating in
> the community and providing quality patches/support that move the project
> forward, they will earn merit. They also will be encouraged to provide
> non-code contributions (documentation, events, community management, etc.)
> and will gain merit for doing so. Those with a proven support and quality
> track record will be encouraged to become committers.
>
> == Community ==
>
> If CarbonData is accepted for incubation, the primary initial goal is to
> build a large community. We really trust that CarbonData will become a key
> project for big data column-like platforms, and so, we bet on a large
> community of users and developers.
>
> == Known Risks ==
>
> Development has been sponsored mostly by a one company.For the project to
> fully transition to the Apache Way governance model, development must shift
> towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
>
> == Orphaned products ==
>
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration with
> sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
>
> == Inexperience with Open Source ==
>
> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project and
> are listed in this proposal. The project will rely on their guidance and
> collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
>
> == Reliance on Salaried Developers ==
>
> Most of the contributors are paid to work in big data space. While they
> might wander from their current employers, they are unlikely to venture far
> from their core expertises and thus will continue to be engaged with the
> project regardless of their current employers.
>
> == An Excessive Fascination with the Apache Brand ==
>
> While we intend to leverage the Apache ‘branding’ when talking to other
> projects as testament of our project’s ‘neutrality’, we have no plans for
> making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
>
> == Initial Source ==
>
> https://github.com/HuaweiBigData/carbondata.git
>
> == External Dependencies ==
>
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
>
>  * Apache Spark
>  * Apache Hadoop
>  * Apache Maven
>  * Apache Commons
>  * Apache Log4j
>  * Apache Thrift
>  * Apache Zookeeper
>  * Scala
>  * Snappy
>  * Kettle (Pentaho)
>  * Eigenbase
>  * Fastutil
>  * GSON
>  * Jmockit
>  * Junit
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@carbondata.incubator.apache.org (moderated subscriptions)
>  * commits@carbondata.incubator.apache.org
>  * dev@carbondata.incubator.apache.org
>  * issues@carbondata.incubator.apache.org
>
> === Git Repository ===
>
>  * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
> === Issue Tracking ===
>
>  * JIRA Project CarbonData (CarbonData)
>
> === Initial Committers ===
>
>  * Liang Chenliang
>  * Jean-Baptiste Onofré
>  * Henry Saputra
>  * Uma Maheswara Rao G
>  * Jenny MA
>  * Jacky Likun
>  * Vimal Das Kammath
>  * Jarray Qiuheng
>
> === Affiliations ===
>
>  * Huawei: Liang Chenliang
>  * Talend: Jean-Baptiste Onofré
>  * Ebay: Henry Saputra
>  * Intel: Uma Maheswara Rao G
>
> === Sponsors ===
>
> === Champion ===
>
>  * Jean-Baptiste Onofré - Apache Member
>
> === Mentors ===
>
>  * Henry Saputra (eBay)
>  * Jean-Baptiste Onofré (Talend)
>  * Uma Maheswara Rao G (Intel)
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
My own +1 (binding) ;)

Regards
JB

On 05/25/2016 10:24 PM, Jean-Baptiste Onofr� wrote:
> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> \u200b[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing
> customer experiences for telecom carriers, enterprises, and consumers on
> big data, In order to satisfy the following customer requirements, we
> created a new Hadoop native file format:
>
>   * Support interactive OLAP-style query over big data in seconds.
>   * Support fast query on individual record which require touching all
> fields.
>   * Fast data loading speed and support incremental load in period of
> minutes.
>   * Support HDFS so that customer can leverage existing Hadoop cluster.
>   * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>   1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>   2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
> the execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression
> schema ,complex data type etc. And CarbonData has following unique
> features:
>
> ==== Indexing ====
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3
> types of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>   The Data block are written in sequence to the disk and within each
> data blocks each column block is written in sequence. Finally, the
> metadata block for the file is written with information about byte
> positions of each block in the file, Min-Max statistics index and the
> start and end MDK of each data block. Since, the entire data in the file
> is in sorted order, the start and end MDK of each data block can be used
> to construct a B+Tree and the file can be logically  represented as a
> B+Tree with the data blocks as leaf nodes (on disk) and the remaining
> non-leaf nodes in memory.
> 2. Inverted index
>   Inverted index is widely used in search engine. By using this index,
> it helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is
> made possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>   For all columns, minmax index is created so that processing/query
> engine can skip scan that is not required.
>
> ==== Global Dictionary ====
>
> Besides I/O reduction, CarbonData accelerates computation by using
> global dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
> ==== Column Group ====
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient
> than columnar format since all columns will be touched by the workload.
> To accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
> ==== Optimized for multiple use cases ====
>
> CarbonData indices and dictionary is highly configurable. To make
> storage optimized for different use cases, user can configure what to
> index, so user can decide and tune the format before loading data into
> CarbonData.
>
> For example
>
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> || Individual record query || Column group, Global dictionary ||
>
> === BigData Processing Framework Integration ===
>
>   * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format
> with data processing framework.
>   * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
>   * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
> Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>
>
> == Initial Goals ==
>
> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
>
> == Current Status ==
>
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
>
> == Meritocracy ==
>
> We intend to radically expand the initial developer and user community
> by running the project in accordance with the "Apache Way". Users and
> new contributors will be treated with respect and welcomed. By
> participating in the community and providing quality patches/support
> that move the project forward, they will earn merit. They also will be
> encouraged to provide non-code contributions (documentation, events,
> community management, etc.) and will gain merit for doing so. Those with
> a proven support and quality track record will be encouraged to become
> committers.
>
> == Community ==
>
> If CarbonData is accepted for incubation, the primary initial goal is to
> build a large community. We really trust that CarbonData will become a
> key project for big data column-like platforms, and so, we bet on a
> large community of users and developers.
>
> == Known Risks ==
>
> Development has been sponsored mostly by a one company.For the project
> to fully transition to the Apache Way governance model, development must
> shift towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
>
> == Orphaned products ==
>
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration
> with sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
>
> == Inexperience with Open Source ==
>
> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project
> and are listed in this proposal. The project will rely on their guidance
> and collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
>
> == Reliance on Salaried Developers ==
>
> Most of the contributors are paid to work in big data space. While they
> might wander from their current employers, they are unlikely to venture
> far from their core expertises and thus will continue to be engaged with
> the project regardless of their current employers.
>
> == An Excessive Fascination with the Apache Brand ==
>
> While we intend to leverage the Apache \u2018branding\u2019 when talking to other
> projects as testament of our project\u2019s \u2018neutrality\u2019, we have no plans
> for making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
>
> == Initial Source ==
>
> https://github.com/HuaweiBigData/carbondata.git
>
> == External Dependencies ==
>
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
>
>   * Apache Spark
>   * Apache Hadoop
>   * Apache Maven
>   * Apache Commons
>   * Apache Log4j
>   * Apache Thrift
>   * Apache Zookeeper
>   * Scala
>   * Snappy
>   * Kettle (Pentaho)
>   * Eigenbase
>   * Fastutil
>   * GSON
>   * Jmockit
>   * Junit
>
> == Required Resources ==
>
> === Mailing lists ===
>
>   * private@carbondata.incubator.apache.org (moderated subscriptions)
>   * commits@carbondata.incubator.apache.org
>   * dev@carbondata.incubator.apache.org
>   * issues@carbondata.incubator.apache.org
>
> === Git Repository ===
>
>   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
> === Issue Tracking ===
>
>   * JIRA Project CarbonData (CarbonData)
>
> === Initial Committers ===
>
>   * Liang Chenliang
>   * Jean-Baptiste Onofr�
>   * Henry Saputra
>   * Uma Maheswara Rao G
>   * Jenny MA
>   * Jacky Likun
>   * Vimal Das Kammath
>   * Jarray Qiuheng
>
> === Affiliations ===
>
>   * Huawei: Liang Chenliang
>   * Talend: Jean-Baptiste Onofr�
>   * Ebay: Henry Saputra
>   * Intel: Uma Maheswara Rao G
>
> === Sponsors ===
>
> === Champion ===
>
>   * Jean-Baptiste Onofr� - Apache Member
>
> === Mentors ===
>
>   * Henry Saputra (eBay)
>   * Jean-Baptiste Onofr� (Talend)
>   * Uma Maheswara Rao G (Intel)
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Luke Han <lu...@gmail.com>.
+1 (binding)


Best Regards!
---------------------

Luke Han

On Wed, May 25, 2016 at 9:44 PM, Wang, Gang1 <ga...@intel.com> wrote:

> +1 (no-binding)
>
> Best Regards
> +Gary.
>
> -----Original Message-----
> From: Cheng, Hao [mailto:hao.cheng@intel.com]
> Sent: Wednesday, May 25, 2016 7:09 PM
> To: general@incubator.apache.org
> Subject: RE: [VOTE] Accept CarbonData into the Apache Incubator
>
> +1
>
> -----Original Message-----
> From: Jacques Nadeau [mailto:jacques@apache.org]
> Sent: Thursday, May 26, 2016 8:26 AM
> To: general@incubator.apache.org
> Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator
>
> +1 (binding)
>
> On Wed, May 25, 2016 at 4:04 PM, John D. Ament <jo...@apache.org>
> wrote:
>
> > +1
> >
> > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> > > Hi all,
> > >
> > > following the discussion thread, I'm now calling a vote to accept
> > > CarbonData into the Incubator.
> > >
> > > ​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [
> > > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> > >
> > > This vote is open for 72 hours.
> > >
> > > The proposal follows, you can also access the wiki page:
> > > https://wiki.apache.org/incubator/CarbonDataProposal
> > >
> > > Thanks !
> > > Regards
> > > JB
> > >
> > > = Apache CarbonData =
> > >
> > > == Abstract ==
> > >
> > > Apache CarbonData is a new Apache Hadoop native file format for
> > > faster interactive query using advanced columnar storage, index,
> > > compression and encoding techniques to improve computing efficiency,
> > > in turn it will help speedup queries an order of magnitude faster
> > > over PetaBytes of data.
> > >
> > > CarbonData github address:
> > > https://github.com/HuaweiBigData/carbondata
> > >
> > > == Background ==
> > >
> > > Huawei is an ICT solution provider, we are committed to enhancing
> > > customer experiences for telecom carriers, enterprises, and
> > > consumers on big data, In order to satisfy the following customer
> > > requirements, we created a new Hadoop native file format:
> > >
> > >   * Support interactive OLAP-style query over big data in seconds.
> > >   * Support fast query on individual record which require touching
> > > all fields.
> > >   * Fast data loading speed and support incremental load in period
> > > of minutes.
> > >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> > >   * Support time based data retention.
> > >
> > > Based on these requirements, we investigated existing file formats
> > > in the Hadoop eco-system, but we could not find a suitable solution
> > > that satisfying requirements all at the same time, so we start
> > > designing CarbonData.
> > >
> > > == Rationale ==
> > >
> > > CarbonData contains multiple modules, which are classified into two
> > > categories:
> > >
> > >   1. CarbonData File Format: which contains core implementation for
> > > file format such as
> > > columnar,index,dictionary,encoding+compression,API for reading/writing
> etc.
> > >   2. CarbonData integration with big data processing framework such
> > > as Apache Spark, Apache Hive etc. Apache Beam is also planned to
> > > abstract the execution runtime.
> > >
> > > === CarbonData File Format ===
> > >
> > > CarbonData file format is a columnar store in HDFS, it has many
> > > features that a modern columnar format has, such as splittable,
> > > compression schema ,complex data type etc. And CarbonData has
> > > following unique
> > > features:
> > >
> > > ==== Indexing ====
> > >
> > > In order to support fast interactive query, CarbonData leverage
> > > indexing technology to reduce I/O scans. CarbonData files stores
> > > data along with index, the index is not stored separately but the
> > > CarbonData file itself contains the index. In current
> > > implementation, CarbonData supports 3 types of indexing:
> > >
> > > 1. Multi-dimensional Key (B+ Tree index)
> > >   The Data block are written in sequence to the disk and within each
> > > data blocks each column block is written in sequence. Finally, the
> > > metadata block for the file is written with information about byte
> > > positions of each block in the file, Min-Max statistics index and
> > > the start and end MDK of each data block. Since, the entire data in
> > > the file is in sorted order, the start and end MDK of each data
> > > block can be used to construct a B+Tree and the file can be
> > > logically  represented as a
> > > B+Tree with the data blocks as leaf nodes (on disk) and the
> > > B+remaining
> > > non-leaf nodes in memory.
> > > 2. Inverted index
> > >   Inverted index is widely used in search engine. By using this
> > > index, it helps processing/query engine to do filtering inside one
> HDFS block.
> > > Furthermore, query acceleration for count distinct like operation is
> > > made possible when combining bitmap and inverted index in query time.
> > > 3. MinMax index
> > >   For all columns, minmax index is created so that processing/query
> > > engine can skip scan that is not required.
> > >
> > > ==== Global Dictionary ====
> > >
> > > Besides I/O reduction, CarbonData accelerates computation by using
> > > global dictionary, which enables processing/query engines to perform
> > > all processing on encoded data without having to convert the data
> > > (Late Materialization). We have observed dramatic performance
> > > improvement for OLAP analytic scenario where table contains many
> > > columns in string data type. The data is converted back to the user
> > > readable form just before processing/query engine returning results to
> user.
> > >
> > > ==== Column Group ====
> > >
> > > Sometimes users want to perform processing/query on multi-columns in
> > > one table, for example, performing scan for individual record in
> > > troubleshooting scenario. In this case, row format is more efficient
> > > than columnar format since all columns will be touched by the workload.
> > > To accelerate this, CarbonData supports storing a group of column in
> > > row format, so data in column group is stored together and enable
> > > fast retrieval.
> > >
> > > ==== Optimized for multiple use cases ====
> > >
> > > CarbonData indices and dictionary is highly configurable. To make
> > > storage optimized for different use cases, user can configure what
> > > to index, so user can decide and tune the format before loading data
> > > into CarbonData.
> > >
> > > For example
> > >
> > > || Use Case || Supporting Features || Interactive OLAP query ||
> > > || Columnar format, Multi-dimensional Key (B+
> > > Tree index), Minmax index, Inverted index ||
> > > || High throughput scan || Global dictionary, Minmax index || Low
> > > || latency point query || Multi-dimensional Key (B+ Tree index),
> > > Partitioning ||
> > > || Individual record query || Column group, Global dictionary ||
> > >
> > > === BigData Processing Framework Integration ===
> > >
> > >   * CarbonData provides InputFormat/OutputFormat interfaces for
> > > Reading/Writing data from the CarbonData files and at the same time
> > > provides abstract API for processing data stored as Carbondata
> > > format with data processing framework.
> > >   * CarbonData provides deep integration with Apache Spark including
> > > predicate push down, column pruning, aggregation push down etc. So
> > > users can use Spark SQL to connect and query from CarbonData.
> > >   * CarbonData can integrate with various big data Query/Processing
> > > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> > >
> > > Example:
> > >
> > >
> > https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/m
> > ain/scala/org/carbondata/examples/CarbonExample.scala
> > >
> > > == Initial Goals ==
> > >
> > > Our initial goals are to bring CarbonData into the ASF, transition
> > > internal engineering processes into the open, and foster a
> > > collaborative development model according to the "Apache Way".
> > >
> > > == Current Status ==
> > >
> > > CarbonData is production ready and already provide a large set of
> > features.
> > > The current license is already Apache 2.0.
> > >
> > > == Meritocracy ==
> > >
> > > We intend to radically expand the initial developer and user
> > > community by running the project in accordance with the "Apache
> > > Way". Users and new contributors will be treated with respect and
> > > welcomed. By participating in the community and providing quality
> > > patches/support that move the project forward, they will earn merit.
> > > They also will be encouraged to provide non-code contributions
> > > (documentation, events, community management, etc.) and will gain
> > > merit for doing so. Those with a proven support and quality track
> > > record will be encouraged to become committers.
> > >
> > > == Community ==
> > >
> > > If CarbonData is accepted for incubation, the primary initial goal
> > > is to build a large community. We really trust that CarbonData will
> > > become a key project for big data column-like platforms, and so, we
> > > bet on a large community of users and developers.
> > >
> > > == Known Risks ==
> > >
> > > Development has been sponsored mostly by a one company.For the
> > > project to fully transition to the Apache Way governance model,
> > > development must shift towards the meritocracy-centric model of
> > > growing a community of contributors balanced with the needs for
> > > extreme stability and core implementation coherency.
> > >
> > > == Orphaned products ==
> > >
> > > Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> > > interest in making CarbonData succeed by driving its close
> > > integration with sister ASF projects. We expect this to further
> > > reduces the risk of orphaning the product.
> > >
> > > == Inexperience with Open Source ==
> > >
> > > Huawei has been developing and using open source software since a
> > > long time. Additionally, several ASF veterans agreed to mentor the
> > > project and are listed in this proposal. The project will rely on
> > > their guidance and collective wisdom to quickly transition the
> > > entire team of initial committers towards practicing the Apache Way.
> > >
> > > == Reliance on Salaried Developers ==
> > >
> > > Most of the contributors are paid to work in big data space. While
> > > they might wander from their current employers, they are unlikely to
> > > venture far from their core expertises and thus will continue to be
> > > engaged with the project regardless of their current employers.
> > >
> > > == An Excessive Fascination with the Apache Brand ==
> > >
> > > While we intend to leverage the Apache ‘branding’ when talking to
> > > other projects as testament of our project’s ‘neutrality’, we have
> > > no plans for making use of Apache brand in press releases nor
> > > posting billboards advertising acceptance of CarbonData into Apache
> Incubator.
> > >
> > > == Initial Source ==
> > >
> > > https://github.com/HuaweiBigData/carbondata.git
> > >
> > > == External Dependencies ==
> > >
> > > All external dependencies are licensed under an Apache 2.0 license
> > > or Apache-compatible license. As we grow the Carbondata community we
> > > will configure our build process to require and validate all
> > > contributions and dependencies are licensed under the Apache 2.0
> > > license or are under an Apache-compatible license.
> > >
> > >   * Apache Spark
> > >   * Apache Hadoop
> > >   * Apache Maven
> > >   * Apache Commons
> > >   * Apache Log4j
> > >   * Apache Thrift
> > >   * Apache Zookeeper
> > >   * Scala
> > >   * Snappy
> > >   * Kettle (Pentaho)
> > >   * Eigenbase
> > >   * Fastutil
> > >   * GSON
> > >   * Jmockit
> > >   * Junit
> > >
> > > == Required Resources ==
> > >
> > > === Mailing lists ===
> > >
> > >   * private@carbondata.incubator.apache.org (moderated subscriptions)
> > >   * commits@carbondata.incubator.apache.org
> > >   * dev@carbondata.incubator.apache.org
> > >   * issues@carbondata.incubator.apache.org
> > >
> > > === Git Repository ===
> > >
> > >   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> > >
> > > === Issue Tracking ===
> > >
> > >   * JIRA Project CarbonData (CarbonData)
> > >
> > > === Initial Committers ===
> > >
> > >   * Liang Chenliang
> > >   * Jean-Baptiste Onofré
> > >   * Henry Saputra
> > >   * Uma Maheswara Rao G
> > >   * Jenny MA
> > >   * Jacky Likun
> > >   * Vimal Das Kammath
> > >   * Jarray Qiuheng
> > >
> > > === Affiliations ===
> > >
> > >   * Huawei: Liang Chenliang
> > >   * Talend: Jean-Baptiste Onofré
> > >   * Ebay: Henry Saputra
> > >   * Intel: Uma Maheswara Rao G
> > >
> > > === Sponsors ===
> > >
> > > === Champion ===
> > >
> > >   * Jean-Baptiste Onofré - Apache Member
> > >
> > > === Mentors ===
> > >
> > >   * Henry Saputra (eBay)
> > >   * Jean-Baptiste Onofré (Talend)
> > >   * Uma Maheswara Rao G (Intel)
> > >
> > > === Sponsoring Entity ===
> > >
> > > The Apache Incubator
> > >
> > > --------------------------------------------------------------------
> > > - To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

RE: [VOTE] Accept CarbonData into the Apache Incubator

Posted by "Wang, Gang1" <ga...@intel.com>.
+1 (no-binding)

Best Regards
+Gary.

-----Original Message-----
From: Cheng, Hao [mailto:hao.cheng@intel.com] 
Sent: Wednesday, May 25, 2016 7:09 PM
To: general@incubator.apache.org
Subject: RE: [VOTE] Accept CarbonData into the Apache Incubator

+1

-----Original Message-----
From: Jacques Nadeau [mailto:jacques@apache.org]
Sent: Thursday, May 26, 2016 8:26 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator

+1 (binding)

On Wed, May 25, 2016 at 4:04 PM, John D. Ament <jo...@apache.org>
wrote:

> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept 
> > CarbonData into the Incubator.
> >
> > ​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ 
> > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for 
> > faster interactive query using advanced columnar storage, index, 
> > compression and encoding techniques to improve computing efficiency, 
> > in turn it will help speedup queries an order of magnitude faster 
> > over PetaBytes of data.
> >
> > CarbonData github address: 
> > https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing 
> > customer experiences for telecom carriers, enterprises, and 
> > consumers on big data, In order to satisfy the following customer 
> > requirements, we created a new Hadoop native file format:
> >
> >   * Support interactive OLAP-style query over big data in seconds.
> >   * Support fast query on individual record which require touching 
> > all fields.
> >   * Fast data loading speed and support incremental load in period 
> > of minutes.
> >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> >   * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats 
> > in the Hadoop eco-system, but we could not find a suitable solution 
> > that satisfying requirements all at the same time, so we start 
> > designing CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> >   1. CarbonData File Format: which contains core implementation for 
> > file format such as 
> > columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> >   2. CarbonData integration with big data processing framework such 
> > as Apache Spark, Apache Hive etc. Apache Beam is also planned to 
> > abstract the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many 
> > features that a modern columnar format has, such as splittable, 
> > compression schema ,complex data type etc. And CarbonData has 
> > following unique
> > features:
> >
> > ==== Indexing ====
> >
> > In order to support fast interactive query, CarbonData leverage 
> > indexing technology to reduce I/O scans. CarbonData files stores 
> > data along with index, the index is not stored separately but the 
> > CarbonData file itself contains the index. In current 
> > implementation, CarbonData supports 3 types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> >   The Data block are written in sequence to the disk and within each 
> > data blocks each column block is written in sequence. Finally, the 
> > metadata block for the file is written with information about byte 
> > positions of each block in the file, Min-Max statistics index and 
> > the start and end MDK of each data block. Since, the entire data in 
> > the file is in sorted order, the start and end MDK of each data 
> > block can be used to construct a B+Tree and the file can be 
> > logically  represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the 
> > B+remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> >   Inverted index is widely used in search engine. By using this 
> > index, it helps processing/query engine to do filtering inside one HDFS block.
> > Furthermore, query acceleration for count distinct like operation is 
> > made possible when combining bitmap and inverted index in query time.
> > 3. MinMax index
> >   For all columns, minmax index is created so that processing/query 
> > engine can skip scan that is not required.
> >
> > ==== Global Dictionary ====
> >
> > Besides I/O reduction, CarbonData accelerates computation by using 
> > global dictionary, which enables processing/query engines to perform 
> > all processing on encoded data without having to convert the data 
> > (Late Materialization). We have observed dramatic performance 
> > improvement for OLAP analytic scenario where table contains many 
> > columns in string data type. The data is converted back to the user 
> > readable form just before processing/query engine returning results to user.
> >
> > ==== Column Group ====
> >
> > Sometimes users want to perform processing/query on multi-columns in 
> > one table, for example, performing scan for individual record in 
> > troubleshooting scenario. In this case, row format is more efficient 
> > than columnar format since all columns will be touched by the workload.
> > To accelerate this, CarbonData supports storing a group of column in 
> > row format, so data in column group is stored together and enable 
> > fast retrieval.
> >
> > ==== Optimized for multiple use cases ====
> >
> > CarbonData indices and dictionary is highly configurable. To make 
> > storage optimized for different use cases, user can configure what 
> > to index, so user can decide and tune the format before loading data 
> > into CarbonData.
> >
> > For example
> >
> > || Use Case || Supporting Features || Interactive OLAP query || 
> > || Columnar format, Multi-dimensional Key (B+
> > Tree index), Minmax index, Inverted index ||
> > || High throughput scan || Global dictionary, Minmax index || Low 
> > || latency point query || Multi-dimensional Key (B+ Tree index),
> > Partitioning ||
> > || Individual record query || Column group, Global dictionary ||
> >
> > === BigData Processing Framework Integration ===
> >
> >   * CarbonData provides InputFormat/OutputFormat interfaces for 
> > Reading/Writing data from the CarbonData files and at the same time 
> > provides abstract API for processing data stored as Carbondata 
> > format with data processing framework.
> >   * CarbonData provides deep integration with Apache Spark including 
> > predicate push down, column pruning, aggregation push down etc. So 
> > users can use Spark SQL to connect and query from CarbonData.
> >   * CarbonData can integrate with various big data Query/Processing 
> > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> >
> > Example:
> >
> >
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/m
> ain/scala/org/carbondata/examples/CarbonExample.scala
> >
> > == Initial Goals ==
> >
> > Our initial goals are to bring CarbonData into the ASF, transition 
> > internal engineering processes into the open, and foster a 
> > collaborative development model according to the "Apache Way".
> >
> > == Current Status ==
> >
> > CarbonData is production ready and already provide a large set of
> features.
> > The current license is already Apache 2.0.
> >
> > == Meritocracy ==
> >
> > We intend to radically expand the initial developer and user 
> > community by running the project in accordance with the "Apache 
> > Way". Users and new contributors will be treated with respect and 
> > welcomed. By participating in the community and providing quality 
> > patches/support that move the project forward, they will earn merit.
> > They also will be encouraged to provide non-code contributions 
> > (documentation, events, community management, etc.) and will gain 
> > merit for doing so. Those with a proven support and quality track 
> > record will be encouraged to become committers.
> >
> > == Community ==
> >
> > If CarbonData is accepted for incubation, the primary initial goal 
> > is to build a large community. We really trust that CarbonData will 
> > become a key project for big data column-like platforms, and so, we 
> > bet on a large community of users and developers.
> >
> > == Known Risks ==
> >
> > Development has been sponsored mostly by a one company.For the 
> > project to fully transition to the Apache Way governance model, 
> > development must shift towards the meritocracy-centric model of 
> > growing a community of contributors balanced with the needs for 
> > extreme stability and core implementation coherency.
> >
> > == Orphaned products ==
> >
> > Huawei is fully committed CarbonData. Moreover, Huawei has a vested 
> > interest in making CarbonData succeed by driving its close 
> > integration with sister ASF projects. We expect this to further 
> > reduces the risk of orphaning the product.
> >
> > == Inexperience with Open Source ==
> >
> > Huawei has been developing and using open source software since a 
> > long time. Additionally, several ASF veterans agreed to mentor the 
> > project and are listed in this proposal. The project will rely on 
> > their guidance and collective wisdom to quickly transition the 
> > entire team of initial committers towards practicing the Apache Way.
> >
> > == Reliance on Salaried Developers ==
> >
> > Most of the contributors are paid to work in big data space. While 
> > they might wander from their current employers, they are unlikely to 
> > venture far from their core expertises and thus will continue to be 
> > engaged with the project regardless of their current employers.
> >
> > == An Excessive Fascination with the Apache Brand ==
> >
> > While we intend to leverage the Apache ‘branding’ when talking to 
> > other projects as testament of our project’s ‘neutrality’, we have 
> > no plans for making use of Apache brand in press releases nor 
> > posting billboards advertising acceptance of CarbonData into Apache Incubator.
> >
> > == Initial Source ==
> >
> > https://github.com/HuaweiBigData/carbondata.git
> >
> > == External Dependencies ==
> >
> > All external dependencies are licensed under an Apache 2.0 license 
> > or Apache-compatible license. As we grow the Carbondata community we 
> > will configure our build process to require and validate all 
> > contributions and dependencies are licensed under the Apache 2.0 
> > license or are under an Apache-compatible license.
> >
> >   * Apache Spark
> >   * Apache Hadoop
> >   * Apache Maven
> >   * Apache Commons
> >   * Apache Log4j
> >   * Apache Thrift
> >   * Apache Zookeeper
> >   * Scala
> >   * Snappy
> >   * Kettle (Pentaho)
> >   * Eigenbase
> >   * Fastutil
> >   * GSON
> >   * Jmockit
> >   * Junit
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> >   * private@carbondata.incubator.apache.org (moderated subscriptions)
> >   * commits@carbondata.incubator.apache.org
> >   * dev@carbondata.incubator.apache.org
> >   * issues@carbondata.incubator.apache.org
> >
> > === Git Repository ===
> >
> >   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> >
> > === Issue Tracking ===
> >
> >   * JIRA Project CarbonData (CarbonData)
> >
> > === Initial Committers ===
> >
> >   * Liang Chenliang
> >   * Jean-Baptiste Onofré
> >   * Henry Saputra
> >   * Uma Maheswara Rao G
> >   * Jenny MA
> >   * Jacky Likun
> >   * Vimal Das Kammath
> >   * Jarray Qiuheng
> >
> > === Affiliations ===
> >
> >   * Huawei: Liang Chenliang
> >   * Talend: Jean-Baptiste Onofré
> >   * Ebay: Henry Saputra
> >   * Intel: Uma Maheswara Rao G
> >
> > === Sponsors ===
> >
> > === Champion ===
> >
> >   * Jean-Baptiste Onofré - Apache Member
> >
> > === Mentors ===
> >
> >   * Henry Saputra (eBay)
> >   * Jean-Baptiste Onofré (Talend)
> >   * Uma Maheswara Rao G (Intel)
> >
> > === Sponsoring Entity ===
> >
> > The Apache Incubator
> >
> > --------------------------------------------------------------------
> > - To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

RE: [VOTE] Accept CarbonData into the Apache Incubator

Posted by "Cheng, Hao" <ha...@intel.com>.
+1

-----Original Message-----
From: Jacques Nadeau [mailto:jacques@apache.org] 
Sent: Thursday, May 26, 2016 8:26 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator

+1 (binding)

On Wed, May 25, 2016 at 4:04 PM, John D. Ament <jo...@apache.org>
wrote:

> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept 
> > CarbonData into the Incubator.
> >
> > ​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ 
> > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for 
> > faster interactive query using advanced columnar storage, index, 
> > compression and encoding techniques to improve computing efficiency, 
> > in turn it will help speedup queries an order of magnitude faster 
> > over PetaBytes of data.
> >
> > CarbonData github address: 
> > https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing 
> > customer experiences for telecom carriers, enterprises, and 
> > consumers on big data, In order to satisfy the following customer 
> > requirements, we created a new Hadoop native file format:
> >
> >   * Support interactive OLAP-style query over big data in seconds.
> >   * Support fast query on individual record which require touching 
> > all fields.
> >   * Fast data loading speed and support incremental load in period 
> > of minutes.
> >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> >   * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats 
> > in the Hadoop eco-system, but we could not find a suitable solution 
> > that satisfying requirements all at the same time, so we start 
> > designing CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> >   1. CarbonData File Format: which contains core implementation for 
> > file format such as 
> > columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> >   2. CarbonData integration with big data processing framework such 
> > as Apache Spark, Apache Hive etc. Apache Beam is also planned to 
> > abstract the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many 
> > features that a modern columnar format has, such as splittable, 
> > compression schema ,complex data type etc. And CarbonData has 
> > following unique
> > features:
> >
> > ==== Indexing ====
> >
> > In order to support fast interactive query, CarbonData leverage 
> > indexing technology to reduce I/O scans. CarbonData files stores 
> > data along with index, the index is not stored separately but the 
> > CarbonData file itself contains the index. In current 
> > implementation, CarbonData supports 3 types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> >   The Data block are written in sequence to the disk and within each 
> > data blocks each column block is written in sequence. Finally, the 
> > metadata block for the file is written with information about byte 
> > positions of each block in the file, Min-Max statistics index and 
> > the start and end MDK of each data block. Since, the entire data in 
> > the file is in sorted order, the start and end MDK of each data 
> > block can be used to construct a B+Tree and the file can be 
> > logically  represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the 
> > B+remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> >   Inverted index is widely used in search engine. By using this 
> > index, it helps processing/query engine to do filtering inside one HDFS block.
> > Furthermore, query acceleration for count distinct like operation is 
> > made possible when combining bitmap and inverted index in query time.
> > 3. MinMax index
> >   For all columns, minmax index is created so that processing/query 
> > engine can skip scan that is not required.
> >
> > ==== Global Dictionary ====
> >
> > Besides I/O reduction, CarbonData accelerates computation by using 
> > global dictionary, which enables processing/query engines to perform 
> > all processing on encoded data without having to convert the data 
> > (Late Materialization). We have observed dramatic performance 
> > improvement for OLAP analytic scenario where table contains many 
> > columns in string data type. The data is converted back to the user 
> > readable form just before processing/query engine returning results to user.
> >
> > ==== Column Group ====
> >
> > Sometimes users want to perform processing/query on multi-columns in 
> > one table, for example, performing scan for individual record in 
> > troubleshooting scenario. In this case, row format is more efficient 
> > than columnar format since all columns will be touched by the workload.
> > To accelerate this, CarbonData supports storing a group of column in 
> > row format, so data in column group is stored together and enable 
> > fast retrieval.
> >
> > ==== Optimized for multiple use cases ====
> >
> > CarbonData indices and dictionary is highly configurable. To make 
> > storage optimized for different use cases, user can configure what 
> > to index, so user can decide and tune the format before loading data 
> > into CarbonData.
> >
> > For example
> >
> > || Use Case || Supporting Features || Interactive OLAP query || 
> > || Columnar format, Multi-dimensional Key (B+
> > Tree index), Minmax index, Inverted index ||
> > || High throughput scan || Global dictionary, Minmax index || Low 
> > || latency point query || Multi-dimensional Key (B+ Tree index),
> > Partitioning ||
> > || Individual record query || Column group, Global dictionary ||
> >
> > === BigData Processing Framework Integration ===
> >
> >   * CarbonData provides InputFormat/OutputFormat interfaces for 
> > Reading/Writing data from the CarbonData files and at the same time 
> > provides abstract API for processing data stored as Carbondata 
> > format with data processing framework.
> >   * CarbonData provides deep integration with Apache Spark including 
> > predicate push down, column pruning, aggregation push down etc. So 
> > users can use Spark SQL to connect and query from CarbonData.
> >   * CarbonData can integrate with various big data Query/Processing 
> > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> >
> > Example:
> >
> >
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/m
> ain/scala/org/carbondata/examples/CarbonExample.scala
> >
> > == Initial Goals ==
> >
> > Our initial goals are to bring CarbonData into the ASF, transition 
> > internal engineering processes into the open, and foster a 
> > collaborative development model according to the "Apache Way".
> >
> > == Current Status ==
> >
> > CarbonData is production ready and already provide a large set of
> features.
> > The current license is already Apache 2.0.
> >
> > == Meritocracy ==
> >
> > We intend to radically expand the initial developer and user 
> > community by running the project in accordance with the "Apache 
> > Way". Users and new contributors will be treated with respect and 
> > welcomed. By participating in the community and providing quality 
> > patches/support that move the project forward, they will earn merit. 
> > They also will be encouraged to provide non-code contributions 
> > (documentation, events, community management, etc.) and will gain 
> > merit for doing so. Those with a proven support and quality track 
> > record will be encouraged to become committers.
> >
> > == Community ==
> >
> > If CarbonData is accepted for incubation, the primary initial goal 
> > is to build a large community. We really trust that CarbonData will 
> > become a key project for big data column-like platforms, and so, we 
> > bet on a large community of users and developers.
> >
> > == Known Risks ==
> >
> > Development has been sponsored mostly by a one company.For the 
> > project to fully transition to the Apache Way governance model, 
> > development must shift towards the meritocracy-centric model of 
> > growing a community of contributors balanced with the needs for 
> > extreme stability and core implementation coherency.
> >
> > == Orphaned products ==
> >
> > Huawei is fully committed CarbonData. Moreover, Huawei has a vested 
> > interest in making CarbonData succeed by driving its close 
> > integration with sister ASF projects. We expect this to further 
> > reduces the risk of orphaning the product.
> >
> > == Inexperience with Open Source ==
> >
> > Huawei has been developing and using open source software since a 
> > long time. Additionally, several ASF veterans agreed to mentor the 
> > project and are listed in this proposal. The project will rely on 
> > their guidance and collective wisdom to quickly transition the 
> > entire team of initial committers towards practicing the Apache Way.
> >
> > == Reliance on Salaried Developers ==
> >
> > Most of the contributors are paid to work in big data space. While 
> > they might wander from their current employers, they are unlikely to 
> > venture far from their core expertises and thus will continue to be 
> > engaged with the project regardless of their current employers.
> >
> > == An Excessive Fascination with the Apache Brand ==
> >
> > While we intend to leverage the Apache ‘branding’ when talking to 
> > other projects as testament of our project’s ‘neutrality’, we have 
> > no plans for making use of Apache brand in press releases nor 
> > posting billboards advertising acceptance of CarbonData into Apache Incubator.
> >
> > == Initial Source ==
> >
> > https://github.com/HuaweiBigData/carbondata.git
> >
> > == External Dependencies ==
> >
> > All external dependencies are licensed under an Apache 2.0 license 
> > or Apache-compatible license. As we grow the Carbondata community we 
> > will configure our build process to require and validate all 
> > contributions and dependencies are licensed under the Apache 2.0 
> > license or are under an Apache-compatible license.
> >
> >   * Apache Spark
> >   * Apache Hadoop
> >   * Apache Maven
> >   * Apache Commons
> >   * Apache Log4j
> >   * Apache Thrift
> >   * Apache Zookeeper
> >   * Scala
> >   * Snappy
> >   * Kettle (Pentaho)
> >   * Eigenbase
> >   * Fastutil
> >   * GSON
> >   * Jmockit
> >   * Junit
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> >   * private@carbondata.incubator.apache.org (moderated subscriptions)
> >   * commits@carbondata.incubator.apache.org
> >   * dev@carbondata.incubator.apache.org
> >   * issues@carbondata.incubator.apache.org
> >
> > === Git Repository ===
> >
> >   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> >
> > === Issue Tracking ===
> >
> >   * JIRA Project CarbonData (CarbonData)
> >
> > === Initial Committers ===
> >
> >   * Liang Chenliang
> >   * Jean-Baptiste Onofré
> >   * Henry Saputra
> >   * Uma Maheswara Rao G
> >   * Jenny MA
> >   * Jacky Likun
> >   * Vimal Das Kammath
> >   * Jarray Qiuheng
> >
> > === Affiliations ===
> >
> >   * Huawei: Liang Chenliang
> >   * Talend: Jean-Baptiste Onofré
> >   * Ebay: Henry Saputra
> >   * Intel: Uma Maheswara Rao G
> >
> > === Sponsors ===
> >
> > === Champion ===
> >
> >   * Jean-Baptiste Onofré - Apache Member
> >
> > === Mentors ===
> >
> >   * Henry Saputra (eBay)
> >   * Jean-Baptiste Onofré (Talend)
> >   * Uma Maheswara Rao G (Intel)
> >
> > === Sponsoring Entity ===
> >
> > The Apache Incubator
> >
> > --------------------------------------------------------------------
> > - To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by Jacques Nadeau <ja...@apache.org>.
+1 (binding)

On Wed, May 25, 2016 at 4:04 PM, John D. Ament <jo...@apache.org>
wrote:

> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept
> > CarbonData into the Incubator.
> >
> > ​[ ] +1 Accept CarbonData into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for faster
> > interactive
> > query using advanced columnar storage, index, compression and encoding
> > techniques
> > to improve computing efficiency, in turn it will help speedup queries an
> > order of
> > magnitude faster over PetaBytes of data.
> >
> > CarbonData github address: https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing
> > customer experiences for telecom carriers, enterprises, and consumers on
> > big data, In order to satisfy the following customer requirements, we
> > created a new Hadoop native file format:
> >
> >   * Support interactive OLAP-style query over big data in seconds.
> >   * Support fast query on individual record which require touching all
> > fields.
> >   * Fast data loading speed and support incremental load in period of
> > minutes.
> >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> >   * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats in
> > the Hadoop eco-system, but we could not find a suitable solution that
> > satisfying requirements all at the same time, so we start designing
> > CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> >   1. CarbonData File Format: which contains core implementation for file
> > format such as columnar,index,dictionary,encoding+compression,API for
> > reading/writing etc.
> >   2. CarbonData integration with big data processing framework such as
> > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
> > the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many features
> > that a modern columnar format has, such as splittable, compression
> > schema ,complex data type etc. And CarbonData has following unique
> > features:
> >
> > ==== Indexing ====
> >
> > In order to support fast interactive query, CarbonData leverage indexing
> > technology to reduce I/O scans. CarbonData files stores data along with
> > index, the index is not stored separately but the CarbonData file itself
> > contains the index. In current implementation, CarbonData supports 3
> > types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> >   The Data block are written in sequence to the disk and within each
> > data blocks each column block is written in sequence. Finally, the
> > metadata block for the file is written with information about byte
> > positions of each block in the file, Min-Max statistics index and the
> > start and end MDK of each data block. Since, the entire data in the file
> > is in sorted order, the start and end MDK of each data block can be used
> > to construct a B+Tree and the file can be logically  represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> >   Inverted index is widely used in search engine. By using this index,
> > it helps processing/query engine to do filtering inside one HDFS block.
> > Furthermore, query acceleration for count distinct like operation is
> > made possible when combining bitmap and inverted index in query time.
> > 3. MinMax index
> >   For all columns, minmax index is created so that processing/query
> > engine can skip scan that is not required.
> >
> > ==== Global Dictionary ====
> >
> > Besides I/O reduction, CarbonData accelerates computation by using
> > global dictionary, which enables processing/query engines to perform all
> > processing on encoded data without having to convert the data (Late
> > Materialization). We have observed dramatic performance improvement for
> > OLAP analytic scenario where table contains many columns in string data
> > type. The data is converted back to the user readable form just before
> > processing/query engine returning results to user.
> >
> > ==== Column Group ====
> >
> > Sometimes users want to perform processing/query on multi-columns in one
> > table, for example, performing scan for individual record in
> > troubleshooting scenario. In this case, row format is more efficient
> > than columnar format since all columns will be touched by the workload.
> > To accelerate this, CarbonData supports storing a group of column in row
> > format, so data in column group is stored together and enable fast
> > retrieval.
> >
> > ==== Optimized for multiple use cases ====
> >
> > CarbonData indices and dictionary is highly configurable. To make
> > storage optimized for different use cases, user can configure what to
> > index, so user can decide and tune the format before loading data into
> > CarbonData.
> >
> > For example
> >
> > || Use Case || Supporting Features ||
> > || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> > Tree index), Minmax index, Inverted index ||
> > || High throughput scan || Global dictionary, Minmax index ||
> > || Low latency point query || Multi-dimensional Key (B+ Tree index),
> > Partitioning ||
> > || Individual record query || Column group, Global dictionary ||
> >
> > === BigData Processing Framework Integration ===
> >
> >   * CarbonData provides InputFormat/OutputFormat interfaces for
> > Reading/Writing data from the CarbonData files and at the same time
> > provides abstract API for processing data stored as Carbondata format
> > with data processing framework.
> >   * CarbonData provides deep integration with Apache Spark including
> > predicate push down, column pruning, aggregation push down etc. So users
> > can use Spark SQL to connect and query from CarbonData.
> >   * CarbonData can integrate with various big data Query/Processing
> > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> >
> > Example:
> >
> >
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
> >
> > == Initial Goals ==
> >
> > Our initial goals are to bring CarbonData into the ASF, transition
> > internal engineering processes into the open, and foster a collaborative
> > development model according to the "Apache Way".
> >
> > == Current Status ==
> >
> > CarbonData is production ready and already provide a large set of
> features.
> > The current license is already Apache 2.0.
> >
> > == Meritocracy ==
> >
> > We intend to radically expand the initial developer and user community
> > by running the project in accordance with the "Apache Way". Users and
> > new contributors will be treated with respect and welcomed. By
> > participating in the community and providing quality patches/support
> > that move the project forward, they will earn merit. They also will be
> > encouraged to provide non-code contributions (documentation, events,
> > community management, etc.) and will gain merit for doing so. Those with
> > a proven support and quality track record will be encouraged to become
> > committers.
> >
> > == Community ==
> >
> > If CarbonData is accepted for incubation, the primary initial goal is to
> > build a large community. We really trust that CarbonData will become a
> > key project for big data column-like platforms, and so, we bet on a
> > large community of users and developers.
> >
> > == Known Risks ==
> >
> > Development has been sponsored mostly by a one company.For the project
> > to fully transition to the Apache Way governance model, development must
> > shift towards the meritocracy-centric model of growing a community of
> > contributors balanced with the needs for extreme stability and core
> > implementation coherency.
> >
> > == Orphaned products ==
> >
> > Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> > interest in making CarbonData succeed by driving its close integration
> > with sister ASF projects. We expect this to further reduces the risk of
> > orphaning the product.
> >
> > == Inexperience with Open Source ==
> >
> > Huawei has been developing and using open source software since a long
> > time. Additionally, several ASF veterans agreed to mentor the project
> > and are listed in this proposal. The project will rely on their guidance
> > and collective wisdom to quickly transition the entire team of initial
> > committers towards practicing the Apache Way.
> >
> > == Reliance on Salaried Developers ==
> >
> > Most of the contributors are paid to work in big data space. While they
> > might wander from their current employers, they are unlikely to venture
> > far from their core expertises and thus will continue to be engaged with
> > the project regardless of their current employers.
> >
> > == An Excessive Fascination with the Apache Brand ==
> >
> > While we intend to leverage the Apache ‘branding’ when talking to other
> > projects as testament of our project’s ‘neutrality’, we have no plans
> > for making use of Apache brand in press releases nor posting billboards
> > advertising acceptance of CarbonData into Apache Incubator.
> >
> > == Initial Source ==
> >
> > https://github.com/HuaweiBigData/carbondata.git
> >
> > == External Dependencies ==
> >
> > All external dependencies are licensed under an Apache 2.0 license or
> > Apache-compatible license. As we grow the Carbondata community we will
> > configure our build process to require and validate all contributions
> > and dependencies are licensed under the Apache 2.0 license or are under
> > an Apache-compatible license.
> >
> >   * Apache Spark
> >   * Apache Hadoop
> >   * Apache Maven
> >   * Apache Commons
> >   * Apache Log4j
> >   * Apache Thrift
> >   * Apache Zookeeper
> >   * Scala
> >   * Snappy
> >   * Kettle (Pentaho)
> >   * Eigenbase
> >   * Fastutil
> >   * GSON
> >   * Jmockit
> >   * Junit
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> >   * private@carbondata.incubator.apache.org (moderated subscriptions)
> >   * commits@carbondata.incubator.apache.org
> >   * dev@carbondata.incubator.apache.org
> >   * issues@carbondata.incubator.apache.org
> >
> > === Git Repository ===
> >
> >   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> >
> > === Issue Tracking ===
> >
> >   * JIRA Project CarbonData (CarbonData)
> >
> > === Initial Committers ===
> >
> >   * Liang Chenliang
> >   * Jean-Baptiste Onofré
> >   * Henry Saputra
> >   * Uma Maheswara Rao G
> >   * Jenny MA
> >   * Jacky Likun
> >   * Vimal Das Kammath
> >   * Jarray Qiuheng
> >
> > === Affiliations ===
> >
> >   * Huawei: Liang Chenliang
> >   * Talend: Jean-Baptiste Onofré
> >   * Ebay: Henry Saputra
> >   * Intel: Uma Maheswara Rao G
> >
> > === Sponsors ===
> >
> > === Champion ===
> >
> >   * Jean-Baptiste Onofré - Apache Member
> >
> > === Mentors ===
> >
> >   * Henry Saputra (eBay)
> >   * Jean-Baptiste Onofré (Talend)
> >   * Uma Maheswara Rao G (Intel)
> >
> > === Sponsoring Entity ===
> >
> > The Apache Incubator
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

Posted by "John D. Ament" <jo...@apache.org>.
+1

On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing
> customer experiences for telecom carriers, enterprises, and consumers on
> big data, In order to satisfy the following customer requirements, we
> created a new Hadoop native file format:
>
>   * Support interactive OLAP-style query over big data in seconds.
>   * Support fast query on individual record which require touching all
> fields.
>   * Fast data loading speed and support incremental load in period of
> minutes.
>   * Support HDFS so that customer can leverage existing Hadoop cluster.
>   * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>   1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>   2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
> the execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression
> schema ,complex data type etc. And CarbonData has following unique
> features:
>
> ==== Indexing ====
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3
> types of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>   The Data block are written in sequence to the disk and within each
> data blocks each column block is written in sequence. Finally, the
> metadata block for the file is written with information about byte
> positions of each block in the file, Min-Max statistics index and the
> start and end MDK of each data block. Since, the entire data in the file
> is in sorted order, the start and end MDK of each data block can be used
> to construct a B+Tree and the file can be logically  represented as a
> B+Tree with the data blocks as leaf nodes (on disk) and the remaining
> non-leaf nodes in memory.
> 2. Inverted index
>   Inverted index is widely used in search engine. By using this index,
> it helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is
> made possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>   For all columns, minmax index is created so that processing/query
> engine can skip scan that is not required.
>
> ==== Global Dictionary ====
>
> Besides I/O reduction, CarbonData accelerates computation by using
> global dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
> ==== Column Group ====
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient
> than columnar format since all columns will be touched by the workload.
> To accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
> ==== Optimized for multiple use cases ====
>
> CarbonData indices and dictionary is highly configurable. To make
> storage optimized for different use cases, user can configure what to
> index, so user can decide and tune the format before loading data into
> CarbonData.
>
> For example
>
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> || Individual record query || Column group, Global dictionary ||
>
> === BigData Processing Framework Integration ===
>
>   * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format
> with data processing framework.
>   * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
>   * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
> Example:
>
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>
> == Initial Goals ==
>
> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
>
> == Current Status ==
>
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
>
> == Meritocracy ==
>
> We intend to radically expand the initial developer and user community
> by running the project in accordance with the "Apache Way". Users and
> new contributors will be treated with respect and welcomed. By
> participating in the community and providing quality patches/support
> that move the project forward, they will earn merit. They also will be
> encouraged to provide non-code contributions (documentation, events,
> community management, etc.) and will gain merit for doing so. Those with
> a proven support and quality track record will be encouraged to become
> committers.
>
> == Community ==
>
> If CarbonData is accepted for incubation, the primary initial goal is to
> build a large community. We really trust that CarbonData will become a
> key project for big data column-like platforms, and so, we bet on a
> large community of users and developers.
>
> == Known Risks ==
>
> Development has been sponsored mostly by a one company.For the project
> to fully transition to the Apache Way governance model, development must
> shift towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
>
> == Orphaned products ==
>
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration
> with sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
>
> == Inexperience with Open Source ==
>
> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project
> and are listed in this proposal. The project will rely on their guidance
> and collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
>
> == Reliance on Salaried Developers ==
>
> Most of the contributors are paid to work in big data space. While they
> might wander from their current employers, they are unlikely to venture
> far from their core expertises and thus will continue to be engaged with
> the project regardless of their current employers.
>
> == An Excessive Fascination with the Apache Brand ==
>
> While we intend to leverage the Apache ‘branding’ when talking to other
> projects as testament of our project’s ‘neutrality’, we have no plans
> for making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
>
> == Initial Source ==
>
> https://github.com/HuaweiBigData/carbondata.git
>
> == External Dependencies ==
>
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
>
>   * Apache Spark
>   * Apache Hadoop
>   * Apache Maven
>   * Apache Commons
>   * Apache Log4j
>   * Apache Thrift
>   * Apache Zookeeper
>   * Scala
>   * Snappy
>   * Kettle (Pentaho)
>   * Eigenbase
>   * Fastutil
>   * GSON
>   * Jmockit
>   * Junit
>
> == Required Resources ==
>
> === Mailing lists ===
>
>   * private@carbondata.incubator.apache.org (moderated subscriptions)
>   * commits@carbondata.incubator.apache.org
>   * dev@carbondata.incubator.apache.org
>   * issues@carbondata.incubator.apache.org
>
> === Git Repository ===
>
>   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
> === Issue Tracking ===
>
>   * JIRA Project CarbonData (CarbonData)
>
> === Initial Committers ===
>
>   * Liang Chenliang
>   * Jean-Baptiste Onofré
>   * Henry Saputra
>   * Uma Maheswara Rao G
>   * Jenny MA
>   * Jacky Likun
>   * Vimal Das Kammath
>   * Jarray Qiuheng
>
> === Affiliations ===
>
>   * Huawei: Liang Chenliang
>   * Talend: Jean-Baptiste Onofré
>   * Ebay: Henry Saputra
>   * Intel: Uma Maheswara Rao G
>
> === Sponsors ===
>
> === Champion ===
>
>   * Jean-Baptiste Onofré - Apache Member
>
> === Mentors ===
>
>   * Henry Saputra (eBay)
>   * Jean-Baptiste Onofré (Talend)
>   * Uma Maheswara Rao G (Intel)
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>