You are viewing a plain text version of this content. The canonical link for it is here.
Posted to cvs@incubator.apache.org by Apache Wiki <wi...@apache.org> on 2016/05/24 13:55:19 UTC

[Incubator Wiki] Trivial Update of "CarbonDataProposal" by NickBurch

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "CarbonDataProposal" page has been changed by NickBurch:
https://wiki.apache.org/incubator/CarbonDataProposal?action=diff&rev1=5&rev2=6

Comment:
Fix some markup

  
  == Abstract ==
  
- Apache CarbonData is a new Apache Hadoop native file format for faster interactive
+ Apache !CarbonData is a new Apache Hadoop native file format for faster interactive
  query using advanced columnar storage, index, compression and encoding techniques
  to improve computing efficiency, in turn it will help speedup queries an order of
  magnitude faster over PetaBytes of data.
  
- CarbonData github address: https://github.com/HuaweiBigData/carbondata
+ !CarbonData github address: https://github.com/HuaweiBigData/carbondata
  
  == Backgrounad ==
  
@@ -25, +25 @@

  
  == Rationale ==
  
- CarbonData contains multiple modules, which are classified into two categories: 
+ !CarbonData contains multiple modules, which are classified into two categories: 
  
-  1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
+  1. !CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
-  2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime.
+  2. !CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime.
  
  === CarbonData File Format ===
  
- CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features:
+ !CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features:
  
  ==== Indexing ====
  
- In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing:
+ In order to support fast interactive query, !CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, !CarbonData supports 3 types of indexing:
  
  1. Multi-dimensional Key (B+ Tree index)
   The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically  represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.
  2. Inverted index
   Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time.
- 3. MinMax index
+ 3. !MinMax index
   For all columns, minmax index is created so that processing/query engine can skip scan that is not required.
  
  ==== Global Dictionary ====
  
- Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
+ Besides I/O reduction, !CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
  
  ==== Column Group ====
  
@@ -77, +77 @@

  
  Our initial goals are to bring CarbonData into the ASF, transition internal engineering processes into the open, and foster a collaborative development model according to the "Apache Way".
  
- == Current Status == 
+ == Current Status ==
  
- CarbonData is production ready and already provide a large set of features.
+ !CarbonData is production ready and already provide a large set of features.
  The current license is already Apache 2.0.
  
  == Meritocracy ==
@@ -88, +88 @@

  
  == Community ==
  
- If CarbonData is accepted for incubation, the primary initial goal is to build a large community. We really trust that CarbonData will become a key project for big data column-like platforms, and so, we bet on a large community of users and developers.
+ If CarbonData is accepted for incubation, the primary initial goal is to build a large community. We really trust that !CarbonData will become a key project for big data column-like platforms, and so, we bet on a large community of users and developers.
  
- == Known Risks == 
+ == Known Risks ==
  
  Development has been sponsored mostly by a one company.For the project to fully transition to the Apache Way governance model, development must shift towards the meritocracy-centric model of growing a community of contributors balanced with the needs for extreme stability and core implementation coherency.
  
  == Orphaned products ==
  
- Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest in making CarbonData succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.
+ Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest in making !CarbonData succeed by driving its close integration with sister ASF projects. We expect this to further reduces the risk of orphaning the product.
  
  == Inexperience with Open Source ==
  
@@ -151, +151 @@

  
   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
  
- === Issue Tracking === 
+ === Issue Tracking ===
  
   * JIRA Project CarbonData (CarbonData)
  
- === Initial Committers === 
+ === Initial Committers ===
  
   * Liang Chenliang 
   * Jean-Baptiste Onofré

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org