You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by li...@apache.org on 2015/09/03 03:31:18 UTC
[03/50] [abbrv] incubator-kylin git commit: Add blog “fast cubing”

Add blog “fast cubing”


Project: http://git-wip-us.apache.org/repos/asf/incubator-kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kylin/commit/265a528d
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kylin/tree/265a528d
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kylin/diff/265a528d

Branch: refs/heads/0.7
Commit: 265a528d6ed5ce29b558fc3554753cf53049316a
Parents: 39bd19a
Author: shaofengshi <sh...@apache.org>
Authored: Mon Aug 17 14:38:04 2015 +0800
Committer: shaofengshi <sh...@apache.org>
Committed: Mon Aug 17 14:38:34 2015 +0800

----------------------------------------------------------------------
 website/_dev/new_metadata.md                   |   2 +-
 website/_dev/test_minicluster.md               |   2 +-
 website/_dev/web_tech.md                       |   2 +-
 website/_docs/howto/howto_jdbc.md              |   2 +-
 website/_docs/howto/howto_use_restapi_in_js.md |   2 +-
 website/_docs/install/advance_settings.md      |   2 +-
 website/_posts/blog/2015-08-15-fast-cubing.md  | 118 ++++++++++++++++++++
 website/images/blog/by-layer-cubing.png        | Bin 0 -> 130733 bytes
 website/images/blog/by-segment-cubing.png      | Bin 0 -> 80771 bytes
 website/images/blog/cube-spanning-tree.png     | Bin 0 -> 27581 bytes
 10 files changed, 124 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/_dev/new_metadata.md
----------------------------------------------------------------------
diff --git a/website/_dev/new_metadata.md b/website/_dev/new_metadata.md
index 851ab85..d52aa61 100644
--- a/website/_dev/new_metadata.md
+++ b/website/_dev/new_metadata.md
@@ -1,6 +1,6 @@
 ---
 layout: dev
-title:  "New meta data model structure"
+title:  "New Metadata Model"
 categories: development
 permalink: /development/new_metadata.html
 ---

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/_dev/test_minicluster.md
----------------------------------------------------------------------
diff --git a/website/_dev/test_minicluster.md b/website/_dev/test_minicluster.md
index 44d998d..f0b21ef 100644
--- a/website/_dev/test_minicluster.md
+++ b/website/_dev/test_minicluster.md
@@ -1,6 +1,6 @@
 ---
 layout: dev
-title:  "Run Kylin test case with HBase Mini Cluster"
+title:  "Run Kylin Test Case with Mini Cluster"
 categories: development
 permalink: /development/test_minicluster.html
 ---

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/_dev/web_tech.md
----------------------------------------------------------------------
diff --git a/website/_dev/web_tech.md b/website/_dev/web_tech.md
index 90ffe76..0603eda 100644
--- a/website/_dev/web_tech.md
+++ b/website/_dev/web_tech.md
@@ -1,6 +1,6 @@
 ---
 layout: dev
-title:  "Kylin web front end summary"
+title:  "Kylin Web Summary"
 categories: development
 permalink: /development/web_tech.html
 ---

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/_docs/howto/howto_jdbc.md
----------------------------------------------------------------------
diff --git a/website/_docs/howto/howto_jdbc.md b/website/_docs/howto/howto_jdbc.md
index df79153..45a2f4e 100644
--- a/website/_docs/howto/howto_jdbc.md
+++ b/website/_docs/howto/howto_jdbc.md
@@ -1,6 +1,6 @@
 ---
 layout: docs
-title:  How to use kylin remote jdbc driver
+title:  How to Use kylin Remote JDBC Driver
 categories: howto
 permalink: /docs/howto/howto_jdbc.html
 version: v0.7.2

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/_docs/howto/howto_use_restapi_in_js.md
----------------------------------------------------------------------
diff --git a/website/_docs/howto/howto_use_restapi_in_js.md b/website/_docs/howto/howto_use_restapi_in_js.md
index 1b2c78b..54b2c3e 100644
--- a/website/_docs/howto/howto_use_restapi_in_js.md
+++ b/website/_docs/howto/howto_use_restapi_in_js.md
@@ -1,6 +1,6 @@
 ---
 layout: docs
-title:  How to use Restful API in Javascript
+title:  How to Use Restful API in Javascript
 categories: howto
 permalink: /docs/howto/howto_use_restapi_in_js.html
 version: v0.8

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/_docs/install/advance_settings.md
----------------------------------------------------------------------
diff --git a/website/_docs/install/advance_settings.md b/website/_docs/install/advance_settings.md
index 43b0bea..3630954 100644
--- a/website/_docs/install/advance_settings.md
+++ b/website/_docs/install/advance_settings.md
@@ -1,6 +1,6 @@
 ---
 layout: docs
-title:  "Advance settings of Kylin environment"
+title:  "Advance Settings of Kylin Environment"
 categories: install
 permalink: /docs/install/advance_settings.html
 version: v0.7.2

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/_posts/blog/2015-08-15-fast-cubing.md
----------------------------------------------------------------------
diff --git a/website/_posts/blog/2015-08-15-fast-cubing.md b/website/_posts/blog/2015-08-15-fast-cubing.md
new file mode 100644
index 0000000..2d4ae81
--- /dev/null
+++ b/website/_posts/blog/2015-08-15-fast-cubing.md
@@ -0,0 +1,118 @@
+---
+layout: post-blog
+title:  "Fast Cubing Algorithm in Apache Kylin: Concept"
+date:   2015-08-15 12:28:00
+author: Shaofeng Shi
+categories: blog
+---
+
+This article is the first post of a series document that introducing the fast cubing algorithm in Apache Kylin; The new algorithm will be released in the near future (v1.1 or later); Now it is under internal testing and pilot; All the source code has been published in Kylin's Git repository <https://github.com/apache/incubator-kylin/> on 0.8 branch. 
+
+
+## By Layer Cubing Algorithm
+Before introduce the fast cubing, I'd like to make a brief introduction on the as-is cubing algorithm, which is called "by layer cubing"; 
+
+As its name indicates, a full cube is calculated by layer: N-dimension, N-1 dimension, N-2 dimension, ... until 0 dimension; Each layer's calculation is based on its parent layer (except the first, which base on source data); So this algorithm need N rounds of MapReduce running in sequence;
+
+In the MapReduce, the key is the composite of the dimensions, the value is the composite of the measures; When the mapper reads a key-value pair, it calculates its possible child cuboids; for each child cuboid, remove 1 dimension from the key, and then output the new key and value to the reducer; 
+
+The reducer gets the values grouped by key; it aggregates the measures, and then output to HDFS; one layer's MR is finished;
+
+When all layers are finished, the cube is calculated;
+
+Figure 1 describes the flow:
+
+![]( /images/blog/by-layer-cubing.png)
+Figure 1: By layer cubing
+
+
+**Advantage**
+
+* This algorithm is clear as it mostly leverage the capability of Hadoop; The sorting, grouping and shuffling are all taken care by Hadoop, developer just focus on the cubing logic;
+* Benefiting from Hadoop's maturity, this algorithm is very stable; In our experience there is seldom case that the mapper or reducer could fail; Even if your Hadoop cluster is small or is busy, it can finish in the end;
+
+
+**Disadvantage**
+
+* If the cube has many dimensions, it need the same number of MR jobs; As each Hadoop job scheduling need extra resource, the overhead cost to Hadoop is considerable; 
+* This algorithm causes too much shuffling to Hadoop; The mapper doesn't do aggregation, all the records that having same dimension values in next layer will be omitted to Hadoop, and then sent to reducer for aggregation;
+* Many writes to HDFS: each layer's cubing need write its output to HDFS for next layer MR to consume; In the end, Kylin need another round MR to convert these output files to HBase HFile for bulk load; These jobs generates many intermediate files in HDFS;
+* All in all: the performance is not good, especially when the cube has many dimensions (>10); 
+
+
+
+## Fast Cubing Algorithm
+The fast cubing algorithm is also called "by segment cubing"; The core idea is, each mapper calculates the feed data block into a small cube segment (with all cuboids), and then output all key/values to reducer; The reducer aggregates them into one big cube segment, finishing the cubing; Figure 2 illustrates the flow;
+
+
+![]( /images/blog/by-segment-cubing.png)
+Figure 2: By segment cubing
+
+### Pre-aggregation in Mapper
+Compared with the origin algorithm, the fast cubing has two main differences:
+
+* The mapper will do pre-aggregation, this will reduce the number of records that the mapper output to Hadoop, and also reduce the number that reducer need to aggregate;
+* One round MR can calculate all cuboids; 
+
+Let take an example: a cube has 4 dimensions: A, B, C, D; Each mapper has 1 million source records to process; The column cardinality in the mapper is Car(A), Car(B), Car(C) and Car(D); 
+
+* When aggregate the source records to base cuboid (1111), with the old "by layer" algorithm, the mapper will output 1 million records to Hadoop; With the fast cubing algorithm, after the pre-aggregation, it only outputs the number of [distinct A, B, C, D] records to Hadoop, which is certainly smaller than source data; In a normal case, it can be 1/10 to 1/1000 of the source records size;
+
+* When aggregate from parent to a child cuboid, say from base cuboid (1111) to 3-dimension cuboid 0111, the dimension A will be aggregated; We assume the dimension A is independent with other dimensions, after aggregation, the cuboid 0111's size will be about 1/Card(A) of the base cuboid; So the output will be reduced to 1/Card(A) of the original one in this step; 
+
+Totally, assume the average cardinality of the dimensions is Card(N), the records that written from mapper to reducer can be reduced to 1/Card(N) of origin size; The less output to Hadoop, the less I/O and computing, the better performance;
+
+### Cuboid spanning tree
+Inner the mapper, there is another change in the cuboid spanning tree visiting order; In origin cubing, Kylin calculates the cuboids with Broad First Search order; In the fast cubing, it uses Depth First Search order, to reduce the cuboids that need be cached in memory; 
+
+Let's look at the figure 3, which is a full spanning tree of a 4-dimension cube: before 0-Dimenson cuboid "\*" be aggregated, cuboid "ABCD", "BCD", "CD" and "D" will be cached in memory; As "\*" has no child, it will be outputted once be calculated, then comes "D"; After "C" be outputted, "CD" will be outputted as all of its children has finished; The memory for a cuboid can be released once it be outputted; The base cuboid ABCD will be the last one to finish; 
+
+![]( /images/blog/cube-spanning-tree.png)
+Figure 3: Cuboid spanning tree
+
+With the DFS visiting order, the output from a mapper is almost fully sorted, as the cuboid ID is at the beginning position of row key, and inner a cuboid the rows are already sorted:
+
+```
+0000
+0001[D0]
+0001[D1]
+....
+0010[C0]
+0010[C1]
+....
+0011[C0][D0]
+0011[C0][D1]
+....
+....
+1111[A0][B0][C0][D0]
+....
+```
+Since the outputs from mapper are already sorted, Hadoop's shuffle and sort would be faster;  
+
+Besides, mapper's pre-aggregation happens in memory, this avoids unnecessary disk and network I/O, and the overhead to Hadoop is considerable reduced; 
+
+###OutOfMemory error
+During the development phase, we encountered the OutOfMemory error in mappers; this could happen when: 
+
+	a) The mapper's JVM heap size is small;
+	b) "Distinct count" measure is used (HyperLogLog is space consuming) 
+	c) The spanning tree is too deep (too many dimensions);
+
+We realized that Kylin couldn't assume the mappers always have enough memory; The cubing algorithm need be self-adapting to various situations; A lot of efforts were put on optimizing the memory use and spilling data to disk when proactively detects an OutOfMemory error; The result is promising, the OOM error is rare to occur now;
+
+Here let's do a summary on the fast cubing algorithm;
+
+**Advantage**
+
+* It is faster than the old method; can reduce 30% to 50% overall building time from our comparison test; 
+* It produces less work load on Hadoop, and leaves less intermediate files on HDFS;
+* The cubing code can be easily reused by other cubing engines like Streaming and Spark;
+
+**Disadvantage**
+
+* The algorithm is a little complicated; This adds the maintenance effort;
+* Although the algorithm can spill data to disk automatically, it still wish the mapper has enough memory to gain best performance; User need more knowledge to tune the cubing; (this can be discussed in detail later)
+
+## Other Enhancements in Fast Cubing
+
+Together with the fast cubing algorithm, we also introduced several other enhancements in the cube build flow; I's like to elaborate in next articles, please keep posted here; If you have interest on Apache Kylin, welcome to visit our home page <http://kylin.incubator.apache.org/> and subscribe our development mailing list at <de...@kylin.incubator.apache.org>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/images/blog/by-layer-cubing.png
----------------------------------------------------------------------
diff --git a/website/images/blog/by-layer-cubing.png b/website/images/blog/by-layer-cubing.png
new file mode 100644
index 0000000..b5238d2
Binary files /dev/null and b/website/images/blog/by-layer-cubing.png differ

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/images/blog/by-segment-cubing.png
----------------------------------------------------------------------
diff --git a/website/images/blog/by-segment-cubing.png b/website/images/blog/by-segment-cubing.png
new file mode 100644
index 0000000..02fbec3
Binary files /dev/null and b/website/images/blog/by-segment-cubing.png differ

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/265a528d/website/images/blog/cube-spanning-tree.png
----------------------------------------------------------------------
diff --git a/website/images/blog/cube-spanning-tree.png b/website/images/blog/cube-spanning-tree.png
new file mode 100644
index 0000000..71c464c
Binary files /dev/null and b/website/images/blog/cube-spanning-tree.png differ