You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by li...@apache.org on 2017/02/26 03:22:03 UTC
kylin git commit: add post for spark cubing

Repository: kylin
Updated Branches:
  refs/heads/document a7e622b53 -> f7adfdeb3


add post for spark cubing

Signed-off-by: Yang Li <li...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/f7adfdeb
Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/f7adfdeb
Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/f7adfdeb

Branch: refs/heads/document
Commit: f7adfdeb30d63acdb3c077da184fce5b855cc6e2
Parents: a7e622b
Author: shaofengshi <sh...@apache.org>
Authored: Sat Feb 25 23:36:26 2017 +0800
Committer: Yang Li <li...@apache.org>
Committed: Sun Feb 26 11:19:15 2017 +0800

----------------------------------------------------------------------
 .../blog/2017-02-23-by-layer-spark-cubing.md    |  58 +++++++++++++++++++
 website/images/blog/spark-cubing-layer.png      | Bin 0 -> 532778 bytes
 website/images/blog/spark-dag.png               | Bin 0 -> 126570 bytes
 website/images/blog/spark-mr-layer.png          | Bin 0 -> 550454 bytes
 website/images/blog/spark-mr-performance.png    | Bin 0 -> 98432 bytes
 5 files changed, 58 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kylin/blob/f7adfdeb/website/_posts/blog/2017-02-23-by-layer-spark-cubing.md
----------------------------------------------------------------------
diff --git a/website/_posts/blog/2017-02-23-by-layer-spark-cubing.md b/website/_posts/blog/2017-02-23-by-layer-spark-cubing.md
new file mode 100644
index 0000000..6525917
--- /dev/null
+++ b/website/_posts/blog/2017-02-23-by-layer-spark-cubing.md
@@ -0,0 +1,58 @@
+---
+layout: post-blog
+title:  By-layer Spark Cubing
+date:   2017-02-23 17:30:00
+author: Shaofeng Shi
+categories: blog
+---
+
+Before v2.0, Apache Kylin uses Hadoop MapReduce as the framework to build Cubes over huge dataset. The MapReduce framework is simple, stable and can fulfill Kylin's need very well except the performance. In order to get better performance, we introduced the "fast cubing" algorithm in Kylin v1.5, tries to do as much as possible aggregations at map side within memory, so to avoid the disk and network I/O; but not all data models can benefit from it, and it still runs on MR which means on-disk sorting and shuffling. 
+
+Now Spark comes; Apache Spark is an open-source cluster-computing framework, which provides programmers with an application programming interface centered on a data structure called RDD; it runs in-memory on the cluster, this makes repeated access to the same data much faster. Spark provides flexible and fancy APIs. You are not tied to Hadoop\u2019s MapReduce two-stage paradigm.
+
+Before introducing how calculate Cube with Spark, let's see how Kylin do that with MR; Figure 1 illustrates how a 4-dimension Cube get calculated with the classic "by-layer" algorithm: the first round MR aggregates the base (4-D) cuboid from source data; the second MR aggregates on the base cuboid to get the 3-D cuboids; With N+1 round MR all layers' cuboids get calculated. 
+
+![MapReduce Cubing by Layer](/images/blog/spark-mr-layer.png)
+
+The "by-layer" Cubing divides a big task into a couple steps, and each step bases on the previous step's output, so it can reuse the previous calculation and also avoid calculating from very beginning when there is a failure in between. These makes it as a reliable algorithm. When moving to Spark, we decide to keep this algorithm, that's why we call this feature as "By layer Spark Cubing". 
+
+
+As we know, RDD (Resilient Distributed Dataset) is a basic concept in Spark. A collection of N-Dimension cuboids can be well described as an RDD, a N-Dimension Cube will have N+1 RDD. These RDDs have the parent/child relationship as the parent can be used to generate the children. With the parent RDD cached in memory, the child RDD's generation can be much efficient than reading from disk. Figure 2 describes this process.
+
+![Spark Cubing by Layer](/images/blog/spark-cubing-layer.png)
+
+Figure 3 is the DAG of Cubing in Spark, it illustrates the process in detail: In "Stage 5", Kylin uses a HiveContext to read the intermediate Hive table, and then do a "map" operation, which is an one to one map, to encode the origin values into K-V bytes. On complete Kylin gets an intermediate encoded RDD. In "Stage 6", the intermediate RDD is aggregated with a "reduceByKey" operation to get RDD-1, which is the base cuboid. Nextly, do an "flatMap" (one to many map) on RDD-1, because the base cuboid has N children cuboids. And so on, all levels' RDDs get calculated. These RDDs will be persisted to distributed file system on complete, but be cached in memory for next level's calculation. When child be generated, it will be removed from cache.
+
+![DAG of Spark Cubing](/images/blog/spark-dag.png)
+
+We did a test to see how much performance improvement can gain from Spark:
+
+Environment
+
+* 4 nodes Hadoop cluster; each node has 28 GB RAM and 12 cores;
+* YRAN has 48GB RAM and 30 cores in total;
+* CDH 5.8, Apache Kylin 2.0 beta.
+
+Spark
+
+* Spark 1.6.3 on YARN
+* 6 executors, each has 4 cores, 4GB +1GB (overhead) memory
+
+Test Data
+
+* Airline data, total 160 million rows
+* Cube: 10 dimensions, 5 measures (SUM)
+
+Test Scenarios
+
+*  Build the cube at different source data level: 3 million, 50 million and 160 million source rows; Compare the build time with MapReduce (by layer) and Spark. No compression enabled.
+The time only cover the building cube step, not including data preparations and subsequent steps.
+
+![Spark vs MR performance](/images/blog/spark-mr-performance.png)
+
+Spark is faster than MR in all the 3 scenarios, and overall it can reduce about half time in the cubing.
+
+
+Now you can download a 2.0.0 beta build from Kylin's download page, and then follow this [post](https://kylin.apache.org/blog/2017/02/25/v2.0.0-beta-ready/) to build a cube with Spark engine. If you have any comments or inputs, please discuss in the community.
+
+

http://git-wip-us.apache.org/repos/asf/kylin/blob/f7adfdeb/website/images/blog/spark-cubing-layer.png
----------------------------------------------------------------------
diff --git a/website/images/blog/spark-cubing-layer.png b/website/images/blog/spark-cubing-layer.png
new file mode 100644
index 0000000..55880f8
Binary files /dev/null and b/website/images/blog/spark-cubing-layer.png differ

http://git-wip-us.apache.org/repos/asf/kylin/blob/f7adfdeb/website/images/blog/spark-dag.png
----------------------------------------------------------------------
diff --git a/website/images/blog/spark-dag.png b/website/images/blog/spark-dag.png
new file mode 100644
index 0000000..f06b636
Binary files /dev/null and b/website/images/blog/spark-dag.png differ

http://git-wip-us.apache.org/repos/asf/kylin/blob/f7adfdeb/website/images/blog/spark-mr-layer.png
----------------------------------------------------------------------
diff --git a/website/images/blog/spark-mr-layer.png b/website/images/blog/spark-mr-layer.png
new file mode 100644
index 0000000..bc62604
Binary files /dev/null and b/website/images/blog/spark-mr-layer.png differ

http://git-wip-us.apache.org/repos/asf/kylin/blob/f7adfdeb/website/images/blog/spark-mr-performance.png
----------------------------------------------------------------------
diff --git a/website/images/blog/spark-mr-performance.png b/website/images/blog/spark-mr-performance.png
new file mode 100644
index 0000000..e17d160
Binary files /dev/null and b/website/images/blog/spark-mr-performance.png differ