You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by sh...@apache.org on 2017/03/29 09:07:49 UTC
[4/4] kylin git commit: revise alberto's tutorial

revise alberto's tutorial


Project: http://git-wip-us.apache.org/repos/asf/kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/a34f34ea
Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/a34f34ea
Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/a34f34ea

Branch: refs/heads/document
Commit: a34f34ea6e3cf26e6bb1de36fcedaba35d2c71e5
Parents: 210a249
Author: shaofengshi <sh...@apache.org>
Authored: Wed Mar 29 17:07:38 2017 +0800
Committer: shaofengshi <sh...@apache.org>
Committed: Wed Mar 29 17:07:38 2017 +0800

----------------------------------------------------------------------
 website/_docs20/index.md                        |  2 +-
 .../_docs20/tutorial/cube_build_performance.md  | 55 ++++++--------------
 2 files changed, 18 insertions(+), 39 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kylin/blob/a34f34ea/website/_docs20/index.md
----------------------------------------------------------------------
diff --git a/website/_docs20/index.md b/website/_docs20/index.md
index b20e0a1..ada6b67 100644
--- a/website/_docs20/index.md
+++ b/website/_docs20/index.md
@@ -34,7 +34,7 @@ Tutorial
 5. [SQL reference: by Apache Calcite](http://calcite.apache.org/docs/reference.html)
 6. [Build Cube with Streaming Data](tutorial/cube_streaming.html)
 7. [Build Cube with Spark Engine (beta)](tutorial/cube_spark.html)
-8. [Kylin Cube build tuning step by step](tutorial/cube_build_performance.html)
+8. [Cube Build Tuning Step by Step](tutorial/cube_build_performance.html)
 
 
 

http://git-wip-us.apache.org/repos/asf/kylin/blob/a34f34ea/website/_docs20/tutorial/cube_build_performance.md
----------------------------------------------------------------------
diff --git a/website/_docs20/tutorial/cube_build_performance.md b/website/_docs20/tutorial/cube_build_performance.md
index bec63dc..d1375fe 100755
--- a/website/_docs20/tutorial/cube_build_performance.md
+++ b/website/_docs20/tutorial/cube_build_performance.md
@@ -1,32 +1,26 @@
 ---
 layout: docs20
-title: Kylin Cube build tuning step by step
+title: Cube Build Tuning Step by Step
 categories: tutorial
 permalink: /docs20/tutorial/cube_build_performance.html
 ---
- *This tutorial is an example step by step about how to optimize build of cube*
+ *This tutorial is an example step by step about how to optimize build of cube.* 
  
- *Thanks to ShaoFeng Shi for help*
-
-
-
-
-Try to optimize a very simple Cube, with 1 Dim and 1 Fact table (Date Dimension)
+In this scenario we're trying to optimize a very simple Cube, with 1 fact and 1 lookup table (Date Dimension). Before do a real tunning, please get an overall understanding about Cube build process from [Optimize Cube Build](/docs20/howto/howto_optimize_build.html)
 
 ![]( /images/tutorial/2.0/cube_build_performance/01.png)
 
-
 The baseline is:
 
 * One Measure: Balance, calculate always Max, Min and Count
 * All Dim_date (10 items) will be used as dimensions 
 * Input is a Hive CSV external table 
-* Output is a Cube in HBase with out compression 
+* Output is a Cube in HBase without compression 
 
 With this configuration, the results are: 13 min to build a cube of 20 Mb  (Cube_01)
 
-### Cube_02: Reduce cardinality
-To make the first improvement, use Joint and Hierarchy on Dimensions to reduce the cardinality.
+### Cube_02: Reduce combinations
+To make the first improvement, use Joint and Hierarchy on Dimensions to reduce the combinations (number of cuboids).
 
 Put together all ID and Text of: Month, Week, Weekday and Quarter using Joint Dimension
 
@@ -37,28 +31,25 @@ Define Id_date and Year as a Hierarchy Dimension
 
 This reduces the size down to 0.72MB and time to 5 min
 
-
-
 [Kylin 2149](https://issues.apache.org/jira/browse/KYLIN-2149), ideally, these Hierarchies can be defined also:
 * Id_weekday > Id_date
 * Id_Month > Id_date
 * Id_Quarter > Id_date
 * Id_week > Id_date
 
-But for now, it isn\u2019t possible to use Joint and hierarchy together in one Dim   :(
+But for now, it impossible to use Joint and Hierarchy together for one dimension.
 
 
-### Cube_03: Compress output Cube
+### Cube_03: Compress output
 To make the next improvement, compress HBase Cube with Snappy:
 
 ![alt text](/images/tutorial/2.0/cube_build_performance/03.png)
 
-Another option is to Now we can try compress HBase Cube with Gzip:
+Another option is Gzip:
 
 ![alt text](/images/tutorial/2.0/cube_build_performance/04.png)
 
 
-
 The results of compression output are:
 
 ![alt text](/images/tutorial/2.0/cube_build_performance/05.png)
@@ -78,18 +69,17 @@ Group detailed times by concepts :
 
 67 % is used to build / process flat table and respect 30% to build the cube
 
-A lot of time is used in the first steps! 
+A lot of time is used in the first steps.
 
 This time distribution is typical in a cube with few measures and few dim (or very optimized)
 
 
-
 Try to use ORC Format and compression on Hive input table (Snappy):
 
 ![]( /images/tutorial/2.0/cube_build_performance/08.png)
 
 
-The time in the first three stree steps (Flat Table) has been improved by half  :)
+The time in the first three stree steps (Flat Table) has been improved by half.
 
 Other columnar formats can be tested:
 
@@ -99,7 +89,7 @@ Other columnar formats can be tested:
 * ORC
 * ORC compressed with Snappy
 
-But the results are worse than when using Sequence file \u2026
+But the results are worse than when using Sequence file.
 
 See comments about this here: [Shaofengshi in MailList](http://apache-kylin.74782.x6.nabble.com/Kylin-Performance-td6713.html#a6767)
 
@@ -107,14 +97,13 @@ The second strep is to redistribute Flat Hive table:
 
 ![]( /images/tutorial/2.0/cube_build_performance/20.png)
 
-
 Is a simple row count, two approximations can be made
 * If it doesn\u2019t need to be accurate, the rows of the fact table can be counted\u2192 this can be performed in parallel with Step 1 (and 99% of the time it will be accurate)
 
 ![]( /images/tutorial/2.0/cube_build_performance/21.png)
 
 
-* See comments about this from Shaofengshi in MailList . In the future versions (Kylin 2265 v2.0), this steps will be implemented using Hive table statistics.
+* In the future versions (KYLIN-2165 v2.0), this steps will be implemented using Hive table statistics.
 
 
 
@@ -140,6 +129,7 @@ WHERE (ID_DATE >= '2016-12-08' AND ID_DATE < '2016-12-23')
 {% endhighlight %}
 
 The problem here, is that, Hive in only using 1 Map to create Flat Table. It is important to lets go to change this behavior. The solution is to partition DIM and FACT in the same columns
+
 * Option 1: Use id_date as a partition column on Hive table. This has a big problem: the Hive metastore is meant for few a hundred of partitions and not thousands (In [Hive 9452](https://issues.apache.org/jira/browse/HIVE-9452) there is an idea to solve this but it isn\u2019t finished yet)
 * Option 2: Generate a new column for this purpose like Monthslot.
 
@@ -217,12 +207,11 @@ How can the performance of Map \u2013 Reduce be improved? The easy way is to increa
 * yarn.nodemanager.resource.memory-mb = 15 GB
 * yarn.scheduler.maximum-allocation-mb = 8 GB
 * yarn.nodemanager.resource.cpu-vcores = 8 cores
-With this config our max theoreticaleorical grade of parallelismelist is 8. However, t , but this has a problem: \u201cTimed out after 3600 secs\u201d
+With this config our max theoreticaleorical grade of parallelismelist is 8. However, but this has a problem: \u201cTimed out after 3600 secs\u201d
 
 ![]( /images/tutorial/2.0/cube_build_performance/26.png)
 
 
-
 The parameter mapreduce.task.timeout  (1 hour by default) define max time that Application Master (AM) can happen with out ACK of Yarn Container. Once this time passes, AM kill the container and retry the same 4 times (with the same result)
 
 Where is the problem? The problem is that 4 mappers started, but each mapper needed more than 4 GB to finish
@@ -243,7 +232,7 @@ During a normal \u201cBuild Cube\u201d step you will see similars messages on YARN log
 ![]( /images/tutorial/2.0/cube_build_performance/28.png)
 
 
-If you don\u2019t see this periodically, perhaps you have a bottleneck in the memory
+If you don\u2019t see this periodically, perhaps you have a bottleneck in the memory.
 
 
 
@@ -263,8 +252,7 @@ In our case we define 3 Aggregations Groups:
 
 ![]( /images/tutorial/2.0/cube_build_performance/31.png)
 
-	
-	
+
 
 Compare without / with AGGs:
 
@@ -276,12 +264,3 @@ Now it uses 3% more of time to build the cube and 0.6% of space, but queries by
 
 
 
-
-
-**__For any suggestions, feel free to contact me__**
-
-**__Thanks, Alberto__**
-
-
-
-