You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by li...@apache.org on 2017/02/03 01:00:48 UTC
svn commit: r1781490 - in /kylin/site: docs16/howto/howto_optimize_build.html feed.xml

Author: lidong
Date: Fri Feb  3 01:00:48 2017
New Revision: 1781490

URL: http://svn.apache.org/viewvc?rev=1781490&view=rev
Log:
minor update on doc

Modified:
    kylin/site/docs16/howto/howto_optimize_build.html
    kylin/site/feed.xml

Modified: kylin/site/docs16/howto/howto_optimize_build.html
URL: http://svn.apache.org/viewvc/kylin/site/docs16/howto/howto_optimize_build.html?rev=1781490&r1=1781489&r2=1781490&view=diff
==============================================================================
--- kylin/site/docs16/howto/howto_optimize_build.html (original)
+++ kylin/site/docs16/howto/howto_optimize_build.html Fri Feb  3 01:00:48 2017
@@ -2246,7 +2246,7 @@ INSERT OVERWRITE TABLE kylin_intermediat
 
 <p>Secondly, Kylin runs a <em>“INSERT OVERWIRTE TABLE …. DISTRIBUTE BY “</em> HiveQL to distribute the rows among a specified number of reducers.</p>
 
-<p>In most cases, Kylin asks Hive to distributes the rows among reducers, then get files very closed in size. The distribute clause is “DISTRIBUTE BY RAND()”.</p>
+<p>In most cases, Kylin asks Hive to randomly distributes the rows among reducers, then get files very closed in size. The distribute clause is “DISTRIBUTE BY RAND()”.</p>
 
 <p>If your Cube has specified a “shard by” dimension (in Cube’s “Advanced setting” page), which is a high cardinality column (like “USER_ID”), Kylin will ask Hive to redistribute data by that column’s value. Then for the rows that have the same value as this column has, they will go to the same file. This is much better than “by random”,  because the data will be not only redistributed but also pre-categorized without additional cost, thus benefiting the subsequent Cube build process. Under a typical scenario, this optimization can cut off 40% building time. In this case the distribute clause will be “DISTRIBUTE BY USER_ID”:</p>
 
@@ -2270,11 +2270,15 @@ INSERT OVERWRITE TABLE kylin_intermediat
 
 <h2 id="build-base-cuboid">Build Base Cuboid</h2>
 
-<p>This step is building the base cuboid from the intermediate table. The mapper number is equals to the reducer number of step 2; The reducer number is estimate with the cube statistics: by default use 1 reducer every 500MB output; If you observed the reducer number is small, you can set “kylin.job.mapreduce.default.reduce.input.mb” in kylin.properteis to a smaller value to get more resources, e.g: <code class="highlighter-rouge">kylin.job.mapreduce.default.reduce.input.mb=200</code></p>
+<p>This step is building the base cuboid from the intermediate table, which is the first round MR of the “by-layer” cubing algorithm. The mapper number is equals to the reducer number of step 2; The reducer number is estimated with the cube statistics: by default use 1 reducer every 500MB output; If you observed the reducer number is small, you can set “kylin.job.mapreduce.default.reduce.input.mb” in kylin.properties to a smaller value to get more resources, e.g: <code class="highlighter-rouge">kylin.job.mapreduce.default.reduce.input.mb=200</code></p>
 
 <h2 id="build-n-dimension-cuboid">Build N-Dimension Cuboid</h2>
 
-<p>These steps are the “by-layer” cubing process, each step uses the output of previous step as the input. It is similar as the “build base cuboid” step. Usually from the N-D to (N/2)-D the building is slow, because it is the cuboid explosion process: N-D has 1 Cuboid, (N-1)-D has N cuboids, (N-2)-D has N*(N-1) cuboids, etc. After (N/2)-D step, the building gets faster gradually.</p>
+<p>These steps are the “by-layer” cubing process, each step uses the output of previous step as the input, and then cut off one dimension to aggregate to get one child cuboid. For example, from cuboid ABCD, cut off A get BCD, cut off B get ACD etc.</p>
+
+<p>Some cuboid can be aggregated from more than 1 parent cubiods, in this case, Kylin will select the minimal parent cuboid. For example, AB can be generated from ABC (id: 1110) and ABD (id: 1101), so ABD will be used as its id is smaller than ABC. Based on this, if D’s cardinality is small, the aggregation will be cost-efficient. So, when you design the Cube rowkey sequence, please remember to put low cardinality dimensions to the tail position. This not only benefit the Cube build, but also benefit the Cube query as the post-aggregation follows the same rule.</p>
+
+<p>Usually from the N-D to (N/2)-D the building is slow, because it is the cuboid explosion process: N-D has 1 Cuboid, (N-1)-D has N cuboids, (N-2)-D has N*(N-1) cuboids, etc. After (N/2)-D step, the building gets faster gradually.</p>
 
 <h2 id="build-cube">Build Cube</h2>
 
@@ -2298,7 +2302,7 @@ INSERT OVERWRITE TABLE kylin_intermediat
 
 <h2 id="convert-cuboid-data-to-hfile">Convert Cuboid Data to HFile</h2>
 
-<p>This step starts a MR job to convert the Cuboid files (sequence file format) into HBase’s HFile format. Kylin calculates the HBase region number with the Cube statistics, by default 1 region per 5GB. The more regions got, the more reducers would be utilized. If you observe the reducer’s number is small and perforamnce is poor, you can set the following parameters in “conf/kylin.properties” to smaller, as follows:</p>
+<p>This step starts a MR job to convert the Cuboid files (sequence file format) into HBase’s HFile format. Kylin calculates the HBase region number with the Cube statistics, by default 1 region per 5GB. The more regions got, the more reducers would be utilized. If you observe the reducer’s number is small and performance is poor, you can set the following parameters in “conf/kylin.properties” to smaller, as follows:</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>kylin.hbase.region.cut=2
 kylin.hbase.hfile.size.gb=1
@@ -2309,7 +2313,7 @@ kylin.hbase.hfile.size.gb=1
 
 <h2 id="load-hfile-to-hbase-table">Load HFile to HBase Table</h2>
 
-<p>This step uses HBase API to load the HFiles to region servers, it is lightweight and fast.</p>
+<p>This step uses HBase API to load the HFile to region servers, it is lightweight and fast.</p>
 
 <h2 id="update-cube-info">Update Cube Info</h2>
 

Modified: kylin/site/feed.xml
URL: http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1781490&r1=1781489&r2=1781490&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Fri Feb  3 01:00:48 2017
@@ -19,8 +19,8 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml" rel="self" type="application/rss+xml"/>
-    <pubDate>Wed, 25 Jan 2017 00:29:40 -0800</pubDate>
-    <lastBuildDate>Wed, 25 Jan 2017 00:29:40 -0800</lastBuildDate>
+    <pubDate>Thu, 02 Feb 2017 17:00:08 -0800</pubDate>
+    <lastBuildDate>Thu, 02 Feb 2017 17:00:08 -0800</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>