You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by li...@apache.org on 2015/09/03 03:31:30 UTC

[15/50] [abbrv] incubator-kylin git commit: update doc: add how to optmize cubes page

update doc: add how to optmize cubes page


Project: http://git-wip-us.apache.org/repos/asf/incubator-kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kylin/commit/8e347017
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kylin/tree/8e347017
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kylin/diff/8e347017

Branch: refs/heads/0.7
Commit: 8e34701709d5d00df72de5d1fafe06f00bee6398
Parents: 83de8e0
Author: honma <ho...@ebay.com>
Authored: Fri Aug 21 15:16:18 2015 +0800
Committer: honma <ho...@ebay.com>
Committed: Fri Aug 21 15:16:18 2015 +0800

----------------------------------------------------------------------
 website/README.md                           |   2 +-
 website/_data/docs.yml                      |   1 +
 website/_docs/howto/howto_backup_hbase.md   |   2 +-
 website/_docs/howto/howto_optimize_cubes.md | 214 +++++++++++++++++++++++
 website/_docs/index.md                      |  33 ++--
 website/_docs/install/advance_settings.md   |   8 -
 website/_docs/install/kylin_cluster.md      |  18 +-
 7 files changed, 252 insertions(+), 26 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/8e347017/website/README.md
----------------------------------------------------------------------
diff --git a/website/README.md b/website/README.md
index f977b5b..873ea58 100644
--- a/website/README.md
+++ b/website/README.md
@@ -27,7 +27,7 @@ This directory contains the source code for the Apache Kylin (incubating) websit
 2. ___layouts__: Page layout template
 3. ___includes__: Page template like header, footer...
 2. ___data__: Jekyll collections, docs.yml is for Docs menu generation
-3. ___docs__: Docuemtation folder
+3. ___docs__: Documentation folder
 4. ___posts__: Blog folder
 5. __download__: Download folder, including released source code package, binary package, ODBC Driver and development version.
 6. __cn__: Chinese version 

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/8e347017/website/_data/docs.yml
----------------------------------------------------------------------
diff --git a/website/_data/docs.yml b/website/_data/docs.yml
index f747579..a1ac243 100644
--- a/website/_data/docs.yml
+++ b/website/_data/docs.yml
@@ -47,6 +47,7 @@
   - howto/howto_build_cube_with_restapi
   - howto/howto_use_restapi_in_js
   - howto/howto_use_restapi
+  - howto/howto_optimize_cubes
   - howto/howto_backup_metadata
   - howto/howto_backup_hbase
   - howto/howto_jdbc

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/8e347017/website/_docs/howto/howto_backup_hbase.md
----------------------------------------------------------------------
diff --git a/website/_docs/howto/howto_backup_hbase.md b/website/_docs/howto/howto_backup_hbase.md
index 1ffc5a5..5c54ca5 100644
--- a/website/_docs/howto/howto_backup_hbase.md
+++ b/website/_docs/howto/howto_backup_hbase.md
@@ -1,6 +1,6 @@
 ---
 layout: docs
-title:  How to Backup HBase Tables
+title:  How to Clean/Backup HBase Tables
 categories: howto
 permalink: /docs/howto/howto_backup_hbase.html
 version: v0.7.2

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/8e347017/website/_docs/howto/howto_optimize_cubes.md
----------------------------------------------------------------------
diff --git a/website/_docs/howto/howto_optimize_cubes.md b/website/_docs/howto/howto_optimize_cubes.md
new file mode 100644
index 0000000..50b42d4
--- /dev/null
+++ b/website/_docs/howto/howto_optimize_cubes.md
@@ -0,0 +1,214 @@
+---
+layout: docs
+title:  How to optimize cubes
+categories: howto
+permalink: /docs/howto/howto_optimize_cubes.html
+version: v0.7.2
+since: v0.7.1
+---
+
+## Hierarchies:
+
+Theoretically for N dimensions you'll end up with 2^N dimension combinations. However for some group of dimensions there are no need to create so many combinations. For example, if you have three dimensions: continent, country, city (In hierarchies, the "bigger" dimension comes first). You will only need the following three combinations of group by when you do drill down analysis:
+
+group by continent
+group by continent, country
+group by continent, country, city
+
+In such cases the combination count is reduced from 2^3=8 to 3, which is a great optimization. The same goes for the YEAR,QUATER,MONTH,DATE case.
+
+If we Donate the hierarchy dimension as H1,H2,H3, typical scenarios would be:
+
+
+A. Hierarchies on lookup table
+
+
+<table>
+  <tr>
+    <td align="center">Fact table</td>
+    <td align="center">(joins)</td>
+    <td align="center">Lookup Table</td>
+  </tr>
+  <tr>
+    <td>column1,column2,,,,,, FK</td>
+    <td></td>
+    <td>PK,,H1,H2,H3,,,,</td>
+  </tr>
+</table>
+
+---
+
+B. Hierarchies on fact table
+
+
+<table>
+  <tr>
+    <td align="center">Fact table</td>
+  </tr>
+  <tr>
+    <td>column1,column2,,,H1,H2,H3,,,,,,, </td>
+  </tr>
+</table>
+
+---
+
+
+There is a special case for scenario A, where PK on the lookup table is accidentally being part of the hierarchies. For example we have a calendar lookup table where cal_dt is the primary key:
+
+A*. Hierarchies on lookup table over its primary key
+
+
+<table>
+  <tr>
+    <td align="center">Lookup Table(Calendar)</td>
+  </tr>
+  <tr>
+    <td>cal_dt(PK), week_beg_dt, month_beg_dt, quarter_beg_dt,,,</td>
+  </tr>
+</table>
+
+---
+
+
+For cases like A* what you need is another optimization called "Derived Columns"
+
+## Derived Columns:
+
+Derived column is used when one or more dimensions (They must be dimension on lookup table, these columns are called "Derived") can be deduced from another(Usually it is the corresponding FK, this is called the "host column")
+
+For example, suppose we have a lookup table where we join fact table and it with "where DimA = DimX". Notice in Kylin, if you choose FK into a dimension, the corresponding PK will be automatically querable, without any extra cost. The secret is that since FK and PK are always identical, Kylin can apply filters/groupby on the FK first, and transparently replace them to PK.  This indicates that if we want the DimA(FK), DimX(PK), DimB, DimC in our cube, we can safely choose DimA,DimB,DimC only.
+
+<table>
+  <tr>
+    <td align="center">Fact table</td>
+    <td align="center">(joins)</td>
+    <td align="center">Lookup Table</td>
+  </tr>
+  <tr>
+    <td>column1,column2,,,,,, DimA(FK) </td>
+    <td></td>
+    <td>DimX(PK),,DimB, DimC</td>
+  </tr>
+</table>
+
+---
+
+
+Let's say that DimA(the dimension representing FK/PK) has a special mapping to DimB:
+
+
+<table>
+  <tr>
+    <th>dimA</th>
+    <th>dimB</th>
+    <th>dimC</th>
+  </tr>
+  <tr>
+    <td>1</td>
+    <td>a</td>
+    <td>?</td>
+  </tr>
+  <tr>
+    <td>2</td>
+    <td>b</td>
+    <td>?</td>
+  </tr>
+  <tr>
+    <td>3</td>
+    <td>c</td>
+    <td>?</td>
+  </tr>
+  <tr>
+    <td>4</td>
+    <td>a</td>
+    <td>?</td>
+  </tr>
+</table>
+
+
+in this case, given a value in DimA, the value of DimB is determined, so we say dimB can be derived from DimA. When we build a cube that contains both DimA and DimB, we simple include DimA, and marking DimB as derived. Derived column(DimB) does not participant in cuboids generation:
+
+original combinations:
+ABC,AB,AC,BC,A,B,C
+
+combinations when driving B from A:
+AC,A,C
+
+at Runtime, in case queries like "select count(*) from fact_table inner join looup1 group by looup1 .dimB", it is expecting cuboid containing DimB to answer the query. However, DimB will appear in NONE of the cuboids due to derived optimization. In this case, we modify the execution plan to make it group by  DimA(its host column) first, we'll get intermediate answer like:
+
+
+<table>
+  <tr>
+    <th>DimA</th>
+    <th>count(*)</th>
+  </tr>
+  <tr>
+    <td>1</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td>2</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td>3</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td>4</td>
+    <td>1</td>
+  </tr>
+</table>
+
+
+Afterwards, Kylin will replace DimA values with DimB values(since both of their values are in lookup table, Kylin can load the whole lookup table into memory and build a mapping for them), and the intermediate result becomes:
+
+
+<table>
+  <tr>
+    <th>DimB</th>
+    <th>count(*)</th>
+  </tr>
+  <tr>
+    <td>a</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td>b</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td>c</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td>a</td>
+    <td>1</td>
+  </tr>
+</table>
+
+
+After this, the runtime SQL engine(calcite) will further aggregate the intermediate result to:
+
+
+<table>
+  <tr>
+    <th>DimB</th>
+    <th>count(*)</th>
+  </tr>
+  <tr>
+    <td>a</td>
+    <td>2</td>
+  </tr>
+  <tr>
+    <td>b</td>
+    <td>1</td>
+  </tr>
+  <tr>
+    <td>c</td>
+    <td>1</td>
+  </tr>
+</table>
+
+
+this step happens at query runtime, this is what it means "at the cost of extra runtime aggregation"

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/8e347017/website/_docs/index.md
----------------------------------------------------------------------
diff --git a/website/_docs/index.md b/website/_docs/index.md
index d536d5d..0c62b63 100644
--- a/website/_docs/index.md
+++ b/website/_docs/index.md
@@ -11,39 +11,46 @@ Welcome to Apache Kylin
 
 Apache Kylin is an open source Distributed Analytics Engine, contributed by eBay Inc., provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
 
-Installation 
+Installation & Setup
 ------------  
 
-Please follow  installation & tutorial to start with Kylin.
+Please follow installation & tutorial in the navigation panel.
 
 Advanced Topics
 -------  
 
-### Connectivity
+#### Connectivity
+
 1.[How to use kylin remote jdbc driver](howto/howto_jdbc.html)
 
 2.[SQL Reference](http://calcite.incubator.apache.org/)
 
-### REST API
-1.[Kylin Restful API List](/development/rest_api.html)
+---
+
+#### REST API
 
-2.[Build Cube with Restful API](/development/build_api.html)
+1.[Kylin Restful API List](howto/howto_use_restapi.html)
 
-3.[How to consume Kylin REST API in javascript](/development/javascript_api.html)
+2.[Build Cube with Restful API](howto/howto_build_cube_with_restapi.html)
 
-### Operations
-1.[Kylin Metadata Store](/development/new_metadata.html)
+3.[How to consume Kylin REST API in javascript](howto/howto_use_restapi_in_js.html)
+
+---
 
-2.[Export Kylin HBase data](howto/howto_backup.html)
+#### Operations
+
+1.[Check Kylin Metadata Store](howto/howto_backup_metadata.html)
+
+2.[Clean/Export Kylin HBase data](howto/howto_backup.html)
 
 3.[Advanced settings of Kylin environment](install/advance_settings.html)
 
-### Test
-1.[Run Kylin test case with HBase Mini Cluster](/development/test_minicluster.html)
+---
 
+#### Technical Details
 
-### Technical Details
 1.[New meta data model structure](/development/new_metadata.html)
 
+---
 
 

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/8e347017/website/_docs/install/advance_settings.md
----------------------------------------------------------------------
diff --git a/website/_docs/install/advance_settings.md b/website/_docs/install/advance_settings.md
index f618d37..e9f2a7a 100644
--- a/website/_docs/install/advance_settings.md
+++ b/website/_docs/install/advance_settings.md
@@ -34,12 +34,4 @@ create 'lzoTable', {NAME => 'colFam',COMPRESSION => 'LZO'}
 You'll need to stop Kylin first by running `./kylin.sh stop`, and then modify $KYLIN_HOME/conf/kylin_job_conf.xml by uncommenting some configuration entries related to LZO compression. 
 After this, you need to run `./kylin.sh start` to start Kylin again. Now Kylin will use LZO to compress MR outputs and hbase tables.
 
-## Kylin Server modes
 
-Kylin instances are stateless,  the runtime state is saved in its "Metadata Store" in hbase (kylin.metadata.url config in conf/kylin.properties). For load balance considerations it is possible to start multiple Kylin instances sharing the same metadata store (thus sharing the same state on table schemas, job status, cube status, etc.)
-
-Each of the kylin instances has a kylin.server.mode entry in conf/kylin.properties specifying the runtime mode, it has three options: 1. "job" for running job engine only 2. "query" for running query engine only and 3 "all" for running both. Notice that only one server can run the job engine("all" mode or "job" mode), the others must all be "query" mode.
-
-A typical scenario is depicted in the following chart:
-
-![]( /images/install/kylin_server_modes.png)

http://git-wip-us.apache.org/repos/asf/incubator-kylin/blob/8e347017/website/_docs/install/kylin_cluster.md
----------------------------------------------------------------------
diff --git a/website/_docs/install/kylin_cluster.md b/website/_docs/install/kylin_cluster.md
index 6cb832f..8dfd094 100644
--- a/website/_docs/install/kylin_cluster.md
+++ b/website/_docs/install/kylin_cluster.md
@@ -7,12 +7,24 @@ version: v0.7.2
 since: v0.7.1
 ---
 
-### Multiple Kylin REST servers
+
+### Kylin Server modes
+
+Kylin instances are stateless,  the runtime state is saved in its "Metadata Store" in hbase (kylin.metadata.url config in conf/kylin.properties). For load balance considerations it is possible to start multiple Kylin instances sharing the same metadata store (thus sharing the same state on table schemas, job status, cube status, etc.)
+
+Each of the kylin instances has a kylin.server.mode entry in conf/kylin.properties specifying the runtime mode, it has three options: 1. "job" for running job engine only 2. "query" for running query engine only and 3 "all" for running both. Notice that only one server can run the job engine("all" mode or "job" mode), the others must all be "query" mode.
+
+A typical scenario is depicted in the following chart:
+
+![]( /images/install/kylin_server_modes.png)
+
+### Setting up Multiple Kylin REST servers
 
 If you are running Kylin in a cluster or you have multiple Kylin REST server instances, please make sure you have the following property correctly configured in ${KYLIN_HOME}/conf/kylin.properties
 
 1. kylin.rest.servers 
-	List of web servers in use, this enables one web server instance to sync up with other servers.
+	List of web servers in use, this enables one web server instance to sync up with other servers. For example: kylin.rest.servers=sandbox1:7070,sandbox2:7070
   
 2. kylin.server.mode
-	Make sure there is only one instance whose "kylin.server.mode" is set to "all" if there are multiple instances.
\ No newline at end of file
+	Make sure there is only one instance whose "kylin.server.mode" is set to "all" if there are multiple instances.
+	
\ No newline at end of file