You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@kylin.apache.org by su...@apache.org on 2016/06/10 12:50:23 UTC

[1/2] kylin git commit: Add document of Standalone HBase Cluster

Repository: kylin
Updated Branches:
  refs/heads/document 5834ded3d -> 9a204458d


Add document of Standalone HBase Cluster


Project: http://git-wip-us.apache.org/repos/asf/kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/43eaece9
Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/43eaece9
Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/43eaece9

Branch: refs/heads/document
Commit: 43eaece94fab3e2b018e35b27324d9a0b58fa834
Parents: 5834ded
Author: sunyerui <su...@gmail.com>
Authored: Fri Jun 10 18:01:52 2016 +0800
Committer: sunyerui <su...@gmail.com>
Committed: Fri Jun 10 18:01:52 2016 +0800

----------------------------------------------------------------------
 .../blog/2016-06-10-standalone-hbase-cluster.md | 35 ++++++++++++++++++++
 1 file changed, 35 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kylin/blob/43eaece9/website/_posts/blog/2016-06-10-standalone-hbase-cluster.md
----------------------------------------------------------------------
diff --git a/website/_posts/blog/2016-06-10-standalone-hbase-cluster.md b/website/_posts/blog/2016-06-10-standalone-hbase-cluster.md
new file mode 100644
index 0000000..35e8604
--- /dev/null
+++ b/website/_posts/blog/2016-06-10-standalone-hbase-cluster.md
@@ -0,0 +1,35 @@
+--
+layout: post-blog
+title:  Deploy Apache Kylin with Standalone HBase Cluster
+date:   2016-06-10 17:30:00
+author: Yerui Sun
+categories: blog
+--
+
+## Introduction
+
+Apache Kylin mainly use HBase to storage cube data. The performance of HBase cluster impacts on the query performance of Kylin directly. In common scenario, HBase is deployed with MR/Hive on one HDFS cluster, which makes that the resouces HBase used is limited, and the MR job affects the performance of HBase. These problems can be resolved with standalone HBase cluster, and Apache Kylin has support this deploy mode for now.
+
+## Enviroment Requirements
+To enable standalone HBase cluster supporting, check the basic enviroments at first:
+ - Deploy the main cluster and hbase cluster, make sure both works normally
+ - Make sure Kylin Server can access both clusters using hdfs shell with fully qualifiered path
+ - Make sure Kylin Server can submit MR job to main cluster, and can use hive shell to access data warehouse, make sure the configurations of hadoop and hive points to main cluster
+ - Make sure Kylin Server can access hbase cluster using hbase shell, make sure the configuration of hbase points to hbase cluster
+ - Make sure the job on main cluster can access hbase cluster directly
+ 
+## Configurations
+Update the config `kylin.hbase.cluster.fs` in kylin.properties, with a value as the Namenode address of HBase Cluster, like `hdfs://hbase-cluster-nn01.example.com:8020`
+
+Notice that the value should keep consistent with the Namenode address of `root.dir` on HBase Master node, to ensure bulkload into hbase.
+
+## Using NN HA
+HDFS Namenode HA improved the availablity of cluster significantly, and maybe the HBase cluster enabled it. Apache Kylin doesn't support the HA perfectly for now, and here's the workaroud:
+ - Add all `dfs.nameservices` related configs of HBase Cluster into `hadoop/etc/hadoop/hdfs-site.xml` in Kylin Server, to make sure that can access HBase Cluster using hdfs shell with nameservice path
+ - Add all `dfs.nameservices` related configs of both two clusters into `kylin_job_conf.xml`, to make sure that the MR job can access hbase cluster with nameservice path
+
+## TroubleShooting
+ - UnknownHostException occurs during Cube Building
+   It usually occurs with HBase HA nameservice config, please refer the above section "Using NN HA"
+ - HFile BulkLoading Stucks for long time
+   Check the regionserver log, there should be lots of error log, with WrongFS exception. Make sure the namenode address in `kylin.properites/kylin.hbase.cluster.fs` and hbase master node `hbase-site.xml/root.dir` is same

[2/2] kylin git commit: Add document for count distinct usage

Posted by su...@apache.org.

Add document for count distinct usage


Project: http://git-wip-us.apache.org/repos/asf/kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/9a204458
Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/9a204458
Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/9a204458

Branch: refs/heads/document
Commit: 9a204458dc503845d9f644347f968ffc862fc99b
Parents: 43eaece
Author: sunyerui <su...@gmail.com>
Authored: Fri Jun 10 19:01:41 2016 +0800
Committer: sunyerui <su...@gmail.com>
Committed: Fri Jun 10 20:49:52 2016 +0800

----------------------------------------------------------------------
 .../blog/2016-06-10-count-distinct-in-kylin.md  | 96 ++++++++++++++++++++
 1 file changed, 96 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kylin/blob/9a204458/website/_posts/blog/2016-06-10-count-distinct-in-kylin.md
----------------------------------------------------------------------
diff --git a/website/_posts/blog/2016-06-10-count-distinct-in-kylin.md b/website/_posts/blog/2016-06-10-count-distinct-in-kylin.md
new file mode 100644
index 0000000..dbf0862
--- /dev/null
+++ b/website/_posts/blog/2016-06-10-count-distinct-in-kylin.md
@@ -0,0 +1,96 @@
+---
+layout: post-blog
+title:  Use Count Distinct in Apache Kylin
+date:   2016-06-10 18:30:00
+author: Yerui Sun 
+categories: blog
+---
+
+Since v.1.5.3
+
+## Background
+Count Distinct is a commonly measure in OLAP analyze, usually used for uv, etc. Apache Kylin offers two kinds of count distinct, approximately and precisely, differs on resource and performance.
+
+## Approximately Count Distinct
+Apache Kylin implements approximately count distinct using HyperLogLog algorithm, offered serveral precision, with the error rates from 9.75% to 1.22%. 
+The result of measure has theorically upper limit in size, as 2^N bytes. For the max precision N=16, the upper limit is 64KB, and the max error rate is 1.22%. 
+This implementation's pros is fast caculating and storage resource saving, but can't be used for precisely requirements.
+
+## Precisely Count Distinct
+Apache Kylin also implements precisely count distinct based on bitmap. For the data with type tiny int(byte), small int(short) and int, project the value into the bitmap directly. For the data with type long, string and others, encode the value as String into a dict, and project the dict id into the bitmap.
+The result of measure is the serialized data of bitmap, not just the count value. This makes sure that the rusult is always right with any roll-up, even across segments.
+This implementation's pros is precesily result, without error, but needs more storage resources. One result size maybe hundreds of MB, when the count distinct value over millions.
+
+## Global Dictionary
+Apache Kylin encode value into dictionay at the segment level by default. That means one same value in different segments maybe encoded into different id, which means the result of precisely count distinct maybe not correct.
+We introduced Global Dictionary with ensurance that one same value always encode into same id in different segments, to resolve this problem. Meanwhile, the capacity of dict has expanded dramatically, upper to support 2G values in one dict. It can also be used to replace default dictionary which has 5M values limitation.
+Current version has no UI for global dictionary yet, and the cube desc json shoule be modified to enable it:
+
+```
+"dictionaries": [
+    {
+          "column": "SUCPAY_USERID",
+	  "reuse": "USER_ID",
+          "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder"
+    }
+]
+```
+
+The `column` means the column which to be encoded, the `builder` specifies the dictionary builder, only `org.apache.kylin.dict.GlobalDictionaryBuilder` is available for now.
+The 'reuse` is used to optimize the dict of more than one columns based on one dataset, please refer the next section 'Example' for more details.
+The global dictionay can't be used for dimensiion encoding for now, that means if one column is used for dimension and count distinct measure in one cube, the dimension encoding should be others but not dict.
+
+## Example
+Here's some example data:
+| DT           | USER\_ID | FLAG1 | FLAG2 | USER\_ID\_FLAG1 | USER\_ID\_FLAG2 |
+| :----------: | :------: | :---: | :---: | :-------------: | :-------------: |
+| 2016-06-08   | AAA      | 1     | 1     | AAA             | AAA             |
+| 2016-06-08   | BBB      | 1     | 1     | BBB             | BBB             |
+| 2016-06-08   | CCC      | 0     | 1     | NULL            | CCC             |
+| 2016-06-09   | AAA      | 0     | 1     | NULL            | AAA             |
+| 2016-06-09   | CCC      | 1     | 0     | CCC             | NULL            |
+| 2016-06-10   | BBB      | 0     | 1     | NULL            | BBB             |
+
+There's basic columns `DT`, `USER_ID`, `FLAG1`, `FLAG2`, and condition columns `USER_ID_FLAG1=if(FLAG1=1,USER_ID,null)`, `USER_ID_FLAG2=if(FLAG2=1,USER_ID,null)`. Supposed the cube is builded by day, has 3 segments.
+
+Without the global dictionay, the precisely count distinct in semgent is correct, but the roll-up acrros segments result is wrong. Here's an example:
+
+```
+select count(distinct user_id_flag1) from table where dt in ('2016-06-08', '2016-06-09')
+```
+The result is 2 but not 3. The reason is that the dict in 2016-06-08 segment is AAA=>1, BBB=>1, and the dict in 2016-06-09 segment is CCC=> 1.
+With global dictionary config as below, the dict became as AAA=>1, BBB=>2, CCC=>3, that will procude correct result.
+```
+"dictionaries": [
+    {
+      "column": "USER_ID_FLAG1",
+      "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder"
+    }
+]
+```
+
+Actually, the data of USER_ID_FLAG1 and USER_ID_FLAG2 both are a subset of USER_ID dataset, that made the dictionary re-using possible. Just encode the USER_ID dataset, and config USER_ID_FLAG1 and USER_ID_FLAG2 resue USER_ID dict:
+```
+"dictionaries": [
+    {
+      "column": "USER_ID",
+      "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder"
+    },
+    {
+      "column": "USER_ID_FLAG1",
+      "reuse": "USER_ID",
+      "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder"
+    },
+    {
+      "column": "USER_ID_FLAG2",
+      "reuse": "USER_ID",
+      "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder"
+    }
+]
+```
+
+## Conclusions
+Here's some basically pricipal to decide which kind of count distinct will be used:
+ - If the result with error rate is acceptable, approximately way is always an better way
+ - If you need precisely result, the only way is precisely count distinct
+ - If you don't need roll-up across segments, or the column data type is tinyint/smallint/int, or the values count is less than 5M, just use default dictionary; otherwise the global dictionary should be configured, and consider the reuse column optimization