You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by sh...@apache.org on 2016/08/08 15:00:34 UTC

kylin git commit: minor update on yerui's blog

Repository: kylin
Updated Branches:
  refs/heads/document f7deeb11b -> 99e65c0b0


minor update on yerui's blog


Project: http://git-wip-us.apache.org/repos/asf/kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/99e65c0b
Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/99e65c0b
Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/99e65c0b

Branch: refs/heads/document
Commit: 99e65c0b03979d66fef3554d2051fcb165524164
Parents: f7deeb1
Author: shaofengshi <sh...@apache.org>
Authored: Mon Aug 8 23:00:26 2016 +0800
Committer: shaofengshi <sh...@apache.org>
Committed: Mon Aug 8 23:00:26 2016 +0800

----------------------------------------------------------------------
 .../blog/2016-08-01-count-distinct-in-kylin.md  | 22 +++++++++++---------
 1 file changed, 12 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kylin/blob/99e65c0b/website/_posts/blog/2016-08-01-count-distinct-in-kylin.md
----------------------------------------------------------------------
diff --git a/website/_posts/blog/2016-08-01-count-distinct-in-kylin.md b/website/_posts/blog/2016-08-01-count-distinct-in-kylin.md
index 242b0d1..105cd2e 100644
--- a/website/_posts/blog/2016-08-01-count-distinct-in-kylin.md
+++ b/website/_posts/blog/2016-08-01-count-distinct-in-kylin.md
@@ -18,19 +18,21 @@ This implementation's pros is fast caculating and storage resource saving, but c
 
 ## Precisely Count Distinct
 Apache Kylin also implements precisely count distinct based on bitmap. For the data with type tiny int(byte), small int(short) and int, project the value into the bitmap directly. For the data with type long, string and others, encode the value as String into a dict, and project the dict id into the bitmap.
-The result of measure is the serialized data of bitmap, not just the count value. This makes sure that the rusult is always right with any roll-up, even across segments.
-This implementation's pros is precesily result, without error, but needs more storage resources. One result size maybe hundreds of MB, when the count distinct value over millions.
+The result of measure is the serialized data of bitmap, not just the count value. This makes sure that the result is always correct with any roll-up, even across segments.
+This implementation's pros is precise result, no error, but needs more storage resources. One result size might be hundreds of MB, when the count distinct value over millions.
 
 ## Global Dictionary
-Apache Kylin encode value into dictionay at the segment level by default. That means one same value in different segments maybe encoded into different id, which means the result of precisely count distinct maybe not correct.
-We introduced Global Dictionary with ensurance that one same value always encode into same id in different segments, to resolve this problem. Meanwhile, the capacity of dict has expanded dramatically, upper to support 2G values in one dict. It can also be used to replace default dictionary which has 5M values limitation.
-Current version has no UI for global dictionary yet, and the cube desc json shoule be modified to enable it:
+Apache Kylin encodes values into dictionay at the segment level by default. That means one value in different segments maybe encoded into different ID, then the result of count distinct will be incorrect.
+
+In v1.5.3 we introduce "Global Dictionary" with ensurance that one value always be encoded into the same ID across different segments. Meanwhile, the capacity of dictionary has expanded dramatically, upper to support 2 billion values in one dictionary. It can also be used to replace the default dictionary which has 5 million values limitation.
+
+Current version (v1.5.3) has no GUI for defining global dictionary yet, you need manually edit the cube desc json like this:
 
 ```
 "dictionaries": [
     {
           "column": "SUCPAY_USERID",
-	  "reuse": "USER_ID",
+	 	   "reuse": "USER_ID",
           "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder"
     }
 ]
@@ -38,7 +40,7 @@ Current version has no UI for global dictionary yet, and the cube desc json shou
 
 The `column` means the column which to be encoded, the `builder` specifies the dictionary builder, only `org.apache.kylin.dict.GlobalDictionaryBuilder` is available for now.
 The 'reuse` is used to optimize the dict of more than one columns based on one dataset, please refer the next section 'Example' for more details.
-The global dictionay can't be used for dimensiion encoding for now, that means if one column is used for dimension and count distinct measure in one cube, the dimension encoding should be others but not dict.
+The global dictionay cannot be used for dimension encoding for now, that means if one column is used for both dimension and count distinct measure in one cube, its dimension encoding should be others instead of dict.
 
 ## Example
 Here's some example data:
@@ -53,7 +55,7 @@ Here's some example data:
 
 There's basic columns `DT`, `USER_ID`, `FLAG1`, `FLAG2`, and condition columns `USER_ID_FLAG1=if(FLAG1=1,USER_ID,null)`, `USER_ID_FLAG2=if(FLAG2=1,USER_ID,null)`. Supposed the cube is builded by day, has 3 segments.
 
-Without the global dictionay, the precisely count distinct in semgent is correct, but the roll-up acrros segments result is wrong. Here's an example:
+Without the global dictionay, the precisely count distinct in a semgent is correct, but the roll-up acrros segments will be wrong. Here's an example:
 
 ```
 select count(distinct user_id_flag1) from table where dt in ('2016-06-08', '2016-06-09')
@@ -99,5 +101,5 @@ kylin.job.mr.config.override.mapreduce.map.memory.mb=8500
 ## Conclusions
 Here's some basically pricipal to decide which kind of count distinct will be used:
  - If the result with error rate is acceptable, approximately way is always an better way
- - If you need precisely result, the only way is precisely count distinct
- - If you don't need roll-up across segments, or the column data type is tinyint/smallint/int, or the values count is less than 5M, just use default dictionary; otherwise the global dictionary should be configured, and consider the reuse column optimization
+ - If you need precise result, the only way is precisely count distinct
+ - If you don't need roll-up across segments (like non-partitioned cube), or the column data type is tinyint/smallint/int, or the values count is less than 5M, just use default dictionary; otherwise the global dictionary should be configured, and also consider the "reuse" column optimization