You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@griffin.apache.org by li...@apache.org on 2018/04/26 08:13:03 UTC

[40/50] [abbrv] incubator-griffin git commit: update dsl-guide and set coalesce as null string by default

update dsl-guide and set coalesce as null string by default


Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-griffin/commit/e9229154
Tree: http://git-wip-us.apache.org/repos/asf/incubator-griffin/tree/e9229154
Diff: http://git-wip-us.apache.org/repos/asf/incubator-griffin/diff/e9229154

Branch: refs/heads/griffin-0.2.0-incubating-rc4
Commit: e9229154832f9802bdffe49039df3f70becf1bc6
Parents: 0ba7e07
Author: Lionel Liu <bh...@163.com>
Authored: Tue Apr 24 17:56:27 2018 +0800
Committer: Lionel Liu <bh...@163.com>
Committed: Tue Apr 24 17:56:27 2018 +0800

----------------------------------------------------------------------
 griffin-doc/measure/dsl-guide.md                | 26 +++++++++++++++++---
 .../measure/rule/dsl/expr/SelectExpr.scala      |  2 +-
 2 files changed, 23 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-griffin/blob/e9229154/griffin-doc/measure/dsl-guide.md
----------------------------------------------------------------------
diff --git a/griffin-doc/measure/dsl-guide.md b/griffin-doc/measure/dsl-guide.md
index 0fc8059..7370274 100644
--- a/griffin-doc/measure/dsl-guide.md
+++ b/griffin-doc/measure/dsl-guide.md
@@ -125,6 +125,10 @@ Profiling rule expression in Griffin DSL is a sql-like expression, with select c
 Uniqueness rule expression in Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is unique.  
 	e.g. `name, age`, `name, (age + 1) as next_age`
 
+### Distinctness Rule
+Distinctness rule expression in Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is distinct.
+    e.g. `name, age`, `name, (age + 1) as next_age`
+
 ### Timeliness Rule
 Timeliness rule expression in Griffin DSL is a list of selection expressions separated by comma, indicates the input time and output time (calculate time as default if not set).  
 	e.g. `ts`, `ts, end_ts`
@@ -149,12 +153,12 @@ For profiling, the request is always the aggregation function of data, the rule
 For example, the dsl rule is `source.cntry, source.id.count(), source.age.max() group by source.cntry`, which represents the profiling requests. After the translation, the sql rule is as below:  
 - **profiling sql rule**: `SELECT source.cntry, count(source.id), max(source.age) FROM source GROUP BY source.cntry`, save as table `profiling`.  
 
-After the translation, the metrics will be persisted in table `profiling`.  
+After the translation, the metrics will be persisted in table `profiling`.
 
 ### Uniqueness
-For uniqueness, or called duplicate, is to find out the duplicate items of data, and rollup the items count group by duplicate times.  
-For example, the dsl rule is `name, age`, which represents the duplicate requests, in this case, source and target are the same data set. After the translation, the sql rule is as below:  
-- **get distinct items from source**: `SELECT name, age FROM source`, save as table `src`.  
+For uniqueness, or called duplicate, is to find out the duplicate items of data, and rollup the items count group by duplicate times.
+For example, the dsl rule is `name, age`, which represents the duplicate requests, in this case, source and target are the same data set. After the translation, the sql rule is as below:
+- **get distinct items from source**: `SELECT name, age FROM source`, save as table `src`.
 - **get all items from target**: `SELECT name, age FROM target`, save as table `tgt`.
 - **join two tables**: `SELECT src.name, src.age FROM tgt RIGHT JOIN src ON coalesce(src.name, '') = coalesce(tgt.name, '') AND coalesce(src.age, '') = coalesce(tgt.age, '')`, save as table `joined`.
 - **get items duplication**: `SELECT name, age, (count(*) - 1) AS dup FROM joined GROUP BY name, age`, save as table `grouped`.
@@ -166,6 +170,20 @@ For example, the dsl rule is `name, age`, which represents the duplicate request
 
 After the translation, the metrics will be persisted in table `dup_metric`.
 
+### Distinctness
+For distinctness, is to find out the duplicate items of data, the same as uniqueness in batch mode, but with some differences in streaming mode.
+In most time, you need distinctness other than uniqueness.
+For example, the dsl rule is `name, age`, which represents the distinct requests, in this case, source and target are the same data set. After the translation, the sql rule is as below:
+- **total count of source**: `SELECT COUNT(*) AS total FROM source`, save as table `total_count`.
+- **group by fields**: `SELECT name, age, (COUNT(*) - 1) AS dup, TRUE AS dist FROM source GROUP BY name, age`, save as table `dup_count`.
+- **distinct metric**: `SELECT COUNT(*) AS dist_count FROM dup_count WHERE dist`, save as table `distinct_metric`.
+- **source join distinct metric**: `SELECT source.*, dup_count.dup AS dup, dup_count.dist AS dist FROM source LEFT JOIN dup_count ON source.name = dup_count.name AND source.age = dup_count.age`, save as table `dist_joined`.
+- **add row number**: `SELECT *, ROW_NUMBER() OVER (DISTRIBUTE BY name, age SORT BY dist) row_num FROM dist_joined`, save as table `row_numbered`.
+- **duplicate records**: `SELECT name, age, dup FROM row_numbered WHERE NOT dist OR row_num > 1`, save as table `dup_records`.
+- **duplicate metric**: `SELECT name, age, dup, COUNT(*) AS num FROM dup_records GROUP BY name, age, dup`, save as table `dup_metric`.
+
+After the translation, the metrics will be persisted in table `distinct_metric` and `dup_metric`.
+
 ### Timeliness
 For timeliness, is to measure the latency of each item, and get the statistics of the latencies.  
 For example, the dsl rule is `ts, out_ts`, the first column means the input time of item, the second column means the output time of item, if not set, `__tmst` will be the default output time column. After the translation, the sql rule is as below:  

http://git-wip-us.apache.org/repos/asf/incubator-griffin/blob/e9229154/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala
----------------------------------------------------------------------
diff --git a/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala b/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala
index d6e350b..0c3c451 100644
--- a/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala
+++ b/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala
@@ -115,7 +115,7 @@ case class SelectionExpr(head: HeadExpr, selectors: Seq[SelectExpr], aliasOpt: O
     selectors.lastOption match {
       case None => desc
       case Some(sel: FunctionSelectExpr) => desc
-      case _ => s"coalesce(${desc}, 'null')"
+      case _ => s"coalesce(${desc}, '')"
     }
   }
   def alias: Option[String] = {