You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@griffin.apache.org by gu...@apache.org on 2018/09/18 07:56:22 UTC
[1/2] incubator-griffin-site git commit: profiling
Repository: incubator-griffin-site
Updated Branches:
refs/heads/master 3891ad018 -> 077729686
profiling
Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/commit/ce45b1dd
Tree: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/tree/ce45b1dd
Diff: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/diff/ce45b1dd
Branch: refs/heads/master
Commit: ce45b1dd3fbf6f9cf18c5995ab5d6cfcaa37827e
Parents: 78070b8
Author: Lionel Liu <bh...@163.com>
Authored: Tue Sep 18 15:33:57 2018 +0800
Committer: Lionel Liu <bh...@163.com>
Committed: Tue Sep 18 15:33:57 2018 +0800
----------------------------------------------------------------------
profiling.md | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 165 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/blob/ce45b1dd/profiling.md
----------------------------------------------------------------------
diff --git a/profiling.md b/profiling.md
index e7a473e..8422a17 100644
--- a/profiling.md
+++ b/profiling.md
@@ -3,3 +3,168 @@ layout: doc
title: "Profiling Use Case"
permalink: /docs/profiling.html
---
+## User Story
+Say we have one data set(demo_src), partitioned by hour, we want to know what is the data like for each hour.
+
+For simplicity, suppose both two data set have the same schema as this:
+```
+id bigint
+age int
+desc string
+dt string
+hour string
+```
+both dt and hour are partitions,
+
+as every day we have one daily partition dt(like 20180912),
+
+for every day we have 24 hourly partitions(like 00, 01, 02, ..., 23).
+
+## Environment Preparation
+You need to prepare the environment for Apache Griffin measure module, including the following software:
+- JDK (1.8+)
+- Hadoop (2.6.0+)
+- Spark (2.2.1+)
+- Hive (2.2.0)
+
+## Build Griffin Measure Module
+1. Download Griffin source package [here](https://www.apache.org/dist/incubator/griffin/0.3.0-incubating).
+2. Unzip the source package.
+ ```
+ unzip griffin-0.3.0-incubating-source-release.zip
+ cd griffin-0.3.0-incubating-source-release
+ ```
+3. Build Griffin jars.
+ ```
+ mvn clean install
+ ```
+
+ Move the built griffin measure jar to your work path.
+
+ ```
+ mv measure/target/measure-0.3.0-incubating.jar <work path>/griffin-measure.jar
+ ```
+
+## Data Preparation
+
+For our quick start, We will generate a hive table demo_src.
+```
+--create hive tables here. hql script
+--Note: replace hdfs location with your own path
+CREATE EXTERNAL TABLE `demo_src`(
+ `id` bigint,
+ `age` int,
+ `desc` string)
+PARTITIONED BY (
+ `dt` string,
+ `hour` string)
+ROW FORMAT DELIMITED
+ FIELDS TERMINATED BY '|'
+LOCATION
+ 'hdfs:///griffin/data/batch/demo_src';
+```
+The data could be generated this:
+```
+1|18|student
+2|23|engineer
+3|42|cook
+...
+```
+You can download [demo data](/data/batch) and execute `./gen_demo_data.sh` to get the data source file.
+Then we will load data into hive table for every hour.
+```
+LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION (dt='20180912',hour='09');
+```
+Or you can just execute `./gen-hive-data.sh` in the downloaded directory above, to generate and load data into the tables hourly.
+
+## Define data quality measure
+
+#### Griffin env configuration
+The environment config file: env.json
+```
+{
+ "spark": {
+ "log.level": "WARN"
+ },
+ "sinks": [
+ {
+ "type": "console"
+ },
+ {
+ "type": "hdfs",
+ "config": {
+ "path": "hdfs:///griffin/persist"
+ }
+ },
+ {
+ "type": "elasticsearch",
+ "config": {
+ "method": "post",
+ "api": "http://es:9200/griffin/accuracy"
+ }
+ }
+ ]
+}
+```
+
+#### Define griffin data quality
+The DQ config file: dq.json
+
+```
+{
+ "name": "batch_prof",
+ "process.type": "batch",
+ "data.sources": [
+ {
+ "name": "src",
+ "baseline": true,
+ "connectors": [
+ {
+ "type": "hive",
+ "version": "1.2",
+ "config": {
+ "database": "default",
+ "table.name": "demo_tgt"
+ }
+ }
+ ]
+ }
+ ],
+ "evaluate.rule": {
+ "rules": [
+ {
+ "dsl.type": "griffin-dsl",
+ "dq.type": "profiling",
+ "out.dataframe.name": "prof",
+ "rule": "src.id.count() AS id_count, src.age.max() AS age_max, src.desc.length().max() AS desc_length_max",
+ "out": [
+ {
+ "type": "metric",
+ "name": "prof"
+ }
+ ]
+ }
+ ]
+ },
+ "sinks": ["CONSOLE", "HDFS"]
+}
+```
+
+## Measure data quality
+Submit the measure job to Spark, with config file paths as parameters.
+
+```
+spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
+--driver-memory 1g --executor-memory 1g --num-executors 2 \
+<path>/griffin-measure.jar \
+<path>/env.json <path>/dq.json
+```
+
+## Report data quality metrics
+Then you can get the calculation log in console, after the job finishes, you can get the result metrics printed. The metrics will also be saved in hdfs: `hdfs:///griffin/persist/<job name>/<timestamp>/_METRICS`.
+
+## Refine Data Quality report
+Depends on your business, you might need to refine your data quality measure further till your are satisfied.
+
+## More Details
+For more details about griffin measures, you can visit our documents in [github](https://github.com/apache/incubator-griffin/tree/master/griffin-doc).
\ No newline at end of file
[2/2] incubator-griffin-site git commit: Merge branch 'profiling' of
https://github.com/bhlx3lyx7/incubator-griffin-site
Posted by gu...@apache.org.
Merge branch 'profiling' of https://github.com/bhlx3lyx7/incubator-griffin-site
Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/commit/07772968
Tree: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/tree/07772968
Diff: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/diff/07772968
Branch: refs/heads/master
Commit: 0777296868773f3456019df24829827a90b46fde
Parents: 3891ad0 ce45b1d
Author: William Guo <gu...@apache.org>
Authored: Tue Sep 18 15:54:30 2018 +0800
Committer: William Guo <gu...@apache.org>
Committed: Tue Sep 18 15:54:30 2018 +0800
----------------------------------------------------------------------
profiling.md | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 165 insertions(+)
----------------------------------------------------------------------