You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kylin.apache.org by ni...@apache.org on 2019/08/01 13:50:20 UTC

[kylin] 01/02: Add mr-hiv dict & health check cli part

This is an automated email from the ASF dual-hosted git repository.

nic pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git

commit cb957a59487aa62cc2e8a9e3af15eaf883f6ed7c
Author: XiaoxiangYu <hi...@126.com>
AuthorDate: Tue Jul 23 19:42:55 2019 +0800

    Add mr-hiv dict & health check cli part
---
 website/_data/docs30.yml                           |   2 +
 .../_docs30/howto/howto_use_health_check_cli.md    | 118 +++++++++++++++++++++
 website/_docs30/howto/howto_use_mr_hive_dict.md    |  31 ++++++
 website/_docs30/tutorial/real_time_olap.md         |   6 +-
 4 files changed, 153 insertions(+), 4 deletions(-)

diff --git a/website/_data/docs30.yml b/website/_data/docs30.yml
index aaeefb6..4f9e47b 100644
--- a/website/_data/docs30.yml
+++ b/website/_data/docs30.yml
@@ -83,3 +83,5 @@
   - howto/howto_update_coprocessor
   - howto/howto_install_ranger_kylin_plugin
   - howto/howto_enable_zookeeper_acl
+  - howto/howto_use_health_check_cli
+  - howto/howto_use_hive_mr_dict
diff --git a/website/_docs30/howto/howto_use_health_check_cli.md b/website/_docs30/howto/howto_use_health_check_cli.md
new file mode 100644
index 0000000..c40a32f
--- /dev/null
+++ b/website/_docs30/howto/howto_use_health_check_cli.md
@@ -0,0 +1,118 @@
+---
+layout: docs30
+title:  Kylin Health Check(NEW)
+categories: howto
+permalink: /docs30/howto/howto_use_health_check_cli.html
+---
+
+## Get started
+In kylin 3.0, we add a health check job of Kylin which help to detect whether your Kylin is in good state. This will help to reduce manually work for Kylin's Administrator. If you have hundreds of cubes and thousands of building job every day, this feature help you quickly find failed job and segment which lost file or hbase table, or cube with too high expansion rate. 
+
+Use this feature by adding following to *kylin.properties* if you are using 126.com:
+{% highlight Groff markup %}
+kylin.job.notification-enabled=true
+kylin.job.notification-mail-enable-starttls=true
+kylin.job.notification-mail-host=smtp.126.com
+kylin.job.notification-mail-username=hahaha@126.com
+kylin.job.notification-mail-password=hahaha
+kylin.job.notification-mail-sender=hahaha@126.com
+kylin.job.notification-admin-emails=hahaha@kyligence.io,hahaha@126.com
+{% endhighlight %} 
+After start the Kylin process, you should execute following command and get email received. In production env, it should be scheduled by crontab etc.
+{% highlight Groff markup %}
+sh bin/kylin.sh org.apache.kylin.tool.KylinHealthCheckJob
+{% endhighlight %} 
+You will receive email in your mailbox.
+
+## Detail of health check step
+
+### Checking metadata
+This part will try record all path of entry which Kylin process failed to load from Metadata(ResourceStore). This maybe a signal of health state for Kylin's Metadata Store.
+
+If find any error, it will be sent via email as following.
+{% highlight Groff markup %}
+Error loading CubeDesc at ${PATH} ...
+Error loading DataModelDesc at ${PATH} ...
+{% endhighlight %}
+
+### Fix missing HDFS path of segments
+This part will try to visit all segments and check whether segment file exists in HDFS.  
+
+If find any error, it will be sent via email as following.
+{% highlight Groff markup %}
+Project: ${PROJECT} cube: ${CUBE} segment: ${SEGMENT} cube id data: ${SEGMENT_PATH} don't exist and need to rebuild it
+{% endhighlight %}
+
+### Checking HBase Table of segments
+This part will check whether HTable belong to each segment exists and state is Enable, you may need to rebuild them or re-enable them if find any.
+
+If find any error, it will be sent via email as following.
+{% highlight Groff markup %}
+HBase table: {TABLE_NAME} not exist for segment: {SEGMENT}, project: {PROJECT}
+{% endhighlight %}
+
+### Checking holes of Cubes
+This part will try to check segment holes of each cube, so lost segments need to be rebuilt if find any.
+
+If find any error, it will be sent via email as following.
+{% highlight Groff markup %}
+{COUNT_HOLE} holes in cube: {CUBE_NAME}, project: {PROJECT_NAME}
+{% endhighlight %}
+
+### Checking too many segments of Cubes
+This part will try to check cube which have too many segments, so they need to merged.
+
+If find any error, it will be sent via email as following.
+{% highlight Groff markup %}
+Too many segments: {COUNT_OF_SEGMENT} for cube: {CUBE_NAME}, project: {PROJECT_NAME}, please merge the segments
+{% endhighlight %}
+
+The threshold is decided by `kylin.tool.health-check.warning-segment-num`, default value is `-1`, which means skip check.
+
+### Checking out-of-date Cubes
+This part will try to find cube which have not been built for a long duration, so maybe you don't really need them. 
+
+If find any error, it will be sent via email as following.
+{% highlight Groff markup %}
+Ready Cube: {CUBE_NAME} in project: {PROJECT_NAME} is not built more then {DAYS} days, maybe it can be disabled
+Disabled Cube: {CUBE_NAME} in project: {PROJECT_NAME} is not built more then {DAYS} days, maybe it can be deleted
+{% endhighlight %}
+
+The threshold is decided by `kylin.tool.health-check.stale-cube-threshold-days`, default value is `100`.
+
+### Check data expansion rate
+This part will try to check cube have high expansion rate, so you may consider optimize them. 
+
+If find any error, it will be sent via stdout as following.
+{% highlight Groff markup %}
+Cube: {CUBE_NAME} in project: {PROJECT_NAME} with too large expansion rate: {RATE}, cube data size: {SIZE}G
+{% endhighlight %}
+
+The expansion rate warning threshold is decided by `kylin.tool.health-check.warning-cube-expansion-rate`.
+The cube-size warning threshold is decided by `kylin.tool.health-check.expansion-check.min-cube-size-gb`.
+
+### Check cube configuration
+
+This part will try to check cube has been set with auto merge & retention configuration. 
+
+If find any error, it will be sent via stdout as following.
+{% highlight Groff markup %}
+Cube: {CUBE_NAME} in project: {PROJECT_NAME} with no auto merge params
+Cube: {CUBE_NAME} in project: {PROJECT_NAME} with no retention params
+{% endhighlight %} 
+
+### Cleanup stopped job
+
+Stopped and Error jobs which have not be repaired in time will be alarmed if find any.
+
+{% highlight Groff markup %}
+Should discard job: {}, which in ERROR/STOPPED state for {} days
+{% endhighlight %} 
+
+The duration is set by `kylin.tool.health-check.stale-job-threshold-days`, default is `30`.
+
+
+----
+
+For the detail of HealthCheck, please check code at *org.apache.kylin.rest.job.KylinHealthCheckJob* in Github Repo. 
+If you have more suggestion or want to add more check rule, please submit a PR to master branch.
diff --git a/website/_docs30/howto/howto_use_mr_hive_dict.md b/website/_docs30/howto/howto_use_mr_hive_dict.md
new file mode 100644
index 0000000..1091428
--- /dev/null
+++ b/website/_docs30/howto/howto_use_mr_hive_dict.md
@@ -0,0 +1,31 @@
+---
+layout: docs30
+title:  Use Hive to build global dictionary
+categories: howto
+permalink: /docs30/howto/howto_use_hive_mr_dict.html
+---
+
+### Global Dictionary 
+Count distinct measure is very important for many scenario, such as PageView statistics, Kylin support count distinct since 1.5.3 (http://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/). 
+Apache Kylin implements precisely count distinct based on bitmap, and use global dictionary to encode string value into a Dict. 
+Currently we have to build global dictionary in single process/JVM, which may take a lot of time and memory for UHC. By in this feature(KYLIN-3841), we use Hive, a distributed SQL engine to build global dictionary.
+
+This will help to:
+1. Reduce memory pressure of Kylin process, MapReduce will be used to build dict for Kylin
+2. Make global dictionary reusable
+3. Make global dictionary readable, you may use global dictionary outside Kylin, maybe useful in many scenario.
+
+And this feature will add three steps if enabled.
+1. Global Dict Mr/Hive extract dict_val from Data
+2. Global Dict Mr/Hive build dict_val
+3. Global Dict Mr/Hive replace dict_val to Data
+
+### How to use
+If you have a count distinct(bitmap) measure for a UHC. Says columns name are PV_ID and USER_ID, and table name is USER_ACTION, you may add cube-level configuration `kylin.dictionary.mr-hive.columns=USER_ACTION_PV_ID,USER_ACTION_USER_ID` to enable this feature.
+You have to know that the value will be replaced into encoded integer in flat hive table, and this may cause failure of some query.
+
+### Configuration
+- `kylin.dictionary.mr-hive.columns` is used to specific which columns need to be Hive-MR dict.
+- `kylin.dictionary.mr-hive.database` is used to specific which database Hive-MR dict located.
+- `kylin.hive.union.style` Sometime sql which used to build global dict table may have syntax problem. This should be fixed by specific this entry with *UNION ALL*.
+- `kylin.dictionary.mr-hive.table.suffix` is used to specific suffix of global dict table.
\ No newline at end of file
diff --git a/website/_docs30/tutorial/real_time_olap.md b/website/_docs30/tutorial/real_time_olap.md
index 90faec9..cd9de8e 100644
--- a/website/_docs30/tutorial/real_time_olap.md
+++ b/website/_docs30/tutorial/real_time_olap.md
@@ -32,10 +32,8 @@ The detail can be found at [Deep Dive into Real-time OLAP](http://kylin.apache.o
 - MapReduce [**distributed computation**]
 - HDFS [**distributed storage**]
 
-
 ![image](/images/RealtimeOlap/realtime-olap-architecture.png)
 
-
 ### Streaming Coordinator
 Streaming coordinator works as the master node of streaming receiver cluster. It's main responsibility include assign/unassign specific topic partition to specific repilca set, pause or resume cosuming behavior, collect mertics such as cosume rate (message per second).
 When `kylin.server.mode` is set to `all` or `stream_coordinator`, that process is a streaming coordinator(candidate). Coordinator only manage metadata, won't process entered message. 
@@ -62,7 +60,7 @@ A replica set is a group of streaming receivers. Replica set is the minimum unit
 ## Prepare environment
 
 ### Install Kafka 
-Don’t use HDP 2.2.4’s build-in Kafka as it is too old, stop it first if it is running. Please download Kafka 1.0 binary package from Kafka project page, and then uncompress it under a folder like /usr/local/.
+Don’t use HDP’s build-in Kafka as it is too old, stop it first if it is running. Please download Kafka 1.0 binary package from Kafka project page, and then uncompress it under a folder like /usr/local/.
 
 {% highlight Groff markup %}
 tar -zxvf kafka_2.12-1.0.2.tgz
@@ -101,7 +99,7 @@ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 -
 Created topic "kylin_streaming_topic".
 {% endhighlight %}
 
-Put sample data to this topic, you can write a python script to do that. Please do not send multi-level json such as `{"name":"Kitty", "location": {"state":"NY", "country":"US"}}` because Receiver cannot parse it currently.
+Put sample data to this topic, you can write a python script to do that.
 
 {% highlight Groff markup %}
 python user_action.py --max-uid 2000 --max-vid 2000 --msg-sec 100 --enable-hour-power false | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic kylin_streaming_topic