You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@griffin.apache.org by 李立威 <41...@qq.com> on 2019/03/04 13:25:48 UTC

Apache Griffin咨询

hello，我最近在自己的电脑上部署Griffin（版本： 0.4.0），有几个问题想咨询一下：
（1）我看Griffin文档上说，可以自定义measure，并设置报警阈值，但我看了Griffin源代码及其UI，没有找到设置报警阈值的入口，也没有找到发送报警邮件的服务，更没有找到配置邮件服务器的参数，可能是我没有注意到这个的细节，麻烦告知一下怎么配置数据质量监控的报警阈值，以及如何更改邮件发送服务器的地址？感谢。


（2）我使用Profiling对某个表的某个字段进行监控，统计出来Metrics是一个表格，而不是图表，能否将其配置为图表呢？


（3）我看了源代码，我觉得Griffin的逻辑是：先创建Measure、Job，其中在创建Job时，会生成定时任务，这个定时任务会按设置的调度时间，周期性的向Spark（通过Livy）发起调度请求，Spark执行完成后，将数据质量结果存储在hdfs和ES中，Griffin Web端，用户每次查询Metrics时，会去ES中拉取相应的数据质量结果。我这种理解对不对呢？


（4）关于任务依赖的问题，你们有没有好的解决方案呢？比如：我有个Hive表，它每天更新一次，每天更新后都会生成一个新的时间分区（dt分区字段=昨天的日期，例如：dt=2019-03-03），当这个表的加工任务完成更新表数据后，再启动Griffin相应的数据质量监控任务去校验新分区中的数据是否符合要求。因为如果Griffin任务不依赖Hive表的加工任务，那么Griffin任务就会不断执行，如果Hive表加工任务还未开始执行，那么昨天的分区就会一直没有数据，那么就会一直收到报警邮件，这样会导致很多不必要的“假报警”。所以还请麻烦看看你们是如何解决这个问题的？


麻烦你在有空的时候，帮忙解答一下上面的问题，感谢。


Apache 开源爱好者之一
2019.03.04

Re: Apache Griffin咨询

Posted by Nick Sokolov <ch...@gmail.com>.

For 1) and 2), alerting based on metrics and profiling graphs, Grafana
works pretty good on my experience.

Problem in 4) can be solved in several ways:
 - "predicates": it is possible to configure data source with predicate
logic, which will be checked before job gets started. However there are few
gotchas. Predicate allows only to "skip" executions if predicate returned
false. If it's once a day job - it won't run that day at all, if data is
not available. Another problem is that only "file.exist" predicate is
supported out of the box, and 0.4.0 does not allow to provide custom ones.
But there is PR open <https://github.com/apache/griffin/pull/484> adding
ability to write custom predicates (for example, you can create predicate,
checking whether hive table have changed since last run).
 - trigger job execution from job itself, either using GRIFFIN-229 (yet to
be merged), or by running griffin-measure via spark-submit explicitly

On Thu, Mar 7, 2019 at 5:58 AM William Guo <gu...@apache.org> wrote:

> hi team,
>
> We translate the origin email into english so the community can understand
> it.
>
> =====
> Hello, I recently deployed Griffin on my own computer (version: 0.4.0). I
> have a few questions to ask:
> (1) I saw the Griffin document saying that you can customize the measure
> and set the alarm threshold, but I saw the Griffin source code and its UI,
> did not find the entry to set the alarm threshold, and did not find the
> service to send the alarm mail, not even Find the parameters to configure
> the mail server, maybe I did not notice the details of this, trouble
> telling how to configure the alarm threshold for data quality monitoring,
> and how to change the address of the mail sending server? thank.
>
>
> (2) I use Profiling to monitor a field in a table. Metrics is a table, not
> a chart. Can it be configured as a chart?
>
>
> (3) I saw the source code. I think Griffin's logic is: first create
> Measure, Job, which will generate a scheduled task when creating a job.
> This scheduled task will periodically pass to Spark according to the set
> scheduling time. Livy) initiates a scheduling request. After Spark is
> executed, the data quality results are stored in hdfs and ES. On the
> Griffin Web, each time the user queries Metrics, it will go to the ES to
> pull the corresponding data quality result. Is this my understanding
> correct?
>
>
> (4) Do you have a good solution to the problem of task dependence? For
> example: I have a Hive table, it is updated once a day, every day after the
> update will generate a new time partition (dt partition field = yesterday's
> date, for example: dt = 2019-03-03), when the processing task of this table
> is completed After updating the table data, restart Griffin's corresponding
> data quality monitoring task to verify that the data in the new partition
> meets the requirements. Because if the Griffin task does not depend on the
> processing task of the Hive table, then the Griffin task will continue to
> execute. If the Hive table processing task has not yet started, then the
> partition will have no data yesterday, then it will always receive the
> alarm mail, so Will lead to a lot of unnecessary "false alarms". So please
> also trouble to see how you solved this problem?
>
>
> Trouble, when you have time, help answer the above questions, thank you.
>
>
> One of Apache open source enthusiasts
> 2019.03.04
>
>
>
> On Tue, Mar 5, 2019 at 9:57 AM 大鹏 <18...@163.com> wrote:
>
> > 部分问题（按问题编号）我所了解的情况如下：
> > （1）目前报警需要结合ES来实现，ES有相关的报警插件；
> > （2）目前不支持自定义图表，只能根据自己的需求开发相应的图表；
> > （3）你的理解是对的
> >
> >
> > 希望对你有所帮助
> >
> >
> > 在2019年03月5日 07:53，李立威<41...@qq.com> 写道：
> > hello，我最近在自己的电脑上部署Griffin（版本： 0.4.0），有几个问题想咨询一下：
> >
> >
> （1）我看Griffin文档上说，可以自定义measure，并设置报警阈值，但我看了Griffin源代码及其UI，没有找到设置报警阈值的入口，也没有找到发送报警邮件的服务，更没有找到配置邮件服务器的参数，可能是我没有注意到这个的细节，麻烦告知一下怎么配置数据质量监控的报警阈值，以及如何更改邮件发送服务器的地址？感谢。
> >
> >
> > （2）我使用Profiling对某个表的某个字段进行监控，统计出来Metrics是一个表格，而不是图表，能否将其配置为图表呢？
> >
> >
> >
> （3）我看了源代码，我觉得Griffin的逻辑是：先创建Measure、Job，其中在创建Job时，会生成定时任务，这个定时任务会按设置的调度时间，周期性的向Spark（通过Livy）发起调度请求，Spark执行完成后，将数据质量结果存储在hdfs和ES中，Griffin
> > Web端，用户每次查询Metrics时，会去ES中拉取相应的数据质量结果。我这种理解对不对呢？
> >
> >
> >
> >
> （4）关于任务依赖的问题，你们有没有好的解决方案呢？比如：我有个Hive表，它每天更新一次，每天更新后都会生成一个新的时间分区（dt分区字段=昨天的日期，例如：dt=2019-03-03），当这个表的加工任务完成更新表数据后，再启动Griffin相应的数据质量监控任务去校验新分区中的数据是否符合要求。因为如果Griffin任务不依赖Hive表的加工任务，那么Griffin任务就会不断执行，如果Hive表加工任务还未开始执行，那么昨天的分区就会一直没有数据，那么就会一直收到报警邮件，这样会导致很多不必要的“假报警”。所以还请麻烦看看你们是如何解决这个问题的？
> >
> >
> > 麻烦你在有空的时候，帮忙解答一下上面的问题，感谢。
> >
> >
> > Apache 开源爱好者之一
> > 2019.03.04
>

Re: Apache Griffin咨询

Posted by William Guo <gu...@apache.org>.

hi team,

We translate the origin email into english so the community can understand
it.

=====
Hello, I recently deployed Griffin on my own computer (version: 0.4.0). I
have a few questions to ask:
(1) I saw the Griffin document saying that you can customize the measure
and set the alarm threshold, but I saw the Griffin source code and its UI,
did not find the entry to set the alarm threshold, and did not find the
service to send the alarm mail, not even Find the parameters to configure
the mail server, maybe I did not notice the details of this, trouble
telling how to configure the alarm threshold for data quality monitoring,
and how to change the address of the mail sending server? thank.

(2) I use Profiling to monitor a field in a table. Metrics is a table, not
a chart. Can it be configured as a chart?

(3) I saw the source code. I think Griffin's logic is: first create
Measure, Job, which will generate a scheduled task when creating a job.
This scheduled task will periodically pass to Spark according to the set
scheduling time. Livy) initiates a scheduling request. After Spark is
executed, the data quality results are stored in hdfs and ES. On the
Griffin Web, each time the user queries Metrics, it will go to the ES to
pull the corresponding data quality result. Is this my understanding
correct?

(4) Do you have a good solution to the problem of task dependence? For
example: I have a Hive table, it is updated once a day, every day after the
update will generate a new time partition (dt partition field = yesterday's
date, for example: dt = 2019-03-03), when the processing task of this table
is completed After updating the table data, restart Griffin's corresponding
data quality monitoring task to verify that the data in the new partition
meets the requirements. Because if the Griffin task does not depend on the
processing task of the Hive table, then the Griffin task will continue to
execute. If the Hive table processing task has not yet started, then the
partition will have no data yesterday, then it will always receive the
alarm mail, so Will lead to a lot of unnecessary "false alarms". So please
also trouble to see how you solved this problem?

Trouble, when you have time, help answer the above questions, thank you.

One of Apache open source enthusiasts
2019.03.04

On Tue, Mar 5, 2019 at 9:57 AM 大鹏 <18...@163.com> wrote:

> 部分问题（按问题编号）我所了解的情况如下：
> （1）目前报警需要结合ES来实现，ES有相关的报警插件；
> （2）目前不支持自定义图表，只能根据自己的需求开发相应的图表；
> （3）你的理解是对的
>
>
> 希望对你有所帮助
>
>
> 在2019年03月5日 07:53，李立威<41...@qq.com> 写道：
> hello，我最近在自己的电脑上部署Griffin（版本： 0.4.0），有几个问题想咨询一下：
>
> （1）我看Griffin文档上说，可以自定义measure，并设置报警阈值，但我看了Griffin源代码及其UI，没有找到设置报警阈值的入口，也没有找到发送报警邮件的服务，更没有找到配置邮件服务器的参数，可能是我没有注意到这个的细节，麻烦告知一下怎么配置数据质量监控的报警阈值，以及如何更改邮件发送服务器的地址？感谢。
>
>
> （2）我使用Profiling对某个表的某个字段进行监控，统计出来Metrics是一个表格，而不是图表，能否将其配置为图表呢？
>
>
> （3）我看了源代码，我觉得Griffin的逻辑是：先创建Measure、Job，其中在创建Job时，会生成定时任务，这个定时任务会按设置的调度时间，周期性的向Spark（通过Livy）发起调度请求，Spark执行完成后，将数据质量结果存储在hdfs和ES中，Griffin
> Web端，用户每次查询Metrics时，会去ES中拉取相应的数据质量结果。我这种理解对不对呢？
>
>
>
> （4）关于任务依赖的问题，你们有没有好的解决方案呢？比如：我有个Hive表，它每天更新一次，每天更新后都会生成一个新的时间分区（dt分区字段=昨天的日期，例如：dt=2019-03-03），当这个表的加工任务完成更新表数据后，再启动Griffin相应的数据质量监控任务去校验新分区中的数据是否符合要求。因为如果Griffin任务不依赖Hive表的加工任务，那么Griffin任务就会不断执行，如果Hive表加工任务还未开始执行，那么昨天的分区就会一直没有数据，那么就会一直收到报警邮件，这样会导致很多不必要的“假报警”。所以还请麻烦看看你们是如何解决这个问题的？
>
>
> 麻烦你在有空的时候，帮忙解答一下上面的问题，感谢。
>
>
> Apache 开源爱好者之一
> 2019.03.04

回复：Apache Griffin咨询

Posted by 大鹏 <18...@163.com>.

部分问题（按问题编号）我所了解的情况如下：
（1）目前报警需要结合ES来实现，ES有相关的报警插件；
（2）目前不支持自定义图表，只能根据自己的需求开发相应的图表；
（3）你的理解是对的


希望对你有所帮助


在2019年03月5日 07:53，李立威<41...@qq.com> 写道：
hello，我最近在自己的电脑上部署Griffin（版本： 0.4.0），有几个问题想咨询一下：
（1）我看Griffin文档上说，可以自定义measure，并设置报警阈值，但我看了Griffin源代码及其UI，没有找到设置报警阈值的入口，也没有找到发送报警邮件的服务，更没有找到配置邮件服务器的参数，可能是我没有注意到这个的细节，麻烦告知一下怎么配置数据质量监控的报警阈值，以及如何更改邮件发送服务器的地址？感谢。


（2）我使用Profiling对某个表的某个字段进行监控，统计出来Metrics是一个表格，而不是图表，能否将其配置为图表呢？


（3）我看了源代码，我觉得Griffin的逻辑是：先创建Measure、Job，其中在创建Job时，会生成定时任务，这个定时任务会按设置的调度时间，周期性的向Spark（通过Livy）发起调度请求，Spark执行完成后，将数据质量结果存储在hdfs和ES中，Griffin Web端，用户每次查询Metrics时，会去ES中拉取相应的数据质量结果。我这种理解对不对呢？


（4）关于任务依赖的问题，你们有没有好的解决方案呢？比如：我有个Hive表，它每天更新一次，每天更新后都会生成一个新的时间分区（dt分区字段=昨天的日期，例如：dt=2019-03-03），当这个表的加工任务完成更新表数据后，再启动Griffin相应的数据质量监控任务去校验新分区中的数据是否符合要求。因为如果Griffin任务不依赖Hive表的加工任务，那么Griffin任务就会不断执行，如果Hive表加工任务还未开始执行，那么昨天的分区就会一直没有数据，那么就会一直收到报警邮件，这样会导致很多不必要的“假报警”。所以还请麻烦看看你们是如何解决这个问题的？


麻烦你在有空的时候，帮忙解答一下上面的问题，感谢。


Apache 开源爱好者之一
2019.03.04