You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@griffin.apache.org by Jeff Zemerick <jz...@apache.org> on 2019/02/06 13:41:02 UTC

publish data to new Kafka topic

Hi Griffin devs,

Continuing my email thread from users@ and to better clarify it, I have a
Kafka topic with JSON data on it. I would like to perform quality checks on
this data, and I would like for data that meets the quality checks to be
published to a separate Kafka topic, while data that fails one or more
quality checks is left on the original Kafka topic. Is something like this
possible with Griffin? Please let me know if my use-case is not clear.

Thanks,
Jeff

Re: publish data to new Kafka topic

Posted by Jeff Zemerick <jz...@apache.org>.

Thanks! I will take a look at that.

Jeff

On Sun, Feb 10, 2019 at 11:10 PM Lionel, Liu <bh...@163.com> wrote:

> Correct, Griffin calculates data quality metrics, but sometimes people
> also care about the “failed” data. There is a way to sink the data into
> hdfs, like the miss matched data output in accuracy:
> https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/step/builder/dsl/transform/AccuracyExpr2DQSteps.scala#L87
> Functionally, Griffin supports “spark-sql” rule to output the filtered
> data you want, just configure the rule like this:
> {        "dsl.type": "spark-sql",        "out.dataframe.name": "failed",
>       "rule": "select * from source where age > 100",        "out":[
>   {            "type": "record"          }        ]      } The failed data
> would be written into the configured output hdfs directory in env.json.
> Furthermore, you want the failed data outputted into a kafka topic, that’s
> a new sink type unsupported in Griffin at current, you need to implement a
> new sink type to sink records.
>
> https://github.com/apache/griffin/tree/master/measure/src/main/scala/org/apache/griffin/measure/sink
>
> https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/configuration/enums/SinkType.scala
>
>
> Thanks
> Lionel, Liu
>
> From: Jeff Zemerick
> Sent: 2019年2月10日 23:56
> To: dev@griffin.apache.org
> Subject: Re: publish data to new Kafka topic
>
> Yes, a "data filter" describes it well. I think what would work would be if
> there could be a Boolean property on a rule that says if the rule fails
> then filter out that data (by redirecting it to a separate Kafka topic).
> Since Griffin is focused on data quality measurement that type of
> functionality might be out of scope for Griffin.
>
> Thanks,
> Jeff
>
>
> On Sun, Feb 10, 2019 at 9:42 AM Lionel, Liu <bh...@163.com> wrote:
>
> > Hi Jeff,
> >
> >
> >
> > Seems like you’re looking for a data filter. Originally, Griffin
> > calculates the data quality like accuracy, profiling, and the output of
> > Griffin would be the data quality metrics.
> >
> > In your case, what kind of data quality do you want to check? How to
> > define the success or failure of you data?
> >
> >
> >
> > Thanks
> > Lionel, Liu
> >
> >
> >
> > *From: *Jeff Zemerick <jz...@apache.org>
> > *Sent: *2019年2月6日 21:41
> > *To: *dev@griffin.apache.org
> > *Subject: *publish data to new Kafka topic
> >
> >
> >
> > Hi Griffin devs,
> >
> >
> >
> > Continuing my email thread from users@ and to better clarify it, I have
> a
> >
> > Kafka topic with JSON data on it. I would like to perform quality checks
> on
> >
> > this data, and I would like for data that meets the quality checks to be
> >
> > published to a separate Kafka topic, while data that fails one or more
> >
> > quality checks is left on the original Kafka topic. Is something like
> this
> >
> > possible with Griffin? Please let me know if my use-case is not clear.
> >
> >
> >
> > Thanks,
> >
> > Jeff
> >
> >
> >
> >
> >
>
>
>

RE: publish data to new Kafka topic

Posted by "Lionel, Liu" <bh...@163.com>.

Correct, Griffin calculates data quality metrics, but sometimes people also care about the “failed” data. There is a way to sink the data into hdfs, like the miss matched data output in accuracy: https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/step/builder/dsl/transform/AccuracyExpr2DQSteps.scala#L87
Functionally, Griffin supports “spark-sql” rule to output the filtered data you want, just configure the rule like this:
{

        "dsl.type": "spark-sql",



        "out.dataframe.name": "failed",

        "rule": "select * from source where age > 100",

        "out":[

          {

            "type": "record"



          }

        ]

      }
 The failed data would be written into the configured output hdfs directory in env.json.
Furthermore, you want the failed data outputted into a kafka topic, that’s a new sink type unsupported in Griffin at current, you need to implement a new sink type to sink records.
https://github.com/apache/griffin/tree/master/measure/src/main/scala/org/apache/griffin/measure/sink
https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/configuration/enums/SinkType.scala


Thanks
Lionel, Liu

From: Jeff Zemerick
Sent: 2019年2月10日 23:56
To: dev@griffin.apache.org
Subject: Re: publish data to new Kafka topic

Yes, a "data filter" describes it well. I think what would work would be if
there could be a Boolean property on a rule that says if the rule fails
then filter out that data (by redirecting it to a separate Kafka topic).
Since Griffin is focused on data quality measurement that type of
functionality might be out of scope for Griffin.

Thanks,
Jeff


On Sun, Feb 10, 2019 at 9:42 AM Lionel, Liu <bh...@163.com> wrote:

> Hi Jeff,
>
>
>
> Seems like you’re looking for a data filter. Originally, Griffin
> calculates the data quality like accuracy, profiling, and the output of
> Griffin would be the data quality metrics.
>
> In your case, what kind of data quality do you want to check? How to
> define the success or failure of you data?
>
>
>
> Thanks
> Lionel, Liu
>
>
>
> *From: *Jeff Zemerick <jz...@apache.org>
> *Sent: *2019年2月6日 21:41
> *To: *dev@griffin.apache.org
> *Subject: *publish data to new Kafka topic
>
>
>
> Hi Griffin devs,
>
>
>
> Continuing my email thread from users@ and to better clarify it, I have a
>
> Kafka topic with JSON data on it. I would like to perform quality checks on
>
> this data, and I would like for data that meets the quality checks to be
>
> published to a separate Kafka topic, while data that fails one or more
>
> quality checks is left on the original Kafka topic. Is something like this
>
> possible with Griffin? Please let me know if my use-case is not clear.
>
>
>
> Thanks,
>
> Jeff
>
>
>
>
>

Re: publish data to new Kafka topic

Posted by Jeff Zemerick <jz...@apache.org>.

Yes, a "data filter" describes it well. I think what would work would be if
there could be a Boolean property on a rule that says if the rule fails
then filter out that data (by redirecting it to a separate Kafka topic).
Since Griffin is focused on data quality measurement that type of
functionality might be out of scope for Griffin.

Thanks,
Jeff


On Sun, Feb 10, 2019 at 9:42 AM Lionel, Liu <bh...@163.com> wrote:

> Hi Jeff,
>
>
>
> Seems like you’re looking for a data filter. Originally, Griffin
> calculates the data quality like accuracy, profiling, and the output of
> Griffin would be the data quality metrics.
>
> In your case, what kind of data quality do you want to check? How to
> define the success or failure of you data?
>
>
>
> Thanks
> Lionel, Liu
>
>
>
> *From: *Jeff Zemerick <jz...@apache.org>
> *Sent: *2019年2月6日 21:41
> *To: *dev@griffin.apache.org
> *Subject: *publish data to new Kafka topic
>
>
>
> Hi Griffin devs,
>
>
>
> Continuing my email thread from users@ and to better clarify it, I have a
>
> Kafka topic with JSON data on it. I would like to perform quality checks on
>
> this data, and I would like for data that meets the quality checks to be
>
> published to a separate Kafka topic, while data that fails one or more
>
> quality checks is left on the original Kafka topic. Is something like this
>
> possible with Griffin? Please let me know if my use-case is not clear.
>
>
>
> Thanks,
>
> Jeff
>
>
>
>
>

RE: publish data to new Kafka topic

Posted by "Lionel, Liu" <bh...@163.com>.

Hi Jeff,

Seems like you’re looking for a data filter. Originally, Griffin calculates the data quality like accuracy, profiling, and the output of Griffin would be the data quality metrics. 
In your case, what kind of data quality do you want to check? How to define the success or failure of you data?

Thanks
Lionel, Liu

From: Jeff Zemerick
Sent: 2019年2月6日 21:41
To: dev@griffin.apache.org
Subject: publish data to new Kafka topic

Hi Griffin devs,

Continuing my email thread from users@ and to better clarify it, I have a
Kafka topic with JSON data on it. I would like to perform quality checks on
this data, and I would like for data that meets the quality checks to be
published to a separate Kafka topic, while data that fails one or more
quality checks is left on the original Kafka topic. Is something like this
possible with Griffin? Please let me know if my use-case is not clear.

Thanks,
Jeff