You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by aj <aj...@gmail.com> on 2020/06/06 06:23:23 UTC

Data Quality Library in Flink

Hello All,

I  want to do some data quality analysis on stream data example.

1. Fill rate in a particular column
2. How many events are going to error queue due to favor schema
validation failed?
3. Different statistics measure of a column.
3. Alert if a particular threshold is breached (like if fill rate is less
than 90% for a column)

Is there any library that exists on top of Flink for data quality. As I am
looking there is a library on top of the spark
https://github.com/awslabs/deequ

This proved all that I am looking for.

-- 
Thanks & Regards,
Anuj Jain



<http://www.cse.iitm.ac.in/%7Eanujjain/>

Re: Data Quality Library in Flink

Posted by aj <aj...@gmail.com>.
Thanks, Andrey, I will check it out.

On Mon, Jun 8, 2020 at 8:10 PM Andrey Zagrebin <az...@apache.org> wrote:

> Hi Anuj,
>
> I am not familiar with data quality measurement methods and deequ
> <https://github.com/awslabs/deequ> in depth.
> What you describe looks like monitoring some data metrics.
> Maybe, there are other community users aware of better solution.
> Meanwhile, I would recommend to implement the checks and failures as
> separate operators and side outputs (for streaming) [1], if not yet
> Then you could also use Flink metrics to aggregate and monitor the data
> [2].
> The metrics systems usually allow to define alerts on metrics, like in
> prometheus [3], [4].
>
> Best,
> Andrey
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/side_output.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html
> [3]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
> [4] https://prometheus.io/docs/alerting/overview/
>
> On Sat, Jun 6, 2020 at 9:23 AM aj <aj...@gmail.com> wrote:
>
>> Hello All,
>>
>> I  want to do some data quality analysis on stream data example.
>>
>> 1. Fill rate in a particular column
>> 2. How many events are going to error queue due to favor schema
>> validation failed?
>> 3. Different statistics measure of a column.
>> 3. Alert if a particular threshold is breached (like if fill rate is less
>> than 90% for a column)
>>
>> Is there any library that exists on top of Flink for data quality. As I
>> am looking there is a library on top of the spark
>> https://github.com/awslabs/deequ
>>
>> This proved all that I am looking for.
>>
>> --
>> Thanks & Regards,
>> Anuj Jain
>>
>>
>>
>> <http://www.cse.iitm.ac.in/%7Eanujjain/>
>>
>

-- 
Thanks & Regards,
Anuj Jain
Mob. : +91- 8588817877
Skype : anuj.jain07
<http://www.oracle.com/>


<http://www.cse.iitm.ac.in/%7Eanujjain/>

Re: Data Quality Library in Flink

Posted by Andrey Zagrebin <az...@apache.org>.
Hi Anuj,

I am not familiar with data quality measurement methods and deequ
<https://github.com/awslabs/deequ> in depth.
What you describe looks like monitoring some data metrics.
Maybe, there are other community users aware of better solution.
Meanwhile, I would recommend to implement the checks and failures as
separate operators and side outputs (for streaming) [1], if not yet
Then you could also use Flink metrics to aggregate and monitor the data [2].
The metrics systems usually allow to define alerts on metrics, like in
prometheus [3], [4].

Best,
Andrey

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/side_output.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter
[4] https://prometheus.io/docs/alerting/overview/

On Sat, Jun 6, 2020 at 9:23 AM aj <aj...@gmail.com> wrote:

> Hello All,
>
> I  want to do some data quality analysis on stream data example.
>
> 1. Fill rate in a particular column
> 2. How many events are going to error queue due to favor schema
> validation failed?
> 3. Different statistics measure of a column.
> 3. Alert if a particular threshold is breached (like if fill rate is less
> than 90% for a column)
>
> Is there any library that exists on top of Flink for data quality. As I am
> looking there is a library on top of the spark
> https://github.com/awslabs/deequ
>
> This proved all that I am looking for.
>
> --
> Thanks & Regards,
> Anuj Jain
>
>
>
> <http://www.cse.iitm.ac.in/%7Eanujjain/>
>