You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by vtygoss <vt...@126.com> on 2022/05/20 08:02:39 UTC

accuracy validation of streaming pipeline

Hi community!


I'm working on migrating from full-data-pipeline(with spark) to incremental-data-pipeline(with flink cdc), and i met a problem about accuracy validation between pipeline based flink and spark.


For bounded data, it's simple to validate the two result sets are consitent or not. 
But, for unbouned data and event-driven application, how to make sure the data stream produced is correct, especially when there are some retract functions with high impactions, e.g. row_number. 


Is there any document for this preblom?  Thanks for your any suggestions or replies. 


Best Regards!

Re: accuracy validation of streaming pipeline

Posted by Leonard Xu <xb...@gmail.com>.
Hi, vtygoss

> I'm working on migrating from full-data-pipeline(with spark) to incremental-data-pipeline(with flink cdc), and i met a problem about accuracy validation between pipeline based flink and spark.

Glad to hear that !



> For bounded data, it's simple to validate the two result sets are consitent or not. 
> But, for unbouned data and event-driven application, how to make sure the data stream produced is correct, especially when there are some retract functions with high impactions, e.g. row_number. 
> 
> Is there any document for this preblom?  Thanks for your any suggestions or replies. 

The validation feature belongs data quality scope from my understanding, it’s usually provided by the platform e.g. the Data Integration Platform. As the underlying pipeline engine/tools, Flink CDC should expose more metrics or data quality checking abilities but we didn’t offers them yet, and these enhancements is on our roadmap.  Currently, you can use Flink source/sink operator’s metric as a rough validation, you can also compare the records count in your source database and sink system multiple times for more accurate validation.

Best,
Leonard


Re: accuracy validation of streaming pipeline

Posted by Shengkai Fang <fs...@gmail.com>.
It's a good question. Let me ping @Leonard to share more thoughts.

Best,
Shengkai

vtygoss <vt...@126.com> 于2022年5月20日周五 16:04写道:

> Hi community!
>
>
> I'm working on migrating from full-data-pipeline(with spark) to
> incremental-data-pipeline(with flink cdc), and i met a problem about
> accuracy validation between pipeline based flink and spark.
>
>
> For bounded data, it's simple to validate the two result sets are
> consitent or not.
>
> But, for unbouned data and event-driven application, how to make sure the
> data stream produced is correct, especially when there are some retract
> functions with high impactions, e.g. row_number.
>
>
> Is there any document for this preblom?  Thanks for your any suggestions
> or replies.
>
>
> Best Regards!
>

Re: accuracy validation of streaming pipeline

Posted by Shengkai Fang <fs...@gmail.com>.
Hi, all.

From my understanding, the accuracy for the sync pipeline requires to
snapshot the source and sink at some points.  It is just like we have a
checkpoint that contains all the data at some time for both sink and
source. Then we can compare the content in the checkpoint and find the
difference.

The main problem is how can we snapshot the data in the source/sink or
provide some meaningful metrics to compare at the points.

Best,
Shengkai

Xuyang <xy...@163.com> 于2022年5月24日周二 21:32写道:

> I think for an unbounded data, we can only check the result at one point
> of time, that is the work what Watermark[1] does. What about tag one time
> and to validate the data accuracy at that moment?
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/table/sql/create/#watermark
>
> 在 2022-05-20 16:02:39,"vtygoss" <vt...@126.com> 写道:
>
> Hi community!
>
>
> I'm working on migrating from full-data-pipeline(with spark) to
> incremental-data-pipeline(with flink cdc), and i met a problem about
> accuracy validation between pipeline based flink and spark.
>
>
> For bounded data, it's simple to validate the two result sets are
> consitent or not.
>
> But, for unbouned data and event-driven application, how to make sure the
> data stream produced is correct, especially when there are some retract
> functions with high impactions, e.g. row_number.
>
>
> Is there any document for this preblom?  Thanks for your any suggestions
> or replies.
>
>
> Best Regards!
>
>

Re:accuracy validation of streaming pipeline

Posted by Xuyang <xy...@163.com>.
I think for an unbounded data, we can only check the result at one point of time, that is the work what Watermark[1] does. What about tag one time and to validate the data accuracy at that moment?

[1] https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/table/sql/create/#watermark



在 2022-05-20 16:02:39,"vtygoss" <vt...@126.com> 写道:

Hi community!




I'm working on migrating from full-data-pipeline(with spark) to incremental-data-pipeline(with flink cdc), and i met a problem about accuracy validation between pipeline based flink and spark.




For bounded data, it's simple to validate the two result sets are consitent or not. 

But, for unbouned data and event-driven application, how to make sure the data stream produced is correct, especially when there are some retract functions with high impactions, e.g. row_number. 




Is there any document for this preblom?  Thanks for your any suggestions or replies. 




Best Regards!