You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by John Tipper <jo...@hotmail.com> on 2022/06/08 12:06:17 UTC

How to handle deletion of items using PyFlink SQL?

Hi all,

I have some reference data that is periodically emitted by a crawler mechanism into an upstream Kinesis data stream, where those rows are used to populate a sink table (and where I am using Flink 1.13 PyFlink SQL within AWS Kinesis Data Analytics).  What is the best pattern to handle deletion of upstream data, such that the downstream table remains in sync with upstream?

For example, at t=1, rows R1, R2, R3 are processed from the stream, resulting in a DB with 3 rows.  At some point between t=1 and t=2, the resource corresponding to R2 was deleted, such that at t=2 when the next crawl was carried out only rows R1 and R2 were emitted into the upstream stream.  How should I process the stream of events so that when I have finished processing the events from t=2 my downstream table also has just rows R1 and R3?

Many thanks,

John

Re:How to handle deletion of items using PyFlink SQL?

Posted by Xuyang <xy...@163.com>.

Hi, John.
What about use udtf [1]?
In your UDTF, all resources are saved as a set or map as s1. When t=2 arrives, the new resources as s2 will be collected by crawl. I think what you want is the deletion data that means 's1' - 's2'.
So just use loop to find out the deletion data and send RowData in function 'eval' in UDTF, and the RowData can be sent with a RowKind 'DELETE'[2]. The 'DELETE' means tell the downstream that this value is deleted.

I will be glad if it can help you.

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#table-functions
[2] https://github.com/apache/flink/blob/44f73c496ed1514ea453615b77bee0486b8998db/flink-core/src/main/java/org/apache/flink/types/RowKind.java#L52

Best！
Xuyang

At 2022-06-08 20:06:17, "John Tipper" <jo...@hotmail.com> wrote:

Hi all,

I have some reference data that is periodically emitted by a crawler mechanism into an upstream Kinesis data stream, where those rows are used to populate a sink table (and where I am using Flink 1.13 PyFlink SQL within AWS Kinesis Data Analytics). What is the best pattern to handle deletion of upstream data, such that the downstream table remains in sync with upstream?

For example, at t=1, rows R1, R2, R3 are processed from the stream, resulting in a DB with 3 rows. At some point between t=1 and t=2, the resource corresponding to R2 was deleted, such that at t=2 when the next crawl was carried out only rows R1 and R2 were emitted into the upstream stream. How should I process the stream of events so that when I have finished processing the events from t=2 my downstream table also has just rows R1 and R3?

Many thanks,

John