You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Kenyore (Jira)" <ji...@apache.org> on 2022/02/24 09:54:00 UTC

[jira] [Created] (FLINK-26348) Maybe ChangelogNormalize should ignore unused columns when deduplicate

Kenyore created FLINK-26348:
-------------------------------

             Summary: Maybe ChangelogNormalize should ignore unused columns when deduplicate
                 Key: FLINK-26348
                 URL: https://issues.apache.org/jira/browse/FLINK-26348
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 1.13.2
            Reporter: Kenyore


In my case I have tables below
 * sku(size:1K+)
 * custom_product(size:10B+)
 * order(size:100M+)

And my sql is like
{code:sql}
SELECT o.code,o.created,s.sku_name,p.product_name FROM order o 
    INNER JOIN custom_product p ON o.p_id=p.id
    INNER JOIN sku s ON s.id=p.s_id
{code}

Table sku has some other columns.
The problem is that when another column(be like description) in any row of table sku changes,flink may produce millions of update rows whitch is useless in downstream.Because we only pick column sku_name in the downstream,but the change is column description.

This kind of useless update row would bring pressure to downstream operators.

I think it is significant for flink to improve this.thks



--
This message was sent by Atlassian Jira
(v8.20.1#820001)