You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@doris.apache.org by di wu <67...@qq.com.INVALID> on 2022/05/29 09:43:26 UTC

回复:Flink mysql cdc to doris多schema join 数据导入doris多次问题

Hi, thanks for your question.
1. I tested it, and flink cdc imports the doris aggregation model. In the case of join and groupby, data can be written and updated normally.
2. The flink-doris-connector version 1.0.3 is based on batch-based streamload, which is controlled by parameters such as sink.batch.size.



Thanks &amp;&amp; Regards,
di.wu

------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dev"                                                                                    <JoshXie90@outlook.com&gt;;
发送时间:&nbsp;2022年5月27日(星期五) 上午10:34
收件人:&nbsp;"dev@doris.apache.org"<dev@doris.apache.org&gt;;

主题:&nbsp;Flink mysql cdc to doris多schema join 数据导入doris多次问题



大家早:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 最近在使用mysql flink cdc to doris时, 发现个问题,&nbsp; 在flink多schema进行join, group by等多个算子计算时, doris flink connector会不定时多次向doris导入数据, 导致聚合模型(sum)计算不准确.(flink1.13.6 doris1.0.0 rc03, connector版本均对应)
如以下示例可以看出, 当flink由多个schema join group by等算子计算后(flink显示有20左右个算子), flink任务执行时, 观察数据, 发现update_time在不断变化, update_count也同时在sum聚合, insert操作执行了96次才将这条数据中mo_num计算准确.
Doris 表结构示例:
{

tenant_code AGGREGATE KEY,

mo_num REPLACE,

update_time MAX,

update_count SUM default 1
}
数据示例:
code
mo_num
xxxx
update_time
update_count
000032
123
其他业务数据
2022-05-19 19:00:28
96


问题:

1.&nbsp;&nbsp;&nbsp;&nbsp; 这种情况是否正常? 期望是一条数据计算完直接插入, 而不是多次插入, 现在这种情况无法使用sum等聚合模型?

2.&nbsp;&nbsp;&nbsp;&nbsp; 如果有这样的场景, 如何避免? 开发同学能否从导入机制的角度帮忙解答一下?