You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Jark Wu (Jira)" <ji...@apache.org> on 2020/04/18 14:12:00 UTC
[jira] [Closed] (FLINK-17228) Streaming sql with nested GROUP BY
got wrong results
[ https://issues.apache.org/jira/browse/FLINK-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jark Wu closed FLINK-17228.
---------------------------
Resolution: Not A Bug
This question is also asked in mailing list. I will pick the answer here:
This behavior is as expected, not a bug. Benchao gave a good explanation about the reason. I will give some further explanation.
In Flink SQL, we will split an update operation (such as uv from 100 -> 101) into two separate messages, one is -[key, 100], the other is +[key, 101].
Once these two messages arrive the downstream aggregation, it will also send two result messages (assuming the previous SUM(uv) is 500),
one is [key, 400], the other is [key, 501].
But this problem is almost addressed since 1.9, if you enabled the mini-batch optimization [1]. Because mini-batch optimization will try best to the
accumulate the separate + and - message in a single mini-batch processing. You can upgrade and have a try.
> Streaming sql with nested GROUP BY got wrong results
> ----------------------------------------------------
>
> Key: FLINK-17228
> URL: https://issues.apache.org/jira/browse/FLINK-17228
> Project: Flink
> Issue Type: Bug
> Components: Table SQL / API, Table SQL / Runtime
> Affects Versions: 1.7.2
> Environment: Flink 1.7.2
> Parallelism is 1
> Reporter: Xingxing Di
> Priority: Blocker
>
> We are facing an special scenario, *we want to know if this feature is supported*:
> First count distinct deviceid for A,B dimensions, then sum up for just A dimension.
> Here is SQL:
> {code:java}
> SELECT dt, SUM(a.uv) AS uv
> FROM (
> SELECT dt, pvareaid, COUNT(DISTINCT cuid) AS uv
> FROM streaming_log_event
> WHERE action IN ('action1')
> AND pvareaid NOT IN ('pv1', 'pv2')
> AND pvareaid IS NOT NULL
> GROUP BY dt, pvareaid
> ) a
> GROUP BY dt;{code}
> The question is the data emitted to sink was wrong, sink periodically got smaller result ({color:#ff0000}86{color}) which was wrong, here is the log:
> {code:java}
> 2020-04-17 22:28:38,727 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed (1/1) (GeneralRedisSinkFunction.invoke:169) - receive data(false,0,86,20200417)
> 2020-04-17 22:28:38,727 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed (1/1) (GeneralRedisSinkFunction.invoke:169) - receive data(true,0,130,20200417)
> 2020-04-17 22:28:39,327 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed (1/1) (GeneralRedisSinkFunction.invoke:169) - receive data(false,0,130,20200417)
> 2020-04-17 22:28:39,327 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed (1/1) (GeneralRedisSinkFunction.invoke:169) - receive data(true,0,86,20200417)
> 2020-04-17 22:28:39,327 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed (1/1) (GeneralRedisSinkFunction.invoke:169) - receive data(false,0,86,20200417)
> 2020-04-17 22:28:39,328 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed (1/1) (GeneralRedisSinkFunction.invoke:169) - receive data(true,0,131,20200417)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)