You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by 熊云昆 <xi...@163.com> on 2020/10/19 00:03:46 UTC
回复:flink sql count distinct结果会变小
看你的数据量不大,可以考虑输出user_id详情出来排查看看到底少了哪个
| |
熊云昆
|
|
邮箱:xiongyunkun@163.com
|
签名由 网易邮箱大师 定制
在2020年10月17日 16:24,867127831 写道:
在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数
job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢?
查询时间 查询结果
2020-10-10 19:00:00 100
2020-10-10 19:00:02 98
2020-10-10 19:00:04 102
2020-10-10 19:00:06 108
2020-10-10 19:00:08 106
2020-10-10 19:00:10 110
sql如下:
create table jdbc_sink(
date_str varchar ,
dau bigint,
PRIMARY KEY (date_str) NOT ENFORCED
) with (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://xxx',
'table-name' = 'xxx',
'driver' = 'com.mysql.jdbc.Driver',
'username' = 'xxx',
'password' = 'xxx'
);
CREATE TABLE action_log_source (
user_id varchar,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) with (
...
);
INSERT INTO
jdbc_sink
SELECT
day_str as date_str,
COUNT(DISTINCT user_id) AS dau
FROM (
select
user_id as user_id,
date_format(event_time, 'yyyy-MM-dd') as day_str
from action_log_source
)
GROUP BY day_str
Re:回复:flink sql count distinct结果会变小
Posted by Michael Ran <gr...@163.com>.
我以前出现回撤的时候 出现过变小,也就是中途出现过失败等,后面又恢复了
在 2020-10-19 08:03:46,"熊云昆" <xi...@163.com> 写道:
>看你的数据量不大,可以考虑输出user_id详情出来排查看看到底少了哪个
>
>
>
>| |
>熊云昆
>|
>|
>邮箱:xiongyunkun@163.com
>|
>
>签名由 网易邮箱大师 定制
>
>在2020年10月17日 16:24,867127831 写道:
>在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数
>job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢?
>
>
> 查询时间 查询结果
>2020-10-10 19:00:00 100
>2020-10-10 19:00:02 98
>2020-10-10 19:00:04 102
>2020-10-10 19:00:06 108
>2020-10-10 19:00:08 106
>2020-10-10 19:00:10 110
>
>
>sql如下:
>create table jdbc_sink(
> date_str varchar ,
> dau bigint,
> PRIMARY KEY (date_str) NOT ENFORCED
>) with (
> 'connector' = 'jdbc',
> 'url' = 'jdbc:mysql://xxx',
> 'table-name' = 'xxx',
> 'driver' = 'com.mysql.jdbc.Driver',
> 'username' = 'xxx',
> 'password' = 'xxx'
>);
>
>
>CREATE TABLE action_log_source (
> user_id varchar,
> event_time TIMESTAMP(3),
> WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
>) with (
> ...
>);
>
>
>INSERT INTO
> jdbc_sink
>SELECT
> day_str as date_str,
> COUNT(DISTINCT user_id) AS dau
>FROM (
> select
> user_id as user_id,
> date_format(event_time, 'yyyy-MM-dd') as day_str
> from action_log_source
>)
>GROUP BY day_str
回复:flink sql count distinct结果会变小
Posted by 867127831 <86...@qq.com>.
这个数据量我是为了说明现象构造的数据来演示一下,不是真实的量,真实的数量据在千万级别的。
------------------ 原始邮件 ------------------
发件人: "user-zh" <xiongyunkun@163.com>;
发送时间: 2020年10月19日(星期一) 上午8:03
收件人: "867127831"<867127831@qq.com>;
抄送: "user-zh"<user-zh@flink.apache.org>;
主题: 回复:flink sql count distinct结果会变小
看你的数据量不大,可以考虑输出user_id详情出来排查看看到底少了哪个
| |
熊云昆
|
|
邮箱:xiongyunkun@163.com
|
签名由 网易邮箱大师 定制
在2020年10月17日 16:24,867127831 写道:
在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数
job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢?
&nbsp; &nbsp; &nbsp; &nbsp; 查询时间&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 查询结果
2020-10-10 19:00:00&nbsp; &nbsp;100
2020-10-10 19:00:02&nbsp; &nbsp;98
2020-10-10 19:00:04&nbsp; &nbsp;102
2020-10-10 19:00:06&nbsp; &nbsp;108
2020-10-10 19:00:08&nbsp; &nbsp;106
2020-10-10 19:00:10&nbsp; &nbsp;110
sql如下:
create table jdbc_sink(
&nbsp; &nbsp; date_str varchar ,
&nbsp; &nbsp; dau bigint,
&nbsp; &nbsp; PRIMARY KEY (date_str) NOT ENFORCED
) with (
&nbsp; 'connector' = 'jdbc',
&nbsp; 'url' = 'jdbc:mysql://xxx',
&nbsp; 'table-name' = 'xxx',
&nbsp; 'driver' = 'com.mysql.jdbc.Driver',
&nbsp; 'username' = 'xxx',
&nbsp; 'password' = 'xxx'
);
CREATE TABLE action_log_source (
&nbsp; user_id varchar,
&nbsp; event_time TIMESTAMP(3),
&nbsp; WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) with (
&nbsp; ...
);
INSERT INTO
&nbsp; jdbc_sink
SELECT
&nbsp; day_str as date_str,
&nbsp; COUNT(DISTINCT user_id) AS dau
FROM (
&nbsp; select&nbsp;
&nbsp; &nbsp; &nbsp; user_id as user_id,
&nbsp; &nbsp; &nbsp; date_format(event_time, 'yyyy-MM-dd') as day_str
&nbsp; from action_log_source
)
GROUP BY day_str