You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by "casel.chen" <ca...@126.com> on 2021/08/11 09:06:31 UTC

flink sql聚合后collect收集数据问题

源表三个字段  name, color, ts
按时间窗口聚合后想根据name group by取colors数组


create table source_table (
   name  STRING,
   color  STRING,
   ts TIMESTAMP,
   WATERMARK ts for ts
) 


create table sink_table (
  name  STRING,
   colors  ARRAY<STRING>
)


1. 请问这个select语句要怎么写?
select name, collect(color) as colors from source_table group by tumble(ts, interval '5' seconds)
这里collect(color)返回的是multiset类型,怎样转成Array类型呢?


2. 如果array元素很多,我只想取其中N个,该怎么写flink sql? 

3, 若取出现次数最多的前N个,又该怎么写flink sql?
select name, collect(color) as colors from (
select name, color from (
  select *, ROW_NUMBER() OVER (PARTITION BY name ORDER BY color_cnt desc) AS row_num from (
select name, color, count(*) as color_cnt group by name, color, tumble(ts, interval '5' seconds)
  ) 
) where row_num < 5
);
是这样写么?

Re: flink sql聚合后collect收集数据问题

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

1 & 2. multiset 不能转成 array。可以考虑使用 listagg + split_index + limit
语句达成需要的效果。当然更方便的还是写一个 UDF。

3. window top-n 可以使用 1.13 新引入的 window tvf:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/zh/docs/dev/table/sql/queries/window-topn/

casel.chen <ca...@126.com> 于2021年8月11日周三 下午5:06写道:

> 源表三个字段  name, color, ts
> 按时间窗口聚合后想根据name group by取colors数组
>
>
> create table source_table (
>    name  STRING,
>    color  STRING,
>    ts TIMESTAMP,
>    WATERMARK ts for ts
> )
>
>
> create table sink_table (
>   name  STRING,
>    colors  ARRAY<STRING>
> )
>
>
> 1. 请问这个select语句要怎么写?
> select name, collect(color) as colors from source_table group by
> tumble(ts, interval '5' seconds)
> 这里collect(color)返回的是multiset类型,怎样转成Array类型呢?
>
>
> 2. 如果array元素很多,我只想取其中N个,该怎么写flink sql?
>
> 3, 若取出现次数最多的前N个,又该怎么写flink sql?
> select name, collect(color) as colors from (
> select name, color from (
>   select *, ROW_NUMBER() OVER (PARTITION BY name ORDER BY color_cnt desc)
> AS row_num from (
> select name, color, count(*) as color_cnt group by name, color, tumble(ts,
> interval '5' seconds)
>   )
> ) where row_num < 5
> );
> 是这样写么?