You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Johnson <it...@163.com> on 2019/11/15 10:14:06 UTC

kylin多个子查询嵌套导致查询速度严重变慢(几十秒)

由于前端需要将多个指标合并展示，故查询kylin时，一个sql可能会嵌套多个多个子查询。生产环境发现，子查询严重影响查询速度。大家知道什么优化方法吗？
测试如下：
1.计算活跃设备
select count(distinct deviceid) dad from KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_DEVICE_ACTIVE
where par_dt>='2019-06-01' and par_dt<='2019-11-15'
group by par_dt
2.计算活跃用户
select g.par_dt,count(distinct g.userid) "activeAccount" from KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_USER_MULTIDIM_ACTIVE g
where g.par_dt>='2019-06-01' and g.par_dt<='2019-11-01'
group by g.par_dt
分别执行两个sql耗时基本在0.5s左右。


3.采用子查询：
select g.par_dt,count(distinct g.userid) "activeAccount",a.dad "activeDevice"
from KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_USER_MULTIDIM_ACTIVE g
left join(
select par_dt,count(distinct deviceid) dad from KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_DEVICE_ACTIVE
where par_dt>='2019-06-01' and par_dt<='2019-11-01'
group by par_dt
) a
on g.par_dt = a.par_dt
where g.par_dt>='2019-06-01' and g.par_dt<='2019-11-01'
group by g.par_dt,a.dad
查询耗时多达1.5s，当子查询增多时，查询耗时高达几十秒。如下图，对于此结果不是很理解，因为子查询完全可以并行去查，然后子查询的结果（数据量已经超级小了）在kylin server端聚合，应该很快啊，为什么会这么慢？

Re:Re: kylin多个子查询嵌套导致查询速度严重变慢(几十秒)

Posted by Johnson <it...@163.com>.


非常感谢您的回复，您的意思是所有模型都做成大宽表的形式（但是这样join的功能就感觉形同虚设）。但是实际业务场景，不可避免多个cube的子查询结果进行join。
我的疑惑是为什么这些子查询不能并行的去执行？串行的话确实很慢。


再次感谢您的回复！





在 2019-11-30 18:19:19，"ShaoFeng Shi" <sh...@apache.org> 写道：

Hi Johnson,


This is a good observation. For a single cube's query, Kylin can dispatch that to HBase for in parallel computing, which usually is sub-seconds. When there are multiple sub-queries
together, Kylin will execute these sub-queries in sequence. After getting the result back to the Kylin query node (in memory),  it will execute the final round computing like joining, filtering, etc.
Depends on the data size, the time it takes varies. That is what you observed.
 
A workaround is, you can define a new model/cube with the tables together. Then no need to use sub-queries.


Best regards,


Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org









Johnson <it...@163.com> 于2019年11月15日周五 下午6:14写道：

由于前端需要将多个指标合并展示，故查询kylin时，一个sql可能会嵌套多个多个子查询。生产环境发现，子查询严重影响查询速度。大家知道什么优化方法吗？
测试如下：
1.计算活跃设备
select count(distinct deviceid) dad from KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_DEVICE_ACTIVE
where par_dt>='2019-06-01' and par_dt<='2019-11-15'
group by par_dt
2.计算活跃用户
select g.par_dt,count(distinct g.userid) "activeAccount" from KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_USER_MULTIDIM_ACTIVE g
where g.par_dt>='2019-06-01' and g.par_dt<='2019-11-01'
group by g.par_dt
分别执行两个sql耗时基本在0.5s左右。


3.采用子查询：
select g.par_dt,count(distinct g.userid) "activeAccount",a.dad "activeDevice"
from KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_USER_MULTIDIM_ACTIVE g
left join(
select par_dt,count(distinct deviceid) dad from KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_DEVICE_ACTIVE
where par_dt>='2019-06-01' and par_dt<='2019-11-01'
group by par_dt
) a
on g.par_dt = a.par_dt
where g.par_dt>='2019-06-01' and g.par_dt<='2019-11-01'
group by g.par_dt,a.dad
查询耗时多达1.5s，当子查询增多时，查询耗时高达几十秒。如下图，对于此结果不是很理解，因为子查询完全可以并行去查，然后子查询的结果（数据量已经超级小了）在kylin server端聚合，应该很快啊，为什么会这么慢？

Re: kylin多个子查询嵌套导致查询速度严重变慢(几十秒)

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Johnson,

This is a good observation. For a single cube's query, Kylin can dispatch
that to HBase for in parallel computing, which usually is sub-seconds. When
there are multiple sub-queries
together, Kylin will execute these sub-queries in sequence. After getting
the result back to the Kylin query node (in memory),  it will execute the
final round computing like joining, filtering, etc.
Depends on the data size, the time it takes varies. That is what you
observed.

A workaround is, you can define a new model/cube with the tables together.
Then no need to use sub-queries.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Johnson <it...@163.com> 于2019年11月15日周五 下午6:14写道：

> 由于前端需要将多个指标合并展示，故查询kylin时，一个sql可能会嵌套多个多个子查询。生产环境发现，子查询严重影响查询速度。大家知道什么优化方法吗？
> 测试如下：
> 1.计算活跃设备
> select count(distinct deviceid) dad from
> KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_DEVICE_ACTIVE
> where par_dt>='2019-06-01' and par_dt<='2019-11-15'
> group by par_dt
> 2.计算活跃用户
> select g.par_dt,count(distinct g.userid) "activeAccount" from
> KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_USER_MULTIDIM_ACTIVE g
> where g.par_dt>='2019-06-01' and g.par_dt<='2019-11-01'
> group by g.par_dt
> 分别执行两个sql耗时基本在0.5s左右。
>
> 3.采用子查询：
> select g.par_dt,count(distinct g.userid) "activeAccount",a.dad
> "activeDevice"
> from KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_USER_MULTIDIM_ACTIVE g
> left join(
> select par_dt,count(distinct deviceid) dad from
> KYLIN_VIEW.KYLIN_VIEW_T_DWA_ACT_XXX_DEVICE_ACTIVE
> where par_dt>='2019-06-01' and par_dt<='2019-11-01'
> group by par_dt
> ) a
> on g.par_dt = a.par_dt
> where g.par_dt>='2019-06-01' and g.par_dt<='2019-11-01'
> group by g.par_dt,a.dad
> 查询耗时多达1.5s，*当子查询增多时，查询耗时高达几十秒*。如下图，对于此结果不是很理解，因为子查询完全可以并行去查，然后子查询的结果（数据量已经超级小了）在kylin
> server端聚合，应该很快啊，为什么会这么慢？
>
>
>
>