You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Luke Han <lu...@gmail.com> on 2015/01/28 14:48:20 UTC
Fwd: performance gap between cubes made from fact table in parquet and fact table in text

Hi Shon,
    If these two hive tables contains same data and structure, the final 
cube should have same size and records. Could you please double check the 
cube size again?
     And please paste these two cube's json if possible for further 
analysis.
     

     Please subscribe and reply to Apache Kylin mailing list for further 
information:
     dev@kylin.incubator.apache.org
      
      Thanks.
Luke

在 2015年1月28日星期三 UTC+8上午10:37:07，林澍荣写道：
>
> Hi, Kylin team,
>      I have created two cubes, one from a fact table in text format, and 
> the other from the same fact table but in parquet format.
>      but when I executed the same query against the two fact tables, 
>  their speeds were significantly different. The following are my queries:
>
> Query 1 :
>
> select count(*) as usernum
> from (select M.phnum,M.total 
>          from (select phnum,sum(bytedn) as total
>                   from fact_table
>                   where lac_num='55109' and ci_num='60996'
>                   and cast(vstime as timestamp) between cast('2014-06-24 
> 08:00:00' as timestamp) and cast('2014-06-24 09:00:00' as timestamp)
>                   group by phnum)
>          as M
>          where M.total > 100) 
> as N
>
> Executing Time :
>
> the text fact table : 8.48s
> the parquet fact table : 47.99s
>
> Query 2 :
>
> select lac_num,ci_num,sum(bytedn)/cast((15*60) as float) as avg_bytedn
> from fact_table
> where lac_num='55288' and ci_num='2113'
> and cast(vstime as timestamp) between cast('2014-06-24 08:00:00' as 
> timestamp) and cast('2014-06-24 08:15:00' as timestamp)
> group by lac_num,ci_num
>
> Executing Time :
>
> the text fact table :0.16s
> the parquet fact table : 0.19s
>
> Query 3 :
>
> select lac_num,ci_num,phnum,sum(bytedn) as sum_bytedn
> from zhejiang_userflow_20140624_080000
> where lac_num='22513' and ci_num='31201'
> and phnum='8613588924366' 
> and cast(vstime as timestamp) between cast('2014-06-24 08:00:00' as 
> timestamp) and cast('2014-06-24 09:00:00' as timestamp)
> group by phnum,lac_num,ci_num
>
> Executing Time :
>
> the text fact table :0.11s
> the parquet fact table : 0.27s
>
> As you can see, the queries against the text fact table always spent less 
> time, especially in Query 1.
> I have checked the data of the cubes in HBase by using scan cmd, and it 
> seems that the two cubes are the same.
> I build the two cubes in the same way. Here is the cube info:
>
> dimensions: phnum,lac_num,ci_num,vstime (all dims are not mandatory)
> measures: _count_,sum(bytedn)
>  
> So what is the reason of the performance gap?
>
> Any feedbacks are appreciated.
>
> Thanks,
> Shon  
>