You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by Mars J <xu...@gmail.com> on 2016/09/02 11:17:58 UTC
how to optimize kylin query
Hi ,
How can I optimize my kylin query ?
Cluster env. : Our cluster has 5 nodes for 1 master and 4 slaves, the
hadoop cluster and hbase cluster is on the same node. The memory is 32G /
node.
Cube Design :
1. tables : A fact table (200,000,000 records in hive) left join a
account dim table (200w records in hive) and left join a branch dim table.
2. dimensions& measures : There are 4 dimensions include 2 columns
of fact table and 1 column of account dim and the other column of branch
dim. and just add 1 measure —— sum()
3. aggregation group : agg1. include column idate, and set this to
mandatory, agg2 include other 3 dimensions.
Query SQL like this:
select a.acct_no,sum(amt) from f left join acct_dim a on
f.acct_no=a.acct_no [ where idate>=... and idate<...] group by a.acct_no .
we may change acct_no to bran_code ,or add the idate optionally, when
group by acct_no ,it costs about 5s.
if we add select columns and join more table like 'left join
acct_dim ... left join branch_dim....' it may take more seconds.
and when we query some detail data like some column in fact table
, it'll query time out.
Our query is every flexible, can we optimize something to query this
group by sql in 1s and can query the detail data correctly?
Thanks .
Re: how to optimize kylin query
Posted by Mars J <xu...@gmail.com>.
No matter no limits for returned (default 50000) or limit 100, it just 1
seconds difference some times the query time is the same. it is also time
out under the time out sql I addressed in my question.
I have attach the kylin.log . U can grep the SQL statement and Error.
There are 2 problems , one is for aggregation query like sum... group by ,
it costs 5 seconds. the other one is to lead hbase regionserver shutdown
and throw some errors.
Thanks so much.
2016-09-03 15:20 GMT+08:00 hongbin ma <ma...@apache.org>:
> how many results are being returned for the slow queries? can you attach
> kylin.log under KYLIN_HOME/logs?
>
> On Fri, Sep 2, 2016 at 7:17 PM, Mars J <xu...@gmail.com> wrote:
>
>> Hi ,
>> How can I optimize my kylin query ?
>>
>> Cluster env. : Our cluster has 5 nodes for 1 master and 4 slaves, the
>> hadoop cluster and hbase cluster is on the same node. The memory is 32G /
>> node.
>>
>> Cube Design :
>> 1. tables : A fact table (200,000,000 records in hive) left join a
>> account dim table (200w records in hive) and left join a branch dim table.
>> 2. dimensions& measures : There are 4 dimensions include 2 columns
>> of fact table and 1 column of account dim and the other column of branch
>> dim. and just add 1 measure —— sum()
>> 3. aggregation group : agg1. include column idate, and set this to
>> mandatory, agg2 include other 3 dimensions.
>>
>> Query SQL like this:
>> select a.acct_no,sum(amt) from f left join acct_dim a on
>> f.acct_no=a.acct_no [ where idate>=... and idate<...] group by a.acct_no .
>> we may change acct_no to bran_code ,or add the idate optionally, when
>> group by acct_no ,it costs about 5s.
>> if we add select columns and join more table like 'left join
>> acct_dim ... left join branch_dim....' it may take more seconds.
>> and when we query some detail data like some column in fact
>> table , it'll query time out.
>>
>> Our query is every flexible, can we optimize something to query
>> this group by sql in 1s and can query the detail data correctly?
>>
>> Thanks .
>>
>>
>>
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
>
Re: how to optimize kylin query
Posted by hongbin ma <ma...@apache.org>.
how many results are being returned for the slow queries? can you attach
kylin.log under KYLIN_HOME/logs?
On Fri, Sep 2, 2016 at 7:17 PM, Mars J <xu...@gmail.com> wrote:
> Hi ,
> How can I optimize my kylin query ?
>
> Cluster env. : Our cluster has 5 nodes for 1 master and 4 slaves, the
> hadoop cluster and hbase cluster is on the same node. The memory is 32G /
> node.
>
> Cube Design :
> 1. tables : A fact table (200,000,000 records in hive) left join a
> account dim table (200w records in hive) and left join a branch dim table.
> 2. dimensions& measures : There are 4 dimensions include 2 columns
> of fact table and 1 column of account dim and the other column of branch
> dim. and just add 1 measure —— sum()
> 3. aggregation group : agg1. include column idate, and set this to
> mandatory, agg2 include other 3 dimensions.
>
> Query SQL like this:
> select a.acct_no,sum(amt) from f left join acct_dim a on
> f.acct_no=a.acct_no [ where idate>=... and idate<...] group by a.acct_no .
> we may change acct_no to bran_code ,or add the idate optionally, when
> group by acct_no ,it costs about 5s.
> if we add select columns and join more table like 'left join
> acct_dim ... left join branch_dim....' it may take more seconds.
> and when we query some detail data like some column in fact table
> , it'll query time out.
>
> Our query is every flexible, can we optimize something to query
> this group by sql in 1s and can query the detail data correctly?
>
> Thanks .
>
>
>
--
Regards,
*Bin Mahone | 马洪宾*