You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Mars J <xu...@gmail.com> on 2016/09/02 11:17:58 UTC

how to optimize kylin query

Hi ,
    How can I optimize my kylin query ?

    Cluster env. : Our cluster has 5 nodes for 1 master and 4 slaves, the
hadoop cluster and hbase cluster is on the same node. The memory is 32G /
node.

    Cube Design :
       1. tables : A fact table (200,000,000 records in hive) left join a
account dim table (200w records in hive) and left join a branch dim table.
       2. dimensions& measures : There are 4 dimensions include 2 columns
of fact table and 1 column of account dim and the other column of branch
dim.  and just add 1 measure —— sum()
       3. aggregation group : agg1. include column idate, and set this to
mandatory, agg2 include other 3 dimensions.

     Query SQL like this:
         select a.acct_no,sum(amt) from f left join acct_dim a on
f.acct_no=a.acct_no [ where idate>=... and idate<...] group by a.acct_no .
  we may change acct_no to bran_code ,or add the idate optionally, when
group by acct_no ,it costs about 5s.
         if we add select columns and join more table like 'left join
acct_dim ... left join branch_dim....' it may take more seconds.
         and when we query some detail data like some column in fact table
, it'll query time out.

       Our query is every flexible, can we optimize something to query this
group by sql in 1s and can query the detail data correctly?

   Thanks .

Re: how to optimize kylin query

Posted by Mars J <xu...@gmail.com>.

No matter no limits for returned (default 50000) or limit 100, it just 1
seconds difference some times the query time is the same. it is also time
out under the time out sql I addressed in my question.

I have attach the kylin.log .  U can grep the SQL statement and Error.
There are 2 problems , one is for aggregation query like sum... group by ,
it costs 5 seconds. the other one is to lead hbase regionserver shutdown
 and throw some errors.

Thanks so much.

2016-09-03 15:20 GMT+08:00 hongbin ma <ma...@apache.org>:

> how many results are being returned for the slow queries? can you attach
> kylin.log under KYLIN_HOME/logs?
>
> On Fri, Sep 2, 2016 at 7:17 PM, Mars J <xu...@gmail.com> wrote:
>
>> Hi ,
>>     How can I optimize my kylin query ?
>>
>>     Cluster env. : Our cluster has 5 nodes for 1 master and 4 slaves, the
>> hadoop cluster and hbase cluster is on the same node. The memory is 32G /
>> node.
>>
>>     Cube Design :
>>        1. tables : A fact table (200,000,000 records in hive) left join a
>> account dim table (200w records in hive) and left join a branch dim table.
>>        2. dimensions& measures : There are 4 dimensions include 2 columns
>> of fact table and 1 column of account dim and the other column of branch
>> dim.  and just add 1 measure —— sum()
>>        3. aggregation group : agg1. include column idate, and set this to
>> mandatory, agg2 include other 3 dimensions.
>>
>>      Query SQL like this:
>>          select a.acct_no,sum(amt) from f left join acct_dim a on
>> f.acct_no=a.acct_no [ where idate>=... and idate<...] group by a.acct_no .
>>   we may change acct_no to bran_code ,or add the idate optionally, when
>> group by acct_no ,it costs about 5s.
>>          if we add select columns and join more table like 'left join
>> acct_dim ... left join branch_dim....' it may take more seconds.
>>          and when we query some detail data like some column in fact
>> table , it'll query time out.
>>
>>        Our query is every flexible, can we optimize something to query
>> this group by sql in 1s and can query the detail data correctly?
>>
>>    Thanks .
>>
>>
>>
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
>

Re: how to optimize kylin query

Posted by hongbin ma <ma...@apache.org>.

how many results are being returned for the slow queries? can you attach
kylin.log under KYLIN_HOME/logs?

On Fri, Sep 2, 2016 at 7:17 PM, Mars J <xu...@gmail.com> wrote:

> Hi ,
>     How can I optimize my kylin query ?
>
>     Cluster env. : Our cluster has 5 nodes for 1 master and 4 slaves, the
> hadoop cluster and hbase cluster is on the same node. The memory is 32G /
> node.
>
>     Cube Design :
>        1. tables : A fact table (200,000,000 records in hive) left join a
> account dim table (200w records in hive) and left join a branch dim table.
>        2. dimensions& measures : There are 4 dimensions include 2 columns
> of fact table and 1 column of account dim and the other column of branch
> dim.  and just add 1 measure —— sum()
>        3. aggregation group : agg1. include column idate, and set this to
> mandatory, agg2 include other 3 dimensions.
>
>      Query SQL like this:
>          select a.acct_no,sum(amt) from f left join acct_dim a on
> f.acct_no=a.acct_no [ where idate>=... and idate<...] group by a.acct_no .
>   we may change acct_no to bran_code ,or add the idate optionally, when
> group by acct_no ,it costs about 5s.
>          if we add select columns and join more table like 'left join
> acct_dim ... left join branch_dim....' it may take more seconds.
>          and when we query some detail data like some column in fact table
> , it'll query time out.
>
>        Our query is every flexible, can we optimize something to query
> this group by sql in 1s and can query the detail data correctly?
>
>    Thanks .
>
>
>



-- 
Regards,

*Bin Mahone | 马洪宾*