You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Roberto Tardío Olmos <ro...@stratebi.com> on 2017/04/24 11:14:18 UTC

Aggregation Groups and High Cardinality Dimension

Hi Kylin Community,

I have some doubts about how aggregation groups works and HCD dimensions 
good practices:

1. I have created a cube with two aggregation groups. In the first Agg 
Group I have included Time and Company dimensions. In the second AGG 
Group I have included Customer Dimension, with a cardinality about 1 
million of rows. This dimension is used with less frequency than the 
dimension at the first agg group and filtering (some IDs) is always 
applied to it.

After cube build, I can execute queries that combine Dimensions of the 
two aggregations groups if a need. However, the query latency is quite 
poor than when I define the three dimensions together at the same Agg 
Group. I guess that is due to aggregation occurs during execution, 
because are no precalculated like when the three dimensions are in the 
same AGG group.

How the two aggregations groups are combined at query execution? I 
suppose that the FK reference of any fact involved in the query result 
is stored and known by two AGG groups. I would like to know more detail 
about how this works.

2. If a have an HCD dimension that is little used in queries and always 
applied Customer ID filtering to get data only for some customers. �Is 
is a good practice to define it in a separated AGG group?

Regards,

-- 
*Roberto Tard�o Olmos*
/Senior Big Data & Business Intelligence Consultant/

Avenida de Brasil, 17, Planta 16.

28020 Madrid

Fijo: 91.788.34.10

Re: Aggregation Groups and High Cardinality Dimension

Posted by Roberto Tardío Olmos <ro...@stratebi.com>.

Thank you very much Billy, now I can understand how aggregation works in 
queries that involves different agg groups.

As you suggest, TopN looks like a very good solution to implement HCD 
when you need TopN measures. As I can read in link that you sent, this 
is because Top N aggregates are pre calculated and stored in fact table 
base cuboid, avoiding combinations with HCD dimension. I am going to try 
it.

However, in addition to Top N Sales Income Measure, I also need to 
retrieve the Sum of Sales Income for one o or more customers ID (usually 
just one). Moreover this sum measure is less used than other measures. 
Because that, I thought that define it in a different Agg Group could be 
solution. By this way I can query HCD dimension if I need, achieving a 
trade off between cube size and query latency. Moreover, as I have more 
dimensions than the 3 that I exposed in previous email (11/Normal/ 
dimensions + 10/Derived/ dimensions) and my cluster is not too Big (3 
node with 21 Gb Ram each one and 4 Vcores), so this is the only way that 
I can compute this cube in a reasonable time (about 6 hours).

However, after your email, I think I can achieve the same goal without 
include the Dimension Customer in any Agg Group. I thought that It was 
mandatory to include all dimensions in at least one Agg Group, but I 
have checked that is possible not to include a defined Dimension in any 
agg group.

�Do you think this a good strategy for this use case (trade off between 
cube building time, size and query latency)?

Moreover, in order to optimize the building cube process 
(https://kylin.apache.org/docs16/howto/howto_optimize_build.html ) I 
read that it is recommended to define Rowkey for HCD dimension as a 
"Shard By". �What do you think about this strategy for our case studio?

King Regards,

El 24/04/2017 a las 15:18, Billy Liu escribi�:
> Thanks Roberto for the questions. These are quite good questions.
>
> For AGG Group in your case, if none of the aggs could answer your 
> query, the query will be executed against the base cuboid, which 
> includes all dimensions. That also explains why the performance will 
> degrade much, since the query will base on the post-processing at the 
> running time.
>
> TopN measure is a good way to process the HCD case. 
> http://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/
> But if the filter condition has no order rule, the TopN could not be 
> used. The AGG group will not help either.
>
> 2017-04-24 19:14 GMT+08:00 Roberto Tard�o Olmos 
> <roberto.tardio@stratebi.com <ma...@stratebi.com>>:
>
>     Hi Kylin Community,
>
>     I have some doubts about how aggregation groups works and HCD
>     dimensions good practices:
>
>     1. I have created a cube with two aggregation groups. In the first
>     Agg Group I have included Time and Company dimensions. In the
>     second AGG Group I have included Customer Dimension, with a
>     cardinality about 1 million of rows. This dimension is used with
>     less frequency than the dimension at the first agg group and
>     filtering (some IDs) is always applied to it.
>
>     After cube build, I can execute queries that combine Dimensions of
>     the two aggregations groups if a need. However, the query latency
>     is quite poor than when I define the three dimensions together at
>     the same Agg Group. I guess that is due to aggregation occurs
>     during execution, because are no precalculated like when the three
>     dimensions are in the same AGG group.
>
>     How the two aggregations groups are combined at query execution? I
>     suppose that the FK reference of any fact involved in the query
>     result is stored and known by two AGG groups. I would like to know
>     more detail about how this works.
>
>     2. If a have an HCD dimension that is little used in queries and
>     always applied Customer ID filtering to get data only for some
>     customers. �Is is a good practice to define it in a separated AGG
>     group?
>
>     Regards,
>
>     -- 
>     *Roberto Tard�o Olmos*
>     /Senior Big Data & Business Intelligence Consultant/
>
>     Avenida de Brasil, 17, Planta 16.
>
>     28020 Madrid
>
>     Fijo: 91.788.34.10
>
>

-- 
*Roberto Tard�o Olmos*
/Senior Big Data & Business Intelligence Consultant/

Avenida de Brasil, 17, Planta 16.

28020 Madrid

Fijo: 91.788.34.10

Re: Aggregation Groups and High Cardinality Dimension

Posted by Billy Liu <bi...@apache.org>.

Thanks Roberto for the questions. These are quite good questions.

For AGG Group in your case, if none of the aggs could answer your query,
the query will be executed against the base cuboid, which includes all
dimensions. That also explains why the performance will degrade much, since
the query will base on the post-processing at the running time.

TopN measure is a good way to process the HCD case.
http://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/
But if the filter condition has no order rule, the TopN could not be used.
The AGG group will not help either.

2017-04-24 19:14 GMT+08:00 Roberto Tardío Olmos <roberto.tardio@stratebi.com
>:

> Hi Kylin Community,
>
> I have some doubts about how aggregation groups works and HCD dimensions
> good practices:
>
> 1. I have created a cube with two aggregation groups. In the first Agg
> Group I have included Time and Company dimensions. In the second AGG Group
> I have included Customer Dimension, with a cardinality about 1 million of
> rows. This dimension is used with less frequency than the dimension at the
> first agg group and filtering (some IDs) is always applied to it.
>
> After cube build, I can execute queries that combine Dimensions of the two
> aggregations groups if a need. However, the query latency is quite poor
> than when I define the three dimensions together at the same Agg Group. I
> guess that is due to aggregation occurs during execution, because are no
> precalculated like when the three dimensions are in the same AGG group.
>
> How the two aggregations groups are combined at query execution? I suppose
> that the FK reference of any fact involved in the query result is stored
> and known by two AGG groups. I would like to know more detail about how
> this works.
>
> 2. If a have an HCD dimension that is little used in queries and always
> applied Customer ID filtering to get data only for some customers. ¿Is is a
> good practice to define it in a separated AGG group?
>
> Regards,
> --
> *Roberto Tardío Olmos*
> *Senior Big Data & Business Intelligence Consultant*
>
> Avenida de Brasil, 17, Planta 16. 28020 Madrid Fijo: 91.788.34.10
>