You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by Dayue Gao <da...@163.com> on 2015/07/28 05:21:48 UTC

Question about derived dimension

Hi Kylin Developers,

I'm confused about when to use derived dimension and how it will affects query performance. So far I've known that derived dimension will not be included in row key and value, then how kylin achieves to implement query grouping by these dimensions? The "Design Cube in Kylin" document says "Derived dimension reduces combination from 2N to 2 at the cost of extra runtime aggregation." I find it hard to understand.

Also, it's not clear to me what is the requirement for using derived dimension. In "Add Dimension" step, there is a tip saying "Using Derived for One-One relationship between columns, like ID and Name". Does it mean derived dimension should be unique key of the lookup table? If that's the case, the applicable scenario seams limited.

I'm aware that there's a JIRA https://issues.apache.org/jira/browse/KYLIN-887 <https://issues.apache.org/jira/browse/KYLIN-887> for it, but I can't wait to know the answer :-)

Regards,
Dayue

Re: Question about derived dimension

Posted by Luke Han <lu...@gmail.com>.

Lookup table's dictionary will be in memory, but not for each time, it
should be shared for all queries.

Kylin supports high cardinality dimension so far (not perfect), will have
one enhancement to support such thing.
*Please refer to other mail thread about this.*

Thanks.


Best Regards!
---------------------

Luke Han

On Tue, Jul 28, 2015 at 6:28 PM, Dayue Gao <da...@163.com> wrote:

> Excellent explanation! Thanks you Hongbin!
>
> > Afterwards, Kylin will replace DimA values with DimB values(since both of
> > them are dimensions on lookup table, Kylin can load the whole lookup
> table
> > into memory and build a mapping for them)
>
> A further question, does it means Kylin will load all needed lookup tables
> for such queries each time? If that's the case, I think I should avoid
> using derived dimension on big dimension like users, right?
>
> Regards,
> Dayue
>
>
> > 在 2015年7月28日，下午5:52，hongbin ma <ma...@apache.org> 写道：
> >
> > derived column is used when one dimension can be deduced from another.
> *(They
> > must be dimension on lookup table)*
> > for example, suppose we have a lookup table:
> >
> > dimA    dimB  dimC
> > 1           a        ?
> > 2           b        ?
> > 3           c        ?
> > 4           a        ?
> >
> > in this case, given a value in DimA, the value of DimB is determined, so
> we
> > say dimB can be derived from DimA. When we build a cube that contains
> both
> > DimA and DimB, we simple include DimA, and marking DimB as derived.
> Derived
> > column(DimB) does not participant in cuboid spanning, thus reducing the
> > possible combinations:
> >
> > original combinations:
> > ABC,AB,AC,BC,A,B,C
> >
> > combinations when driving B from A:
> > AC,A,C
> >
> > at Runtime, in case queries like "select count(*) from fact_table inner
> > join looup1 group by looup1 .dimB", it is expecting cuboid containing
> DimB
> > to answer the query. However, DimB will appear in none of the cuboids due
> > to derived optimization. In this case, we modify the execution plan to
> make
> > it group by  DimA(its deriving dimension) first, we'll get intermediate
> > answer like:
> >
> > DimA  count(*)
> > 1          1
> > 2          1
> > 3          1
> > 4          1
> >
> > Afterwards, Kylin will replace DimA values with DimB values(since both of
> > them are dimensions on lookup table, Kylin can load the whole lookup
> table
> > into memory and build a mapping for them), and the intermediate result
> > becomes:
> >
> > DimB  count(*)
> > a          1
> > b          1
> > c          1
> > a          1
> >
> > After this, the runtime SQL engine(calcite) will further aggregate the
> > intermediate result to:
> >
> > DimB  count(*)
> > a          2
> > b          1
> > c          1
> >
> > this step happens at query runtime, this is what it means "at the cost of
> > extra runtime aggregation"
> >
> > On Tue, Jul 28, 2015 at 11:21 AM, Dayue Gao <dayue_gao@163.com <mailto:
> dayue_gao@163.com>> wrote:
> >
> >> Hi Kylin Developers,
> >>
> >> I'm confused about when to use derived dimension and how it will affects
> >> query performance. So far I've known that derived dimension will not be
> >> included in row key and value, then how kylin achieves to implement
> query
> >> grouping by these dimensions? The "Design Cube in Kylin" document says
> >> "Derived dimension reduces combination from 2N to 2 at the cost of extra
> >> runtime aggregation." I find it hard to understand.
> >>
> >> Also, it's not clear to me what is the requirement for using derived
> >> dimension. In "Add Dimension" step, there is a tip saying "Using Derived
> >> for One-One relationship between columns, like ID and Name". Does it
> mean
> >> derived dimension should be unique key of the lookup table? If that's
> the
> >> case, the applicable scenario seams limited.
> >>
> >> I'm aware that there's a JIRA
> >> https://issues.apache.org/jira/browse/KYLIN-887 <
> >> https://issues.apache.org/jira/browse/KYLIN-887 <
> https://issues.apache.org/jira/browse/KYLIN-887>> for it, but I can't wait
> >> to know the answer :-)
> >>
> >> Regards,
> >> Dayue
> >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io <http://kylin.io/>
> > Github: https://github.com/binmahone <https://github.com/binmahone>
>

Re: Question about derived dimension

Posted by Dayue Gao <da...@163.com>.

Excellent explanation! Thanks you Hongbin!

> Afterwards, Kylin will replace DimA values with DimB values(since both of
> them are dimensions on lookup table, Kylin can load the whole lookup table
> into memory and build a mapping for them)

A further question, does it means Kylin will load all needed lookup tables for such queries each time? If that's the case, I think I should avoid using derived dimension on big dimension like users, right?

Regards,
Dayue


> 在 2015年7月28日，下午5:52，hongbin ma <ma...@apache.org> 写道：
> 
> derived column is used when one dimension can be deduced from another. *(They
> must be dimension on lookup table)*
> for example, suppose we have a lookup table:
> 
> dimA    dimB  dimC
> 1           a        ?
> 2           b        ?
> 3           c        ?
> 4           a        ?
> 
> in this case, given a value in DimA, the value of DimB is determined, so we
> say dimB can be derived from DimA. When we build a cube that contains both
> DimA and DimB, we simple include DimA, and marking DimB as derived. Derived
> column(DimB) does not participant in cuboid spanning, thus reducing the
> possible combinations:
> 
> original combinations:
> ABC,AB,AC,BC,A,B,C
> 
> combinations when driving B from A:
> AC,A,C
> 
> at Runtime, in case queries like "select count(*) from fact_table inner
> join looup1 group by looup1 .dimB", it is expecting cuboid containing DimB
> to answer the query. However, DimB will appear in none of the cuboids due
> to derived optimization. In this case, we modify the execution plan to make
> it group by  DimA(its deriving dimension) first, we'll get intermediate
> answer like:
> 
> DimA  count(*)
> 1          1
> 2          1
> 3          1
> 4          1
> 
> Afterwards, Kylin will replace DimA values with DimB values(since both of
> them are dimensions on lookup table, Kylin can load the whole lookup table
> into memory and build a mapping for them), and the intermediate result
> becomes:
> 
> DimB  count(*)
> a          1
> b          1
> c          1
> a          1
> 
> After this, the runtime SQL engine(calcite) will further aggregate the
> intermediate result to:
> 
> DimB  count(*)
> a          2
> b          1
> c          1
> 
> this step happens at query runtime, this is what it means "at the cost of
> extra runtime aggregation"
> 
> On Tue, Jul 28, 2015 at 11:21 AM, Dayue Gao <dayue_gao@163.com <ma...@163.com>> wrote:
> 
>> Hi Kylin Developers,
>> 
>> I'm confused about when to use derived dimension and how it will affects
>> query performance. So far I've known that derived dimension will not be
>> included in row key and value, then how kylin achieves to implement query
>> grouping by these dimensions? The "Design Cube in Kylin" document says
>> "Derived dimension reduces combination from 2N to 2 at the cost of extra
>> runtime aggregation." I find it hard to understand.
>> 
>> Also, it's not clear to me what is the requirement for using derived
>> dimension. In "Add Dimension" step, there is a tip saying "Using Derived
>> for One-One relationship between columns, like ID and Name". Does it mean
>> derived dimension should be unique key of the lookup table? If that's the
>> case, the applicable scenario seams limited.
>> 
>> I'm aware that there's a JIRA
>> https://issues.apache.org/jira/browse/KYLIN-887 <
>> https://issues.apache.org/jira/browse/KYLIN-887 <https://issues.apache.org/jira/browse/KYLIN-887>> for it, but I can't wait
>> to know the answer :-)
>> 
>> Regards,
>> Dayue
> 
> 
> 
> 
> -- 
> Regards,
> 
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io <http://kylin.io/>
> Github: https://github.com/binmahone <https://github.com/binmahone>

Re: Question about derived dimension

Posted by hongbin ma <ma...@apache.org>.

derived column is used when one dimension can be deduced from another. *(They
must be dimension on lookup table)*
for example, suppose we have a lookup table:

dimA    dimB  dimC
1           a        ?
2           b        ?
3           c        ?
4           a        ?

in this case, given a value in DimA, the value of DimB is determined, so we
say dimB can be derived from DimA. When we build a cube that contains both
DimA and DimB, we simple include DimA, and marking DimB as derived. Derived
column(DimB) does not participant in cuboid spanning, thus reducing the
possible combinations:

original combinations:
ABC,AB,AC,BC,A,B,C

combinations when driving B from A:
AC,A,C

at Runtime, in case queries like "select count(*) from fact_table inner
join looup1 group by looup1 .dimB", it is expecting cuboid containing DimB
to answer the query. However, DimB will appear in none of the cuboids due
to derived optimization. In this case, we modify the execution plan to make
it group by  DimA(its deriving dimension) first, we'll get intermediate
answer like:

DimA  count(*)
1          1
2          1
3          1
4          1

Afterwards, Kylin will replace DimA values with DimB values(since both of
them are dimensions on lookup table, Kylin can load the whole lookup table
into memory and build a mapping for them), and the intermediate result
becomes:

DimB  count(*)
a          1
b          1
c          1
a          1

After this, the runtime SQL engine(calcite) will further aggregate the
intermediate result to:

DimB  count(*)
a          2
b          1
c          1

this step happens at query runtime, this is what it means "at the cost of
extra runtime aggregation"

On Tue, Jul 28, 2015 at 11:21 AM, Dayue Gao <da...@163.com> wrote:

> Hi Kylin Developers,
>
> I'm confused about when to use derived dimension and how it will affects
> query performance. So far I've known that derived dimension will not be
> included in row key and value, then how kylin achieves to implement query
> grouping by these dimensions? The "Design Cube in Kylin" document says
> "Derived dimension reduces combination from 2N to 2 at the cost of extra
> runtime aggregation." I find it hard to understand.
>
> Also, it's not clear to me what is the requirement for using derived
> dimension. In "Add Dimension" step, there is a tip saying "Using Derived
> for One-One relationship between columns, like ID and Name". Does it mean
> derived dimension should be unique key of the lookup table? If that's the
> case, the applicable scenario seams limited.
>
> I'm aware that there's a JIRA
> https://issues.apache.org/jira/browse/KYLIN-887 <
> https://issues.apache.org/jira/browse/KYLIN-887> for it, but I can't wait
> to know the answer :-)
>
> Regards,
> Dayue

-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone