You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Julian Hyde <jh...@apache.org> on 2015/06/19 21:03:51 UTC

Hierarchies

I’d like to ask a provocative question: Why does Kylin have hierarchies?

There may be some good reasons, but having thought for a long time about OLAP architectures I have come to the conclusion that hierarchies can be more trouble than they are worth. I regret that I made them so central to Mondrian’s architecture; they are a part of the MDX language, so Mondrian had to have them in some form, but more of the system should have been built using attributes. Since Kylin is SQL-based, it doesn’t need hierarchies at all.

In OLAP, hierarchies are really useful in the presentation layer: a hierarchy is a drill path. If user has just expanded attribute A (e.g. Year) then they are very likely to want to expand attribute B (e.g. Month) or C (e.g. Week). So, hierarchies improve the user’s experience.

In the engine and storage layer there are some concepts similar to hierarchies:
functional dependencies (i.e. for a given value of X, column Y always has the same value),
highly correlated columns (e.g. for a given value of zipcode, state almost always has the same value), and
columns that are frequently aggregated together (e.g. a query rarely has “group by productName” but more often has “group by manufacturer, brand, productName”).

These allow the kinds of storage optimization that hierarchies allow in Kylin, but they can be inferred without human intervention*, are more general, and less restrictive. For example, when choosing the set of cuboids you would tend to include highly correlated columns (if you have just built a cuboid using zipcode, there is a high benefit and low incremental cost to add state and nation to it because state is highly correlated and nation is functionally dependent). Same outcome has having an explicit (nation, state, zipcode) hierarchy.

So, I am not claiming that hierarchies are not useful; I am claiming that they are not essential. If you were to remove explicit support for hierarchies and replace them with fuzzier concepts like highly correlated columns you might find that the system becomes radically simpler at its core.

Forgive me for being provocative. I want to challenge assumptions. If the architecture is working fine, feel free to disregard. But if you are seeing signs to architectural strain, this might be an opportunity to simplify.

Julian

* Functional dependencies be inferred from the underlying star schema. Calcite’s aggregate designer discovers highly correlated columns with no human intervention, just by profiling the data; and columns that are frequently aggregated together could be discovered by looking at query logs. Kylin could do something similar.

Re: Hierarchies

Posted by Li Yang <li...@apache.org>.
The question is excellent!  Hierarchy is not the precise term as to how
it's used in Kylin.

Hierarchy in Kylin is merely a concept to define partial cube. Kylin does
not require parent-child relationship among the columns which defines
"Hierarchy"[1]. For example, it's a good practice to define Month->Week as
hierarchy in Kylin, because they are highly correlated. However Month->Week
(many-to-many) does not maintain parent-child relationship, thus not a
hierarchy is classic OLAP. Expose the term "Hierarchy" to Kylin user can be
misleading and miss optimization tweaks like this.

I agree with Julian that some more abstract concept like "highly
correlated" or "often group by together" should replace Hierarchy in Kylin
core. Automatic discovering "highly correlated" and become self adaptable
to query patterns will be the ultimate goal along the line.

Doing pre-aggregation is Kylin's core competency compare to other analysis
engines. We shall move focus to this capability again once streaming cubing
reaches milestone.

Cheers
Yang

[1] https://en.wikipedia.org/wiki/OLAP_cube#Hierarchy


On Sun, Jun 21, 2015 at 11:42 PM, Julian Hyde <jh...@gmail.com>
wrote:

> Either would be fine. Ideally Kylin would optimize storage automatically
> but, pragmatically, it’s reasonable to allow the user to supply hints.
> Hierarchies are a natural way for the user to supply hints but it seems to
> me that they shouldn’t exist in the deeper parts of the system.
>
> Julian
>
>
> > On Jun 21, 2015, at 8:11 AM, Adunuthula, Seshu <sa...@ebay.com>
> wrote:
> >
> > Julian,
> >
> > Inferring implicit hierarchies from a highly correlated columns sounds
> > like an intriguing idea. Are you thinking Kylin auto infer that a set of
> > columns are correlated and allow for storage optimization or more of a
> > lazy specification of the hierarchies at the time of cuboid definition?
> >
> > Wanted to hear Yang¹s thoughts on this.
> >
> > Regards
> > Seshu
> >
> > On 6/19/15, 12:03 PM, "Julian Hyde" <jh...@apache.org> wrote:
> >
> >> I¹d like to ask a provocative question: Why does Kylin have hierarchies?
> >>
> >> There may be some good reasons, but having thought for a long time about
> >> OLAP architectures I have come to the conclusion that hierarchies can be
> >> more trouble than they are worth. I regret that I made them so central
> to
> >> Mondrian¹s architecture; they are a part of the MDX language, so
> Mondrian
> >> had to have them in some form, but more of the system should have been
> >> built using attributes. Since Kylin is SQL-based, it doesn¹t need
> >> hierarchies at all.
> >>
> >> In OLAP, hierarchies are really useful in the presentation layer: a
> >> hierarchy is a drill path. If user has just expanded attribute A (e.g.
> >> Year) then they are very likely to want to expand attribute B (e.g.
> >> Month) or C (e.g. Week). So, hierarchies improve the user¹s experience.
> >>
> >> In the engine and storage layer there are some concepts similar to
> >> hierarchies:
> >> functional dependencies (i.e. for a given value of X, column Y always
> has
> >> the same value),
> >> highly correlated columns (e.g. for a given value of zipcode, state
> >> almost always has the same value), and
> >> columns that are frequently aggregated together (e.g. a query rarely has
> >> ³group by productName² but more often has ³group by manufacturer, brand,
> >> productName²).
> >>
> >> These allow the kinds of storage optimization that hierarchies allow in
> >> Kylin, but they can be inferred without human intervention*, are more
> >> general, and less restrictive. For example, when choosing the set of
> >> cuboids you would tend to include highly correlated columns (if you have
> >> just built a cuboid using zipcode, there is a high benefit and low
> >> incremental cost to add state and nation to it because state is highly
> >> correlated and nation is functionally dependent). Same outcome has
> having
> >> an explicit (nation, state, zipcode) hierarchy.
> >>
> >> So, I am not claiming that hierarchies are not useful; I am claiming
> that
> >> they are not essential. If you were to remove explicit support for
> >> hierarchies and replace them with fuzzier concepts like highly
> correlated
> >> columns you might find that the system becomes radically simpler at its
> >> core.
> >>
> >> Forgive me for being provocative. I want to challenge assumptions. If
> the
> >> architecture is working fine, feel free to disregard. But if you are
> >> seeing signs to architectural strain, this might be an opportunity to
> >> simplify.
> >>
> >> Julian
> >>
> >> * Functional dependencies be inferred from the underlying star schema.
> >> Calcite¹s aggregate designer discovers highly correlated columns with no
> >> human intervention, just by profiling the data; and columns that are
> >> frequently aggregated together could be discovered by looking at query
> >> logs. Kylin could do something similar.
> >
>
>

Re: Hierarchies

Posted by Julian Hyde <jh...@gmail.com>.
Either would be fine. Ideally Kylin would optimize storage automatically but, pragmatically, it’s reasonable to allow the user to supply hints. Hierarchies are a natural way for the user to supply hints but it seems to me that they shouldn’t exist in the deeper parts of the system.

Julian


> On Jun 21, 2015, at 8:11 AM, Adunuthula, Seshu <sa...@ebay.com> wrote:
> 
> Julian,
> 
> Inferring implicit hierarchies from a highly correlated columns sounds
> like an intriguing idea. Are you thinking Kylin auto infer that a set of
> columns are correlated and allow for storage optimization or more of a
> lazy specification of the hierarchies at the time of cuboid definition?
> 
> Wanted to hear Yang¹s thoughts on this.
> 
> Regards
> Seshu
> 
> On 6/19/15, 12:03 PM, "Julian Hyde" <jh...@apache.org> wrote:
> 
>> I¹d like to ask a provocative question: Why does Kylin have hierarchies?
>> 
>> There may be some good reasons, but having thought for a long time about
>> OLAP architectures I have come to the conclusion that hierarchies can be
>> more trouble than they are worth. I regret that I made them so central to
>> Mondrian¹s architecture; they are a part of the MDX language, so Mondrian
>> had to have them in some form, but more of the system should have been
>> built using attributes. Since Kylin is SQL-based, it doesn¹t need
>> hierarchies at all.
>> 
>> In OLAP, hierarchies are really useful in the presentation layer: a
>> hierarchy is a drill path. If user has just expanded attribute A (e.g.
>> Year) then they are very likely to want to expand attribute B (e.g.
>> Month) or C (e.g. Week). So, hierarchies improve the user¹s experience.
>> 
>> In the engine and storage layer there are some concepts similar to
>> hierarchies:
>> functional dependencies (i.e. for a given value of X, column Y always has
>> the same value),
>> highly correlated columns (e.g. for a given value of zipcode, state
>> almost always has the same value), and
>> columns that are frequently aggregated together (e.g. a query rarely has
>> ³group by productName² but more often has ³group by manufacturer, brand,
>> productName²).
>> 
>> These allow the kinds of storage optimization that hierarchies allow in
>> Kylin, but they can be inferred without human intervention*, are more
>> general, and less restrictive. For example, when choosing the set of
>> cuboids you would tend to include highly correlated columns (if you have
>> just built a cuboid using zipcode, there is a high benefit and low
>> incremental cost to add state and nation to it because state is highly
>> correlated and nation is functionally dependent). Same outcome has having
>> an explicit (nation, state, zipcode) hierarchy.
>> 
>> So, I am not claiming that hierarchies are not useful; I am claiming that
>> they are not essential. If you were to remove explicit support for
>> hierarchies and replace them with fuzzier concepts like highly correlated
>> columns you might find that the system becomes radically simpler at its
>> core.
>> 
>> Forgive me for being provocative. I want to challenge assumptions. If the
>> architecture is working fine, feel free to disregard. But if you are
>> seeing signs to architectural strain, this might be an opportunity to
>> simplify.
>> 
>> Julian
>> 
>> * Functional dependencies be inferred from the underlying star schema.
>> Calcite¹s aggregate designer discovers highly correlated columns with no
>> human intervention, just by profiling the data; and columns that are
>> frequently aggregated together could be discovered by looking at query
>> logs. Kylin could do something similar.
> 


Re: Hierarchies

Posted by "Adunuthula, Seshu" <sa...@ebay.com>.
Julian,

Inferring implicit hierarchies from a highly correlated columns sounds
like an intriguing idea. Are you thinking Kylin auto infer that a set of
columns are correlated and allow for storage optimization or more of a
lazy specification of the hierarchies at the time of cuboid definition?

Wanted to hear Yang¹s thoughts on this.

Regards
Seshu

On 6/19/15, 12:03 PM, "Julian Hyde" <jh...@apache.org> wrote:

>I¹d like to ask a provocative question: Why does Kylin have hierarchies?
>
>There may be some good reasons, but having thought for a long time about
>OLAP architectures I have come to the conclusion that hierarchies can be
>more trouble than they are worth. I regret that I made them so central to
>Mondrian¹s architecture; they are a part of the MDX language, so Mondrian
>had to have them in some form, but more of the system should have been
>built using attributes. Since Kylin is SQL-based, it doesn¹t need
>hierarchies at all.
>
>In OLAP, hierarchies are really useful in the presentation layer: a
>hierarchy is a drill path. If user has just expanded attribute A (e.g.
>Year) then they are very likely to want to expand attribute B (e.g.
>Month) or C (e.g. Week). So, hierarchies improve the user¹s experience.
>
>In the engine and storage layer there are some concepts similar to
>hierarchies:
>functional dependencies (i.e. for a given value of X, column Y always has
>the same value),
>highly correlated columns (e.g. for a given value of zipcode, state
>almost always has the same value), and
>columns that are frequently aggregated together (e.g. a query rarely has
>³group by productName² but more often has ³group by manufacturer, brand,
>productName²).
>
>These allow the kinds of storage optimization that hierarchies allow in
>Kylin, but they can be inferred without human intervention*, are more
>general, and less restrictive. For example, when choosing the set of
>cuboids you would tend to include highly correlated columns (if you have
>just built a cuboid using zipcode, there is a high benefit and low
>incremental cost to add state and nation to it because state is highly
>correlated and nation is functionally dependent). Same outcome has having
>an explicit (nation, state, zipcode) hierarchy.
>
>So, I am not claiming that hierarchies are not useful; I am claiming that
>they are not essential. If you were to remove explicit support for
>hierarchies and replace them with fuzzier concepts like highly correlated
>columns you might find that the system becomes radically simpler at its
>core.
>
>Forgive me for being provocative. I want to challenge assumptions. If the
>architecture is working fine, feel free to disregard. But if you are
>seeing signs to architectural strain, this might be an opportunity to
>simplify.
>
>Julian
>
>* Functional dependencies be inferred from the underlying star schema.
>Calcite¹s aggregate designer discovers highly correlated columns with no
>human intervention, just by profiling the data; and columns that are
>frequently aggregated together could be discovered by looking at query
>logs. Kylin could do something similar.