You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by dong wang <el...@gmail.com> on 2015/06/17 08:50:28 UTC

will we support derived dimensions from fact table itself

usually, we may have 2 common cases:
1, only have a fact table which has already been joined with lookup tables,
take it as a wide table

2, we have star schema data model, and when defining the cube, we define
the join with the fact table and lookup tables,

currently,

to the 1st case, we cannot create derived dimensions based on the fact
table which can be created only from the lookup tables,  for example, we
have fact table: region_id, region_name, province_id, province_name,
city_id, city_name, M1, M2, ...Mn, we want to create a hierarchy dimension:
region_id->province_id->city_id, and a derived dimenison: (region_name,
province_name, city_name) as well, if we can achieve it, there will be no
complicated joins at all, and also, the join will take much time in the
first step when building the cube~ thus, will we support such demension
design?

to the 2nd case, even though we make the join, however, if the column names
are changed for some reasons, we will have to build the whole cube again
instead of only update the dimension data?

Re: will we support derived dimensions from fact table itself

Posted by "Shi, Shaofeng" <sh...@ebay.com>.

The design of lookup_table1 is problematic; you couldn't join it with
table like fact_table2, fact_table3 only on region_id; let’s see an
example:

lookup_table1 (assume it has two records with region_id R1):
region_id, region_name, province_id, province_name, city_id, city_name
R1, EAST, P1, ANHUI, C1, HEFEI

R1, EAST, P2, JIANGSU, C2, NANJING

fact_table2 (assume it only has 1 record, region_id is R1):
region_id, year_id, M1, M2,..., Mn
R1, 2015, m1, m2…


Join fact_table2 and lookup_table1 on region_id, you will get two records :
R1, EAST, 2015, m1, m2…

R1, EAST, 2015, m1, m2…


This is wrong, as the measures are repeated;

A better design might be, having lookup table for each layer:

Region:
region_id, region_name

Province:
province_id, province_name

City:
city_id, city_name

Then all the names can be derived from the id on fact table;

Just some my cents, let me know if I’m wrong...

On 6/17/15, 9:39 PM, "dong wang" <el...@gmail.com> wrote:

>Hi shaofeng, hongbin,
>
>for case 1:  there should be some weakness, suppose that we have a "STAR"
>model schema:
>
>lookup_table1:
>region_id, region_name, province_id, province_name, city_id, city_name
>
>fact_table1:
>region_id, province_id, city_id, M1, M2, Mn
>
>fact_table2:
>region_id, year_id, M1, M2,..., Mn
>
>fact_table3:
>region_id, province_id, quarter_id, M1, M2, Mn
>
>as mentioned, only the "fact_table1" can successfully join with
>"lookup_table1",  "fact_table2" and "fact_table3" cannot share
>"lookup_table1",  is it right?
>
>
>
>for case 2:
>sorry for the misunderstanding, what I mean actually is that if the VALUE
>of the city_name for some rows is changed, we will have to totally built
>the cube again, is it right?  can we scan all the id and values in the
>fact
>table and build a hashmap, and before kylin returns the data to the user
>end, we only do the JOIN with the LITTLE amount of the qualified query
>data
>with the hashmap,  and the reason why I think we should support similar
>feature is that for a small hadoop cluster and HUGE amount of data, it is
>VERY time-consuming to do the HIVE join when the first cube building step!
>
>2015-06-17 17:27 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
>
>> Hi dong, ³Derived² is only for lookup table (and the lookup table need
>> ensure the uniqueness of PK, so with one PK we can exactly find one
>>value
>> for the derived column);
>>
>> For case 1, Kylin will not support derive from fact table; The problem
>>is,
>> how can we know/ensure 1 region_id only has only 1 region_name mapped if
>> you repeats region_id and region_name in each row of fact table? If
>>there
>> are more than 1 mapping, which one should be taken?
>>
>> If a table column name be changed, the cubes need be updated to use the
>> new name, and a re-build is needed; So you need avoid to change
>> table/column name; (I guess you¹re not mean this, is it?)
>>
>> On 6/17/15, 2:50 PM, "dong wang" <el...@gmail.com> wrote:
>>
>> >usually, we may have 2 common cases:
>> >1, only have a fact table which has already been joined with lookup
>> >tables,
>> >take it as a wide table
>> >
>> >2, we have star schema data model, and when defining the cube, we
>>define
>> >the join with the fact table and lookup tables,
>> >
>> >currently,
>> >
>> >to the 1st case, we cannot create derived dimensions based on the fact
>> >table which can be created only from the lookup tables,  for example,
>>we
>> >have fact table: region_id, region_name, province_id, province_name,
>> >city_id, city_name, M1, M2, ...Mn, we want to create a hierarchy
>> >dimension:
>> >region_id->province_id->city_id, and a derived dimenison: (region_name,
>> >province_name, city_name) as well, if we can achieve it, there will be
>>no
>> >complicated joins at all, and also, the join will take much time in the
>> >first step when building the cube~ thus, will we support such demension
>> >design?
>> >
>> >to the 2nd case, even though we make the join, however, if the column
>> >names
>> >are changed for some reasons, we will have to build the whole cube
>>again
>> >instead of only update the dimension data?
>>
>>

Re: will we support derived dimensions from fact table itself

Posted by dong wang <el...@gmail.com>.

Hi shaofeng, hongbin,

for case 1:  there should be some weakness, suppose that we have a "STAR"
model schema:

lookup_table1:
region_id, region_name, province_id, province_name, city_id, city_name

fact_table1:
region_id, province_id, city_id, M1, M2, Mn

fact_table2:
region_id, year_id, M1, M2,..., Mn

fact_table3:
region_id, province_id, quarter_id, M1, M2, Mn

as mentioned, only the "fact_table1" can successfully join with
"lookup_table1",  "fact_table2" and "fact_table3" cannot share
"lookup_table1",  is it right?



for case 2:
sorry for the misunderstanding, what I mean actually is that if the VALUE
of the city_name for some rows is changed, we will have to totally built
the cube again, is it right?  can we scan all the id and values in the fact
table and build a hashmap, and before kylin returns the data to the user
end, we only do the JOIN with the LITTLE amount of the qualified query data
with the hashmap,  and the reason why I think we should support similar
feature is that for a small hadoop cluster and HUGE amount of data, it is
VERY time-consuming to do the HIVE join when the first cube building step!

2015-06-17 17:27 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:

> Hi dong, ³Derived² is only for lookup table (and the lookup table need
> ensure the uniqueness of PK, so with one PK we can exactly find one value
> for the derived column);
>
> For case 1, Kylin will not support derive from fact table; The problem is,
> how can we know/ensure 1 region_id only has only 1 region_name mapped if
> you repeats region_id and region_name in each row of fact table? If there
> are more than 1 mapping, which one should be taken?
>
> If a table column name be changed, the cubes need be updated to use the
> new name, and a re-build is needed; So you need avoid to change
> table/column name; (I guess you¹re not mean this, is it?)
>
> On 6/17/15, 2:50 PM, "dong wang" <el...@gmail.com> wrote:
>
> >usually, we may have 2 common cases:
> >1, only have a fact table which has already been joined with lookup
> >tables,
> >take it as a wide table
> >
> >2, we have star schema data model, and when defining the cube, we define
> >the join with the fact table and lookup tables,
> >
> >currently,
> >
> >to the 1st case, we cannot create derived dimensions based on the fact
> >table which can be created only from the lookup tables,  for example, we
> >have fact table: region_id, region_name, province_id, province_name,
> >city_id, city_name, M1, M2, ...Mn, we want to create a hierarchy
> >dimension:
> >region_id->province_id->city_id, and a derived dimenison: (region_name,
> >province_name, city_name) as well, if we can achieve it, there will be no
> >complicated joins at all, and also, the join will take much time in the
> >first step when building the cube~ thus, will we support such demension
> >design?
> >
> >to the 2nd case, even though we make the join, however, if the column
> >names
> >are changed for some reasons, we will have to build the whole cube again
> >instead of only update the dimension data?
>
>

Re: will we support derived dimensions from fact table itself

Posted by "Shi, Shaofeng" <sh...@ebay.com>.

Hi dong, ³Derived² is only for lookup table (and the lookup table need
ensure the uniqueness of PK, so with one PK we can exactly find one value
for the derived column);

For case 1, Kylin will not support derive from fact table; The problem is,
how can we know/ensure 1 region_id only has only 1 region_name mapped if
you repeats region_id and region_name in each row of fact table? If there
are more than 1 mapping, which one should be taken?

If a table column name be changed, the cubes need be updated to use the
new name, and a re-build is needed; So you need avoid to change
table/column name; (I guess you¹re not mean this, is it?)

On 6/17/15, 2:50 PM, "dong wang" <el...@gmail.com> wrote:

>usually, we may have 2 common cases:
>1, only have a fact table which has already been joined with lookup
>tables,
>take it as a wide table
>
>2, we have star schema data model, and when defining the cube, we define
>the join with the fact table and lookup tables,
>
>currently,
>
>to the 1st case, we cannot create derived dimensions based on the fact
>table which can be created only from the lookup tables,  for example, we
>have fact table: region_id, region_name, province_id, province_name,
>city_id, city_name, M1, M2, ...Mn, we want to create a hierarchy
>dimension:
>region_id->province_id->city_id, and a derived dimenison: (region_name,
>province_name, city_name) as well, if we can achieve it, there will be no
>complicated joins at all, and also, the join will take much time in the
>first step when building the cube~ thus, will we support such demension
>design?
>
>to the 2nd case, even though we make the join, however, if the column
>names
>are changed for some reasons, we will have to build the whole cube again
>instead of only update the dimension data?

Re: will we support derived dimensions from fact table itself

Posted by hongbin ma <ma...@apache.org>.

for case 1:

do you mean that you want to derive region_name from region_id, city_name
from city_id?

for case 2:

theoretically it can be hacked by modifying metadata store only, however,
it can be very complex and error prone

On Wed, Jun 17, 2015 at 2:50 PM, dong wang <el...@gmail.com> wrote:

> usually, we may have 2 common cases:
> 1, only have a fact table which has already been joined with lookup tables,
> take it as a wide table
>
> 2, we have star schema data model, and when defining the cube, we define
> the join with the fact table and lookup tables,
>
> currently,
>
> to the 1st case, we cannot create derived dimensions based on the fact
> table which can be created only from the lookup tables,  for example, we
> have fact table: region_id, region_name, province_id, province_name,
> city_id, city_name, M1, M2, ...Mn, we want to create a hierarchy dimension:
> region_id->province_id->city_id, and a derived dimenison: (region_name,
> province_name, city_name) as well, if we can achieve it, there will be no
> complicated joins at all, and also, the join will take much time in the
> first step when building the cube~ thus, will we support such demension
> design?
>
> to the 2nd case, even though we make the join, however, if the column names
> are changed for some reasons, we will have to build the whole cube again
> instead of only update the dimension data?
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone