You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Alberto Ramón <a....@gmail.com> on 2016/12/08 23:10:40 UTC

Use derived or Joint

Typical case 1:

*IDDate*

*Month_ID*

*Month_Txt*

*DayWeek_ID*

*DayWeek_Txt*

*Year*

2016-03-01

3

March

2

Wendesday

2016

2016-03-02

3

March

3

Thursday

2016

2016-03-02

3

March

4

Friday

2016

IDDate is PK of Dim table and Unique


SOL 1: Uses Hierarchy and Derived from non PK column


*Month_ID*

Hierarchy 2

Normal 1

*Month_Txt*


Derived 1

*DayWeek_ID*

Hierarchy 3

Normal 2

*DayWeek_Txt*


Derived 2

*Year*

Hierarchy 1

Normal 3

Year > Month > Day

Text are derived from ID (in month and Week)

PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444>

PB2: I don't know how create Derived column from non PK with actual UI (Kylin
– 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin 1786
<https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3)



SOL 2:

*Month_ID*

Hierarchy 2

Join 1

*Month_Txt*


Join 1

*DayWeek_ID*

Hierarchy 3

Join 2

*DayWeek_Txt*


Join 2

*Year*

Hierarchy 1

Normal 3


SOL 2 is this the best solution ??



Typical case 2:

I see the same scenario a lot of times (derived columns with 1:1 Relation)

Product_ID *(PK)*

Product_TXT

TypeProduct_ID

TypeProduct_TXT

Country_TXT

Country_ID

Optimize queries by product / category / country, are mandatory

Perhaps,

Country (lower cardinality) its a good candidate to Join

I don't want put Product_TXT as Join, because is a longgggg text, and can
be affect Row_Key of HBase, but I need Queries like ... where product_TXT =
""iRobot Roomba 650 Robotic Vacuum Cleaner

suggestions ?

Re: Use derived or Joint

Posted by Alberto Ramón <a....@gmail.com>.

Thanks, your notes about Hierarchy are very good and important

2016-12-12 2:49 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:

> " define "year" - "IDDate" as a hierarchy", which is the benefit ?
> --> The combination "year" + "IDDate" has the same line number as the
> combination "IDDate"; so aggregate from the former to the latter will not
> aggregate much; Then we can prune the later with the "hierarchy" to reduce
> the cube size;
>
> Nowadays, in derived columns, the Host column, is always the PK of table ?
> --> Yes
>
> 2016-12-10 20:25 GMT+08:00 Alberto Ramón <a....@gmail.com>:
>
>> thanks for you clear explanation !!
>>
>>
>> The only point that I can't understand is
>> " define "year" - "IDDate" as a hierarchy", which is the benefit ?
>>       [image: Imágenes integradas 1]
>> Where:
>>
>>    -
>>
>>    IDData is PK of Dim table, Unique & Identity
>>    - Year is a Normal Dim --> I will have precalculated by years
>>
>>
>> Nowadays, in derived columns, the Host column, is always the PK of table ?
>>
>>
>>
>>
>> 2016-12-09 15:25 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
>>
>>> Hi Albert, I think you're raising a good question; Many users face such
>>> questions when using Kylin in their cases. Let me try to share some my
>>> cents.
>>>
>>> "Derived" or "Joint" ?
>>> These are two independent means in Kylin (they're not conflict). Using
>>> which depends on how these dimensions being used I think;
>>>
>>> Take the "IDDate" case you mentioned as an example; If most of you
>>> queries are aggregated at the PK/FK level (which is date), and user just
>>> want to in passing other fields like its "MonthTxt", "DayWeekTxt", defining
>>> them as "derived" will be very good.
>>>
>>> But if you also want to aggregate also at "MonthTxt" or "DayWeekTxt"
>>> level, defining them as "Derived" might not be good; Because Kylin need
>>> translates the condition of "MonthTxt" into a set of PK values ("IDDate"),
>>> and then query from Cube with these values, because the cube only
>>> pre-aggreated at "IDDate"; This will slow down the query; (Ofcause if your
>>> dataset is small it still be acceptable)
>>>
>>> Besides, defining "Year" - "DayWeek_ID" - "Month_ID" as hierarchy is not
>>> suggested, because they are not a hierarchy relationship, but in parallel
>>> here;  ("March" is not a child of "2016", it appears in every year)
>>>
>>> "Joint" can be used in two typical cases:
>>> 1) combine multiple ultra low cardinality dimensions
>>> 2) combine dimensions which has 1:1 (like "Month_ID" "Month_Txt") or
>>> close to 1:1 relationship (like "USER_ID", "USER_EMAIL")
>>>
>>> For case 1, I might design the cube in this way (assume you have the
>>> need to group by year, month, dayweek):
>>> 1) define all them as normal dimension
>>> 2) define "year" - "IDDate" as a hierarchy
>>> 3) define "Month_ID" + "Month_Txt" as a joint, because they're 1:1
>>> 4) define "DayWeek_ID" + "DayWeek_Txt" as a joint, because they're 1:1
>>>
>>>
>>> For case 2, I have the same suggestion as above.
>>>
>>> 2016-12-09 7:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
>>>
>>>> Typical case 1:
>>>>
>>>> *IDDate*
>>>>
>>>> *Month_ID*
>>>>
>>>> *Month_Txt*
>>>>
>>>> *DayWeek_ID*
>>>>
>>>> *DayWeek_Txt*
>>>>
>>>> *Year*
>>>>
>>>> 2016-03-01
>>>>
>>>> 3
>>>>
>>>> March
>>>>
>>>> 2
>>>>
>>>> Wendesday
>>>>
>>>> 2016
>>>>
>>>> 2016-03-02
>>>>
>>>> 3
>>>>
>>>> March
>>>>
>>>> 3
>>>>
>>>> Thursday
>>>>
>>>> 2016
>>>>
>>>> 2016-03-02
>>>>
>>>> 3
>>>>
>>>> March
>>>>
>>>> 4
>>>>
>>>> Friday
>>>>
>>>> 2016
>>>>
>>>> IDDate is PK of Dim table and Unique
>>>>
>>>>
>>>> SOL 1: Uses Hierarchy and Derived from non PK column
>>>>
>>>>
>>>> *Month_ID*
>>>>
>>>> Hierarchy 2
>>>>
>>>> Normal 1
>>>>
>>>> *Month_Txt*
>>>>
>>>>
>>>> Derived 1
>>>>
>>>> *DayWeek_ID*
>>>>
>>>> Hierarchy 3
>>>>
>>>> Normal 2
>>>>
>>>> *DayWeek_Txt*
>>>>
>>>>
>>>> Derived 2
>>>>
>>>> *Year*
>>>>
>>>> Hierarchy 1
>>>>
>>>> Normal 3
>>>>
>>>> Year > Month > Day
>>>>
>>>> Text are derived from ID (in month and Week)
>>>>
>>>> PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444>
>>>>
>>>> PB2: I don't know how create Derived column from non PK with actual UI (Kylin
>>>> – 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin
>>>> 1786 <https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3)
>>>>
>>>>
>>>>
>>>> SOL 2:
>>>>
>>>> *Month_ID*
>>>>
>>>> Hierarchy 2
>>>>
>>>> Join 1
>>>>
>>>> *Month_Txt*
>>>>
>>>>
>>>> Join 1
>>>>
>>>> *DayWeek_ID*
>>>>
>>>> Hierarchy 3
>>>>
>>>> Join 2
>>>>
>>>> *DayWeek_Txt*
>>>>
>>>>
>>>> Join 2
>>>>
>>>> *Year*
>>>>
>>>> Hierarchy 1
>>>>
>>>> Normal 3
>>>>
>>>>
>>>> SOL 2 is this the best solution ??
>>>>
>>>>
>>>>
>>>> Typical case 2:
>>>>
>>>> I see the same scenario a lot of times (derived columns with 1:1
>>>> Relation)
>>>>
>>>> Product_ID *(PK)*
>>>>
>>>> Product_TXT
>>>>
>>>> TypeProduct_ID
>>>>
>>>> TypeProduct_TXT
>>>>
>>>> Country_TXT
>>>>
>>>> Country_ID
>>>>
>>>> Optimize queries by product / category / country, are mandatory
>>>>
>>>> Perhaps,
>>>>
>>>> Country (lower cardinality) its a good candidate to Join
>>>>
>>>> I don't want put Product_TXT as Join, because is a longgggg text, and
>>>> can be affect Row_Key of HBase, but I need Queries like ... where
>>>> product_TXT = ""iRobot Roomba 650 Robotic Vacuum Cleaner
>>>>
>>>> suggestions ?
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: Use derived or Joint

Posted by ShaoFeng Shi <sh...@apache.org>.

" define "year" - "IDDate" as a hierarchy", which is the benefit ?
--> The combination "year" + "IDDate" has the same line number as the
combination "IDDate"; so aggregate from the former to the latter will not
aggregate much; Then we can prune the later with the "hierarchy" to reduce
the cube size;

Nowadays, in derived columns, the Host column, is always the PK of table ?
--> Yes

2016-12-10 20:25 GMT+08:00 Alberto Ramón <a....@gmail.com>:

> thanks for you clear explanation !!
>
>
> The only point that I can't understand is
> " define "year" - "IDDate" as a hierarchy", which is the benefit ?
>       [image: Imágenes integradas 1]
> Where:
>
>    -
>
>    IDData is PK of Dim table, Unique & Identity
>    - Year is a Normal Dim --> I will have precalculated by years
>
>
> Nowadays, in derived columns, the Host column, is always the PK of table ?
>
>
>
>
> 2016-12-09 15:25 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
>
>> Hi Albert, I think you're raising a good question; Many users face such
>> questions when using Kylin in their cases. Let me try to share some my
>> cents.
>>
>> "Derived" or "Joint" ?
>> These are two independent means in Kylin (they're not conflict). Using
>> which depends on how these dimensions being used I think;
>>
>> Take the "IDDate" case you mentioned as an example; If most of you
>> queries are aggregated at the PK/FK level (which is date), and user just
>> want to in passing other fields like its "MonthTxt", "DayWeekTxt", defining
>> them as "derived" will be very good.
>>
>> But if you also want to aggregate also at "MonthTxt" or "DayWeekTxt"
>> level, defining them as "Derived" might not be good; Because Kylin need
>> translates the condition of "MonthTxt" into a set of PK values ("IDDate"),
>> and then query from Cube with these values, because the cube only
>> pre-aggreated at "IDDate"; This will slow down the query; (Ofcause if your
>> dataset is small it still be acceptable)
>>
>> Besides, defining "Year" - "DayWeek_ID" - "Month_ID" as hierarchy is not
>> suggested, because they are not a hierarchy relationship, but in parallel
>> here;  ("March" is not a child of "2016", it appears in every year)
>>
>> "Joint" can be used in two typical cases:
>> 1) combine multiple ultra low cardinality dimensions
>> 2) combine dimensions which has 1:1 (like "Month_ID" "Month_Txt") or
>> close to 1:1 relationship (like "USER_ID", "USER_EMAIL")
>>
>> For case 1, I might design the cube in this way (assume you have the need
>> to group by year, month, dayweek):
>> 1) define all them as normal dimension
>> 2) define "year" - "IDDate" as a hierarchy
>> 3) define "Month_ID" + "Month_Txt" as a joint, because they're 1:1
>> 4) define "DayWeek_ID" + "DayWeek_Txt" as a joint, because they're 1:1
>>
>>
>> For case 2, I have the same suggestion as above.
>>
>> 2016-12-09 7:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
>>
>>> Typical case 1:
>>>
>>> *IDDate*
>>>
>>> *Month_ID*
>>>
>>> *Month_Txt*
>>>
>>> *DayWeek_ID*
>>>
>>> *DayWeek_Txt*
>>>
>>> *Year*
>>>
>>> 2016-03-01
>>>
>>> 3
>>>
>>> March
>>>
>>> 2
>>>
>>> Wendesday
>>>
>>> 2016
>>>
>>> 2016-03-02
>>>
>>> 3
>>>
>>> March
>>>
>>> 3
>>>
>>> Thursday
>>>
>>> 2016
>>>
>>> 2016-03-02
>>>
>>> 3
>>>
>>> March
>>>
>>> 4
>>>
>>> Friday
>>>
>>> 2016
>>>
>>> IDDate is PK of Dim table and Unique
>>>
>>>
>>> SOL 1: Uses Hierarchy and Derived from non PK column
>>>
>>>
>>> *Month_ID*
>>>
>>> Hierarchy 2
>>>
>>> Normal 1
>>>
>>> *Month_Txt*
>>>
>>>
>>> Derived 1
>>>
>>> *DayWeek_ID*
>>>
>>> Hierarchy 3
>>>
>>> Normal 2
>>>
>>> *DayWeek_Txt*
>>>
>>>
>>> Derived 2
>>>
>>> *Year*
>>>
>>> Hierarchy 1
>>>
>>> Normal 3
>>>
>>> Year > Month > Day
>>>
>>> Text are derived from ID (in month and Week)
>>>
>>> PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444>
>>>
>>> PB2: I don't know how create Derived column from non PK with actual UI (Kylin
>>> – 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin
>>> 1786 <https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3)
>>>
>>>
>>>
>>> SOL 2:
>>>
>>> *Month_ID*
>>>
>>> Hierarchy 2
>>>
>>> Join 1
>>>
>>> *Month_Txt*
>>>
>>>
>>> Join 1
>>>
>>> *DayWeek_ID*
>>>
>>> Hierarchy 3
>>>
>>> Join 2
>>>
>>> *DayWeek_Txt*
>>>
>>>
>>> Join 2
>>>
>>> *Year*
>>>
>>> Hierarchy 1
>>>
>>> Normal 3
>>>
>>>
>>> SOL 2 is this the best solution ??
>>>
>>>
>>>
>>> Typical case 2:
>>>
>>> I see the same scenario a lot of times (derived columns with 1:1
>>> Relation)
>>>
>>> Product_ID *(PK)*
>>>
>>> Product_TXT
>>>
>>> TypeProduct_ID
>>>
>>> TypeProduct_TXT
>>>
>>> Country_TXT
>>>
>>> Country_ID
>>>
>>> Optimize queries by product / category / country, are mandatory
>>>
>>> Perhaps,
>>>
>>> Country (lower cardinality) its a good candidate to Join
>>>
>>> I don't want put Product_TXT as Join, because is a longgggg text, and
>>> can be affect Row_Key of HBase, but I need Queries like ... where
>>> product_TXT = ""iRobot Roomba 650 Robotic Vacuum Cleaner
>>>
>>> suggestions ?
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Use derived or Joint

Posted by Alberto Ramón <a....@gmail.com>.

thanks for you clear explanation !!


The only point that I can't understand is
" define "year" - "IDDate" as a hierarchy", which is the benefit ?
      [image: Imágenes integradas 1]
Where:

   -

   IDData is PK of Dim table, Unique & Identity
   - Year is a Normal Dim --> I will have precalculated by years


Nowadays, in derived columns, the Host column, is always the PK of table ?




2016-12-09 15:25 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:

> Hi Albert, I think you're raising a good question; Many users face such
> questions when using Kylin in their cases. Let me try to share some my
> cents.
>
> "Derived" or "Joint" ?
> These are two independent means in Kylin (they're not conflict). Using
> which depends on how these dimensions being used I think;
>
> Take the "IDDate" case you mentioned as an example; If most of you queries
> are aggregated at the PK/FK level (which is date), and user just want to in
> passing other fields like its "MonthTxt", "DayWeekTxt", defining them as
> "derived" will be very good.
>
> But if you also want to aggregate also at "MonthTxt" or "DayWeekTxt"
> level, defining them as "Derived" might not be good; Because Kylin need
> translates the condition of "MonthTxt" into a set of PK values ("IDDate"),
> and then query from Cube with these values, because the cube only
> pre-aggreated at "IDDate"; This will slow down the query; (Ofcause if your
> dataset is small it still be acceptable)
>
> Besides, defining "Year" - "DayWeek_ID" - "Month_ID" as hierarchy is not
> suggested, because they are not a hierarchy relationship, but in parallel
> here;  ("March" is not a child of "2016", it appears in every year)
>
> "Joint" can be used in two typical cases:
> 1) combine multiple ultra low cardinality dimensions
> 2) combine dimensions which has 1:1 (like "Month_ID" "Month_Txt") or close
> to 1:1 relationship (like "USER_ID", "USER_EMAIL")
>
> For case 1, I might design the cube in this way (assume you have the need
> to group by year, month, dayweek):
> 1) define all them as normal dimension
> 2) define "year" - "IDDate" as a hierarchy
> 3) define "Month_ID" + "Month_Txt" as a joint, because they're 1:1
> 4) define "DayWeek_ID" + "DayWeek_Txt" as a joint, because they're 1:1
>
>
> For case 2, I have the same suggestion as above.
>
> 2016-12-09 7:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
>
>> Typical case 1:
>>
>> *IDDate*
>>
>> *Month_ID*
>>
>> *Month_Txt*
>>
>> *DayWeek_ID*
>>
>> *DayWeek_Txt*
>>
>> *Year*
>>
>> 2016-03-01
>>
>> 3
>>
>> March
>>
>> 2
>>
>> Wendesday
>>
>> 2016
>>
>> 2016-03-02
>>
>> 3
>>
>> March
>>
>> 3
>>
>> Thursday
>>
>> 2016
>>
>> 2016-03-02
>>
>> 3
>>
>> March
>>
>> 4
>>
>> Friday
>>
>> 2016
>>
>> IDDate is PK of Dim table and Unique
>>
>>
>> SOL 1: Uses Hierarchy and Derived from non PK column
>>
>>
>> *Month_ID*
>>
>> Hierarchy 2
>>
>> Normal 1
>>
>> *Month_Txt*
>>
>>
>> Derived 1
>>
>> *DayWeek_ID*
>>
>> Hierarchy 3
>>
>> Normal 2
>>
>> *DayWeek_Txt*
>>
>>
>> Derived 2
>>
>> *Year*
>>
>> Hierarchy 1
>>
>> Normal 3
>>
>> Year > Month > Day
>>
>> Text are derived from ID (in month and Week)
>>
>> PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444>
>>
>> PB2: I don't know how create Derived column from non PK with actual UI (Kylin
>> – 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin
>> 1786 <https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3)
>>
>>
>>
>> SOL 2:
>>
>> *Month_ID*
>>
>> Hierarchy 2
>>
>> Join 1
>>
>> *Month_Txt*
>>
>>
>> Join 1
>>
>> *DayWeek_ID*
>>
>> Hierarchy 3
>>
>> Join 2
>>
>> *DayWeek_Txt*
>>
>>
>> Join 2
>>
>> *Year*
>>
>> Hierarchy 1
>>
>> Normal 3
>>
>>
>> SOL 2 is this the best solution ??
>>
>>
>>
>> Typical case 2:
>>
>> I see the same scenario a lot of times (derived columns with 1:1 Relation)
>>
>> Product_ID *(PK)*
>>
>> Product_TXT
>>
>> TypeProduct_ID
>>
>> TypeProduct_TXT
>>
>> Country_TXT
>>
>> Country_ID
>>
>> Optimize queries by product / category / country, are mandatory
>>
>> Perhaps,
>>
>> Country (lower cardinality) its a good candidate to Join
>>
>> I don't want put Product_TXT as Join, because is a longgggg text, and can
>> be affect Row_Key of HBase, but I need Queries like ... where product_TXT =
>> ""iRobot Roomba 650 Robotic Vacuum Cleaner
>>
>> suggestions ?
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: Use derived or Joint

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Albert, I think you're raising a good question; Many users face such
questions when using Kylin in their cases. Let me try to share some my
cents.

"Derived" or "Joint" ?
These are two independent means in Kylin (they're not conflict). Using
which depends on how these dimensions being used I think;

Take the "IDDate" case you mentioned as an example; If most of you queries
are aggregated at the PK/FK level (which is date), and user just want to in
passing other fields like its "MonthTxt", "DayWeekTxt", defining them as
"derived" will be very good.

But if you also want to aggregate also at "MonthTxt" or "DayWeekTxt" level,
defining them as "Derived" might not be good; Because Kylin need translates
the condition of "MonthTxt" into a set of PK values ("IDDate"), and then
query from Cube with these values, because the cube only pre-aggreated at
"IDDate"; This will slow down the query; (Ofcause if your dataset is small
it still be acceptable)

Besides, defining "Year" - "DayWeek_ID" - "Month_ID" as hierarchy is not
suggested, because they are not a hierarchy relationship, but in parallel
here;  ("March" is not a child of "2016", it appears in every year)

"Joint" can be used in two typical cases:
1) combine multiple ultra low cardinality dimensions
2) combine dimensions which has 1:1 (like "Month_ID" "Month_Txt") or close
to 1:1 relationship (like "USER_ID", "USER_EMAIL")

For case 1, I might design the cube in this way (assume you have the need
to group by year, month, dayweek):
1) define all them as normal dimension
2) define "year" - "IDDate" as a hierarchy
3) define "Month_ID" + "Month_Txt" as a joint, because they're 1:1
4) define "DayWeek_ID" + "DayWeek_Txt" as a joint, because they're 1:1


For case 2, I have the same suggestion as above.

2016-12-09 7:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:

> Typical case 1:
>
> *IDDate*
>
> *Month_ID*
>
> *Month_Txt*
>
> *DayWeek_ID*
>
> *DayWeek_Txt*
>
> *Year*
>
> 2016-03-01
>
> 3
>
> March
>
> 2
>
> Wendesday
>
> 2016
>
> 2016-03-02
>
> 3
>
> March
>
> 3
>
> Thursday
>
> 2016
>
> 2016-03-02
>
> 3
>
> March
>
> 4
>
> Friday
>
> 2016
>
> IDDate is PK of Dim table and Unique
>
>
> SOL 1: Uses Hierarchy and Derived from non PK column
>
>
> *Month_ID*
>
> Hierarchy 2
>
> Normal 1
>
> *Month_Txt*
>
>
> Derived 1
>
> *DayWeek_ID*
>
> Hierarchy 3
>
> Normal 2
>
> *DayWeek_Txt*
>
>
> Derived 2
>
> *Year*
>
> Hierarchy 1
>
> Normal 3
>
> Year > Month > Day
>
> Text are derived from ID (in month and Week)
>
> PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444>
>
> PB2: I don't know how create Derived column from non PK with actual UI (Kylin
> – 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin
> 1786 <https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3)
>
>
>
> SOL 2:
>
> *Month_ID*
>
> Hierarchy 2
>
> Join 1
>
> *Month_Txt*
>
>
> Join 1
>
> *DayWeek_ID*
>
> Hierarchy 3
>
> Join 2
>
> *DayWeek_Txt*
>
>
> Join 2
>
> *Year*
>
> Hierarchy 1
>
> Normal 3
>
>
> SOL 2 is this the best solution ??
>
>
>
> Typical case 2:
>
> I see the same scenario a lot of times (derived columns with 1:1 Relation)
>
> Product_ID *(PK)*
>
> Product_TXT
>
> TypeProduct_ID
>
> TypeProduct_TXT
>
> Country_TXT
>
> Country_ID
>
> Optimize queries by product / category / country, are mandatory
>
> Perhaps,
>
> Country (lower cardinality) its a good candidate to Join
>
> I don't want put Product_TXT as Join, because is a longgggg text, and can
> be affect Row_Key of HBase, but I need Queries like ... where product_TXT =
> ""iRobot Roomba 650 Robotic Vacuum Cleaner
>
> suggestions ?
>



-- 
Best regards,

Shaofeng Shi 史少锋

Re: Use derived or Joint

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Albert, I think you're raising a good question; Many users face such
questions when using Kylin in their cases. Let me try to share some my
cents.

"Derived" or "Joint" ?
These are two independent means in Kylin (they're not conflict). Using
which depends on how these dimensions being used I think;

Take the "IDDate" case you mentioned as an example; If most of you queries
are aggregated at the PK/FK level (which is date), and user just want to in
passing other fields like its "MonthTxt", "DayWeekTxt", defining them as
"derived" will be very good.

But if you also want to aggregate also at "MonthTxt" or "DayWeekTxt" level,
defining them as "Derived" might not be good; Because Kylin need translates
the condition of "MonthTxt" into a set of PK values ("IDDate"), and then
query from Cube with these values, because the cube only pre-aggreated at
"IDDate"; This will slow down the query; (Ofcause if your dataset is small
it still be acceptable)

Besides, defining "Year" - "DayWeek_ID" - "Month_ID" as hierarchy is not
suggested, because they are not a hierarchy relationship, but in parallel
here;  ("March" is not a child of "2016", it appears in every year)

"Joint" can be used in two typical cases:
1) combine multiple ultra low cardinality dimensions
2) combine dimensions which has 1:1 (like "Month_ID" "Month_Txt") or close
to 1:1 relationship (like "USER_ID", "USER_EMAIL")

For case 1, I might design the cube in this way (assume you have the need
to group by year, month, dayweek):
1) define all them as normal dimension
2) define "year" - "IDDate" as a hierarchy
3) define "Month_ID" + "Month_Txt" as a joint, because they're 1:1
4) define "DayWeek_ID" + "DayWeek_Txt" as a joint, because they're 1:1


For case 2, I have the same suggestion as above.

2016-12-09 7:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:

> Typical case 1:
>
> *IDDate*
>
> *Month_ID*
>
> *Month_Txt*
>
> *DayWeek_ID*
>
> *DayWeek_Txt*
>
> *Year*
>
> 2016-03-01
>
> 3
>
> March
>
> 2
>
> Wendesday
>
> 2016
>
> 2016-03-02
>
> 3
>
> March
>
> 3
>
> Thursday
>
> 2016
>
> 2016-03-02
>
> 3
>
> March
>
> 4
>
> Friday
>
> 2016
>
> IDDate is PK of Dim table and Unique
>
>
> SOL 1: Uses Hierarchy and Derived from non PK column
>
>
> *Month_ID*
>
> Hierarchy 2
>
> Normal 1
>
> *Month_Txt*
>
>
> Derived 1
>
> *DayWeek_ID*
>
> Hierarchy 3
>
> Normal 2
>
> *DayWeek_Txt*
>
>
> Derived 2
>
> *Year*
>
> Hierarchy 1
>
> Normal 3
>
> Year > Month > Day
>
> Text are derived from ID (in month and Week)
>
> PB1: KYLIN-444 <https://issues.apache.org/jira/browse/KYLIN-444>
>
> PB2: I don't know how create Derived column from non PK with actual UI (Kylin
> – 1313 <https://issues.apache.org/jira/browse/KYLIN-1313> v1.5.2 Kylin
> 1786 <https://issues.apache.org/jira/browse/KYLIN-1786>, v1.5.3)
>
>
>
> SOL 2:
>
> *Month_ID*
>
> Hierarchy 2
>
> Join 1
>
> *Month_Txt*
>
>
> Join 1
>
> *DayWeek_ID*
>
> Hierarchy 3
>
> Join 2
>
> *DayWeek_Txt*
>
>
> Join 2
>
> *Year*
>
> Hierarchy 1
>
> Normal 3
>
>
> SOL 2 is this the best solution ??
>
>
>
> Typical case 2:
>
> I see the same scenario a lot of times (derived columns with 1:1 Relation)
>
> Product_ID *(PK)*
>
> Product_TXT
>
> TypeProduct_ID
>
> TypeProduct_TXT
>
> Country_TXT
>
> Country_ID
>
> Optimize queries by product / category / country, are mandatory
>
> Perhaps,
>
> Country (lower cardinality) its a good candidate to Join
>
> I don't want put Product_TXT as Join, because is a longgggg text, and can
> be affect Row_Key of HBase, but I need Queries like ... where product_TXT =
> ""iRobot Roomba 650 Robotic Vacuum Cleaner
>
> suggestions ?
>



-- 
Best regards,

Shaofeng Shi 史少锋