You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by ShaoFeng Shi <sh...@apache.org> on 2017/01/25 08:46:52 UTC

New document: "How to optimize cube build"

Hello,

A new document is added for the practices of cube build. Any suggestion or
comment is welcomed. We can update the doc later with feedbacks;

Here is the link:
https://kylin.apache.org/docs16/howto/howto_optimize_build.html

-- 
Best regards,

Shaofeng Shi 史少锋

Re: New document: "How to optimize cube build"

Posted by ShaoFeng Shi <sh...@gmail.com>.
correct.

Get Outlook for iOS




On Mon, Feb 13, 2017 at 3:52 PM +0800, "Ajay Chitre" <ch...@gmail.com> wrote:










In this case, if user runs a query with a WHERE clause that has 2 dimensions from the "aggregation group" & 2 dimensions from the "other 5 dimensions", Kylin will compute the results from the base cuboid, correct? Or would it error out?

I can test it myself but I am being lazy -:) Looking for a quick answer from the experts. Thanks for your help.
On Sun, Feb 12, 2017 at 3:04 AM, ShaoFeng Shi <sh...@apache.org> wrote:
Ajay,
There is no such a setting, but the "aggregation group" has something similar; say the cube totally has 15 dimensions, but in the agg group you only pick up 10 dimensions, then Kylin will build totally 1 (base cuboid) + 2^10 -1 (combinations of the 10 dimensions); Use this way you can leave those 5 dimension only appear on the base cuboid.
2017-02-09 9:20 GMT+08:00 Ajay Chitre <ch...@gmail.com>:
My question was a general question. Not any specific issue that I am encountering -:)

I understand that we can prune by using Hierarchical dimensions, aggregation groups etc. But what if these types of aggregations are not possible.

Let's say I've 15 dimensions (& I can't prune any), would Kylin build 32,766 Cuboids or is there a property to say... "If no. of dimensions are over X, stop building more Cuboids. Get from the base"? (Knowing this will slow down the queries).

Please let me know. Thanks.


On Mon, Feb 6, 2017 at 5:43 AM, ShaoFeng Shi <sh...@gmail.com> wrote:
Ajay, thanks for your feedback;
For question 1, the code has been merged in master branch; next release would be 2.0; a beta release will be published soon.
For question 2, yes your understanding is correct: a N dim FULL cube will have 2^N - 1 cuboids; but if you adopted some way like hierarchy, joint or separating dimensions to multi groups, it will be a "partial" cube which means some cuboids will be pruned. 
If a query uses dimensions across aggregation groups, then only the base cuboid can fulfill it, kylin has to do the post aggregation from the base cuboid, the performance would be downgraded. Please check whether it's this case in your side.
Get Outlook for iOS




On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <ch...@gmail.com> wrote:










Thanks for writing this document. It's very helpful. I've following questions:

1) Doc says... "Kylin will build dictionaries in memory (in next version this will be moved to MR)".

Which version can we expect this in? For large Cubes this process takes a long time on local machine. We really need to move this to the Hadoop cluster. In fact, it will be great if we can have an option to run this under Spark -:) 

2) About the "Build N-Dimension Cuboid" step.

Does Kylin build ALL Cuboids? My understanding is:

Total no. of Cuboids = (2 to the power of # of dimensions) - 1

Correct?

So if there are 7 dimensions, there will be 127 Cuboids, right? Does Kylin create ALL of them?

I was under the impression that, after some point, Kylin will just get measures from the Base Cuboid; instead of building all of them. Please explain.

Thanks.



On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <li...@apache.org> wrote:
Be free to update the document with different opinions. :-)

On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <sh...@apache.org> wrote:
Hi Alberto,
Thanks for your comments! In many cases the data is imported to Hadoop in T+1 mode. Especially when everyday's data is tens of GB, it is reasonable to partition the Hive table by date. The problem is whether it worth to keep a long history data in Hive; Usually user only keep a couple monthes' data in Hive; If the partition number exceeds the threshold in Hive, he/she can remove the oldest partitions or move to another table easily; That is a common practice of Hive I think, and it is very good to know that Hive 2.0 will solve this. 
2017-01-25 17:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
Be careful about partition by "FLIGHTDATE"

From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance

"Option 1: Use id_date as partition column on Hive table. This have a big
 problem: the Hive metastore is meant for few hundred of partitions not 
thousand (Hive 9452 there is an idea to solve this isn’t in progress)"

In Hive 2.0 will be a preview (only for testing) to solve this

2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
Hello,
A new document is added for the practices of cube build. Any suggestion or comment is welcomed. We can update the doc later with feedbacks;
Here is the link:https://kylin.apache.org/docs16/howto/howto_optimize_build.html

-- 
Best regards,
Shaofeng Shi 史少锋







-- 
Best regards,
Shaofeng Shi 史少锋
















-- 
Best regards,
Shaofeng Shi 史少锋










Re: New document: "How to optimize cube build"

Posted by Ajay Chitre <ch...@gmail.com>.
In this case, if user runs a query with a WHERE clause that has 2
dimensions from the "aggregation group" & 2 dimensions from the "other 5
dimensions", Kylin will compute the results from the base cuboid, correct?
Or would it error out?

I can test it myself but I am being lazy -:) Looking for a quick answer
from the experts. Thanks for your help.

On Sun, Feb 12, 2017 at 3:04 AM, ShaoFeng Shi <sh...@apache.org>
wrote:

> Ajay,
>
> There is no such a setting, but the "aggregation group" has something
> similar; say the cube totally has 15 dimensions, but in the agg group you
> only pick up 10 dimensions, then Kylin will build totally 1 (base cuboid) +
> 2^10 -1 (combinations of the 10 dimensions); Use this way you can leave
> those 5 dimension only appear on the base cuboid.
>
> 2017-02-09 9:20 GMT+08:00 Ajay Chitre <ch...@gmail.com>:
>
>> My question was a general question. Not any specific issue that I am
>> encountering -:)
>>
>> I understand that we can prune by using Hierarchical dimensions,
>> aggregation groups etc. But what if these types of aggregations are not
>> possible.
>>
>> Let's say I've 15 dimensions (& I can't prune any), would Kylin build
>> 32,766 Cuboids or is there a property to say... "If no. of dimensions are
>> over X, stop building more Cuboids. Get from the base"? (Knowing this will
>> slow down the queries).
>>
>> Please let me know. Thanks.
>>
>>
>> On Mon, Feb 6, 2017 at 5:43 AM, ShaoFeng Shi <sh...@gmail.com>
>> wrote:
>>
>>> Ajay, thanks for your feedback;
>>>
>>> For question 1, the code has been merged in master branch; next release
>>> would be 2.0; a beta release will be published soon.
>>>
>>> For question 2, yes your understanding is correct: a N dim FULL cube
>>> will have 2^N - 1 cuboids; but if you adopted some way like hierarchy,
>>> joint or separating dimensions to multi groups, it will be a "partial" cube
>>> which means some cuboids will be pruned.
>>>
>>> If a query uses dimensions across aggregation groups, then only the base
>>> cuboid can fulfill it, kylin has to do the post aggregation from the base
>>> cuboid, the performance would be downgraded. Please check whether it's this
>>> case in your side.
>>>
>>> Get Outlook for iOS <https://aka.ms/o0ukef>
>>>
>>>
>>>
>>>
>>> On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <
>>> chitre.ajay@gmail.com> wrote:
>>>
>>> Thanks for writing this document. It's very helpful. I've following
>>>> questions:
>>>>
>>>> 1) Doc says... "Kylin will build dictionaries in memory (in next
>>>> version this will be moved to MR)".
>>>>
>>>> Which version can we expect this in? For large Cubes this process takes
>>>> a long time on local machine. We really need to move this to the Hadoop
>>>> cluster. In fact, it will be great if we can have an option to run this
>>>> under Spark -:)
>>>>
>>>> 2) About the "Build N-Dimension Cuboid" step.
>>>>
>>>> Does Kylin build ALL Cuboids? My understanding is:
>>>>
>>>> Total no. of Cuboids = (2 to the power of # of dimensions) - 1
>>>>
>>>> Correct?
>>>>
>>>> So if there are 7 dimensions, there will be 127 Cuboids, right? Does
>>>> Kylin create ALL of them?
>>>>
>>>> I was under the impression that, after some point, Kylin will just get
>>>> measures from the Base Cuboid; instead of building all of them. Please
>>>> explain.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <li...@apache.org> wrote:
>>>>
>>>>> Be free to update the document with different opinions. :-)
>>>>>
>>>>> On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <shaofengshi@apache.org
>>>>> > wrote:
>>>>>
>>>>>> Hi Alberto,
>>>>>>
>>>>>> Thanks for your comments! In many cases the data is imported to
>>>>>> Hadoop in T+1 mode. Especially when everyday's data is tens of GB, it is
>>>>>> reasonable to partition the Hive table by date. The problem is whether it
>>>>>> worth to keep a long history data in Hive; Usually user only keep a couple
>>>>>> monthes' data in Hive; If the partition number exceeds the threshold in
>>>>>> Hive, he/she can remove the oldest partitions or move to another table
>>>>>> easily; That is a common practice of Hive I think, and it is very good to
>>>>>> know that Hive 2.0 will solve this.
>>>>>>
>>>>>> 2017-01-25 17:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
>>>>>>
>>>>>>> Be careful about partition by "FLIGHTDATE"
>>>>>>>
>>>>>>> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerfo
>>>>>>> rmance
>>>>>>>
>>>>>>> *"Option 1: Use id_date as partition column on Hive table. This have
>>>>>>> a big problem: the Hive metastore is meant for few hundred of partitions
>>>>>>> not thousand (Hive 9452 there is an idea to solve this isn’t in progress)*
>>>>>>> "
>>>>>>>
>>>>>>> In Hive 2.0 will be a preview (only for testing) to solve this
>>>>>>>
>>>>>>> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> A new document is added for the practices of cube build. Any
>>>>>>>> suggestion or comment is welcomed. We can update the doc later with
>>>>>>>> feedbacks;
>>>>>>>>
>>>>>>>> Here is the link:
>>>>>>>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Shaofeng Shi 史少锋
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>>
>>>>>> Shaofeng Shi 史少锋
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: New document: "How to optimize cube build"

Posted by ShaoFeng Shi <sh...@apache.org>.
Ajay,

There is no such a setting, but the "aggregation group" has something
similar; say the cube totally has 15 dimensions, but in the agg group you
only pick up 10 dimensions, then Kylin will build totally 1 (base cuboid) +
2^10 -1 (combinations of the 10 dimensions); Use this way you can leave
those 5 dimension only appear on the base cuboid.

2017-02-09 9:20 GMT+08:00 Ajay Chitre <ch...@gmail.com>:

> My question was a general question. Not any specific issue that I am
> encountering -:)
>
> I understand that we can prune by using Hierarchical dimensions,
> aggregation groups etc. But what if these types of aggregations are not
> possible.
>
> Let's say I've 15 dimensions (& I can't prune any), would Kylin build
> 32,766 Cuboids or is there a property to say... "If no. of dimensions are
> over X, stop building more Cuboids. Get from the base"? (Knowing this will
> slow down the queries).
>
> Please let me know. Thanks.
>
>
> On Mon, Feb 6, 2017 at 5:43 AM, ShaoFeng Shi <sh...@gmail.com>
> wrote:
>
>> Ajay, thanks for your feedback;
>>
>> For question 1, the code has been merged in master branch; next release
>> would be 2.0; a beta release will be published soon.
>>
>> For question 2, yes your understanding is correct: a N dim FULL cube will
>> have 2^N - 1 cuboids; but if you adopted some way like hierarchy, joint or
>> separating dimensions to multi groups, it will be a "partial" cube which
>> means some cuboids will be pruned.
>>
>> If a query uses dimensions across aggregation groups, then only the base
>> cuboid can fulfill it, kylin has to do the post aggregation from the base
>> cuboid, the performance would be downgraded. Please check whether it's this
>> case in your side.
>>
>> Get Outlook for iOS <https://aka.ms/o0ukef>
>>
>>
>>
>>
>> On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <
>> chitre.ajay@gmail.com> wrote:
>>
>> Thanks for writing this document. It's very helpful. I've following
>>> questions:
>>>
>>> 1) Doc says... "Kylin will build dictionaries in memory (in next version
>>> this will be moved to MR)".
>>>
>>> Which version can we expect this in? For large Cubes this process takes
>>> a long time on local machine. We really need to move this to the Hadoop
>>> cluster. In fact, it will be great if we can have an option to run this
>>> under Spark -:)
>>>
>>> 2) About the "Build N-Dimension Cuboid" step.
>>>
>>> Does Kylin build ALL Cuboids? My understanding is:
>>>
>>> Total no. of Cuboids = (2 to the power of # of dimensions) - 1
>>>
>>> Correct?
>>>
>>> So if there are 7 dimensions, there will be 127 Cuboids, right? Does
>>> Kylin create ALL of them?
>>>
>>> I was under the impression that, after some point, Kylin will just get
>>> measures from the Base Cuboid; instead of building all of them. Please
>>> explain.
>>>
>>> Thanks.
>>>
>>>
>>>
>>> On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <li...@apache.org> wrote:
>>>
>>>> Be free to update the document with different opinions. :-)
>>>>
>>>> On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <sh...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Alberto,
>>>>>
>>>>> Thanks for your comments! In many cases the data is imported to Hadoop
>>>>> in T+1 mode. Especially when everyday's data is tens of GB, it is
>>>>> reasonable to partition the Hive table by date. The problem is whether it
>>>>> worth to keep a long history data in Hive; Usually user only keep a couple
>>>>> monthes' data in Hive; If the partition number exceeds the threshold in
>>>>> Hive, he/she can remove the oldest partitions or move to another table
>>>>> easily; That is a common practice of Hive I think, and it is very good to
>>>>> know that Hive 2.0 will solve this.
>>>>>
>>>>> 2017-01-25 17:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
>>>>>
>>>>>> Be careful about partition by "FLIGHTDATE"
>>>>>>
>>>>>> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerfo
>>>>>> rmance
>>>>>>
>>>>>> *"Option 1: Use id_date as partition column on Hive table. This have
>>>>>> a big problem: the Hive metastore is meant for few hundred of partitions
>>>>>> not thousand (Hive 9452 there is an idea to solve this isn’t in progress)*
>>>>>> "
>>>>>>
>>>>>> In Hive 2.0 will be a preview (only for testing) to solve this
>>>>>>
>>>>>> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> A new document is added for the practices of cube build. Any
>>>>>>> suggestion or comment is welcomed. We can update the doc later with
>>>>>>> feedbacks;
>>>>>>>
>>>>>>> Here is the link:
>>>>>>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Shaofeng Shi 史少锋
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>>
>>>>> Shaofeng Shi 史少锋
>>>>>
>>>>>
>>>>
>>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: New document: "How to optimize cube build"

Posted by Ajay Chitre <ch...@gmail.com>.
My question was a general question. Not any specific issue that I am
encountering -:)

I understand that we can prune by using Hierarchical dimensions,
aggregation groups etc. But what if these types of aggregations are not
possible.

Let's say I've 15 dimensions (& I can't prune any), would Kylin build
32,766 Cuboids or is there a property to say... "If no. of dimensions are
over X, stop building more Cuboids. Get from the base"? (Knowing this will
slow down the queries).

Please let me know. Thanks.


On Mon, Feb 6, 2017 at 5:43 AM, ShaoFeng Shi <sh...@gmail.com> wrote:

> Ajay, thanks for your feedback;
>
> For question 1, the code has been merged in master branch; next release
> would be 2.0; a beta release will be published soon.
>
> For question 2, yes your understanding is correct: a N dim FULL cube will
> have 2^N - 1 cuboids; but if you adopted some way like hierarchy, joint or
> separating dimensions to multi groups, it will be a "partial" cube which
> means some cuboids will be pruned.
>
> If a query uses dimensions across aggregation groups, then only the base
> cuboid can fulfill it, kylin has to do the post aggregation from the base
> cuboid, the performance would be downgraded. Please check whether it's this
> case in your side.
>
> Get Outlook for iOS <https://aka.ms/o0ukef>
>
>
>
>
> On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <chitre.ajay@gmail.com
> > wrote:
>
> Thanks for writing this document. It's very helpful. I've following
>> questions:
>>
>> 1) Doc says... "Kylin will build dictionaries in memory (in next version
>> this will be moved to MR)".
>>
>> Which version can we expect this in? For large Cubes this process takes a
>> long time on local machine. We really need to move this to the Hadoop
>> cluster. In fact, it will be great if we can have an option to run this
>> under Spark -:)
>>
>> 2) About the "Build N-Dimension Cuboid" step.
>>
>> Does Kylin build ALL Cuboids? My understanding is:
>>
>> Total no. of Cuboids = (2 to the power of # of dimensions) - 1
>>
>> Correct?
>>
>> So if there are 7 dimensions, there will be 127 Cuboids, right? Does
>> Kylin create ALL of them?
>>
>> I was under the impression that, after some point, Kylin will just get
>> measures from the Base Cuboid; instead of building all of them. Please
>> explain.
>>
>> Thanks.
>>
>>
>>
>> On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <li...@apache.org> wrote:
>>
>>> Be free to update the document with different opinions. :-)
>>>
>>> On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <sh...@apache.org>
>>> wrote:
>>>
>>>> Hi Alberto,
>>>>
>>>> Thanks for your comments! In many cases the data is imported to Hadoop
>>>> in T+1 mode. Especially when everyday's data is tens of GB, it is
>>>> reasonable to partition the Hive table by date. The problem is whether it
>>>> worth to keep a long history data in Hive; Usually user only keep a couple
>>>> monthes' data in Hive; If the partition number exceeds the threshold in
>>>> Hive, he/she can remove the oldest partitions or move to another table
>>>> easily; That is a common practice of Hive I think, and it is very good to
>>>> know that Hive 2.0 will solve this.
>>>>
>>>> 2017-01-25 17:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
>>>>
>>>>> Be careful about partition by "FLIGHTDATE"
>>>>>
>>>>> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerfo
>>>>> rmance
>>>>>
>>>>> *"Option 1: Use id_date as partition column on Hive table. This have a
>>>>> big problem: the Hive metastore is meant for few hundred of partitions not
>>>>> thousand (Hive 9452 there is an idea to solve this isn’t in progress)*
>>>>> "
>>>>>
>>>>> In Hive 2.0 will be a preview (only for testing) to solve this
>>>>>
>>>>> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> A new document is added for the practices of cube build. Any
>>>>>> suggestion or comment is welcomed. We can update the doc later with
>>>>>> feedbacks;
>>>>>>
>>>>>> Here is the link:
>>>>>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>>
>>>>>> Shaofeng Shi 史少锋
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>>
>>>> Shaofeng Shi 史少锋
>>>>
>>>>
>>>
>>

Re: New document: "How to optimize cube build"

Posted by ShaoFeng Shi <sh...@gmail.com>.
Ajay, thanks for your feedback;
For question 1, the code has been merged in master branch; next release would be 2.0; a beta release will be published soon.
For question 2, yes your understanding is correct: a N dim FULL cube will have 2^N - 1 cuboids; but if you adopted some way like hierarchy, joint or separating dimensions to multi groups, it will be a "partial" cube which means some cuboids will be pruned. 
If a query uses dimensions across aggregation groups, then only the base cuboid can fulfill it, kylin has to do the post aggregation from the base cuboid, the performance would be downgraded. Please check whether it's this case in your side.
Get Outlook for iOS




On Mon, Feb 6, 2017 at 2:05 PM +0900, "Ajay Chitre" <ch...@gmail.com> wrote:










Thanks for writing this document. It's very helpful. I've following questions:

1) Doc says... "Kylin will build dictionaries in memory (in next version this will be moved to MR)".

Which version can we expect this in? For large Cubes this process takes a long time on local machine. We really need to move this to the Hadoop cluster. In fact, it will be great if we can have an option to run this under Spark -:) 

2) About the "Build N-Dimension Cuboid" step.

Does Kylin build ALL Cuboids? My understanding is:

Total no. of Cuboids = (2 to the power of # of dimensions) - 1

Correct?

So if there are 7 dimensions, there will be 127 Cuboids, right? Does Kylin create ALL of them?

I was under the impression that, after some point, Kylin will just get measures from the Base Cuboid; instead of building all of them. Please explain.

Thanks.



On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <li...@apache.org> wrote:
Be free to update the document with different opinions. :-)

On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <sh...@apache.org> wrote:
Hi Alberto,
Thanks for your comments! In many cases the data is imported to Hadoop in T+1 mode. Especially when everyday's data is tens of GB, it is reasonable to partition the Hive table by date. The problem is whether it worth to keep a long history data in Hive; Usually user only keep a couple monthes' data in Hive; If the partition number exceeds the threshold in Hive, he/she can remove the oldest partitions or move to another table easily; That is a common practice of Hive I think, and it is very good to know that Hive 2.0 will solve this. 
2017-01-25 17:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
Be careful about partition by "FLIGHTDATE"

From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance

"Option 1: Use id_date as partition column on Hive table. This have a big
 problem: the Hive metastore is meant for few hundred of partitions not 
thousand (Hive 9452 there is an idea to solve this isn’t in progress)"

In Hive 2.0 will be a preview (only for testing) to solve this

2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
Hello,
A new document is added for the practices of cube build. Any suggestion or comment is welcomed. We can update the doc later with feedbacks;
Here is the link:https://kylin.apache.org/docs16/howto/howto_optimize_build.html

-- 
Best regards,
Shaofeng Shi 史少锋







-- 
Best regards,
Shaofeng Shi 史少锋












Re: New document: "How to optimize cube build"

Posted by Ajay Chitre <ch...@gmail.com>.
Thanks for writing this document. It's very helpful. I've following
questions:

1) Doc says... "Kylin will build dictionaries in memory (in next version
this will be moved to MR)".

Which version can we expect this in? For large Cubes this process takes a
long time on local machine. We really need to move this to the Hadoop
cluster. In fact, it will be great if we can have an option to run this
under Spark -:)

2) About the "Build N-Dimension Cuboid" step.

Does Kylin build ALL Cuboids? My understanding is:

Total no. of Cuboids = (2 to the power of # of dimensions) - 1

Correct?

So if there are 7 dimensions, there will be 127 Cuboids, right? Does Kylin
create ALL of them?

I was under the impression that, after some point, Kylin will just get
measures from the Base Cuboid; instead of building all of them. Please
explain.

Thanks.



On Sat, Feb 4, 2017 at 2:19 AM, Li Yang <li...@apache.org> wrote:

> Be free to update the document with different opinions. :-)
>
> On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <sh...@apache.org>
> wrote:
>
>> Hi Alberto,
>>
>> Thanks for your comments! In many cases the data is imported to Hadoop in
>> T+1 mode. Especially when everyday's data is tens of GB, it is
>> reasonable to partition the Hive table by date. The problem is whether it
>> worth to keep a long history data in Hive; Usually user only keep a couple
>> monthes' data in Hive; If the partition number exceeds the threshold in
>> Hive, he/she can remove the oldest partitions or move to another table
>> easily; That is a common practice of Hive I think, and it is very good to
>> know that Hive 2.0 will solve this.
>>
>> 2017-01-25 17:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
>>
>>> Be careful about partition by "FLIGHTDATE"
>>>
>>> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance
>>>
>>> *"Option 1: Use id_date as partition column on Hive table. This have a
>>> big problem: the Hive metastore is meant for few hundred of partitions not
>>> thousand (Hive 9452 there is an idea to solve this isn’t in progress)*"
>>>
>>> In Hive 2.0 will be a preview (only for testing) to solve this
>>>
>>> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
>>>
>>>> Hello,
>>>>
>>>> A new document is added for the practices of cube build. Any suggestion
>>>> or comment is welcomed. We can update the doc later with feedbacks;
>>>>
>>>> Here is the link:
>>>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>>>
>>>> --
>>>> Best regards,
>>>>
>>>> Shaofeng Shi 史少锋
>>>>
>>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>

Re: New document: "How to optimize cube build"

Posted by Li Yang <li...@apache.org>.
Be free to update the document with different opinions. :-)

On Thu, Jan 26, 2017 at 11:34 AM, ShaoFeng Shi <sh...@apache.org>
wrote:

> Hi Alberto,
>
> Thanks for your comments! In many cases the data is imported to Hadoop in
> T+1 mode. Especially when everyday's data is tens of GB, it is
> reasonable to partition the Hive table by date. The problem is whether it
> worth to keep a long history data in Hive; Usually user only keep a couple
> monthes' data in Hive; If the partition number exceeds the threshold in
> Hive, he/she can remove the oldest partitions or move to another table
> easily; That is a common practice of Hive I think, and it is very good to
> know that Hive 2.0 will solve this.
>
> 2017-01-25 17:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:
>
>> Be careful about partition by "FLIGHTDATE"
>>
>> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance
>>
>> *"Option 1: Use id_date as partition column on Hive table. This have a
>> big problem: the Hive metastore is meant for few hundred of partitions not
>> thousand (Hive 9452 there is an idea to solve this isn’t in progress)*"
>>
>> In Hive 2.0 will be a preview (only for testing) to solve this
>>
>> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
>>
>>> Hello,
>>>
>>> A new document is added for the practices of cube build. Any suggestion
>>> or comment is welcomed. We can update the doc later with feedbacks;
>>>
>>> Here is the link:
>>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: New document: "How to optimize cube build"

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Alberto,

Thanks for your comments! In many cases the data is imported to Hadoop in
T+1 mode. Especially when everyday's data is tens of GB, it is
reasonable to partition the Hive table by date. The problem is whether it
worth to keep a long history data in Hive; Usually user only keep a couple
monthes' data in Hive; If the partition number exceeds the threshold in
Hive, he/she can remove the oldest partitions or move to another table
easily; That is a common practice of Hive I think, and it is very good to
know that Hive 2.0 will solve this.

2017-01-25 17:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:

> Be careful about partition by "FLIGHTDATE"
>
> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance
>
> *"Option 1: Use id_date as partition column on Hive table. This have a big
> problem: the Hive metastore is meant for few hundred of partitions not
> thousand (Hive 9452 there is an idea to solve this isn’t in progress)*"
>
> In Hive 2.0 will be a preview (only for testing) to solve this
>
> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
>
>> Hello,
>>
>> A new document is added for the practices of cube build. Any suggestion
>> or comment is welcomed. We can update the doc later with feedbacks;
>>
>> Here is the link:
>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: New document: "How to optimize cube build"

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Alberto,

Thanks for your comments! In many cases the data is imported to Hadoop in
T+1 mode. Especially when everyday's data is tens of GB, it is
reasonable to partition the Hive table by date. The problem is whether it
worth to keep a long history data in Hive; Usually user only keep a couple
monthes' data in Hive; If the partition number exceeds the threshold in
Hive, he/she can remove the oldest partitions or move to another table
easily; That is a common practice of Hive I think, and it is very good to
know that Hive 2.0 will solve this.

2017-01-25 17:10 GMT+08:00 Alberto Ramón <a....@gmail.com>:

> Be careful about partition by "FLIGHTDATE"
>
> From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance
>
> *"Option 1: Use id_date as partition column on Hive table. This have a big
> problem: the Hive metastore is meant for few hundred of partitions not
> thousand (Hive 9452 there is an idea to solve this isn’t in progress)*"
>
> In Hive 2.0 will be a preview (only for testing) to solve this
>
> 2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:
>
>> Hello,
>>
>> A new document is added for the practices of cube build. Any suggestion
>> or comment is welcomed. We can update the doc later with feedbacks;
>>
>> Here is the link:
>> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: New document: "How to optimize cube build"

Posted by Alberto Ramón <a....@gmail.com>.
Be careful about partition by "FLIGHTDATE"

From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance

*"Option 1: Use id_date as partition column on Hive table. This have a big
problem: the Hive metastore is meant for few hundred of partitions not
thousand (Hive 9452 there is an idea to solve this isn’t in progress)*"

In Hive 2.0 will be a preview (only for testing) to solve this

2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:

> Hello,
>
> A new document is added for the practices of cube build. Any suggestion or
> comment is welcomed. We can update the doc later with feedbacks;
>
> Here is the link:
> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

Re: New document: "How to optimize cube build"

Posted by Alberto Ramón <a....@gmail.com>.
Be careful about partition by "FLIGHTDATE"

From https://github.com/albertoRamon/Kylin/tree/master/KylinPerformance

*"Option 1: Use id_date as partition column on Hive table. This have a big
problem: the Hive metastore is meant for few hundred of partitions not
thousand (Hive 9452 there is an idea to solve this isn’t in progress)*"

In Hive 2.0 will be a preview (only for testing) to solve this

2017-01-25 9:46 GMT+01:00 ShaoFeng Shi <sh...@apache.org>:

> Hello,
>
> A new document is added for the practices of cube build. Any suggestion or
> comment is welcomed. We can update the doc later with feedbacks;
>
> Here is the link:
> https://kylin.apache.org/docs16/howto/howto_optimize_build.html
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>