You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by Li Yang <li...@apache.org> on 2016/07/05 13:07:03 UTC
Re: Dimension table 300MB Limit

For question 1, the answer is yes. Putting all clusters in one joint allows
querying every one of them. It's just not optimal if you only want to
select date & clusterX. Because under the scene all clusters are always
selected together even though only one of them is needed.

For question 2, it's also yes.

Sorry for the late reply. Very busy recently.

Yang

On Wed, Jun 29, 2016 at 6:18 AM, Richard Calaba (Fishbowl) <
rcalaba@fishbowl.com> wrote:

> Hi Li Yang,
>
>
>
> Can we get better example how to configure the JSON to define the
> “extended” measure ???  Some description what it exactly does and what is
> the impact on cube build and query ….
>
>
>
> The join dimensions might be a good idea as well and is not limited to
> lookup tables/fact table right ??? … let’s use this scenario:
>
>
>
>     --- Dimensions date, customer, cluster1, cluster2, ….  Cluster10 , ..
> measures …
>
>
>
>    Question 1: if I *define the joint dimensions {cluster1, cluster2, ….,
> cluster10}–*
>
> *can I still run correct SQL query SELECT date, cluster1,COUNT(*) FROM
> fact GROUP BY date, cluster1 ???? *
>
> Meaning I am not specifying where filter neither I am reading in select
> the values of cluster2/3/…10  nowhere …. But I might have 2nd query to do
> the same grouping and count(*) logic but just for cluster2 .. or 3 …. Or 10
> …
>
>
>
>   Question 2: as the *clusterX values depend on the date dimension* ->
> the date will be always in the query -> should I then define the joint
> dimensions: {date, cluster1, cluster2, …. Cluster10 } ???
>
>
>
>    Question 3: if answer to question 1 is that this is not correct joint
> dimension definition but answer to Question 2 is that date is to be part of
> the joint dimension definition if the value of the clusterX depends on the
> dimension date … then I am concluding that I can optimize the cuboids
> numbers by specifying 10 joint dimensions:
>
>
>
>                 {date, cluster1}
>
>                 {date, cluster2}
>
>                 ….
>
>                 {date, cluster10}
>
>
>
>                                 Right ???
>
>
>
> Please help us to understand those advanced topics …(I have read the
> https://kylin.apache.org/blog/2016/02/18/new-aggregation-group/ ) which
> states only this:
>
> ·         *Joint rules.* This is a newly introduced rule. If two or more
> dimensions are “joint”, then any valid cuboid will either contain none of
> these dimensions, or contain them all. In other words, these dimensions
> will always be “together”. This is useful when the cube designer is sure
> some of the dimensions will always be queried together. It is also a
> nuclear weapon for combination pruning on less-likely-to-use dimensions.
> Suppose you have 20 dimensions, the first 10 dimensions are frequently used
> and the latter 10 are less likely to be used. By joining the latter 10
> dimensions as “joint”, you’re effectively reducing cuboid numbers from 2^20
> to 2^11. Actually this is pretty much what the old “aggregation group”
> mechanism was for. If you’re using it prior Kylin v1.5, our metadata
> upgrade tool will automatically translate it to joint semantics.
> By flexibly using the new aggregation group you can in theory control
> whatever cuboid to compute/skip. This could significant reduce the
> computation and storage overhead, especially when the cube is serving for a
> fixed dashboard, which will reproduce SQL queries that only require some
> specific cuboids. In extreme cases you can configure each AGG contain only
> one cuboid, and a handful of AGGs will consists of the cuboid whitelist
> that you’ll need.
>
> Thank you, Richard.
>
>
>
> *From:* Li Yang [mailto:liyang@apache.org]
> *Sent:* Tuesday, June 28, 2016 7:16 AM
>
> *To:* user@kylin.apache.org
> *Subject:* Re: Dimension table 300MB Limit
>
>
>
> There are options to treat columns on fact table without triggering the
> dimension explosion (just like derived). One is the "joint" dimensions
> introduced is v1.5. Another is "extended" measure. The related document
> need to catch up however.
>
> Yang
>
>
>
> On Tue, Jun 28, 2016 at 10:09 AM, Arun Khetarpal <ak...@gmail.com>
> wrote:
>
> I agree with Ric - Forcing to put back the dimension value may be a step
> back.
>
> I propose to open a Jira to track this issue (and possibly work on this) -
> Thoughts/Suggestions?
>
>
>
> Regards,
>
> Arun
>
>
>
>
>
>
>
> On 28 June 2016 at 06:51, Richard Calaba (Fishbowl) <rc...@fishbowl.com>
> wrote:
>
> Did little search in the bin/*.sh and found setenv.sh so tried setting
> KYLIN_JVM_SETTINGS environment variable to -Xms1024M -Xmx16g – resolved my
> ‘sudden’ death of the Kylin server after increasing
> kylin.table.snapshot.max_mb
>
>
>
> So far all looks good, fingers crossed J
>
>
>
> Ric.
>
>
>
> *From:* Richard Calaba (Fishbowl) [mailto:rcalaba@fishbowl.com]
> *Sent:* Monday, June 27, 2016 5:48 PM
> *To:* user@kylin.apache.org
> *Cc:* 'Richard Calaba (Fishbowl)' <rc...@fishbowl.com>
>
>
> *Subject:* RE: Dimension table 300MB Limit
>
>
>
> I am facing errors in kylin.log complaining about less than 100MB
> available -> then the Kylin server dies silently. The issues is caused by
> high cardinality dimension which requires approx 700MB data snapshot. I
> have increase the parameter *kylin.table.snapshot.max_mb=750 *to 750MB –
> with this settings the Build Step 4 is not anymore complaining about the
> snapshot more than 300MB (the exeception java.lang.IllegalStateException:
> Table snapshot should be no greater than 300 MB is gone) but the server
> dies after a while. There is a plenty of memory free on the node where
> Kylin runs (more than 20GB free) so it seems to be problem of Kylin total
> memory limit. I didn’t find a way how to increase the Kylin memory limit so
> the big snapshot won’t kill the Kylin server …. How to do that ???
>
>
>
> It is urgent ! J
>
>
>
> Thanx, ric
>
>
>
> *From:* Richard Calaba (Fishbowl) [mailto:rcalaba@fishbowl.com
> <rc...@fishbowl.com>]
> *Sent:* Monday, June 27, 2016 5:23 PM
> *To:* 'user@kylin.apache.org' <us...@kylin.apache.org>
> *Subject:* RE: Dimension table 300MB Limit
>
>
>
> I have 2 scenarios:
>
>
>
> 1)      time -dependent attributes of customer – here it might be an
> option to put those to fact table as the values are derived from date and
> ID -> but I need those dimensions to be “derived” from fact table (2 fields
> – date and id – define the value – I have 10 fields like that in the lookup
> table so bringing those as independent (normal) dimensions would increase
> the Build time by 2^10 times right … ???
>
>
>
> 2)      2nd scenario is similar – lot of attributes of customer (which is
> the high cardinality dimension – approx 10 mio customers) to be used as
> derived dimension
>
>
>
> Forcing to put the high cardinality dimensions into fact table is in my
> opinion a step back – we are denormalizing the star-schema ….
>
>
>
> Ric.
>
>
>
> *From:* Li Yang [mailto:liyang@apache.org <li...@apache.org>]
> *Sent:* Monday, June 27, 2016 3:45 PM
> *To:* user@kylin.apache.org
> *Subject:* Re: Dimension table 300MB Limit
>
>
>
> Such big dimensions better be part of the fact table (rather than on a
> lookup table). Simplest way is to create a hive view joining the old fact
> and the customer, then assign the view to be the new fact table.
>
>
>
> On Tue, Jun 28, 2016 at 5:26 AM, Richard Calaba (Fishbowl) <
> rcalaba@fishbowl.com> wrote:
>
> We have same issue though our size is just 700MB …. So interested in the
> background info and workarounds other than setting higher snapshot limit …
> if any ?
>
>
>
> Ric.
>
>
>
> *From:* Arun Khetarpal [mailto:akhetarp@gmail.com]
> *Sent:* Monday, June 27, 2016 11:55 AM
> *To:* user@kylin.apache.org
> *Subject:* Dimension table 300MB Limit
>
>
>
> Hi,
>
>
>
> We are evaluating Kylin as an Analytical Engine for OLAP. We are facing
> issues with OOM when dealing with large dimensions ~ 70GB (customer data)
> [set kylin.table.snapshot.max_mb to a high limit]
>
>
>
> I guess having a Dictionary this big in memory will not be a solution. Is
> there any suggested workaround for the same?
>
>
>
> Is there any work done to get around this by the community?
>
>
>
> Regards,
>
> Arun
>
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2016.0.7640 / Virus Database: 4613/12504 - Release Date: 06/27/16
>
>
>
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2016.0.7640 / Virus Database: 4613/12505 - Release Date: 06/27/16
>
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2016.0.7640 / Virus Database: 4613/12505 - Release Date: 06/27/16
>
>
>
>
>
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2016.0.7640 / Virus Database: 4613/12512 - Release Date: 06/28/16
>
>