You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by hongbin ma <ma...@apache.org> on 2015/09/01 13:07:52 UTC

Re: Lookup Table Enumerator high memory

    for 1) ..  seems like only the resource path / table desc etc is only
kept in memory while a new lookupstringtable is created per query/request
which holds onto data for the lifetime of the request.  So once the request
is done, it should be garbage collectable ?

/table is just for the hive table's schema, the look up table content is
cached in SnapshotManager and it will not be evicted so far. So if you have
a lot of large lookup tables this will be a problem


3) Also the derived filter translator, is there a way to modify the '
IN_THRESHOLD'  via config file ?

Are you facing performance issue with a lot of IN clauses? if so , please
take a look at https://issues.apache.org/jira/browse/KYLIN-740, the patch
will be merged into next release

On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <ab...@infoworks.io> wrote:

> Sorry for the confusion,
>
>     for 1) ..  seems like only the resource path / table desc etc is only
> kept in memory while a new lookupstringtable is created per query/request
> which holds onto data for the lifetime of the request.  So once the request
> is done, it should be garbage collectable ?
>
>
> 3) Also the derived filter translator, is there a way to modify the '
> IN_THRESHOLD'  via config file ?
>
>
>
>
>
> Regards,
> Abhilash
>
> On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <ab...@infoworks.io>
> wrote:
>
> > Hello,
> >
> >     We started noticing that Kylin tomcat server is taking a lot of ram.
> > It even hit a limit of 10GB.
> >
> >     After spending some time by going over the code, it seems like the
> > cube enumerator is not storing anything in memory. But the Lookup table
> > enumerator seems to be loading all records and storing it in memory.
> >
> >     1) What happens when there are lot of projects defined and we end up
> > with tons of look up tables across them. Does it get swapped out
> > automatically ?  I am not able to track where eviction is happening. The
> > snapshot manager has a 'removeSnapshot' but its intent seems different to
> > me.
> >
> >     2) How do we handle really higher cardinality dimension. Eg: If I
> have
> > sales as a fact and customers as a dimension, there will be millions of
> > customers. However a store is good candidate to keep in memory but not
> > customers. Whats the recommended setting while creating the cube to
> handle
> > such a case
> >
> > Regards,
> > Abhilash
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Lookup Table Enumerator high memory

Posted by ShaoFeng Shi <sh...@apache.org>.
There are a couple options for this:

1) use a Hive view to shade your wide lookup table, picking up only
interested columns, and then use this view as lookup table in Cube. From
1.5.x Kylin starts to support view as lookup;

2) Kylin 2.0 supports using a big table as lookup; When create the model,
you have an option of "not take snapshot", then Kylin will not load it to
memory.

2017-07-07 9:04 GMT+08:00 flycshi <fl...@gmail.com>:

> in DictionaryGeneratorCLI.class for method processSegment
> to get lookup table by column
>         // snapshot
>         Set<String> toSnapshot = Sets.newHashSet();
>         Set<TableRef> toCheckLookup = Sets.newHashSet();
>         for (DimensionDesc dim : cubeSeg.getCubeDesc().getDimensions()) {
>             TableRef table = dim.getTableRef();
>             if (*cubeSeg.getModel().isLookupTable(table)*) {
>                 toSnapshot.add(table.getTableIdentity());
>                 toCheckLookup.add(table);
>             }
>         }
>
> when lookup table is larger, this step easily to failed to load lookup
> table
>
> the judgement of here , whether can consider to add more judgement
> for example, if the column is not a derived dimension and even if the
> column
> belong to a lookup table, and the lookup table will not to load in memory
>
> otherwise, every lookup table will load in memory,this lead to big lookup
> table unable to use in kylin completely。
>
> Looking forward to your reply,thanks.
>
>
> --
> View this message in context: http://apache-kylin.74782.x6.
> nabble.com/Lookup-Table-Enumerator-high-memory-tp1397p8384.html
> Sent from the Apache Kylin mailing list archive at Nabble.com.
>



-- 
Best regards,

Shaofeng Shi 史少锋

Re: Lookup Table Enumerator high memory

Posted by flycshi <fl...@gmail.com>.
in DictionaryGeneratorCLI.class for method processSegment
to get lookup table by column 
        // snapshot
        Set<String> toSnapshot = Sets.newHashSet();
        Set<TableRef> toCheckLookup = Sets.newHashSet();
        for (DimensionDesc dim : cubeSeg.getCubeDesc().getDimensions()) {
            TableRef table = dim.getTableRef();
            if (*cubeSeg.getModel().isLookupTable(table)*) {
                toSnapshot.add(table.getTableIdentity());
                toCheckLookup.add(table);
            }
        }

when lookup table is larger, this step easily to failed to load lookup table

the judgement of here , whether can consider to add more judgement
for example, if the column is not a derived dimension and even if the column
belong to a lookup table, and the lookup table will not to load in memory

otherwise, every lookup table will load in memory,this lead to big lookup
table unable to use in kylin completely。

Looking forward to your reply,thanks.


--
View this message in context: http://apache-kylin.74782.x6.nabble.com/Lookup-Table-Enumerator-high-memory-tp1397p8384.html
Sent from the Apache Kylin mailing list archive at Nabble.com.

Re: Lookup Table Enumerator high memory

Posted by Abhilash L L <ab...@infoworks.io>.
Thanks for the clarification

We were wondering the same thing. For a given cuboid, query performance
will be very sensitive to the order of columns in the row key..   similar
to indexes in rdbms..

Regards,
Abhilash

On Thu, Sep 3, 2015 at 7:21 PM, Shi, Shaofeng <sh...@ebay.com> wrote:

> Hi Abhilash,
>
> “Mandantory” is a property on a row key column; You can see the option in
> the “Advanced” step; If a column is set to “Mandantory=true”, it will be
> moved to the head position of the row key; and that column will not be
> aggregated when calculating the cube. This will avoid unnecessary
> calculation and storage; If your query has where condition on that
> required column, the query performance will be very good;
>
> Let me give a sample; Assume I have a fact table which has the following
> dimensions: date, seller, country;
>
> Among them, date and country are low cardinality columns, seller is a high
> cardinality column; As almost all my queries are having seller specified,
> I set “seller” as mandatory in the row key, then this column is moved to
> the head of the row key, and will not be aggregated; The HBase row key
> will be like:
>
> seller1,cal_dt,country —>
> seller2,cal_dt,country —>
> seller3,cal_dt,country —>
> …
> sellerN,cal_dt,country —>
>
> seller1,cal_dt —>
> seller2,cal_dt —>
> seller3,cal_dt —>
> ...
> sellerN,cal_dt —>
>
> seller1,country —>
> seller2,country —>
> seller3,country —>
>
> ...
> sellerN,country —>
>
>
> As the seller’s cardinality is high, when given a seller value, the hbase
> scan range will be very small, then the query performance will be good;
>
> If you have SQLs which has no “seller” specified, in that case this cube
> may not provide same response time; We would suggest user to create
> another cube without seller dimension; Multiple cubes can co-exist in one
> project and Kylin will pick up the most-appropriate cube to serve the
> queries;
>
>
>
> On 9/2/15, 7:41 PM, "Abhilash L L" <ab...@infoworks.io> wrote:
>
> >Thanks for explanations Hongbin and Li,
> >
> >   We seem to have a decent understanding of hierarchical and derived
> >dimensions.
> >
> >   For hierarchical, the columns part of the hierarchy also participate in
> >adding an extra level to cubiods. They become part of rowkey as well and
> >cubing happens on those columns as well.
> >
> >   For dervied, the query is rewritten to use the join key and then the in
> >memory look up table is used to rewrite the hbase response to values with
> >the derived dimension.
> >
> >   However there is something called a 'Normal' dimension (only one column
> >at a time), which we are trying to see how it works during query
> >resolution. Is this the mandatory dimension ? But since the UI allows only
> >column per 'Normal' dimension do we have to create one for each column ?
> >
> >
> > Also, a good write up about the types of dimensions and when to use each
> >type will be really helpful for users, who do not want get into the code
> >to
> >figure out stuff. The clarification seeking requests might keep coming up
> >as well. Just a thought.
> >
> >
> >Regards,
> >Abhilash
> >
> >On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <li...@apache.org> wrote:
> >
> >> Kylin assumes lookup table to be small (<100MB), thus can fit in memory.
> >> In your model, if order or customer go beyond millions, then they have
> >>to
> >> be on the fact table.  Like Hongbin mentioned, an easy way is to use a
> >>hive
> >> view.
> >>
> >> About analyzing ultra-high cardinality columns (like millions of
> >> customers), we see two common use cases.
> >>
> >> 1. TopN analysis.  Returning a millions records is not useful at all,
> >> instread, returning the TopN big customer makes much better sense.
> >> KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
> >> feature under development that aims to respond to TopN queries in
> >> subsecond.
> >>
> >> 2. Focused analysis.  Looking at a specific customer (e.g. where
> >> customer=A).  Such query can be very fast by creating a cube with
> >>customer
> >> as a Mandatory dimension.
> >>
> >> Cheers
> >> Yang
> >>
> >> On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <ma...@apache.org>
> >>wrote:
> >>
> >> > ​Kylin handles star schema well, but my encounter issues like OOM on
> >>your
> >> > case.
> >> > How many large lookup tables do you have?
> >> > I'm not sure if a evict policy will help because anytime a SQL
> >>involves
> >> the
> >> > lookup table, the lookup table snapshot will have to be loaded
> >>again(so
> >> the
> >> > snapshots are swapping-in-swapping-out)
> >> >
> >> > One way to solve the problem is to join your tables into a flatten
> >>table
> >> > using Hive view, providing Kylin with single big fact table. And
> >>please
> >> > notice avoid using dictionary on high cardinality columns.
> >> >
> >> > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <ab...@infoworks.io>
> >> > wrote:
> >> >
> >> > > Thanks for replying Hongbin,
> >> > >
> >> > >      for 1) we are trying to add some sort of evitction based cache
> >> > instead
> >> > > of a map. However, we still are trying to figure out what to do for
> >>3).
> >> > >
> >> > >     What is the general advice ? The case here is ..  I have order
> >> > details
> >> > > as a fact and order as a dimension and also customer. Now each of
> >>these
> >> > > will run into many millions.  Also, the f-key is not a long/bigint,
> >> its a
> >> > > string which is a combination of our custom columns. Making it a
> >> > dictionary
> >> > > will not work as we understand. Please suggest what should be the
> >> > approach
> >> > > taken
> >> > >
> >> > > Regards,
> >> > > Abhilash
> >> > >
> >> > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <ma...@apache.org>
> >> wrote:
> >> > >
> >> > > >     for 1) ..  seems like only the resource path / table desc etc
> >>is
> >> > only
> >> > > > kept in memory while a new lookupstringtable is created per
> >> > query/request
> >> > > > which holds onto data for the lifetime of the request.  So once
> >>the
> >> > > request
> >> > > > is done, it should be garbage collectable ?
> >> > > >
> >> > > > /table is just for the hive table's schema, the look up table
> >>content
> >> > is
> >> > > > cached in SnapshotManager and it will not be evicted so far. So if
> >> you
> >> > > have
> >> > > > a lot of large lookup tables this will be a problem
> >> > > >
> >> > > >
> >> > > > 3) Also the derived filter translator, is there a way to modify
> >>the '
> >> > > > IN_THRESHOLD'  via config file ?
> >> > > >
> >> > > > Are you facing performance issue with a lot of IN clauses? if so ,
> >> > please
> >> > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740,
> >>the
> >> > > patch
> >> > > > will be merged into next release
> >> > > >
> >> > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L
> >><abhilash@infoworks.io
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Sorry for the confusion,
> >> > > > >
> >> > > > >     for 1) ..  seems like only the resource path / table desc
> >>etc
> >> is
> >> > > only
> >> > > > > kept in memory while a new lookupstringtable is created per
> >> > > query/request
> >> > > > > which holds onto data for the lifetime of the request.  So once
> >>the
> >> > > > request
> >> > > > > is done, it should be garbage collectable ?
> >> > > > >
> >> > > > >
> >> > > > > 3) Also the derived filter translator, is there a way to modify
> >> the '
> >> > > > > IN_THRESHOLD'  via config file ?
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > Regards,
> >> > > > > Abhilash
> >> > > > >
> >> > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
> >> abhilash@infoworks.io
> >> > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hello,
> >> > > > > >
> >> > > > > >     We started noticing that Kylin tomcat server is taking a
> >>lot
> >> of
> >> > > > ram.
> >> > > > > > It even hit a limit of 10GB.
> >> > > > > >
> >> > > > > >     After spending some time by going over the code, it seems
> >> like
> >> > > the
> >> > > > > > cube enumerator is not storing anything in memory. But the
> >>Lookup
> >> > > table
> >> > > > > > enumerator seems to be loading all records and storing it in
> >> > memory.
> >> > > > > >
> >> > > > > >     1) What happens when there are lot of projects defined
> >>and we
> >> > end
> >> > > > up
> >> > > > > > with tons of look up tables across them. Does it get swapped
> >>out
> >> > > > > > automatically ?  I am not able to track where eviction is
> >> > happening.
> >> > > > The
> >> > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
> >> > > different
> >> > > > to
> >> > > > > > me.
> >> > > > > >
> >> > > > > >     2) How do we handle really higher cardinality dimension.
> >>Eg:
> >> > If I
> >> > > > > have
> >> > > > > > sales as a fact and customers as a dimension, there will be
> >> > millions
> >> > > of
> >> > > > > > customers. However a store is good candidate to keep in memory
> >> but
> >> > > not
> >> > > > > > customers. Whats the recommended setting while creating the
> >>cube
> >> to
> >> > > > > handle
> >> > > > > > such a case
> >> > > > > >
> >> > > > > > Regards,
> >> > > > > > Abhilash
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Regards,
> >> > > >
> >> > > > *Bin Mahone | 马洪宾*
> >> > > > Apache Kylin: http://kylin.io
> >> > > > Github: https://github.com/binmahone
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > *Bin Mahone | 马洪宾*
> >> > Apache Kylin: http://kylin.io
> >> > Github: https://github.com/binmahone
> >> >
> >>
>
>

Re: Lookup Table Enumerator high memory

Posted by "Shi, Shaofeng" <sh...@ebay.com>.
Hi Abhilash,

“Mandantory” is a property on a row key column; You can see the option in
the “Advanced” step; If a column is set to “Mandantory=true”, it will be
moved to the head position of the row key; and that column will not be
aggregated when calculating the cube. This will avoid unnecessary
calculation and storage; If your query has where condition on that
required column, the query performance will be very good;

Let me give a sample; Assume I have a fact table which has the following
dimensions: date, seller, country;

Among them, date and country are low cardinality columns, seller is a high
cardinality column; As almost all my queries are having seller specified,
I set “seller” as mandatory in the row key, then this column is moved to
the head of the row key, and will not be aggregated; The HBase row key
will be like:

seller1,cal_dt,country —>
seller2,cal_dt,country —>
seller3,cal_dt,country —>
…
sellerN,cal_dt,country —>

seller1,cal_dt —>
seller2,cal_dt —>
seller3,cal_dt —>
...
sellerN,cal_dt —>

seller1,country —>
seller2,country —>
seller3,country —>

...
sellerN,country —>


As the seller’s cardinality is high, when given a seller value, the hbase
scan range will be very small, then the query performance will be good;

If you have SQLs which has no “seller” specified, in that case this cube
may not provide same response time; We would suggest user to create
another cube without seller dimension; Multiple cubes can co-exist in one
project and Kylin will pick up the most-appropriate cube to serve the
queries;



On 9/2/15, 7:41 PM, "Abhilash L L" <ab...@infoworks.io> wrote:

>Thanks for explanations Hongbin and Li,
>
>   We seem to have a decent understanding of hierarchical and derived
>dimensions.
>
>   For hierarchical, the columns part of the hierarchy also participate in
>adding an extra level to cubiods. They become part of rowkey as well and
>cubing happens on those columns as well.
>
>   For dervied, the query is rewritten to use the join key and then the in
>memory look up table is used to rewrite the hbase response to values with
>the derived dimension.
>
>   However there is something called a 'Normal' dimension (only one column
>at a time), which we are trying to see how it works during query
>resolution. Is this the mandatory dimension ? But since the UI allows only
>column per 'Normal' dimension do we have to create one for each column ?
>
>
> Also, a good write up about the types of dimensions and when to use each
>type will be really helpful for users, who do not want get into the code
>to
>figure out stuff. The clarification seeking requests might keep coming up
>as well. Just a thought.
>
>
>Regards,
>Abhilash
>
>On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <li...@apache.org> wrote:
>
>> Kylin assumes lookup table to be small (<100MB), thus can fit in memory.
>> In your model, if order or customer go beyond millions, then they have
>>to
>> be on the fact table.  Like Hongbin mentioned, an easy way is to use a
>>hive
>> view.
>>
>> About analyzing ultra-high cardinality columns (like millions of
>> customers), we see two common use cases.
>>
>> 1. TopN analysis.  Returning a millions records is not useful at all,
>> instread, returning the TopN big customer makes much better sense.
>> KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
>> feature under development that aims to respond to TopN queries in
>> subsecond.
>>
>> 2. Focused analysis.  Looking at a specific customer (e.g. where
>> customer=A).  Such query can be very fast by creating a cube with
>>customer
>> as a Mandatory dimension.
>>
>> Cheers
>> Yang
>>
>> On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <ma...@apache.org>
>>wrote:
>>
>> > ​Kylin handles star schema well, but my encounter issues like OOM on
>>your
>> > case.
>> > How many large lookup tables do you have?
>> > I'm not sure if a evict policy will help because anytime a SQL
>>involves
>> the
>> > lookup table, the lookup table snapshot will have to be loaded
>>again(so
>> the
>> > snapshots are swapping-in-swapping-out)
>> >
>> > One way to solve the problem is to join your tables into a flatten
>>table
>> > using Hive view, providing Kylin with single big fact table. And
>>please
>> > notice avoid using dictionary on high cardinality columns.
>> >
>> > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <ab...@infoworks.io>
>> > wrote:
>> >
>> > > Thanks for replying Hongbin,
>> > >
>> > >      for 1) we are trying to add some sort of evitction based cache
>> > instead
>> > > of a map. However, we still are trying to figure out what to do for
>>3).
>> > >
>> > >     What is the general advice ? The case here is ..  I have order
>> > details
>> > > as a fact and order as a dimension and also customer. Now each of
>>these
>> > > will run into many millions.  Also, the f-key is not a long/bigint,
>> its a
>> > > string which is a combination of our custom columns. Making it a
>> > dictionary
>> > > will not work as we understand. Please suggest what should be the
>> > approach
>> > > taken
>> > >
>> > > Regards,
>> > > Abhilash
>> > >
>> > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <ma...@apache.org>
>> wrote:
>> > >
>> > > >     for 1) ..  seems like only the resource path / table desc etc
>>is
>> > only
>> > > > kept in memory while a new lookupstringtable is created per
>> > query/request
>> > > > which holds onto data for the lifetime of the request.  So once
>>the
>> > > request
>> > > > is done, it should be garbage collectable ?
>> > > >
>> > > > /table is just for the hive table's schema, the look up table
>>content
>> > is
>> > > > cached in SnapshotManager and it will not be evicted so far. So if
>> you
>> > > have
>> > > > a lot of large lookup tables this will be a problem
>> > > >
>> > > >
>> > > > 3) Also the derived filter translator, is there a way to modify
>>the '
>> > > > IN_THRESHOLD'  via config file ?
>> > > >
>> > > > Are you facing performance issue with a lot of IN clauses? if so ,
>> > please
>> > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740,
>>the
>> > > patch
>> > > > will be merged into next release
>> > > >
>> > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L
>><abhilash@infoworks.io
>> >
>> > > > wrote:
>> > > >
>> > > > > Sorry for the confusion,
>> > > > >
>> > > > >     for 1) ..  seems like only the resource path / table desc
>>etc
>> is
>> > > only
>> > > > > kept in memory while a new lookupstringtable is created per
>> > > query/request
>> > > > > which holds onto data for the lifetime of the request.  So once
>>the
>> > > > request
>> > > > > is done, it should be garbage collectable ?
>> > > > >
>> > > > >
>> > > > > 3) Also the derived filter translator, is there a way to modify
>> the '
>> > > > > IN_THRESHOLD'  via config file ?
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > Regards,
>> > > > > Abhilash
>> > > > >
>> > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
>> abhilash@infoworks.io
>> > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hello,
>> > > > > >
>> > > > > >     We started noticing that Kylin tomcat server is taking a
>>lot
>> of
>> > > > ram.
>> > > > > > It even hit a limit of 10GB.
>> > > > > >
>> > > > > >     After spending some time by going over the code, it seems
>> like
>> > > the
>> > > > > > cube enumerator is not storing anything in memory. But the
>>Lookup
>> > > table
>> > > > > > enumerator seems to be loading all records and storing it in
>> > memory.
>> > > > > >
>> > > > > >     1) What happens when there are lot of projects defined
>>and we
>> > end
>> > > > up
>> > > > > > with tons of look up tables across them. Does it get swapped
>>out
>> > > > > > automatically ?  I am not able to track where eviction is
>> > happening.
>> > > > The
>> > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
>> > > different
>> > > > to
>> > > > > > me.
>> > > > > >
>> > > > > >     2) How do we handle really higher cardinality dimension.
>>Eg:
>> > If I
>> > > > > have
>> > > > > > sales as a fact and customers as a dimension, there will be
>> > millions
>> > > of
>> > > > > > customers. However a store is good candidate to keep in memory
>> but
>> > > not
>> > > > > > customers. Whats the recommended setting while creating the
>>cube
>> to
>> > > > > handle
>> > > > > > such a case
>> > > > > >
>> > > > > > Regards,
>> > > > > > Abhilash
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Regards,
>> > > >
>> > > > *Bin Mahone | 马洪宾*
>> > > > Apache Kylin: http://kylin.io
>> > > > Github: https://github.com/binmahone
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Regards,
>> >
>> > *Bin Mahone | 马洪宾*
>> > Apache Kylin: http://kylin.io
>> > Github: https://github.com/binmahone
>> >
>>


Re: Lookup Table Enumerator high memory

Posted by Abhilash L L <ab...@infoworks.io>.
'For the suggestion of 'flattening' the customer into the order table (fact
table), need a few clarifications/suggestions

Lets say we have flattened the customers into the fact table

To get number of customers, we can create a normal dimension on the
'customer id' in the fact table. Customer id becomes part of rowkey.

How do we get attributes like 'name' or 'age' for a given customer id.

   Adding a dummy measure on every customer column doesnt make sense when
queried without customer in group by. Also leads to lot of duplicate data
on hbase. If we try to query without group by we get an error '<colname> does
not exist in row key desc'.

   We cant do anything similar to 'derived dimension' on a fact table as
its only possible on lookup tables. Also will create snapshot etc.


Regards,
Abhilash

On Wed, Sep 2, 2015 at 7:48 PM, Abhilash L L <ab...@infoworks.io> wrote:

> Hi Luke,
>
>  I was mainly referring behaviour of the hierarchical, derived and Normal
> types within Kylin.  Especially derived and normal, the effect of using
> these is not very apparent..   especially since there are nochanges in row
> key design etc..
>
> Regards,
> Abhilash
>
> On Wed, Sep 2, 2015 at 7:32 PM, Luke Han <lu...@gmail.com> wrote:
>
>> Cube, Hierarchy Dimension and Measure are very common in DW/BI area,
>> suppose the "cube modeler" has experience about that:)
>>
>> But of cause, we should enhance Kylin's terminology page:
>> http://kylin.incubator.apache.org/docs/gettingstarted/terminology.html
>>
>> Meanwhile, would like to recommend this one for reference:
>> http://www.kimballgroup.com/2008/10/maintaining-dimension-hierarchies/
>>
>> Hope these could bring a little bit help:)
>>
>> Thanks.
>>
>>
>>
>> Best Regards!
>> ---------------------
>>
>> Luke Han
>>
>> On Wed, Sep 2, 2015 at 7:41 PM, Abhilash L L <ab...@infoworks.io>
>> wrote:
>>
>> > Thanks for explanations Hongbin and Li,
>> >
>> >    We seem to have a decent understanding of hierarchical and derived
>> > dimensions.
>> >
>> >    For hierarchical, the columns part of the hierarchy also participate
>> in
>> > adding an extra level to cubiods. They become part of rowkey as well and
>> > cubing happens on those columns as well.
>> >
>> >    For dervied, the query is rewritten to use the join key and then the
>> in
>> > memory look up table is used to rewrite the hbase response to values
>> with
>> > the derived dimension.
>> >
>> >    However there is something called a 'Normal' dimension (only one
>> column
>> > at a time), which we are trying to see how it works during query
>> > resolution. Is this the mandatory dimension ? But since the UI allows
>> only
>> > column per 'Normal' dimension do we have to create one for each column ?
>> >
>> >
>> >  Also, a good write up about the types of dimensions and when to use
>> each
>> > type will be really helpful for users, who do not want get into the
>> code to
>> > figure out stuff. The clarification seeking requests might keep coming
>> up
>> > as well. Just a thought.
>> >
>> >
>> > Regards,
>> > Abhilash
>> >
>> > On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <li...@apache.org> wrote:
>> >
>> > > Kylin assumes lookup table to be small (<100MB), thus can fit in
>> memory.
>> > > In your model, if order or customer go beyond millions, then they
>> have to
>> > > be on the fact table.  Like Hongbin mentioned, an easy way is to use a
>> > hive
>> > > view.
>> > >
>> > > About analyzing ultra-high cardinality columns (like millions of
>> > > customers), we see two common use cases.
>> > >
>> > > 1. TopN analysis.  Returning a millions records is not useful at all,
>> > > instread, returning the TopN big customer makes much better sense.
>> > > KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
>> > > feature under development that aims to respond to TopN queries in
>> > > subsecond.
>> > >
>> > > 2. Focused analysis.  Looking at a specific customer (e.g. where
>> > > customer=A).  Such query can be very fast by creating a cube with
>> > customer
>> > > as a Mandatory dimension.
>> > >
>> > > Cheers
>> > > Yang
>> > >
>> > > On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <ma...@apache.org>
>> > wrote:
>> > >
>> > > > ​Kylin handles star schema well, but my encounter issues like OOM on
>> > your
>> > > > case.
>> > > > How many large lookup tables do you have?
>> > > > I'm not sure if a evict policy will help because anytime a SQL
>> involves
>> > > the
>> > > > lookup table, the lookup table snapshot will have to be loaded
>> again(so
>> > > the
>> > > > snapshots are swapping-in-swapping-out)
>> > > >
>> > > > One way to solve the problem is to join your tables into a flatten
>> > table
>> > > > using Hive view, providing Kylin with single big fact table. And
>> please
>> > > > notice avoid using dictionary on high cardinality columns.
>> > > >
>> > > > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <
>> abhilash@infoworks.io>
>> > > > wrote:
>> > > >
>> > > > > Thanks for replying Hongbin,
>> > > > >
>> > > > >      for 1) we are trying to add some sort of evitction based
>> cache
>> > > > instead
>> > > > > of a map. However, we still are trying to figure out what to do
>> for
>> > 3).
>> > > > >
>> > > > >     What is the general advice ? The case here is ..  I have order
>> > > > details
>> > > > > as a fact and order as a dimension and also customer. Now each of
>> > these
>> > > > > will run into many millions.  Also, the f-key is not a
>> long/bigint,
>> > > its a
>> > > > > string which is a combination of our custom columns. Making it a
>> > > > dictionary
>> > > > > will not work as we understand. Please suggest what should be the
>> > > > approach
>> > > > > taken
>> > > > >
>> > > > > Regards,
>> > > > > Abhilash
>> > > > >
>> > > > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <ma...@apache.org>
>> > > wrote:
>> > > > >
>> > > > > >     for 1) ..  seems like only the resource path / table desc
>> etc
>> > is
>> > > > only
>> > > > > > kept in memory while a new lookupstringtable is created per
>> > > > query/request
>> > > > > > which holds onto data for the lifetime of the request.  So once
>> the
>> > > > > request
>> > > > > > is done, it should be garbage collectable ?
>> > > > > >
>> > > > > > /table is just for the hive table's schema, the look up table
>> > content
>> > > > is
>> > > > > > cached in SnapshotManager and it will not be evicted so far. So
>> if
>> > > you
>> > > > > have
>> > > > > > a lot of large lookup tables this will be a problem
>> > > > > >
>> > > > > >
>> > > > > > 3) Also the derived filter translator, is there a way to modify
>> > the '
>> > > > > > IN_THRESHOLD'  via config file ?
>> > > > > >
>> > > > > > Are you facing performance issue with a lot of IN clauses? if
>> so ,
>> > > > please
>> > > > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740,
>> > the
>> > > > > patch
>> > > > > > will be merged into next release
>> > > > > >
>> > > > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <
>> > abhilash@infoworks.io
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Sorry for the confusion,
>> > > > > > >
>> > > > > > >     for 1) ..  seems like only the resource path / table desc
>> etc
>> > > is
>> > > > > only
>> > > > > > > kept in memory while a new lookupstringtable is created per
>> > > > > query/request
>> > > > > > > which holds onto data for the lifetime of the request.  So
>> once
>> > the
>> > > > > > request
>> > > > > > > is done, it should be garbage collectable ?
>> > > > > > >
>> > > > > > >
>> > > > > > > 3) Also the derived filter translator, is there a way to
>> modify
>> > > the '
>> > > > > > > IN_THRESHOLD'  via config file ?
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > > Abhilash
>> > > > > > >
>> > > > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
>> > > abhilash@infoworks.io
>> > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hello,
>> > > > > > > >
>> > > > > > > >     We started noticing that Kylin tomcat server is taking a
>> > lot
>> > > of
>> > > > > > ram.
>> > > > > > > > It even hit a limit of 10GB.
>> > > > > > > >
>> > > > > > > >     After spending some time by going over the code, it
>> seems
>> > > like
>> > > > > the
>> > > > > > > > cube enumerator is not storing anything in memory. But the
>> > Lookup
>> > > > > table
>> > > > > > > > enumerator seems to be loading all records and storing it in
>> > > > memory.
>> > > > > > > >
>> > > > > > > >     1) What happens when there are lot of projects defined
>> and
>> > we
>> > > > end
>> > > > > > up
>> > > > > > > > with tons of look up tables across them. Does it get swapped
>> > out
>> > > > > > > > automatically ?  I am not able to track where eviction is
>> > > > happening.
>> > > > > > The
>> > > > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
>> > > > > different
>> > > > > > to
>> > > > > > > > me.
>> > > > > > > >
>> > > > > > > >     2) How do we handle really higher cardinality dimension.
>> > Eg:
>> > > > If I
>> > > > > > > have
>> > > > > > > > sales as a fact and customers as a dimension, there will be
>> > > > millions
>> > > > > of
>> > > > > > > > customers. However a store is good candidate to keep in
>> memory
>> > > but
>> > > > > not
>> > > > > > > > customers. Whats the recommended setting while creating the
>> > cube
>> > > to
>> > > > > > > handle
>> > > > > > > > such a case
>> > > > > > > >
>> > > > > > > > Regards,
>> > > > > > > > Abhilash
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Regards,
>> > > > > >
>> > > > > > *Bin Mahone | 马洪宾*
>> > > > > > Apache Kylin: http://kylin.io
>> > > > > > Github: https://github.com/binmahone
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Regards,
>> > > >
>> > > > *Bin Mahone | 马洪宾*
>> > > > Apache Kylin: http://kylin.io
>> > > > Github: https://github.com/binmahone
>> > > >
>> > >
>> >
>>
>
>

Re: Lookup Table Enumerator high memory

Posted by Abhilash L L <ab...@infoworks.io>.
Hi Luke,

 I was mainly referring behaviour of the hierarchical, derived and Normal
types within Kylin.  Especially derived and normal, the effect of using
these is not very apparent..   especially since there are nochanges in row
key design etc..

Regards,
Abhilash

On Wed, Sep 2, 2015 at 7:32 PM, Luke Han <lu...@gmail.com> wrote:

> Cube, Hierarchy Dimension and Measure are very common in DW/BI area,
> suppose the "cube modeler" has experience about that:)
>
> But of cause, we should enhance Kylin's terminology page:
> http://kylin.incubator.apache.org/docs/gettingstarted/terminology.html
>
> Meanwhile, would like to recommend this one for reference:
> http://www.kimballgroup.com/2008/10/maintaining-dimension-hierarchies/
>
> Hope these could bring a little bit help:)
>
> Thanks.
>
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Wed, Sep 2, 2015 at 7:41 PM, Abhilash L L <ab...@infoworks.io>
> wrote:
>
> > Thanks for explanations Hongbin and Li,
> >
> >    We seem to have a decent understanding of hierarchical and derived
> > dimensions.
> >
> >    For hierarchical, the columns part of the hierarchy also participate
> in
> > adding an extra level to cubiods. They become part of rowkey as well and
> > cubing happens on those columns as well.
> >
> >    For dervied, the query is rewritten to use the join key and then the
> in
> > memory look up table is used to rewrite the hbase response to values with
> > the derived dimension.
> >
> >    However there is something called a 'Normal' dimension (only one
> column
> > at a time), which we are trying to see how it works during query
> > resolution. Is this the mandatory dimension ? But since the UI allows
> only
> > column per 'Normal' dimension do we have to create one for each column ?
> >
> >
> >  Also, a good write up about the types of dimensions and when to use each
> > type will be really helpful for users, who do not want get into the code
> to
> > figure out stuff. The clarification seeking requests might keep coming up
> > as well. Just a thought.
> >
> >
> > Regards,
> > Abhilash
> >
> > On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <li...@apache.org> wrote:
> >
> > > Kylin assumes lookup table to be small (<100MB), thus can fit in
> memory.
> > > In your model, if order or customer go beyond millions, then they have
> to
> > > be on the fact table.  Like Hongbin mentioned, an easy way is to use a
> > hive
> > > view.
> > >
> > > About analyzing ultra-high cardinality columns (like millions of
> > > customers), we see two common use cases.
> > >
> > > 1. TopN analysis.  Returning a millions records is not useful at all,
> > > instread, returning the TopN big customer makes much better sense.
> > > KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
> > > feature under development that aims to respond to TopN queries in
> > > subsecond.
> > >
> > > 2. Focused analysis.  Looking at a specific customer (e.g. where
> > > customer=A).  Such query can be very fast by creating a cube with
> > customer
> > > as a Mandatory dimension.
> > >
> > > Cheers
> > > Yang
> > >
> > > On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <ma...@apache.org>
> > wrote:
> > >
> > > > ​Kylin handles star schema well, but my encounter issues like OOM on
> > your
> > > > case.
> > > > How many large lookup tables do you have?
> > > > I'm not sure if a evict policy will help because anytime a SQL
> involves
> > > the
> > > > lookup table, the lookup table snapshot will have to be loaded
> again(so
> > > the
> > > > snapshots are swapping-in-swapping-out)
> > > >
> > > > One way to solve the problem is to join your tables into a flatten
> > table
> > > > using Hive view, providing Kylin with single big fact table. And
> please
> > > > notice avoid using dictionary on high cardinality columns.
> > > >
> > > > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <abhilash@infoworks.io
> >
> > > > wrote:
> > > >
> > > > > Thanks for replying Hongbin,
> > > > >
> > > > >      for 1) we are trying to add some sort of evitction based cache
> > > > instead
> > > > > of a map. However, we still are trying to figure out what to do for
> > 3).
> > > > >
> > > > >     What is the general advice ? The case here is ..  I have order
> > > > details
> > > > > as a fact and order as a dimension and also customer. Now each of
> > these
> > > > > will run into many millions.  Also, the f-key is not a long/bigint,
> > > its a
> > > > > string which is a combination of our custom columns. Making it a
> > > > dictionary
> > > > > will not work as we understand. Please suggest what should be the
> > > > approach
> > > > > taken
> > > > >
> > > > > Regards,
> > > > > Abhilash
> > > > >
> > > > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <ma...@apache.org>
> > > wrote:
> > > > >
> > > > > >     for 1) ..  seems like only the resource path / table desc etc
> > is
> > > > only
> > > > > > kept in memory while a new lookupstringtable is created per
> > > > query/request
> > > > > > which holds onto data for the lifetime of the request.  So once
> the
> > > > > request
> > > > > > is done, it should be garbage collectable ?
> > > > > >
> > > > > > /table is just for the hive table's schema, the look up table
> > content
> > > > is
> > > > > > cached in SnapshotManager and it will not be evicted so far. So
> if
> > > you
> > > > > have
> > > > > > a lot of large lookup tables this will be a problem
> > > > > >
> > > > > >
> > > > > > 3) Also the derived filter translator, is there a way to modify
> > the '
> > > > > > IN_THRESHOLD'  via config file ?
> > > > > >
> > > > > > Are you facing performance issue with a lot of IN clauses? if so
> ,
> > > > please
> > > > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740,
> > the
> > > > > patch
> > > > > > will be merged into next release
> > > > > >
> > > > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <
> > abhilash@infoworks.io
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sorry for the confusion,
> > > > > > >
> > > > > > >     for 1) ..  seems like only the resource path / table desc
> etc
> > > is
> > > > > only
> > > > > > > kept in memory while a new lookupstringtable is created per
> > > > > query/request
> > > > > > > which holds onto data for the lifetime of the request.  So once
> > the
> > > > > > request
> > > > > > > is done, it should be garbage collectable ?
> > > > > > >
> > > > > > >
> > > > > > > 3) Also the derived filter translator, is there a way to modify
> > > the '
> > > > > > > IN_THRESHOLD'  via config file ?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Abhilash
> > > > > > >
> > > > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
> > > abhilash@infoworks.io
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > >     We started noticing that Kylin tomcat server is taking a
> > lot
> > > of
> > > > > > ram.
> > > > > > > > It even hit a limit of 10GB.
> > > > > > > >
> > > > > > > >     After spending some time by going over the code, it seems
> > > like
> > > > > the
> > > > > > > > cube enumerator is not storing anything in memory. But the
> > Lookup
> > > > > table
> > > > > > > > enumerator seems to be loading all records and storing it in
> > > > memory.
> > > > > > > >
> > > > > > > >     1) What happens when there are lot of projects defined
> and
> > we
> > > > end
> > > > > > up
> > > > > > > > with tons of look up tables across them. Does it get swapped
> > out
> > > > > > > > automatically ?  I am not able to track where eviction is
> > > > happening.
> > > > > > The
> > > > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
> > > > > different
> > > > > > to
> > > > > > > > me.
> > > > > > > >
> > > > > > > >     2) How do we handle really higher cardinality dimension.
> > Eg:
> > > > If I
> > > > > > > have
> > > > > > > > sales as a fact and customers as a dimension, there will be
> > > > millions
> > > > > of
> > > > > > > > customers. However a store is good candidate to keep in
> memory
> > > but
> > > > > not
> > > > > > > > customers. Whats the recommended setting while creating the
> > cube
> > > to
> > > > > > > handle
> > > > > > > > such a case
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Abhilash
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > *Bin Mahone | 马洪宾*
> > > > > > Apache Kylin: http://kylin.io
> > > > > > Github: https://github.com/binmahone
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > *Bin Mahone | 马洪宾*
> > > > Apache Kylin: http://kylin.io
> > > > Github: https://github.com/binmahone
> > > >
> > >
> >
>

Re: Lookup Table Enumerator high memory

Posted by Luke Han <lu...@gmail.com>.
Cube, Hierarchy Dimension and Measure are very common in DW/BI area,
suppose the "cube modeler" has experience about that:)

But of cause, we should enhance Kylin's terminology page:
http://kylin.incubator.apache.org/docs/gettingstarted/terminology.html

Meanwhile, would like to recommend this one for reference:
http://www.kimballgroup.com/2008/10/maintaining-dimension-hierarchies/

Hope these could bring a little bit help:)

Thanks.



Best Regards!
---------------------

Luke Han

On Wed, Sep 2, 2015 at 7:41 PM, Abhilash L L <ab...@infoworks.io> wrote:

> Thanks for explanations Hongbin and Li,
>
>    We seem to have a decent understanding of hierarchical and derived
> dimensions.
>
>    For hierarchical, the columns part of the hierarchy also participate in
> adding an extra level to cubiods. They become part of rowkey as well and
> cubing happens on those columns as well.
>
>    For dervied, the query is rewritten to use the join key and then the in
> memory look up table is used to rewrite the hbase response to values with
> the derived dimension.
>
>    However there is something called a 'Normal' dimension (only one column
> at a time), which we are trying to see how it works during query
> resolution. Is this the mandatory dimension ? But since the UI allows only
> column per 'Normal' dimension do we have to create one for each column ?
>
>
>  Also, a good write up about the types of dimensions and when to use each
> type will be really helpful for users, who do not want get into the code to
> figure out stuff. The clarification seeking requests might keep coming up
> as well. Just a thought.
>
>
> Regards,
> Abhilash
>
> On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <li...@apache.org> wrote:
>
> > Kylin assumes lookup table to be small (<100MB), thus can fit in memory.
> > In your model, if order or customer go beyond millions, then they have to
> > be on the fact table.  Like Hongbin mentioned, an easy way is to use a
> hive
> > view.
> >
> > About analyzing ultra-high cardinality columns (like millions of
> > customers), we see two common use cases.
> >
> > 1. TopN analysis.  Returning a millions records is not useful at all,
> > instread, returning the TopN big customer makes much better sense.
> > KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
> > feature under development that aims to respond to TopN queries in
> > subsecond.
> >
> > 2. Focused analysis.  Looking at a specific customer (e.g. where
> > customer=A).  Such query can be very fast by creating a cube with
> customer
> > as a Mandatory dimension.
> >
> > Cheers
> > Yang
> >
> > On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <ma...@apache.org>
> wrote:
> >
> > > ​Kylin handles star schema well, but my encounter issues like OOM on
> your
> > > case.
> > > How many large lookup tables do you have?
> > > I'm not sure if a evict policy will help because anytime a SQL involves
> > the
> > > lookup table, the lookup table snapshot will have to be loaded again(so
> > the
> > > snapshots are swapping-in-swapping-out)
> > >
> > > One way to solve the problem is to join your tables into a flatten
> table
> > > using Hive view, providing Kylin with single big fact table. And please
> > > notice avoid using dictionary on high cardinality columns.
> > >
> > > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <ab...@infoworks.io>
> > > wrote:
> > >
> > > > Thanks for replying Hongbin,
> > > >
> > > >      for 1) we are trying to add some sort of evitction based cache
> > > instead
> > > > of a map. However, we still are trying to figure out what to do for
> 3).
> > > >
> > > >     What is the general advice ? The case here is ..  I have order
> > > details
> > > > as a fact and order as a dimension and also customer. Now each of
> these
> > > > will run into many millions.  Also, the f-key is not a long/bigint,
> > its a
> > > > string which is a combination of our custom columns. Making it a
> > > dictionary
> > > > will not work as we understand. Please suggest what should be the
> > > approach
> > > > taken
> > > >
> > > > Regards,
> > > > Abhilash
> > > >
> > > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <ma...@apache.org>
> > wrote:
> > > >
> > > > >     for 1) ..  seems like only the resource path / table desc etc
> is
> > > only
> > > > > kept in memory while a new lookupstringtable is created per
> > > query/request
> > > > > which holds onto data for the lifetime of the request.  So once the
> > > > request
> > > > > is done, it should be garbage collectable ?
> > > > >
> > > > > /table is just for the hive table's schema, the look up table
> content
> > > is
> > > > > cached in SnapshotManager and it will not be evicted so far. So if
> > you
> > > > have
> > > > > a lot of large lookup tables this will be a problem
> > > > >
> > > > >
> > > > > 3) Also the derived filter translator, is there a way to modify
> the '
> > > > > IN_THRESHOLD'  via config file ?
> > > > >
> > > > > Are you facing performance issue with a lot of IN clauses? if so ,
> > > please
> > > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740,
> the
> > > > patch
> > > > > will be merged into next release
> > > > >
> > > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <
> abhilash@infoworks.io
> > >
> > > > > wrote:
> > > > >
> > > > > > Sorry for the confusion,
> > > > > >
> > > > > >     for 1) ..  seems like only the resource path / table desc etc
> > is
> > > > only
> > > > > > kept in memory while a new lookupstringtable is created per
> > > > query/request
> > > > > > which holds onto data for the lifetime of the request.  So once
> the
> > > > > request
> > > > > > is done, it should be garbage collectable ?
> > > > > >
> > > > > >
> > > > > > 3) Also the derived filter translator, is there a way to modify
> > the '
> > > > > > IN_THRESHOLD'  via config file ?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Abhilash
> > > > > >
> > > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
> > abhilash@infoworks.io
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > >     We started noticing that Kylin tomcat server is taking a
> lot
> > of
> > > > > ram.
> > > > > > > It even hit a limit of 10GB.
> > > > > > >
> > > > > > >     After spending some time by going over the code, it seems
> > like
> > > > the
> > > > > > > cube enumerator is not storing anything in memory. But the
> Lookup
> > > > table
> > > > > > > enumerator seems to be loading all records and storing it in
> > > memory.
> > > > > > >
> > > > > > >     1) What happens when there are lot of projects defined and
> we
> > > end
> > > > > up
> > > > > > > with tons of look up tables across them. Does it get swapped
> out
> > > > > > > automatically ?  I am not able to track where eviction is
> > > happening.
> > > > > The
> > > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
> > > > different
> > > > > to
> > > > > > > me.
> > > > > > >
> > > > > > >     2) How do we handle really higher cardinality dimension.
> Eg:
> > > If I
> > > > > > have
> > > > > > > sales as a fact and customers as a dimension, there will be
> > > millions
> > > > of
> > > > > > > customers. However a store is good candidate to keep in memory
> > but
> > > > not
> > > > > > > customers. Whats the recommended setting while creating the
> cube
> > to
> > > > > > handle
> > > > > > > such a case
> > > > > > >
> > > > > > > Regards,
> > > > > > > Abhilash
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > *Bin Mahone | 马洪宾*
> > > > > Apache Kylin: http://kylin.io
> > > > > Github: https://github.com/binmahone
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > > Apache Kylin: http://kylin.io
> > > Github: https://github.com/binmahone
> > >
> >
>

Re: Lookup Table Enumerator high memory

Posted by Abhilash L L <ab...@infoworks.io>.
Thanks for explanations Hongbin and Li,

   We seem to have a decent understanding of hierarchical and derived
dimensions.

   For hierarchical, the columns part of the hierarchy also participate in
adding an extra level to cubiods. They become part of rowkey as well and
cubing happens on those columns as well.

   For dervied, the query is rewritten to use the join key and then the in
memory look up table is used to rewrite the hbase response to values with
the derived dimension.

   However there is something called a 'Normal' dimension (only one column
at a time), which we are trying to see how it works during query
resolution. Is this the mandatory dimension ? But since the UI allows only
column per 'Normal' dimension do we have to create one for each column ?


 Also, a good write up about the types of dimensions and when to use each
type will be really helpful for users, who do not want get into the code to
figure out stuff. The clarification seeking requests might keep coming up
as well. Just a thought.


Regards,
Abhilash

On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <li...@apache.org> wrote:

> Kylin assumes lookup table to be small (<100MB), thus can fit in memory.
> In your model, if order or customer go beyond millions, then they have to
> be on the fact table.  Like Hongbin mentioned, an easy way is to use a hive
> view.
>
> About analyzing ultra-high cardinality columns (like millions of
> customers), we see two common use cases.
>
> 1. TopN analysis.  Returning a millions records is not useful at all,
> instread, returning the TopN big customer makes much better sense.
> KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
> feature under development that aims to respond to TopN queries in
> subsecond.
>
> 2. Focused analysis.  Looking at a specific customer (e.g. where
> customer=A).  Such query can be very fast by creating a cube with customer
> as a Mandatory dimension.
>
> Cheers
> Yang
>
> On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <ma...@apache.org> wrote:
>
> > ​Kylin handles star schema well, but my encounter issues like OOM on your
> > case.
> > How many large lookup tables do you have?
> > I'm not sure if a evict policy will help because anytime a SQL involves
> the
> > lookup table, the lookup table snapshot will have to be loaded again(so
> the
> > snapshots are swapping-in-swapping-out)
> >
> > One way to solve the problem is to join your tables into a flatten table
> > using Hive view, providing Kylin with single big fact table. And please
> > notice avoid using dictionary on high cardinality columns.
> >
> > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <ab...@infoworks.io>
> > wrote:
> >
> > > Thanks for replying Hongbin,
> > >
> > >      for 1) we are trying to add some sort of evitction based cache
> > instead
> > > of a map. However, we still are trying to figure out what to do for 3).
> > >
> > >     What is the general advice ? The case here is ..  I have order
> > details
> > > as a fact and order as a dimension and also customer. Now each of these
> > > will run into many millions.  Also, the f-key is not a long/bigint,
> its a
> > > string which is a combination of our custom columns. Making it a
> > dictionary
> > > will not work as we understand. Please suggest what should be the
> > approach
> > > taken
> > >
> > > Regards,
> > > Abhilash
> > >
> > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <ma...@apache.org>
> wrote:
> > >
> > > >     for 1) ..  seems like only the resource path / table desc etc is
> > only
> > > > kept in memory while a new lookupstringtable is created per
> > query/request
> > > > which holds onto data for the lifetime of the request.  So once the
> > > request
> > > > is done, it should be garbage collectable ?
> > > >
> > > > /table is just for the hive table's schema, the look up table content
> > is
> > > > cached in SnapshotManager and it will not be evicted so far. So if
> you
> > > have
> > > > a lot of large lookup tables this will be a problem
> > > >
> > > >
> > > > 3) Also the derived filter translator, is there a way to modify the '
> > > > IN_THRESHOLD'  via config file ?
> > > >
> > > > Are you facing performance issue with a lot of IN clauses? if so ,
> > please
> > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740, the
> > > patch
> > > > will be merged into next release
> > > >
> > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <abhilash@infoworks.io
> >
> > > > wrote:
> > > >
> > > > > Sorry for the confusion,
> > > > >
> > > > >     for 1) ..  seems like only the resource path / table desc etc
> is
> > > only
> > > > > kept in memory while a new lookupstringtable is created per
> > > query/request
> > > > > which holds onto data for the lifetime of the request.  So once the
> > > > request
> > > > > is done, it should be garbage collectable ?
> > > > >
> > > > >
> > > > > 3) Also the derived filter translator, is there a way to modify
> the '
> > > > > IN_THRESHOLD'  via config file ?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > > Abhilash
> > > > >
> > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
> abhilash@infoworks.io
> > >
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > >     We started noticing that Kylin tomcat server is taking a lot
> of
> > > > ram.
> > > > > > It even hit a limit of 10GB.
> > > > > >
> > > > > >     After spending some time by going over the code, it seems
> like
> > > the
> > > > > > cube enumerator is not storing anything in memory. But the Lookup
> > > table
> > > > > > enumerator seems to be loading all records and storing it in
> > memory.
> > > > > >
> > > > > >     1) What happens when there are lot of projects defined and we
> > end
> > > > up
> > > > > > with tons of look up tables across them. Does it get swapped out
> > > > > > automatically ?  I am not able to track where eviction is
> > happening.
> > > > The
> > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
> > > different
> > > > to
> > > > > > me.
> > > > > >
> > > > > >     2) How do we handle really higher cardinality dimension. Eg:
> > If I
> > > > > have
> > > > > > sales as a fact and customers as a dimension, there will be
> > millions
> > > of
> > > > > > customers. However a store is good candidate to keep in memory
> but
> > > not
> > > > > > customers. Whats the recommended setting while creating the cube
> to
> > > > > handle
> > > > > > such a case
> > > > > >
> > > > > > Regards,
> > > > > > Abhilash
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > *Bin Mahone | 马洪宾*
> > > > Apache Kylin: http://kylin.io
> > > > Github: https://github.com/binmahone
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>

Re: Lookup Table Enumerator high memory

Posted by Li Yang <li...@apache.org>.
Kylin assumes lookup table to be small (<100MB), thus can fit in memory.
In your model, if order or customer go beyond millions, then they have to
be on the fact table.  Like Hongbin mentioned, an easy way is to use a hive
view.

About analyzing ultra-high cardinality columns (like millions of
customers), we see two common use cases.

1. TopN analysis.  Returning a millions records is not useful at all,
instread, returning the TopN big customer makes much better sense.
KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
feature under development that aims to respond to TopN queries in subsecond.

2. Focused analysis.  Looking at a specific customer (e.g. where
customer=A).  Such query can be very fast by creating a cube with customer
as a Mandatory dimension.

Cheers
Yang

On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <ma...@apache.org> wrote:

> ​Kylin handles star schema well, but my encounter issues like OOM on your
> case.
> How many large lookup tables do you have?
> I'm not sure if a evict policy will help because anytime a SQL involves the
> lookup table, the lookup table snapshot will have to be loaded again(so the
> snapshots are swapping-in-swapping-out)
>
> One way to solve the problem is to join your tables into a flatten table
> using Hive view, providing Kylin with single big fact table. And please
> notice avoid using dictionary on high cardinality columns.
>
> On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <ab...@infoworks.io>
> wrote:
>
> > Thanks for replying Hongbin,
> >
> >      for 1) we are trying to add some sort of evitction based cache
> instead
> > of a map. However, we still are trying to figure out what to do for 3).
> >
> >     What is the general advice ? The case here is ..  I have order
> details
> > as a fact and order as a dimension and also customer. Now each of these
> > will run into many millions.  Also, the f-key is not a long/bigint, its a
> > string which is a combination of our custom columns. Making it a
> dictionary
> > will not work as we understand. Please suggest what should be the
> approach
> > taken
> >
> > Regards,
> > Abhilash
> >
> > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <ma...@apache.org> wrote:
> >
> > >     for 1) ..  seems like only the resource path / table desc etc is
> only
> > > kept in memory while a new lookupstringtable is created per
> query/request
> > > which holds onto data for the lifetime of the request.  So once the
> > request
> > > is done, it should be garbage collectable ?
> > >
> > > /table is just for the hive table's schema, the look up table content
> is
> > > cached in SnapshotManager and it will not be evicted so far. So if you
> > have
> > > a lot of large lookup tables this will be a problem
> > >
> > >
> > > 3) Also the derived filter translator, is there a way to modify the '
> > > IN_THRESHOLD'  via config file ?
> > >
> > > Are you facing performance issue with a lot of IN clauses? if so ,
> please
> > > take a look at https://issues.apache.org/jira/browse/KYLIN-740, the
> > patch
> > > will be merged into next release
> > >
> > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <ab...@infoworks.io>
> > > wrote:
> > >
> > > > Sorry for the confusion,
> > > >
> > > >     for 1) ..  seems like only the resource path / table desc etc is
> > only
> > > > kept in memory while a new lookupstringtable is created per
> > query/request
> > > > which holds onto data for the lifetime of the request.  So once the
> > > request
> > > > is done, it should be garbage collectable ?
> > > >
> > > >
> > > > 3) Also the derived filter translator, is there a way to modify the '
> > > > IN_THRESHOLD'  via config file ?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > > Abhilash
> > > >
> > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <abhilash@infoworks.io
> >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > >     We started noticing that Kylin tomcat server is taking a lot of
> > > ram.
> > > > > It even hit a limit of 10GB.
> > > > >
> > > > >     After spending some time by going over the code, it seems like
> > the
> > > > > cube enumerator is not storing anything in memory. But the Lookup
> > table
> > > > > enumerator seems to be loading all records and storing it in
> memory.
> > > > >
> > > > >     1) What happens when there are lot of projects defined and we
> end
> > > up
> > > > > with tons of look up tables across them. Does it get swapped out
> > > > > automatically ?  I am not able to track where eviction is
> happening.
> > > The
> > > > > snapshot manager has a 'removeSnapshot' but its intent seems
> > different
> > > to
> > > > > me.
> > > > >
> > > > >     2) How do we handle really higher cardinality dimension. Eg:
> If I
> > > > have
> > > > > sales as a fact and customers as a dimension, there will be
> millions
> > of
> > > > > customers. However a store is good candidate to keep in memory but
> > not
> > > > > customers. Whats the recommended setting while creating the cube to
> > > > handle
> > > > > such a case
> > > > >
> > > > > Regards,
> > > > > Abhilash
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > > Apache Kylin: http://kylin.io
> > > Github: https://github.com/binmahone
> > >
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Re: Lookup Table Enumerator high memory

Posted by hongbin ma <ma...@apache.org>.
​Kylin handles star schema well, but my encounter issues like OOM on your
case.
How many large lookup tables do you have?
I'm not sure if a evict policy will help because anytime a SQL involves the
lookup table, the lookup table snapshot will have to be loaded again(so the
snapshots are swapping-in-swapping-out)

One way to solve the problem is to join your tables into a flatten table
using Hive view, providing Kylin with single big fact table. And please
notice avoid using dictionary on high cardinality columns.

On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <ab...@infoworks.io> wrote:

> Thanks for replying Hongbin,
>
>      for 1) we are trying to add some sort of evitction based cache instead
> of a map. However, we still are trying to figure out what to do for 3).
>
>     What is the general advice ? The case here is ..  I have order details
> as a fact and order as a dimension and also customer. Now each of these
> will run into many millions.  Also, the f-key is not a long/bigint, its a
> string which is a combination of our custom columns. Making it a dictionary
> will not work as we understand. Please suggest what should be the approach
> taken
>
> Regards,
> Abhilash
>
> On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <ma...@apache.org> wrote:
>
> >     for 1) ..  seems like only the resource path / table desc etc is only
> > kept in memory while a new lookupstringtable is created per query/request
> > which holds onto data for the lifetime of the request.  So once the
> request
> > is done, it should be garbage collectable ?
> >
> > /table is just for the hive table's schema, the look up table content is
> > cached in SnapshotManager and it will not be evicted so far. So if you
> have
> > a lot of large lookup tables this will be a problem
> >
> >
> > 3) Also the derived filter translator, is there a way to modify the '
> > IN_THRESHOLD'  via config file ?
> >
> > Are you facing performance issue with a lot of IN clauses? if so , please
> > take a look at https://issues.apache.org/jira/browse/KYLIN-740, the
> patch
> > will be merged into next release
> >
> > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <ab...@infoworks.io>
> > wrote:
> >
> > > Sorry for the confusion,
> > >
> > >     for 1) ..  seems like only the resource path / table desc etc is
> only
> > > kept in memory while a new lookupstringtable is created per
> query/request
> > > which holds onto data for the lifetime of the request.  So once the
> > request
> > > is done, it should be garbage collectable ?
> > >
> > >
> > > 3) Also the derived filter translator, is there a way to modify the '
> > > IN_THRESHOLD'  via config file ?
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > > Abhilash
> > >
> > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <ab...@infoworks.io>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > >     We started noticing that Kylin tomcat server is taking a lot of
> > ram.
> > > > It even hit a limit of 10GB.
> > > >
> > > >     After spending some time by going over the code, it seems like
> the
> > > > cube enumerator is not storing anything in memory. But the Lookup
> table
> > > > enumerator seems to be loading all records and storing it in memory.
> > > >
> > > >     1) What happens when there are lot of projects defined and we end
> > up
> > > > with tons of look up tables across them. Does it get swapped out
> > > > automatically ?  I am not able to track where eviction is happening.
> > The
> > > > snapshot manager has a 'removeSnapshot' but its intent seems
> different
> > to
> > > > me.
> > > >
> > > >     2) How do we handle really higher cardinality dimension. Eg: If I
> > > have
> > > > sales as a fact and customers as a dimension, there will be millions
> of
> > > > customers. However a store is good candidate to keep in memory but
> not
> > > > customers. Whats the recommended setting while creating the cube to
> > > handle
> > > > such a case
> > > >
> > > > Regards,
> > > > Abhilash
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Lookup Table Enumerator high memory

Posted by Abhilash L L <ab...@infoworks.io>.
Thanks for replying Hongbin,

     for 1) we are trying to add some sort of evitction based cache instead
of a map. However, we still are trying to figure out what to do for 3).

    What is the general advice ? The case here is ..  I have order details
as a fact and order as a dimension and also customer. Now each of these
will run into many millions.  Also, the f-key is not a long/bigint, its a
string which is a combination of our custom columns. Making it a dictionary
will not work as we understand. Please suggest what should be the approach
taken

Regards,
Abhilash

On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <ma...@apache.org> wrote:

>     for 1) ..  seems like only the resource path / table desc etc is only
> kept in memory while a new lookupstringtable is created per query/request
> which holds onto data for the lifetime of the request.  So once the request
> is done, it should be garbage collectable ?
>
> /table is just for the hive table's schema, the look up table content is
> cached in SnapshotManager and it will not be evicted so far. So if you have
> a lot of large lookup tables this will be a problem
>
>
> 3) Also the derived filter translator, is there a way to modify the '
> IN_THRESHOLD'  via config file ?
>
> Are you facing performance issue with a lot of IN clauses? if so , please
> take a look at https://issues.apache.org/jira/browse/KYLIN-740, the patch
> will be merged into next release
>
> On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <ab...@infoworks.io>
> wrote:
>
> > Sorry for the confusion,
> >
> >     for 1) ..  seems like only the resource path / table desc etc is only
> > kept in memory while a new lookupstringtable is created per query/request
> > which holds onto data for the lifetime of the request.  So once the
> request
> > is done, it should be garbage collectable ?
> >
> >
> > 3) Also the derived filter translator, is there a way to modify the '
> > IN_THRESHOLD'  via config file ?
> >
> >
> >
> >
> >
> > Regards,
> > Abhilash
> >
> > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <ab...@infoworks.io>
> > wrote:
> >
> > > Hello,
> > >
> > >     We started noticing that Kylin tomcat server is taking a lot of
> ram.
> > > It even hit a limit of 10GB.
> > >
> > >     After spending some time by going over the code, it seems like the
> > > cube enumerator is not storing anything in memory. But the Lookup table
> > > enumerator seems to be loading all records and storing it in memory.
> > >
> > >     1) What happens when there are lot of projects defined and we end
> up
> > > with tons of look up tables across them. Does it get swapped out
> > > automatically ?  I am not able to track where eviction is happening.
> The
> > > snapshot manager has a 'removeSnapshot' but its intent seems different
> to
> > > me.
> > >
> > >     2) How do we handle really higher cardinality dimension. Eg: If I
> > have
> > > sales as a fact and customers as a dimension, there will be millions of
> > > customers. However a store is good candidate to keep in memory but not
> > > customers. Whats the recommended setting while creating the cube to
> > handle
> > > such a case
> > >
> > > Regards,
> > > Abhilash
> > >
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>