You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by Jason Hale <ja...@koddi.com> on 2016/08/01 23:20:09 UTC

Kylin Cube Performance

I'm setting up a test case for a portion of our dataset, to evaluate Kylin
and I'm not seeing the performance that I would expect.

The cube building process is taking about 5-6 hours with  ~69,000,000
records and 10 dimensions. I'm not sure if that's the expected build time,
but the other problem is the query performance after building the cube.

All queries were tested with a very simple query (e.g. SELECT SUM(clicks)
FROM reporting GROUP BY search_type)

Grouping by 1 or 2 dimensions gives me very responsive queries (under 2
seconds), but adding more dimensions drastically increases the query time
(over 1 minute and it times out through hbase). I would expect these
queries to have all similar query times since they should query the
respective cuboid, so I'm not sure why the performance would suffer. I
didn't set up any special rules for the cube, but during the build process
it showed all the N-dimension cubes and the log simply said 'skipped'.

Is there something I'm missing in the configuration?

I have a HDP cluster with 3 nodes and 1 client node on which Kylin is
installed. Do I need to adjust the hadoop configuration. I'm using most of
the default HDP settings.

What more information can I provide?

Re: Kylin Cube Performance

Posted by Jason Hale <ja...@koddi.com>.
Ah, that makes sense. Thank you, hongbin.

On Thu, Aug 4, 2016 at 11:04 AM, hongbin ma <ma...@apache.org> wrote:

> If you have a limit of 100, kylin is SUPPOSED to be far more efficient.
> However there's currently a issue here that might cause limit clause being
> overlooked (https://issues.apache.org/jira/browse/KYLIN-1936)
>
> I'm working on fixing KYLIN-1936, it will be fixed in 1.5.4.
>
> On Thu, Aug 4, 2016 at 11:36 PM, Jason Hale <ja...@koddi.com> wrote:
>
> > True, but even if there's a limit of 100, it still has to scan all
> records?
> > Perhaps I'm just used to how Postgres handles that as it only scans the
> > necessary records, not the entire set if it's limited. I can rethink the
> > way I approach it if that's the case.
> >
> > On Thu, Aug 4, 2016 at 10:31 AM, hongbin ma <ma...@apache.org>
> wrote:
> >
> > > Hi Jason
> > >
> > > As Shaofeng explained it's not reasonable to expect sub-second latency
> if
> > > you're returning tens of millions of records. You data model is quite
> > > simple and you don't have costly measure like distinct count etc., so
> > kylin
> > > should be performant on normal OLAP queries.
> > >
> > > Another advise: if the cardinality of the mandatory dimensions
> (CHILD_ID
> > > and SITE_ID) are very high, you might isolate such dimensions into a
> > > separate "aggregation group", so that 1. queries not touching these
> > > dimensions can be performant 2. calculate less cuboids. Please refer to
> > > http://kylin.apache.org/blog/2016/02/18/new-aggregation-group/
> > >
> > > On Thu, Aug 4, 2016 at 11:11 PM, ShaoFeng Shi <sh...@apache.org>
> > > wrote:
> > >
> > > > The log is pretty clear; the cuboid is exact match, but the scan
> count
> > is
> > > > massive:
> > > >
> > > > Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992
> to
> > > 992
> > > > Total scan count: 12306477
> > > >
> > > > Please add where condition to narrow down the result set as much as
> > > > possible; It doesn't make sense for an OLAP query to return millions
> of
> > > > records;
> > > >
> > > > 2016-08-04 13:05 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > >
> > > > > Sure, see kylin.log below:
> > > > >
> > > > > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > > > > controller.QueryController:175 : The original query:  SELECT
> > > SUM(clicks)
> > > > > FROM hpa_reporting2 GROUP BY site_id, child_id, search_type,
> > hotel_id,
> > > > > report_date
> > > > > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > > > > service.QueryService:266 : The corrected query: SELECT SUM(clicks)
> > FROM
> > > > > hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
> > > > > report_date
> > > > > LIMIT 50000
> > > > > 2016-08-04 00:47:35,908 INFO  [http-bio-7070-exec-7]
> > > > routing.QueryRouter:48
> > > > > : The project manager's reference is
> > > > > org.apache.kylin.metadata.project.ProjectManager@3a3735a5
> > > > > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> > > > routing.QueryRouter:60
> > > > > : Find candidates by table DEFAULT.HPA_REPORTING2 and
> > > project=KODDI_DEV :
> > > > > org.apache.kylin.query.routing.Candidate@51ed1b3b
> > > > > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> > > > routing.QueryRouter:49
> > > > > : Applying rule: class
> > > > > org.apache.kylin.query.routing.rules.
> RemoveUncapableRealizationsRule,
> > > > > realizations before: [hpa_reporting2_cube_clone(CUBE)],
> realizations
> > > > > after:
> > > > > [hpa_reporting2_cube_clone(CUBE)]
> > > > > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> > > > routing.QueryRouter:49
> > > > > : Applying rule: class
> > > > > org.apache.kylin.query.routing.rules.RealizationSortRule,
> > realizations
> > > > > before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
> > > > > [hpa_reporting2_cube_clone(CUBE)]
> > > > > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> > > > routing.QueryRouter:72
> > > > > : The realizations remaining: [hpa_reporting2_cube_clone(CUBE)]
> And
> > the
> > > > > final chosen one is the first one
> > > > > 2016-08-04 00:47:35,975 DEBUG [http-bio-7070-exec-7]
> > > > > enumerator.OLAPEnumerator:107 : query storage...
> > > > > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > > > > v2.CubeStorageQuery:239 : exactAggregation is true
> > > > > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > > > > v2.CubeStorageQuery:357 : Enable limit 50000
> > > > > 2016-08-04 00:47:35,977 DEBUG [http-bio-7070-exec-7]
> > > > > v2.CubeHBaseEndpointRPC:257 : New scanner for current segment
> > > > > hpa_reporting2_cube_clone[19700101000000_20160828000000] will use
> > > > > SCAN_FILTER_AGGR_CHECKMEM as endpoint's behavior
> > > > > 2016-08-04 00:47:35,979 DEBUG [http-bio-7070-exec-7]
> > > > > v2.CubeHBaseEndpointRPC:313 : Serialized scanRequestBytes 836
> bytes,
> > > > > rawScanBytesString 56 bytes
> > > > > 2016-08-04 00:47:35,979 INFO  [http-bio-7070-exec-7]
> > > > > v2.CubeHBaseEndpointRPC:315 : The scan 31b2dd4c for segment
> > > > > hpa_reporting2_cube_clone[19700101000000_20160828000000] is as
> below
> > > with
> > > > > 1
> > > > > separate raw scans, shard part of start/end key is set to 0
> > > > > 2016-08-04 00:47:35,980 INFO  [http-bio-7070-exec-7]
> > > v2.CubeHBaseRPC:271
> > > > :
> > > > > Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992
> > to
> > > > 992
> > > > > Start:
> > > > > \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > > > > x00\x00\x00\x00\x00
> > > > > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > > > > x00\x00\x00\x00\x00)
> > > > > Stop:
> > > > >  \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > > > > xFF\xFF\xFF\xFF\xFF\x00
> > > > > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > > > > xFF\xFF\xFF\xFF\xFF\x00),
> > > > > No Fuzzy Key
> > > > > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > > > > v2.CubeHBaseEndpointRPC:320 : Submitting rpc to 1 shards starting
> > from
> > > > > shard 2, scan range count 1
> > > > > 2016-08-04 00:47:35,981 INFO  [http-bio-7070-exec-7]
> > > > > v2.CubeHBaseEndpointRPC:103 : Timeout for ExpectedSizeIterator is:
> > > 99000
> > > > > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > > > > enumerator.OLAPEnumerator:127 : return TupleIterator...
> > > > > 2016-08-04 00:47:52,773 INFO  [pool-6-thread-1]
> > > > v2.CubeHBaseEndpointRPC:351
> > > > > : <sub-thread for GTScanRequest 31b2dd4c> Endpoint RPC returned
> from
> > > > HTable
> > > > > KYLIN_RIK9O18H07 Shard
> > > > > \x4B\x59\x4C\x49\x4E\x5F\x52\x49\x4B\x39\x4F\x31\x38\x48\
> > > > > x30\x37\x2C\x00\x02\x2C\x31\x34\x37\x30\x31\x35\x35\x33\
> > > > > x31\x34\x39\x33\x37\x2E\x61\x33\x61\x35\x34\x37\x39\x61\
> > > > > x32\x63\x37\x61\x61\x64\x30\x36\x33\x66\x30\x33\x64\x63\
> > > > > x34\x65\x31\x30\x36\x33\x61\x33\x61\x37\x2E
> > > > > on host: ip-10-0-0-157.ec2.internal.Total scanned row: 12306477.
> > Total
> > > > > filtered/aggred row: 0. Time elapsed in EP: 16562(ms). Server CPU
> > > usage:
> > > > > 0.24348086721950246, server physical mem left: 7.195234304E9,
> server
> > > swap
> > > > > mem left:0.0.Etc message: start latency: 15@1,agg done@13760
> > ,compress
> > > > > done@16562,server stats done@16562,
> > > > > debugGitTag:cf4d2940b67d622eacd2ac9a913b221091a35c2e;.Normal
> > Complete:
> > > > > true.
> > > > > 2016-08-04 00:47:54,068 DEBUG [pool-6-thread-1]
> > > util.CompressionUtils:67
> > > > :
> > > > > Original: 46465726 bytes. Decompressed: 150553629 bytes. Time: 1294
> > > > > 2016-08-04 00:48:29,303 INFO  [pool-4-thread-1]
> > > > > threadpool.DefaultScheduler:106 : Job Fetcher: 0 running, 0 actual
> > > > > running,
> > > > > 0 ready, 12 others
> > > > > 2016-08-04 00:48:31,990 INFO  [http-bio-7070-exec-7]
> > > > > service.QueryService:399 : Scan count for each storageContext:
> > > 12306477,
> > > > > 2016-08-04 00:48:31,991 INFO  [http-bio-7070-exec-7]
> > > > > controller.QueryController:197 : Stats of SQL response:
> isException:
> > > > false,
> > > > > duration: 56152, total scan count 12306477
> > > > > 2016-08-04 00:48:32,000 WARN  [http-bio-7070-exec-7]
> > > > > sizeof.ObjectGraphWalker:209 : The configured limit of 1,000 object
> > > > > references was reached while attempting to calculate the size of
> the
> > > > object
> > > > > graph. Severe performance degradation could occur if the sizing
> > > operation
> > > > > continues. This can be avoided by setting the CacheManger or Cache
> > > > > <sizeOfPolicy> elements maxDepthExceededBehavior to "abort" or
> adding
> > > > stop
> > > > > points with @IgnoreSizeOf annotations. If performance degradation
> is
> > > NOT
> > > > an
> > > > > issue at the configured limit, raise the limit value using the
> > > > CacheManager
> > > > > or Cache <sizeOfPolicy> elements maxDepth attribute. For more
> > > > information,
> > > > > see the Ehcache configuration documentation.
> > > > > 2016-08-04 00:48:32,091 INFO  [http-bio-7070-exec-7]
> > > > > service.QueryService:250 :
> > > > > ==========================[QUERY]===============================
> > > > > SQL: SELECT SUM(clicks) FROM hpa_reporting2 GROUP BY site_id,
> > child_id,
> > > > > search_type, hotel_id, report_date
> > > > > User: ADMIN
> > > > > Success: true
> > > > > Duration: 56.152
> > > > > Project: koddi_dev
> > > > > Realization Names: [hpa_reporting2_cube_clone]
> > > > > Cuboid Ids: [992]
> > > > > Total scan count: 12306477
> > > > > Result row count: 50000
> > > > > Accept Partial: true
> > > > > Is Partial Result: false
> > > > > Hit Exception Cache: false
> > > > > Storage cache used: false
> > > > > Message: null
> > > > > ==========================[QUERY]===============================
> > > > >
> > > > > On Wed, Aug 3, 2016 at 8:38 PM, ShaoFeng Shi <
> shaofengshi@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jason, could you please provide the full log since sending
> query
> > > to
> > > > > and
> > > > > > getting result back? The key information is which cuboid is used
> > for
> > > > the
> > > > > > query, cuboid exact match or fuzzy match, how many records be
> > scanned
> > > > and
> > > > > > how long it tooks; Thanks.
> > > > > >
> > > > > > 2016-08-03 23:19 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > >
> > > > > > > Yes, it would have to do post-aggregation in that case, but the
> > > > strange
> > > > > > > thing is that query was running fast (about 1 second), while
> > > queries
> > > > > with
> > > > > > > more dimensions, such as "SELECT SUM(clicks) FROM reporting
> GROUP
> > > BY
> > > > > > > site_id, child_id, report_date, hotel_id". This query will take
> > > about
> > > > > 106
> > > > > > > seconds, but it shouldn't need to do any post-aggregation so I
> > > would
> > > > > > think
> > > > > > > it should return much quicker than that from the respective
> > cuboid.
> > > > > > >
> > > > > > > Here's the explain plan:
> > > > > > > OLAPToEnumerableConverter
> > > > > > > OLAPProjectRel(EXPR$0=[$4])
> > > > > > > OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
> > > > > > > OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
> > > > > > > HOTEL_ID=[$2], CLICKS=[$10])
> > > > > > > OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1,
> > 2,
> > > 3,
> > > > > 4,
> > > > > > 5,
> > > > > > > 6, 7, 8, 9, 10, 11]])
> > > > > > >
> > > > > > > On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <
> > > shaofengshi@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > In the cube definition, you defined "SITE_ID", "CHILD_ID" as
> > > > > > "Mandatory"
> > > > > > > > dimension, which means they will not be aggregated in cube
> > build
> > > > > phase
> > > > > > > for
> > > > > > > > all combinations.
> > > > > > > >
> > > > > > > > So when you run a query like  "SELECT SUM(clicks) FROM
> > reporting
> > > > > GROUP
> > > > > > BY
> > > > > > > > search_type", Kylin will use the combination  "SITE_ID" +
> > > > "CHILD_ID"
> > > > > +
> > > > > > > > "SEARCH_TYPE" to serve, there will be post-aggregation in
> > > runtime;
> > > > > The
> > > > > > > > performance is much depent on the cardinality of "SITE_ID"
> and
> > > > > > > "CHILD_ID".
> > > > > > > >
> > > > > > > >
> > > > > > > > 2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > > > >
> > > > > > > > > I've looked over the optimization options before, but did
> not
> > > > > notice
> > > > > > > the
> > > > > > > > > rowkey ordering. I can try this and see if this helps me.
> > This
> > > is
> > > > > the
> > > > > > > > only
> > > > > > > > > thing I see that I can attempt to optimize further in the
> > > design,
> > > > > but
> > > > > > > > I'll
> > > > > > > > > provide my cube design below. I only have one measure to
> keep
> > > it
> > > > > > > simple:
> > > > > > > > >
> > > > > > > > > {
> > > > > > > > >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> > > > > > > > >   "version": "1.5.2",
> > > > > > > > >   "name": "hpa_reporting2_cube",
> > > > > > > > >   "description": "",
> > > > > > > > >   "dimensions": [
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "REPORT_DATE",
> > > > > > > > >       "derived": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "SEARCH_TYPE",
> > > > > > > > >       "derived": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "HOTEL_ID",
> > > > > > > > >       "derived": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "CHILD_ID",
> > > > > > > > >       "derived": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "COUNTRY",
> > > > > > > > >       "derived": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "DEVICE_TYPE",
> > > > > > > > >       "derived": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "STAY_LENGTH",
> > > > > > > > >       "derived": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "TRUE_RANK_AG",
> > > > > > > > >       "derived": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "ROOM_BUNDLE",
> > > > > > > > >       "derived": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> > > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > > >       "column": "SITE_ID",
> > > > > > > > >       "derived": null
> > > > > > > > >     }
> > > > > > > > >   ],
> > > > > > > > >   "measures": [
> > > > > > > > >     {
> > > > > > > > >       "name": "_COUNT_",
> > > > > > > > >       "function": {
> > > > > > > > >         "expression": "COUNT",
> > > > > > > > >         "parameter": {
> > > > > > > > >           "type": "constant",
> > > > > > > > >           "value": "1",
> > > > > > > > >           "next_parameter": null
> > > > > > > > >         },
> > > > > > > > >         "returntype": "bigint"
> > > > > > > > >       },
> > > > > > > > >       "dependent_measure_ref": null
> > > > > > > > >     },
> > > > > > > > >     {
> > > > > > > > >       "name": "CLICKS",
> > > > > > > > >       "function": {
> > > > > > > > >         "expression": "SUM",
> > > > > > > > >         "parameter": {
> > > > > > > > >           "type": "column",
> > > > > > > > >           "value": "CLICKS",
> > > > > > > > >           "next_parameter": null
> > > > > > > > >         },
> > > > > > > > >         "returntype": "decimal"
> > > > > > > > >       },
> > > > > > > > >       "dependent_measure_ref": null
> > > > > > > > >     }
> > > > > > > > >   ],
> > > > > > > > >   "rowkey": {
> > > > > > > > >     "rowkey_columns": [
> > > > > > > > >       {
> > > > > > > > >         "column": "REPORT_DATE",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       },
> > > > > > > > >       {
> > > > > > > > >         "column": "SEARCH_TYPE",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       },
> > > > > > > > >       {
> > > > > > > > >         "column": "HOTEL_ID",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       },
> > > > > > > > >       {
> > > > > > > > >         "column": "CHILD_ID",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       },
> > > > > > > > >       {
> > > > > > > > >         "column": "COUNTRY",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       },
> > > > > > > > >       {
> > > > > > > > >         "column": "DEVICE_TYPE",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       },
> > > > > > > > >       {
> > > > > > > > >         "column": "STAY_LENGTH",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       },
> > > > > > > > >       {
> > > > > > > > >         "column": "TRUE_RANK_AG",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       },
> > > > > > > > >       {
> > > > > > > > >         "column": "ROOM_BUNDLE",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       },
> > > > > > > > >       {
> > > > > > > > >         "column": "SITE_ID",
> > > > > > > > >         "encoding": "dict",
> > > > > > > > >         "isShardBy": false
> > > > > > > > >       }
> > > > > > > > >     ]
> > > > > > > > >   },
> > > > > > > > >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> > > > > > > > >   "last_modified": 1470070273935,
> > > > > > > > >   "model_name": "hpa_reporting_model2",
> > > > > > > > >   "null_string": null,
> > > > > > > > >   "hbase_mapping": {
> > > > > > > > >     "column_family": [
> > > > > > > > >       {
> > > > > > > > >         "name": "F1",
> > > > > > > > >         "columns": [
> > > > > > > > >           {
> > > > > > > > >             "qualifier": "M",
> > > > > > > > >             "measure_refs": [
> > > > > > > > >               "_COUNT_",
> > > > > > > > >               "CLICKS"
> > > > > > > > >             ]
> > > > > > > > >           }
> > > > > > > > >         ]
> > > > > > > > >       }
> > > > > > > > >     ]
> > > > > > > > >   },
> > > > > > > > >   "aggregation_groups": [
> > > > > > > > >     {
> > > > > > > > >       "includes": [
> > > > > > > > >         "REPORT_DATE",
> > > > > > > > >         "SEARCH_TYPE",
> > > > > > > > >         "HOTEL_ID",
> > > > > > > > >         "CHILD_ID",
> > > > > > > > >         "COUNTRY",
> > > > > > > > >         "DEVICE_TYPE",
> > > > > > > > >         "STAY_LENGTH",
> > > > > > > > >         "TRUE_RANK_AG",
> > > > > > > > >         "ROOM_BUNDLE",
> > > > > > > > >         "SITE_ID"
> > > > > > > > >       ],
> > > > > > > > >       "select_rule": {
> > > > > > > > >         "hierarchy_dims": [],
> > > > > > > > >         "mandatory_dims": [
> > > > > > > > >           "SITE_ID",
> > > > > > > > >           "CHILD_ID"
> > > > > > > > >         ],
> > > > > > > > >         "joint_dims": [
> > > > > > > > >           [
> > > > > > > > >             "ROOM_BUNDLE",
> > > > > > > > >             "TRUE_RANK_AG"
> > > > > > > > >           ]
> > > > > > > > >         ]
> > > > > > > > >       }
> > > > > > > > >     }
> > > > > > > > >   ],
> > > > > > > > >   "notify_list": [],
> > > > > > > > >   "status_need_notify": [
> > > > > > > > >     "ERROR",
> > > > > > > > >     "DISCARDED",
> > > > > > > > >     "SUCCEED"
> > > > > > > > >   ],
> > > > > > > > >   "partition_date_start": 0,
> > > > > > > > >   "partition_date_end": 3153600000000,
> > > > > > > > >   "auto_merge_time_ranges": [
> > > > > > > > >     604800000,
> > > > > > > > >     2419200000
> > > > > > > > >   ],
> > > > > > > > >   "retention_range": 0,
> > > > > > > > >   "engine_type": 2,
> > > > > > > > >   "storage_type": 2,
> > > > > > > > >   "override_kylin_properties": {}
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <
> > > > > shaofengshi@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > As Yiming mentioned, the cube design matters for the
> > > > performance
> > > > > of
> > > > > > > > both
> > > > > > > > > > build and query; please check "Optimize Cube" in the
> > document
> > > > web
> > > > > > > page
> > > > > > > > > and
> > > > > > > > > > do optimizaiton as much as possible;
> > > > > > > > > >
> > > > > > > > > > Besides, the cluster's capacity and Hadoop configuration
> is
> > > > also
> > > > > an
> > > > > > > > > > important factor; Try to identify the bottleneck and then
> > > > > optimize
> > > > > > or
> > > > > > > > add
> > > > > > > > > > capacity.
> > > > > > > > > >
> > > > > > > > > > From 1.5 Kylin ships with two cubing algorithm; the steps
> > > > "Build
> > > > > > > > > > N-Dimension Cuboid" are the legacy "Layered" cubing
> > > algorithm;
> > > > > They
> > > > > > > > will
> > > > > > > > > be
> > > > > > > > > > skipped when Kylin selects to use the new "Fast" cubing
> > > > > algorithm,
> > > > > > > > which
> > > > > > > > > is
> > > > > > > > > > the "Build Cube" step after them. Please click the hadoop
> > > link
> > > > in
> > > > > > > that
> > > > > > > > > step
> > > > > > > > > > to inspect the MR job's statistics;
> > > > > > > > > >
> > > > > > > > > > Hope this helps to some extend;
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <
> > > liuyiming.vip@gmail.com
> > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Jason,
> > > > > > > > > > >
> > > > > > > > > > > Cube design is the performance key for Kylin, not only
> > > query,
> > > > > but
> > > > > > > > also
> > > > > > > > > > cube
> > > > > > > > > > > building process. How to select dimensions, how to
> define
> > > the
> > > > > > > > > > relationship
> > > > > > > > > > > between dimensions, how to select encode method, how to
> > > > define
> > > > > > > > measure,
> > > > > > > > > > > even how to choose the Hbase key order will have a
> > > > significant
> > > > > > > impact
> > > > > > > > > on
> > > > > > > > > > > performance.  There are quite a few wonderful documents
> > > > > > introducing
> > > > > > > > how
> > > > > > > > > > to
> > > > > > > > > > > do this, http://kylin.apache.org/docs15/ .
> > > > > > > > > > >
> > > > > > > > > > > One more thing, if you could share your cube design,
> you
> > > > would
> > > > > > get
> > > > > > > > help
> > > > > > > > > > > easier here.
> > > > > > > > > > >
> > > > > > > > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <jason@koddi.com
> >:
> > > > > > > > > > >
> > > > > > > > > > > > I'm setting up a test case for a portion of our
> > dataset,
> > > to
> > > > > > > > evaluate
> > > > > > > > > > > Kylin
> > > > > > > > > > > > and I'm not seeing the performance that I would
> expect.
> > > > > > > > > > > >
> > > > > > > > > > > > The cube building process is taking about 5-6 hours
> > with
> > > > > > > > ~69,000,000
> > > > > > > > > > > > records and 10 dimensions. I'm not sure if that's the
> > > > > expected
> > > > > > > > build
> > > > > > > > > > > time,
> > > > > > > > > > > > but the other problem is the query performance after
> > > > building
> > > > > > the
> > > > > > > > > cube.
> > > > > > > > > > > >
> > > > > > > > > > > > All queries were tested with a very simple query
> (e.g.
> > > > SELECT
> > > > > > > > > > SUM(clicks)
> > > > > > > > > > > > FROM reporting GROUP BY search_type)
> > > > > > > > > > > >
> > > > > > > > > > > > Grouping by 1 or 2 dimensions gives me very
> responsive
> > > > > queries
> > > > > > > > > (under 2
> > > > > > > > > > > > seconds), but adding more dimensions drastically
> > > increases
> > > > > the
> > > > > > > > query
> > > > > > > > > > time
> > > > > > > > > > > > (over 1 minute and it times out through hbase). I
> would
> > > > > expect
> > > > > > > > these
> > > > > > > > > > > > queries to have all similar query times since they
> > should
> > > > > query
> > > > > > > the
> > > > > > > > > > > > respective cuboid, so I'm not sure why the
> performance
> > > > would
> > > > > > > > suffer.
> > > > > > > > > I
> > > > > > > > > > > > didn't set up any special rules for the cube, but
> > during
> > > > the
> > > > > > > build
> > > > > > > > > > > process
> > > > > > > > > > > > it showed all the N-dimension cubes and the log
> simply
> > > said
> > > > > > > > > 'skipped'.
> > > > > > > > > > > >
> > > > > > > > > > > > Is there something I'm missing in the configuration?
> > > > > > > > > > > >
> > > > > > > > > > > > I have a HDP cluster with 3 nodes and 1 client node
> on
> > > > which
> > > > > > > Kylin
> > > > > > > > is
> > > > > > > > > > > > installed. Do I need to adjust the hadoop
> > configuration.
> > > > I'm
> > > > > > > using
> > > > > > > > > most
> > > > > > > > > > > of
> > > > > > > > > > > > the default HDP settings.
> > > > > > > > > > > >
> > > > > > > > > > > > What more information can I provide?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > With Warm regards
> > > > > > > > > > >
> > > > > > > > > > > Yiming Liu (刘一鸣)
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > >
> > > > > > > > > > Shaofeng Shi
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > > Shaofeng Shi
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > > > Shaofeng Shi
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > > Shaofeng Shi
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > >
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
>

Re: Kylin Cube Performance

Posted by hongbin ma <ma...@apache.org>.
If you have a limit of 100, kylin is SUPPOSED to be far more efficient.
However there's currently a issue here that might cause limit clause being
overlooked (https://issues.apache.org/jira/browse/KYLIN-1936)

I'm working on fixing KYLIN-1936, it will be fixed in 1.5.4.

On Thu, Aug 4, 2016 at 11:36 PM, Jason Hale <ja...@koddi.com> wrote:

> True, but even if there's a limit of 100, it still has to scan all records?
> Perhaps I'm just used to how Postgres handles that as it only scans the
> necessary records, not the entire set if it's limited. I can rethink the
> way I approach it if that's the case.
>
> On Thu, Aug 4, 2016 at 10:31 AM, hongbin ma <ma...@apache.org> wrote:
>
> > Hi Jason
> >
> > As Shaofeng explained it's not reasonable to expect sub-second latency if
> > you're returning tens of millions of records. You data model is quite
> > simple and you don't have costly measure like distinct count etc., so
> kylin
> > should be performant on normal OLAP queries.
> >
> > Another advise: if the cardinality of the mandatory dimensions (CHILD_ID
> > and SITE_ID) are very high, you might isolate such dimensions into a
> > separate "aggregation group", so that 1. queries not touching these
> > dimensions can be performant 2. calculate less cuboids. Please refer to
> > http://kylin.apache.org/blog/2016/02/18/new-aggregation-group/
> >
> > On Thu, Aug 4, 2016 at 11:11 PM, ShaoFeng Shi <sh...@apache.org>
> > wrote:
> >
> > > The log is pretty clear; the cuboid is exact match, but the scan count
> is
> > > massive:
> > >
> > > Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to
> > 992
> > > Total scan count: 12306477
> > >
> > > Please add where condition to narrow down the result set as much as
> > > possible; It doesn't make sense for an OLAP query to return millions of
> > > records;
> > >
> > > 2016-08-04 13:05 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > >
> > > > Sure, see kylin.log below:
> > > >
> > > > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > > > controller.QueryController:175 : The original query:  SELECT
> > SUM(clicks)
> > > > FROM hpa_reporting2 GROUP BY site_id, child_id, search_type,
> hotel_id,
> > > > report_date
> > > > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > > > service.QueryService:266 : The corrected query: SELECT SUM(clicks)
> FROM
> > > > hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
> > > > report_date
> > > > LIMIT 50000
> > > > 2016-08-04 00:47:35,908 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:48
> > > > : The project manager's reference is
> > > > org.apache.kylin.metadata.project.ProjectManager@3a3735a5
> > > > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:60
> > > > : Find candidates by table DEFAULT.HPA_REPORTING2 and
> > project=KODDI_DEV :
> > > > org.apache.kylin.query.routing.Candidate@51ed1b3b
> > > > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:49
> > > > : Applying rule: class
> > > > org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
> > > > realizations before: [hpa_reporting2_cube_clone(CUBE)], realizations
> > > > after:
> > > > [hpa_reporting2_cube_clone(CUBE)]
> > > > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:49
> > > > : Applying rule: class
> > > > org.apache.kylin.query.routing.rules.RealizationSortRule,
> realizations
> > > > before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
> > > > [hpa_reporting2_cube_clone(CUBE)]
> > > > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:72
> > > > : The realizations remaining: [hpa_reporting2_cube_clone(CUBE)] And
> the
> > > > final chosen one is the first one
> > > > 2016-08-04 00:47:35,975 DEBUG [http-bio-7070-exec-7]
> > > > enumerator.OLAPEnumerator:107 : query storage...
> > > > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > > > v2.CubeStorageQuery:239 : exactAggregation is true
> > > > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > > > v2.CubeStorageQuery:357 : Enable limit 50000
> > > > 2016-08-04 00:47:35,977 DEBUG [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:257 : New scanner for current segment
> > > > hpa_reporting2_cube_clone[19700101000000_20160828000000] will use
> > > > SCAN_FILTER_AGGR_CHECKMEM as endpoint's behavior
> > > > 2016-08-04 00:47:35,979 DEBUG [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:313 : Serialized scanRequestBytes 836 bytes,
> > > > rawScanBytesString 56 bytes
> > > > 2016-08-04 00:47:35,979 INFO  [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:315 : The scan 31b2dd4c for segment
> > > > hpa_reporting2_cube_clone[19700101000000_20160828000000] is as below
> > with
> > > > 1
> > > > separate raw scans, shard part of start/end key is set to 0
> > > > 2016-08-04 00:47:35,980 INFO  [http-bio-7070-exec-7]
> > v2.CubeHBaseRPC:271
> > > :
> > > > Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992
> to
> > > 992
> > > > Start:
> > > > \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > > > x00\x00\x00\x00\x00
> > > > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > > > x00\x00\x00\x00\x00)
> > > > Stop:
> > > >  \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > > > xFF\xFF\xFF\xFF\xFF\x00
> > > > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > > > xFF\xFF\xFF\xFF\xFF\x00),
> > > > No Fuzzy Key
> > > > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:320 : Submitting rpc to 1 shards starting
> from
> > > > shard 2, scan range count 1
> > > > 2016-08-04 00:47:35,981 INFO  [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:103 : Timeout for ExpectedSizeIterator is:
> > 99000
> > > > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > > > enumerator.OLAPEnumerator:127 : return TupleIterator...
> > > > 2016-08-04 00:47:52,773 INFO  [pool-6-thread-1]
> > > v2.CubeHBaseEndpointRPC:351
> > > > : <sub-thread for GTScanRequest 31b2dd4c> Endpoint RPC returned from
> > > HTable
> > > > KYLIN_RIK9O18H07 Shard
> > > > \x4B\x59\x4C\x49\x4E\x5F\x52\x49\x4B\x39\x4F\x31\x38\x48\
> > > > x30\x37\x2C\x00\x02\x2C\x31\x34\x37\x30\x31\x35\x35\x33\
> > > > x31\x34\x39\x33\x37\x2E\x61\x33\x61\x35\x34\x37\x39\x61\
> > > > x32\x63\x37\x61\x61\x64\x30\x36\x33\x66\x30\x33\x64\x63\
> > > > x34\x65\x31\x30\x36\x33\x61\x33\x61\x37\x2E
> > > > on host: ip-10-0-0-157.ec2.internal.Total scanned row: 12306477.
> Total
> > > > filtered/aggred row: 0. Time elapsed in EP: 16562(ms). Server CPU
> > usage:
> > > > 0.24348086721950246, server physical mem left: 7.195234304E9, server
> > swap
> > > > mem left:0.0.Etc message: start latency: 15@1,agg done@13760
> ,compress
> > > > done@16562,server stats done@16562,
> > > > debugGitTag:cf4d2940b67d622eacd2ac9a913b221091a35c2e;.Normal
> Complete:
> > > > true.
> > > > 2016-08-04 00:47:54,068 DEBUG [pool-6-thread-1]
> > util.CompressionUtils:67
> > > :
> > > > Original: 46465726 bytes. Decompressed: 150553629 bytes. Time: 1294
> > > > 2016-08-04 00:48:29,303 INFO  [pool-4-thread-1]
> > > > threadpool.DefaultScheduler:106 : Job Fetcher: 0 running, 0 actual
> > > > running,
> > > > 0 ready, 12 others
> > > > 2016-08-04 00:48:31,990 INFO  [http-bio-7070-exec-7]
> > > > service.QueryService:399 : Scan count for each storageContext:
> > 12306477,
> > > > 2016-08-04 00:48:31,991 INFO  [http-bio-7070-exec-7]
> > > > controller.QueryController:197 : Stats of SQL response: isException:
> > > false,
> > > > duration: 56152, total scan count 12306477
> > > > 2016-08-04 00:48:32,000 WARN  [http-bio-7070-exec-7]
> > > > sizeof.ObjectGraphWalker:209 : The configured limit of 1,000 object
> > > > references was reached while attempting to calculate the size of the
> > > object
> > > > graph. Severe performance degradation could occur if the sizing
> > operation
> > > > continues. This can be avoided by setting the CacheManger or Cache
> > > > <sizeOfPolicy> elements maxDepthExceededBehavior to "abort" or adding
> > > stop
> > > > points with @IgnoreSizeOf annotations. If performance degradation is
> > NOT
> > > an
> > > > issue at the configured limit, raise the limit value using the
> > > CacheManager
> > > > or Cache <sizeOfPolicy> elements maxDepth attribute. For more
> > > information,
> > > > see the Ehcache configuration documentation.
> > > > 2016-08-04 00:48:32,091 INFO  [http-bio-7070-exec-7]
> > > > service.QueryService:250 :
> > > > ==========================[QUERY]===============================
> > > > SQL: SELECT SUM(clicks) FROM hpa_reporting2 GROUP BY site_id,
> child_id,
> > > > search_type, hotel_id, report_date
> > > > User: ADMIN
> > > > Success: true
> > > > Duration: 56.152
> > > > Project: koddi_dev
> > > > Realization Names: [hpa_reporting2_cube_clone]
> > > > Cuboid Ids: [992]
> > > > Total scan count: 12306477
> > > > Result row count: 50000
> > > > Accept Partial: true
> > > > Is Partial Result: false
> > > > Hit Exception Cache: false
> > > > Storage cache used: false
> > > > Message: null
> > > > ==========================[QUERY]===============================
> > > >
> > > > On Wed, Aug 3, 2016 at 8:38 PM, ShaoFeng Shi <shaofengshi@apache.org
> >
> > > > wrote:
> > > >
> > > > > Hi Jason, could you please provide the full log since sending query
> > to
> > > > and
> > > > > getting result back? The key information is which cuboid is used
> for
> > > the
> > > > > query, cuboid exact match or fuzzy match, how many records be
> scanned
> > > and
> > > > > how long it tooks; Thanks.
> > > > >
> > > > > 2016-08-03 23:19 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > >
> > > > > > Yes, it would have to do post-aggregation in that case, but the
> > > strange
> > > > > > thing is that query was running fast (about 1 second), while
> > queries
> > > > with
> > > > > > more dimensions, such as "SELECT SUM(clicks) FROM reporting GROUP
> > BY
> > > > > > site_id, child_id, report_date, hotel_id". This query will take
> > about
> > > > 106
> > > > > > seconds, but it shouldn't need to do any post-aggregation so I
> > would
> > > > > think
> > > > > > it should return much quicker than that from the respective
> cuboid.
> > > > > >
> > > > > > Here's the explain plan:
> > > > > > OLAPToEnumerableConverter
> > > > > > OLAPProjectRel(EXPR$0=[$4])
> > > > > > OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
> > > > > > OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
> > > > > > HOTEL_ID=[$2], CLICKS=[$10])
> > > > > > OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1,
> 2,
> > 3,
> > > > 4,
> > > > > 5,
> > > > > > 6, 7, 8, 9, 10, 11]])
> > > > > >
> > > > > > On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <
> > shaofengshi@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > In the cube definition, you defined "SITE_ID", "CHILD_ID" as
> > > > > "Mandatory"
> > > > > > > dimension, which means they will not be aggregated in cube
> build
> > > > phase
> > > > > > for
> > > > > > > all combinations.
> > > > > > >
> > > > > > > So when you run a query like  "SELECT SUM(clicks) FROM
> reporting
> > > > GROUP
> > > > > BY
> > > > > > > search_type", Kylin will use the combination  "SITE_ID" +
> > > "CHILD_ID"
> > > > +
> > > > > > > "SEARCH_TYPE" to serve, there will be post-aggregation in
> > runtime;
> > > > The
> > > > > > > performance is much depent on the cardinality of "SITE_ID" and
> > > > > > "CHILD_ID".
> > > > > > >
> > > > > > >
> > > > > > > 2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > > >
> > > > > > > > I've looked over the optimization options before, but did not
> > > > notice
> > > > > > the
> > > > > > > > rowkey ordering. I can try this and see if this helps me.
> This
> > is
> > > > the
> > > > > > > only
> > > > > > > > thing I see that I can attempt to optimize further in the
> > design,
> > > > but
> > > > > > > I'll
> > > > > > > > provide my cube design below. I only have one measure to keep
> > it
> > > > > > simple:
> > > > > > > >
> > > > > > > > {
> > > > > > > >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> > > > > > > >   "version": "1.5.2",
> > > > > > > >   "name": "hpa_reporting2_cube",
> > > > > > > >   "description": "",
> > > > > > > >   "dimensions": [
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "REPORT_DATE",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "SEARCH_TYPE",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "HOTEL_ID",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "CHILD_ID",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "COUNTRY",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "DEVICE_TYPE",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "STAY_LENGTH",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "TRUE_RANK_AG",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "ROOM_BUNDLE",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "SITE_ID",
> > > > > > > >       "derived": null
> > > > > > > >     }
> > > > > > > >   ],
> > > > > > > >   "measures": [
> > > > > > > >     {
> > > > > > > >       "name": "_COUNT_",
> > > > > > > >       "function": {
> > > > > > > >         "expression": "COUNT",
> > > > > > > >         "parameter": {
> > > > > > > >           "type": "constant",
> > > > > > > >           "value": "1",
> > > > > > > >           "next_parameter": null
> > > > > > > >         },
> > > > > > > >         "returntype": "bigint"
> > > > > > > >       },
> > > > > > > >       "dependent_measure_ref": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "CLICKS",
> > > > > > > >       "function": {
> > > > > > > >         "expression": "SUM",
> > > > > > > >         "parameter": {
> > > > > > > >           "type": "column",
> > > > > > > >           "value": "CLICKS",
> > > > > > > >           "next_parameter": null
> > > > > > > >         },
> > > > > > > >         "returntype": "decimal"
> > > > > > > >       },
> > > > > > > >       "dependent_measure_ref": null
> > > > > > > >     }
> > > > > > > >   ],
> > > > > > > >   "rowkey": {
> > > > > > > >     "rowkey_columns": [
> > > > > > > >       {
> > > > > > > >         "column": "REPORT_DATE",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "SEARCH_TYPE",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "HOTEL_ID",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "CHILD_ID",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "COUNTRY",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "DEVICE_TYPE",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "STAY_LENGTH",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "TRUE_RANK_AG",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "ROOM_BUNDLE",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "SITE_ID",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       }
> > > > > > > >     ]
> > > > > > > >   },
> > > > > > > >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> > > > > > > >   "last_modified": 1470070273935,
> > > > > > > >   "model_name": "hpa_reporting_model2",
> > > > > > > >   "null_string": null,
> > > > > > > >   "hbase_mapping": {
> > > > > > > >     "column_family": [
> > > > > > > >       {
> > > > > > > >         "name": "F1",
> > > > > > > >         "columns": [
> > > > > > > >           {
> > > > > > > >             "qualifier": "M",
> > > > > > > >             "measure_refs": [
> > > > > > > >               "_COUNT_",
> > > > > > > >               "CLICKS"
> > > > > > > >             ]
> > > > > > > >           }
> > > > > > > >         ]
> > > > > > > >       }
> > > > > > > >     ]
> > > > > > > >   },
> > > > > > > >   "aggregation_groups": [
> > > > > > > >     {
> > > > > > > >       "includes": [
> > > > > > > >         "REPORT_DATE",
> > > > > > > >         "SEARCH_TYPE",
> > > > > > > >         "HOTEL_ID",
> > > > > > > >         "CHILD_ID",
> > > > > > > >         "COUNTRY",
> > > > > > > >         "DEVICE_TYPE",
> > > > > > > >         "STAY_LENGTH",
> > > > > > > >         "TRUE_RANK_AG",
> > > > > > > >         "ROOM_BUNDLE",
> > > > > > > >         "SITE_ID"
> > > > > > > >       ],
> > > > > > > >       "select_rule": {
> > > > > > > >         "hierarchy_dims": [],
> > > > > > > >         "mandatory_dims": [
> > > > > > > >           "SITE_ID",
> > > > > > > >           "CHILD_ID"
> > > > > > > >         ],
> > > > > > > >         "joint_dims": [
> > > > > > > >           [
> > > > > > > >             "ROOM_BUNDLE",
> > > > > > > >             "TRUE_RANK_AG"
> > > > > > > >           ]
> > > > > > > >         ]
> > > > > > > >       }
> > > > > > > >     }
> > > > > > > >   ],
> > > > > > > >   "notify_list": [],
> > > > > > > >   "status_need_notify": [
> > > > > > > >     "ERROR",
> > > > > > > >     "DISCARDED",
> > > > > > > >     "SUCCEED"
> > > > > > > >   ],
> > > > > > > >   "partition_date_start": 0,
> > > > > > > >   "partition_date_end": 3153600000000,
> > > > > > > >   "auto_merge_time_ranges": [
> > > > > > > >     604800000,
> > > > > > > >     2419200000
> > > > > > > >   ],
> > > > > > > >   "retention_range": 0,
> > > > > > > >   "engine_type": 2,
> > > > > > > >   "storage_type": 2,
> > > > > > > >   "override_kylin_properties": {}
> > > > > > > > }
> > > > > > > >
> > > > > > > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <
> > > > shaofengshi@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > As Yiming mentioned, the cube design matters for the
> > > performance
> > > > of
> > > > > > > both
> > > > > > > > > build and query; please check "Optimize Cube" in the
> document
> > > web
> > > > > > page
> > > > > > > > and
> > > > > > > > > do optimizaiton as much as possible;
> > > > > > > > >
> > > > > > > > > Besides, the cluster's capacity and Hadoop configuration is
> > > also
> > > > an
> > > > > > > > > important factor; Try to identify the bottleneck and then
> > > > optimize
> > > > > or
> > > > > > > add
> > > > > > > > > capacity.
> > > > > > > > >
> > > > > > > > > From 1.5 Kylin ships with two cubing algorithm; the steps
> > > "Build
> > > > > > > > > N-Dimension Cuboid" are the legacy "Layered" cubing
> > algorithm;
> > > > They
> > > > > > > will
> > > > > > > > be
> > > > > > > > > skipped when Kylin selects to use the new "Fast" cubing
> > > > algorithm,
> > > > > > > which
> > > > > > > > is
> > > > > > > > > the "Build Cube" step after them. Please click the hadoop
> > link
> > > in
> > > > > > that
> > > > > > > > step
> > > > > > > > > to inspect the MR job's statistics;
> > > > > > > > >
> > > > > > > > > Hope this helps to some extend;
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <
> > liuyiming.vip@gmail.com
> > > >:
> > > > > > > > >
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > Cube design is the performance key for Kylin, not only
> > query,
> > > > but
> > > > > > > also
> > > > > > > > > cube
> > > > > > > > > > building process. How to select dimensions, how to define
> > the
> > > > > > > > > relationship
> > > > > > > > > > between dimensions, how to select encode method, how to
> > > define
> > > > > > > measure,
> > > > > > > > > > even how to choose the Hbase key order will have a
> > > significant
> > > > > > impact
> > > > > > > > on
> > > > > > > > > > performance.  There are quite a few wonderful documents
> > > > > introducing
> > > > > > > how
> > > > > > > > > to
> > > > > > > > > > do this, http://kylin.apache.org/docs15/ .
> > > > > > > > > >
> > > > > > > > > > One more thing, if you could share your cube design, you
> > > would
> > > > > get
> > > > > > > help
> > > > > > > > > > easier here.
> > > > > > > > > >
> > > > > > > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > > > > > >
> > > > > > > > > > > I'm setting up a test case for a portion of our
> dataset,
> > to
> > > > > > > evaluate
> > > > > > > > > > Kylin
> > > > > > > > > > > and I'm not seeing the performance that I would expect.
> > > > > > > > > > >
> > > > > > > > > > > The cube building process is taking about 5-6 hours
> with
> > > > > > > ~69,000,000
> > > > > > > > > > > records and 10 dimensions. I'm not sure if that's the
> > > > expected
> > > > > > > build
> > > > > > > > > > time,
> > > > > > > > > > > but the other problem is the query performance after
> > > building
> > > > > the
> > > > > > > > cube.
> > > > > > > > > > >
> > > > > > > > > > > All queries were tested with a very simple query (e.g.
> > > SELECT
> > > > > > > > > SUM(clicks)
> > > > > > > > > > > FROM reporting GROUP BY search_type)
> > > > > > > > > > >
> > > > > > > > > > > Grouping by 1 or 2 dimensions gives me very responsive
> > > > queries
> > > > > > > > (under 2
> > > > > > > > > > > seconds), but adding more dimensions drastically
> > increases
> > > > the
> > > > > > > query
> > > > > > > > > time
> > > > > > > > > > > (over 1 minute and it times out through hbase). I would
> > > > expect
> > > > > > > these
> > > > > > > > > > > queries to have all similar query times since they
> should
> > > > query
> > > > > > the
> > > > > > > > > > > respective cuboid, so I'm not sure why the performance
> > > would
> > > > > > > suffer.
> > > > > > > > I
> > > > > > > > > > > didn't set up any special rules for the cube, but
> during
> > > the
> > > > > > build
> > > > > > > > > > process
> > > > > > > > > > > it showed all the N-dimension cubes and the log simply
> > said
> > > > > > > > 'skipped'.
> > > > > > > > > > >
> > > > > > > > > > > Is there something I'm missing in the configuration?
> > > > > > > > > > >
> > > > > > > > > > > I have a HDP cluster with 3 nodes and 1 client node on
> > > which
> > > > > > Kylin
> > > > > > > is
> > > > > > > > > > > installed. Do I need to adjust the hadoop
> configuration.
> > > I'm
> > > > > > using
> > > > > > > > most
> > > > > > > > > > of
> > > > > > > > > > > the default HDP settings.
> > > > > > > > > > >
> > > > > > > > > > > What more information can I provide?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > With Warm regards
> > > > > > > > > >
> > > > > > > > > > Yiming Liu (刘一鸣)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > > > Shaofeng Shi
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Shaofeng Shi
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > > Shaofeng Shi
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*

Re: Kylin Cube Performance

Posted by Jason Hale <ja...@koddi.com>.
True, but even if there's a limit of 100, it still has to scan all records?
Perhaps I'm just used to how Postgres handles that as it only scans the
necessary records, not the entire set if it's limited. I can rethink the
way I approach it if that's the case.

On Thu, Aug 4, 2016 at 10:31 AM, hongbin ma <ma...@apache.org> wrote:

> Hi Jason
>
> As Shaofeng explained it's not reasonable to expect sub-second latency if
> you're returning tens of millions of records. You data model is quite
> simple and you don't have costly measure like distinct count etc., so kylin
> should be performant on normal OLAP queries.
>
> Another advise: if the cardinality of the mandatory dimensions (CHILD_ID
> and SITE_ID) are very high, you might isolate such dimensions into a
> separate "aggregation group", so that 1. queries not touching these
> dimensions can be performant 2. calculate less cuboids. Please refer to
> http://kylin.apache.org/blog/2016/02/18/new-aggregation-group/
>
> On Thu, Aug 4, 2016 at 11:11 PM, ShaoFeng Shi <sh...@apache.org>
> wrote:
>
> > The log is pretty clear; the cuboid is exact match, but the scan count is
> > massive:
> >
> > Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to
> 992
> > Total scan count: 12306477
> >
> > Please add where condition to narrow down the result set as much as
> > possible; It doesn't make sense for an OLAP query to return millions of
> > records;
> >
> > 2016-08-04 13:05 GMT+08:00 Jason Hale <ja...@koddi.com>:
> >
> > > Sure, see kylin.log below:
> > >
> > > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > > controller.QueryController:175 : The original query:  SELECT
> SUM(clicks)
> > > FROM hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
> > > report_date
> > > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > > service.QueryService:266 : The corrected query: SELECT SUM(clicks) FROM
> > > hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
> > > report_date
> > > LIMIT 50000
> > > 2016-08-04 00:47:35,908 INFO  [http-bio-7070-exec-7]
> > routing.QueryRouter:48
> > > : The project manager's reference is
> > > org.apache.kylin.metadata.project.ProjectManager@3a3735a5
> > > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> > routing.QueryRouter:60
> > > : Find candidates by table DEFAULT.HPA_REPORTING2 and
> project=KODDI_DEV :
> > > org.apache.kylin.query.routing.Candidate@51ed1b3b
> > > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> > routing.QueryRouter:49
> > > : Applying rule: class
> > > org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
> > > realizations before: [hpa_reporting2_cube_clone(CUBE)], realizations
> > > after:
> > > [hpa_reporting2_cube_clone(CUBE)]
> > > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> > routing.QueryRouter:49
> > > : Applying rule: class
> > > org.apache.kylin.query.routing.rules.RealizationSortRule, realizations
> > > before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
> > > [hpa_reporting2_cube_clone(CUBE)]
> > > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> > routing.QueryRouter:72
> > > : The realizations remaining: [hpa_reporting2_cube_clone(CUBE)] And the
> > > final chosen one is the first one
> > > 2016-08-04 00:47:35,975 DEBUG [http-bio-7070-exec-7]
> > > enumerator.OLAPEnumerator:107 : query storage...
> > > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > > v2.CubeStorageQuery:239 : exactAggregation is true
> > > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > > v2.CubeStorageQuery:357 : Enable limit 50000
> > > 2016-08-04 00:47:35,977 DEBUG [http-bio-7070-exec-7]
> > > v2.CubeHBaseEndpointRPC:257 : New scanner for current segment
> > > hpa_reporting2_cube_clone[19700101000000_20160828000000] will use
> > > SCAN_FILTER_AGGR_CHECKMEM as endpoint's behavior
> > > 2016-08-04 00:47:35,979 DEBUG [http-bio-7070-exec-7]
> > > v2.CubeHBaseEndpointRPC:313 : Serialized scanRequestBytes 836 bytes,
> > > rawScanBytesString 56 bytes
> > > 2016-08-04 00:47:35,979 INFO  [http-bio-7070-exec-7]
> > > v2.CubeHBaseEndpointRPC:315 : The scan 31b2dd4c for segment
> > > hpa_reporting2_cube_clone[19700101000000_20160828000000] is as below
> with
> > > 1
> > > separate raw scans, shard part of start/end key is set to 0
> > > 2016-08-04 00:47:35,980 INFO  [http-bio-7070-exec-7]
> v2.CubeHBaseRPC:271
> > :
> > > Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to
> > 992
> > > Start:
> > > \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > > x00\x00\x00\x00\x00
> > > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > > x00\x00\x00\x00\x00)
> > > Stop:
> > >  \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > > xFF\xFF\xFF\xFF\xFF\x00
> > > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > > xFF\xFF\xFF\xFF\xFF\x00),
> > > No Fuzzy Key
> > > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > > v2.CubeHBaseEndpointRPC:320 : Submitting rpc to 1 shards starting from
> > > shard 2, scan range count 1
> > > 2016-08-04 00:47:35,981 INFO  [http-bio-7070-exec-7]
> > > v2.CubeHBaseEndpointRPC:103 : Timeout for ExpectedSizeIterator is:
> 99000
> > > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > > enumerator.OLAPEnumerator:127 : return TupleIterator...
> > > 2016-08-04 00:47:52,773 INFO  [pool-6-thread-1]
> > v2.CubeHBaseEndpointRPC:351
> > > : <sub-thread for GTScanRequest 31b2dd4c> Endpoint RPC returned from
> > HTable
> > > KYLIN_RIK9O18H07 Shard
> > > \x4B\x59\x4C\x49\x4E\x5F\x52\x49\x4B\x39\x4F\x31\x38\x48\
> > > x30\x37\x2C\x00\x02\x2C\x31\x34\x37\x30\x31\x35\x35\x33\
> > > x31\x34\x39\x33\x37\x2E\x61\x33\x61\x35\x34\x37\x39\x61\
> > > x32\x63\x37\x61\x61\x64\x30\x36\x33\x66\x30\x33\x64\x63\
> > > x34\x65\x31\x30\x36\x33\x61\x33\x61\x37\x2E
> > > on host: ip-10-0-0-157.ec2.internal.Total scanned row: 12306477. Total
> > > filtered/aggred row: 0. Time elapsed in EP: 16562(ms). Server CPU
> usage:
> > > 0.24348086721950246, server physical mem left: 7.195234304E9, server
> swap
> > > mem left:0.0.Etc message: start latency: 15@1,agg done@13760,compress
> > > done@16562,server stats done@16562,
> > > debugGitTag:cf4d2940b67d622eacd2ac9a913b221091a35c2e;.Normal Complete:
> > > true.
> > > 2016-08-04 00:47:54,068 DEBUG [pool-6-thread-1]
> util.CompressionUtils:67
> > :
> > > Original: 46465726 bytes. Decompressed: 150553629 bytes. Time: 1294
> > > 2016-08-04 00:48:29,303 INFO  [pool-4-thread-1]
> > > threadpool.DefaultScheduler:106 : Job Fetcher: 0 running, 0 actual
> > > running,
> > > 0 ready, 12 others
> > > 2016-08-04 00:48:31,990 INFO  [http-bio-7070-exec-7]
> > > service.QueryService:399 : Scan count for each storageContext:
> 12306477,
> > > 2016-08-04 00:48:31,991 INFO  [http-bio-7070-exec-7]
> > > controller.QueryController:197 : Stats of SQL response: isException:
> > false,
> > > duration: 56152, total scan count 12306477
> > > 2016-08-04 00:48:32,000 WARN  [http-bio-7070-exec-7]
> > > sizeof.ObjectGraphWalker:209 : The configured limit of 1,000 object
> > > references was reached while attempting to calculate the size of the
> > object
> > > graph. Severe performance degradation could occur if the sizing
> operation
> > > continues. This can be avoided by setting the CacheManger or Cache
> > > <sizeOfPolicy> elements maxDepthExceededBehavior to "abort" or adding
> > stop
> > > points with @IgnoreSizeOf annotations. If performance degradation is
> NOT
> > an
> > > issue at the configured limit, raise the limit value using the
> > CacheManager
> > > or Cache <sizeOfPolicy> elements maxDepth attribute. For more
> > information,
> > > see the Ehcache configuration documentation.
> > > 2016-08-04 00:48:32,091 INFO  [http-bio-7070-exec-7]
> > > service.QueryService:250 :
> > > ==========================[QUERY]===============================
> > > SQL: SELECT SUM(clicks) FROM hpa_reporting2 GROUP BY site_id, child_id,
> > > search_type, hotel_id, report_date
> > > User: ADMIN
> > > Success: true
> > > Duration: 56.152
> > > Project: koddi_dev
> > > Realization Names: [hpa_reporting2_cube_clone]
> > > Cuboid Ids: [992]
> > > Total scan count: 12306477
> > > Result row count: 50000
> > > Accept Partial: true
> > > Is Partial Result: false
> > > Hit Exception Cache: false
> > > Storage cache used: false
> > > Message: null
> > > ==========================[QUERY]===============================
> > >
> > > On Wed, Aug 3, 2016 at 8:38 PM, ShaoFeng Shi <sh...@apache.org>
> > > wrote:
> > >
> > > > Hi Jason, could you please provide the full log since sending query
> to
> > > and
> > > > getting result back? The key information is which cuboid is used for
> > the
> > > > query, cuboid exact match or fuzzy match, how many records be scanned
> > and
> > > > how long it tooks; Thanks.
> > > >
> > > > 2016-08-03 23:19 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > >
> > > > > Yes, it would have to do post-aggregation in that case, but the
> > strange
> > > > > thing is that query was running fast (about 1 second), while
> queries
> > > with
> > > > > more dimensions, such as "SELECT SUM(clicks) FROM reporting GROUP
> BY
> > > > > site_id, child_id, report_date, hotel_id". This query will take
> about
> > > 106
> > > > > seconds, but it shouldn't need to do any post-aggregation so I
> would
> > > > think
> > > > > it should return much quicker than that from the respective cuboid.
> > > > >
> > > > > Here's the explain plan:
> > > > > OLAPToEnumerableConverter
> > > > > OLAPProjectRel(EXPR$0=[$4])
> > > > > OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
> > > > > OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
> > > > > HOTEL_ID=[$2], CLICKS=[$10])
> > > > > OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1, 2,
> 3,
> > > 4,
> > > > 5,
> > > > > 6, 7, 8, 9, 10, 11]])
> > > > >
> > > > > On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <
> shaofengshi@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > In the cube definition, you defined "SITE_ID", "CHILD_ID" as
> > > > "Mandatory"
> > > > > > dimension, which means they will not be aggregated in cube build
> > > phase
> > > > > for
> > > > > > all combinations.
> > > > > >
> > > > > > So when you run a query like  "SELECT SUM(clicks) FROM reporting
> > > GROUP
> > > > BY
> > > > > > search_type", Kylin will use the combination  "SITE_ID" +
> > "CHILD_ID"
> > > +
> > > > > > "SEARCH_TYPE" to serve, there will be post-aggregation in
> runtime;
> > > The
> > > > > > performance is much depent on the cardinality of "SITE_ID" and
> > > > > "CHILD_ID".
> > > > > >
> > > > > >
> > > > > > 2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > >
> > > > > > > I've looked over the optimization options before, but did not
> > > notice
> > > > > the
> > > > > > > rowkey ordering. I can try this and see if this helps me. This
> is
> > > the
> > > > > > only
> > > > > > > thing I see that I can attempt to optimize further in the
> design,
> > > but
> > > > > > I'll
> > > > > > > provide my cube design below. I only have one measure to keep
> it
> > > > > simple:
> > > > > > >
> > > > > > > {
> > > > > > >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> > > > > > >   "version": "1.5.2",
> > > > > > >   "name": "hpa_reporting2_cube",
> > > > > > >   "description": "",
> > > > > > >   "dimensions": [
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "REPORT_DATE",
> > > > > > >       "derived": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "SEARCH_TYPE",
> > > > > > >       "derived": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "HOTEL_ID",
> > > > > > >       "derived": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "CHILD_ID",
> > > > > > >       "derived": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "COUNTRY",
> > > > > > >       "derived": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "DEVICE_TYPE",
> > > > > > >       "derived": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "STAY_LENGTH",
> > > > > > >       "derived": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "TRUE_RANK_AG",
> > > > > > >       "derived": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "ROOM_BUNDLE",
> > > > > > >       "derived": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > >       "column": "SITE_ID",
> > > > > > >       "derived": null
> > > > > > >     }
> > > > > > >   ],
> > > > > > >   "measures": [
> > > > > > >     {
> > > > > > >       "name": "_COUNT_",
> > > > > > >       "function": {
> > > > > > >         "expression": "COUNT",
> > > > > > >         "parameter": {
> > > > > > >           "type": "constant",
> > > > > > >           "value": "1",
> > > > > > >           "next_parameter": null
> > > > > > >         },
> > > > > > >         "returntype": "bigint"
> > > > > > >       },
> > > > > > >       "dependent_measure_ref": null
> > > > > > >     },
> > > > > > >     {
> > > > > > >       "name": "CLICKS",
> > > > > > >       "function": {
> > > > > > >         "expression": "SUM",
> > > > > > >         "parameter": {
> > > > > > >           "type": "column",
> > > > > > >           "value": "CLICKS",
> > > > > > >           "next_parameter": null
> > > > > > >         },
> > > > > > >         "returntype": "decimal"
> > > > > > >       },
> > > > > > >       "dependent_measure_ref": null
> > > > > > >     }
> > > > > > >   ],
> > > > > > >   "rowkey": {
> > > > > > >     "rowkey_columns": [
> > > > > > >       {
> > > > > > >         "column": "REPORT_DATE",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       },
> > > > > > >       {
> > > > > > >         "column": "SEARCH_TYPE",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       },
> > > > > > >       {
> > > > > > >         "column": "HOTEL_ID",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       },
> > > > > > >       {
> > > > > > >         "column": "CHILD_ID",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       },
> > > > > > >       {
> > > > > > >         "column": "COUNTRY",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       },
> > > > > > >       {
> > > > > > >         "column": "DEVICE_TYPE",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       },
> > > > > > >       {
> > > > > > >         "column": "STAY_LENGTH",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       },
> > > > > > >       {
> > > > > > >         "column": "TRUE_RANK_AG",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       },
> > > > > > >       {
> > > > > > >         "column": "ROOM_BUNDLE",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       },
> > > > > > >       {
> > > > > > >         "column": "SITE_ID",
> > > > > > >         "encoding": "dict",
> > > > > > >         "isShardBy": false
> > > > > > >       }
> > > > > > >     ]
> > > > > > >   },
> > > > > > >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> > > > > > >   "last_modified": 1470070273935,
> > > > > > >   "model_name": "hpa_reporting_model2",
> > > > > > >   "null_string": null,
> > > > > > >   "hbase_mapping": {
> > > > > > >     "column_family": [
> > > > > > >       {
> > > > > > >         "name": "F1",
> > > > > > >         "columns": [
> > > > > > >           {
> > > > > > >             "qualifier": "M",
> > > > > > >             "measure_refs": [
> > > > > > >               "_COUNT_",
> > > > > > >               "CLICKS"
> > > > > > >             ]
> > > > > > >           }
> > > > > > >         ]
> > > > > > >       }
> > > > > > >     ]
> > > > > > >   },
> > > > > > >   "aggregation_groups": [
> > > > > > >     {
> > > > > > >       "includes": [
> > > > > > >         "REPORT_DATE",
> > > > > > >         "SEARCH_TYPE",
> > > > > > >         "HOTEL_ID",
> > > > > > >         "CHILD_ID",
> > > > > > >         "COUNTRY",
> > > > > > >         "DEVICE_TYPE",
> > > > > > >         "STAY_LENGTH",
> > > > > > >         "TRUE_RANK_AG",
> > > > > > >         "ROOM_BUNDLE",
> > > > > > >         "SITE_ID"
> > > > > > >       ],
> > > > > > >       "select_rule": {
> > > > > > >         "hierarchy_dims": [],
> > > > > > >         "mandatory_dims": [
> > > > > > >           "SITE_ID",
> > > > > > >           "CHILD_ID"
> > > > > > >         ],
> > > > > > >         "joint_dims": [
> > > > > > >           [
> > > > > > >             "ROOM_BUNDLE",
> > > > > > >             "TRUE_RANK_AG"
> > > > > > >           ]
> > > > > > >         ]
> > > > > > >       }
> > > > > > >     }
> > > > > > >   ],
> > > > > > >   "notify_list": [],
> > > > > > >   "status_need_notify": [
> > > > > > >     "ERROR",
> > > > > > >     "DISCARDED",
> > > > > > >     "SUCCEED"
> > > > > > >   ],
> > > > > > >   "partition_date_start": 0,
> > > > > > >   "partition_date_end": 3153600000000,
> > > > > > >   "auto_merge_time_ranges": [
> > > > > > >     604800000,
> > > > > > >     2419200000
> > > > > > >   ],
> > > > > > >   "retention_range": 0,
> > > > > > >   "engine_type": 2,
> > > > > > >   "storage_type": 2,
> > > > > > >   "override_kylin_properties": {}
> > > > > > > }
> > > > > > >
> > > > > > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <
> > > shaofengshi@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jason,
> > > > > > > >
> > > > > > > > As Yiming mentioned, the cube design matters for the
> > performance
> > > of
> > > > > > both
> > > > > > > > build and query; please check "Optimize Cube" in the document
> > web
> > > > > page
> > > > > > > and
> > > > > > > > do optimizaiton as much as possible;
> > > > > > > >
> > > > > > > > Besides, the cluster's capacity and Hadoop configuration is
> > also
> > > an
> > > > > > > > important factor; Try to identify the bottleneck and then
> > > optimize
> > > > or
> > > > > > add
> > > > > > > > capacity.
> > > > > > > >
> > > > > > > > From 1.5 Kylin ships with two cubing algorithm; the steps
> > "Build
> > > > > > > > N-Dimension Cuboid" are the legacy "Layered" cubing
> algorithm;
> > > They
> > > > > > will
> > > > > > > be
> > > > > > > > skipped when Kylin selects to use the new "Fast" cubing
> > > algorithm,
> > > > > > which
> > > > > > > is
> > > > > > > > the "Build Cube" step after them. Please click the hadoop
> link
> > in
> > > > > that
> > > > > > > step
> > > > > > > > to inspect the MR job's statistics;
> > > > > > > >
> > > > > > > > Hope this helps to some extend;
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <
> liuyiming.vip@gmail.com
> > >:
> > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > Cube design is the performance key for Kylin, not only
> query,
> > > but
> > > > > > also
> > > > > > > > cube
> > > > > > > > > building process. How to select dimensions, how to define
> the
> > > > > > > > relationship
> > > > > > > > > between dimensions, how to select encode method, how to
> > define
> > > > > > measure,
> > > > > > > > > even how to choose the Hbase key order will have a
> > significant
> > > > > impact
> > > > > > > on
> > > > > > > > > performance.  There are quite a few wonderful documents
> > > > introducing
> > > > > > how
> > > > > > > > to
> > > > > > > > > do this, http://kylin.apache.org/docs15/ .
> > > > > > > > >
> > > > > > > > > One more thing, if you could share your cube design, you
> > would
> > > > get
> > > > > > help
> > > > > > > > > easier here.
> > > > > > > > >
> > > > > > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > > > > >
> > > > > > > > > > I'm setting up a test case for a portion of our dataset,
> to
> > > > > > evaluate
> > > > > > > > > Kylin
> > > > > > > > > > and I'm not seeing the performance that I would expect.
> > > > > > > > > >
> > > > > > > > > > The cube building process is taking about 5-6 hours with
> > > > > > ~69,000,000
> > > > > > > > > > records and 10 dimensions. I'm not sure if that's the
> > > expected
> > > > > > build
> > > > > > > > > time,
> > > > > > > > > > but the other problem is the query performance after
> > building
> > > > the
> > > > > > > cube.
> > > > > > > > > >
> > > > > > > > > > All queries were tested with a very simple query (e.g.
> > SELECT
> > > > > > > > SUM(clicks)
> > > > > > > > > > FROM reporting GROUP BY search_type)
> > > > > > > > > >
> > > > > > > > > > Grouping by 1 or 2 dimensions gives me very responsive
> > > queries
> > > > > > > (under 2
> > > > > > > > > > seconds), but adding more dimensions drastically
> increases
> > > the
> > > > > > query
> > > > > > > > time
> > > > > > > > > > (over 1 minute and it times out through hbase). I would
> > > expect
> > > > > > these
> > > > > > > > > > queries to have all similar query times since they should
> > > query
> > > > > the
> > > > > > > > > > respective cuboid, so I'm not sure why the performance
> > would
> > > > > > suffer.
> > > > > > > I
> > > > > > > > > > didn't set up any special rules for the cube, but during
> > the
> > > > > build
> > > > > > > > > process
> > > > > > > > > > it showed all the N-dimension cubes and the log simply
> said
> > > > > > > 'skipped'.
> > > > > > > > > >
> > > > > > > > > > Is there something I'm missing in the configuration?
> > > > > > > > > >
> > > > > > > > > > I have a HDP cluster with 3 nodes and 1 client node on
> > which
> > > > > Kylin
> > > > > > is
> > > > > > > > > > installed. Do I need to adjust the hadoop configuration.
> > I'm
> > > > > using
> > > > > > > most
> > > > > > > > > of
> > > > > > > > > > the default HDP settings.
> > > > > > > > > >
> > > > > > > > > > What more information can I provide?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > With Warm regards
> > > > > > > > >
> > > > > > > > > Yiming Liu (刘一鸣)
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > > Shaofeng Shi
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > > > Shaofeng Shi
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > > Shaofeng Shi
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
>

Re: Kylin Cube Performance

Posted by hongbin ma <ma...@apache.org>.
Hi Jason

As Shaofeng explained it's not reasonable to expect sub-second latency if
you're returning tens of millions of records. You data model is quite
simple and you don't have costly measure like distinct count etc., so kylin
should be performant on normal OLAP queries.

Another advise: if the cardinality of the mandatory dimensions (CHILD_ID
and SITE_ID) are very high, you might isolate such dimensions into a
separate "aggregation group", so that 1. queries not touching these
dimensions can be performant 2. calculate less cuboids. Please refer to
http://kylin.apache.org/blog/2016/02/18/new-aggregation-group/

On Thu, Aug 4, 2016 at 11:11 PM, ShaoFeng Shi <sh...@apache.org>
wrote:

> The log is pretty clear; the cuboid is exact match, but the scan count is
> massive:
>
> Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to 992
> Total scan count: 12306477
>
> Please add where condition to narrow down the result set as much as
> possible; It doesn't make sense for an OLAP query to return millions of
> records;
>
> 2016-08-04 13:05 GMT+08:00 Jason Hale <ja...@koddi.com>:
>
> > Sure, see kylin.log below:
> >
> > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > controller.QueryController:175 : The original query:  SELECT SUM(clicks)
> > FROM hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
> > report_date
> > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > service.QueryService:266 : The corrected query: SELECT SUM(clicks) FROM
> > hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
> > report_date
> > LIMIT 50000
> > 2016-08-04 00:47:35,908 INFO  [http-bio-7070-exec-7]
> routing.QueryRouter:48
> > : The project manager's reference is
> > org.apache.kylin.metadata.project.ProjectManager@3a3735a5
> > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> routing.QueryRouter:60
> > : Find candidates by table DEFAULT.HPA_REPORTING2 and project=KODDI_DEV :
> > org.apache.kylin.query.routing.Candidate@51ed1b3b
> > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> routing.QueryRouter:49
> > : Applying rule: class
> > org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
> > realizations before: [hpa_reporting2_cube_clone(CUBE)], realizations
> > after:
> > [hpa_reporting2_cube_clone(CUBE)]
> > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> routing.QueryRouter:49
> > : Applying rule: class
> > org.apache.kylin.query.routing.rules.RealizationSortRule, realizations
> > before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
> > [hpa_reporting2_cube_clone(CUBE)]
> > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> routing.QueryRouter:72
> > : The realizations remaining: [hpa_reporting2_cube_clone(CUBE)] And the
> > final chosen one is the first one
> > 2016-08-04 00:47:35,975 DEBUG [http-bio-7070-exec-7]
> > enumerator.OLAPEnumerator:107 : query storage...
> > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > v2.CubeStorageQuery:239 : exactAggregation is true
> > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > v2.CubeStorageQuery:357 : Enable limit 50000
> > 2016-08-04 00:47:35,977 DEBUG [http-bio-7070-exec-7]
> > v2.CubeHBaseEndpointRPC:257 : New scanner for current segment
> > hpa_reporting2_cube_clone[19700101000000_20160828000000] will use
> > SCAN_FILTER_AGGR_CHECKMEM as endpoint's behavior
> > 2016-08-04 00:47:35,979 DEBUG [http-bio-7070-exec-7]
> > v2.CubeHBaseEndpointRPC:313 : Serialized scanRequestBytes 836 bytes,
> > rawScanBytesString 56 bytes
> > 2016-08-04 00:47:35,979 INFO  [http-bio-7070-exec-7]
> > v2.CubeHBaseEndpointRPC:315 : The scan 31b2dd4c for segment
> > hpa_reporting2_cube_clone[19700101000000_20160828000000] is as below with
> > 1
> > separate raw scans, shard part of start/end key is set to 0
> > 2016-08-04 00:47:35,980 INFO  [http-bio-7070-exec-7] v2.CubeHBaseRPC:271
> :
> > Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to
> 992
> > Start:
> > \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > x00\x00\x00\x00\x00
> > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > x00\x00\x00\x00\x00)
> > Stop:
> >  \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > xFF\xFF\xFF\xFF\xFF\x00
> > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > xFF\xFF\xFF\xFF\xFF\x00),
> > No Fuzzy Key
> > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > v2.CubeHBaseEndpointRPC:320 : Submitting rpc to 1 shards starting from
> > shard 2, scan range count 1
> > 2016-08-04 00:47:35,981 INFO  [http-bio-7070-exec-7]
> > v2.CubeHBaseEndpointRPC:103 : Timeout for ExpectedSizeIterator is: 99000
> > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > enumerator.OLAPEnumerator:127 : return TupleIterator...
> > 2016-08-04 00:47:52,773 INFO  [pool-6-thread-1]
> v2.CubeHBaseEndpointRPC:351
> > : <sub-thread for GTScanRequest 31b2dd4c> Endpoint RPC returned from
> HTable
> > KYLIN_RIK9O18H07 Shard
> > \x4B\x59\x4C\x49\x4E\x5F\x52\x49\x4B\x39\x4F\x31\x38\x48\
> > x30\x37\x2C\x00\x02\x2C\x31\x34\x37\x30\x31\x35\x35\x33\
> > x31\x34\x39\x33\x37\x2E\x61\x33\x61\x35\x34\x37\x39\x61\
> > x32\x63\x37\x61\x61\x64\x30\x36\x33\x66\x30\x33\x64\x63\
> > x34\x65\x31\x30\x36\x33\x61\x33\x61\x37\x2E
> > on host: ip-10-0-0-157.ec2.internal.Total scanned row: 12306477. Total
> > filtered/aggred row: 0. Time elapsed in EP: 16562(ms). Server CPU usage:
> > 0.24348086721950246, server physical mem left: 7.195234304E9, server swap
> > mem left:0.0.Etc message: start latency: 15@1,agg done@13760,compress
> > done@16562,server stats done@16562,
> > debugGitTag:cf4d2940b67d622eacd2ac9a913b221091a35c2e;.Normal Complete:
> > true.
> > 2016-08-04 00:47:54,068 DEBUG [pool-6-thread-1] util.CompressionUtils:67
> :
> > Original: 46465726 bytes. Decompressed: 150553629 bytes. Time: 1294
> > 2016-08-04 00:48:29,303 INFO  [pool-4-thread-1]
> > threadpool.DefaultScheduler:106 : Job Fetcher: 0 running, 0 actual
> > running,
> > 0 ready, 12 others
> > 2016-08-04 00:48:31,990 INFO  [http-bio-7070-exec-7]
> > service.QueryService:399 : Scan count for each storageContext: 12306477,
> > 2016-08-04 00:48:31,991 INFO  [http-bio-7070-exec-7]
> > controller.QueryController:197 : Stats of SQL response: isException:
> false,
> > duration: 56152, total scan count 12306477
> > 2016-08-04 00:48:32,000 WARN  [http-bio-7070-exec-7]
> > sizeof.ObjectGraphWalker:209 : The configured limit of 1,000 object
> > references was reached while attempting to calculate the size of the
> object
> > graph. Severe performance degradation could occur if the sizing operation
> > continues. This can be avoided by setting the CacheManger or Cache
> > <sizeOfPolicy> elements maxDepthExceededBehavior to "abort" or adding
> stop
> > points with @IgnoreSizeOf annotations. If performance degradation is NOT
> an
> > issue at the configured limit, raise the limit value using the
> CacheManager
> > or Cache <sizeOfPolicy> elements maxDepth attribute. For more
> information,
> > see the Ehcache configuration documentation.
> > 2016-08-04 00:48:32,091 INFO  [http-bio-7070-exec-7]
> > service.QueryService:250 :
> > ==========================[QUERY]===============================
> > SQL: SELECT SUM(clicks) FROM hpa_reporting2 GROUP BY site_id, child_id,
> > search_type, hotel_id, report_date
> > User: ADMIN
> > Success: true
> > Duration: 56.152
> > Project: koddi_dev
> > Realization Names: [hpa_reporting2_cube_clone]
> > Cuboid Ids: [992]
> > Total scan count: 12306477
> > Result row count: 50000
> > Accept Partial: true
> > Is Partial Result: false
> > Hit Exception Cache: false
> > Storage cache used: false
> > Message: null
> > ==========================[QUERY]===============================
> >
> > On Wed, Aug 3, 2016 at 8:38 PM, ShaoFeng Shi <sh...@apache.org>
> > wrote:
> >
> > > Hi Jason, could you please provide the full log since sending query to
> > and
> > > getting result back? The key information is which cuboid is used for
> the
> > > query, cuboid exact match or fuzzy match, how many records be scanned
> and
> > > how long it tooks; Thanks.
> > >
> > > 2016-08-03 23:19 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > >
> > > > Yes, it would have to do post-aggregation in that case, but the
> strange
> > > > thing is that query was running fast (about 1 second), while queries
> > with
> > > > more dimensions, such as "SELECT SUM(clicks) FROM reporting GROUP BY
> > > > site_id, child_id, report_date, hotel_id". This query will take about
> > 106
> > > > seconds, but it shouldn't need to do any post-aggregation so I would
> > > think
> > > > it should return much quicker than that from the respective cuboid.
> > > >
> > > > Here's the explain plan:
> > > > OLAPToEnumerableConverter
> > > > OLAPProjectRel(EXPR$0=[$4])
> > > > OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
> > > > OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
> > > > HOTEL_ID=[$2], CLICKS=[$10])
> > > > OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1, 2, 3,
> > 4,
> > > 5,
> > > > 6, 7, 8, 9, 10, 11]])
> > > >
> > > > On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <shaofengshi@apache.org
> >
> > > > wrote:
> > > >
> > > > > In the cube definition, you defined "SITE_ID", "CHILD_ID" as
> > > "Mandatory"
> > > > > dimension, which means they will not be aggregated in cube build
> > phase
> > > > for
> > > > > all combinations.
> > > > >
> > > > > So when you run a query like  "SELECT SUM(clicks) FROM reporting
> > GROUP
> > > BY
> > > > > search_type", Kylin will use the combination  "SITE_ID" +
> "CHILD_ID"
> > +
> > > > > "SEARCH_TYPE" to serve, there will be post-aggregation in runtime;
> > The
> > > > > performance is much depent on the cardinality of "SITE_ID" and
> > > > "CHILD_ID".
> > > > >
> > > > >
> > > > > 2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > >
> > > > > > I've looked over the optimization options before, but did not
> > notice
> > > > the
> > > > > > rowkey ordering. I can try this and see if this helps me. This is
> > the
> > > > > only
> > > > > > thing I see that I can attempt to optimize further in the design,
> > but
> > > > > I'll
> > > > > > provide my cube design below. I only have one measure to keep it
> > > > simple:
> > > > > >
> > > > > > {
> > > > > >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> > > > > >   "version": "1.5.2",
> > > > > >   "name": "hpa_reporting2_cube",
> > > > > >   "description": "",
> > > > > >   "dimensions": [
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "REPORT_DATE",
> > > > > >       "derived": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "SEARCH_TYPE",
> > > > > >       "derived": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "HOTEL_ID",
> > > > > >       "derived": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "CHILD_ID",
> > > > > >       "derived": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "COUNTRY",
> > > > > >       "derived": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "DEVICE_TYPE",
> > > > > >       "derived": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "STAY_LENGTH",
> > > > > >       "derived": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "TRUE_RANK_AG",
> > > > > >       "derived": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "ROOM_BUNDLE",
> > > > > >       "derived": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > >       "column": "SITE_ID",
> > > > > >       "derived": null
> > > > > >     }
> > > > > >   ],
> > > > > >   "measures": [
> > > > > >     {
> > > > > >       "name": "_COUNT_",
> > > > > >       "function": {
> > > > > >         "expression": "COUNT",
> > > > > >         "parameter": {
> > > > > >           "type": "constant",
> > > > > >           "value": "1",
> > > > > >           "next_parameter": null
> > > > > >         },
> > > > > >         "returntype": "bigint"
> > > > > >       },
> > > > > >       "dependent_measure_ref": null
> > > > > >     },
> > > > > >     {
> > > > > >       "name": "CLICKS",
> > > > > >       "function": {
> > > > > >         "expression": "SUM",
> > > > > >         "parameter": {
> > > > > >           "type": "column",
> > > > > >           "value": "CLICKS",
> > > > > >           "next_parameter": null
> > > > > >         },
> > > > > >         "returntype": "decimal"
> > > > > >       },
> > > > > >       "dependent_measure_ref": null
> > > > > >     }
> > > > > >   ],
> > > > > >   "rowkey": {
> > > > > >     "rowkey_columns": [
> > > > > >       {
> > > > > >         "column": "REPORT_DATE",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       },
> > > > > >       {
> > > > > >         "column": "SEARCH_TYPE",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       },
> > > > > >       {
> > > > > >         "column": "HOTEL_ID",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       },
> > > > > >       {
> > > > > >         "column": "CHILD_ID",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       },
> > > > > >       {
> > > > > >         "column": "COUNTRY",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       },
> > > > > >       {
> > > > > >         "column": "DEVICE_TYPE",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       },
> > > > > >       {
> > > > > >         "column": "STAY_LENGTH",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       },
> > > > > >       {
> > > > > >         "column": "TRUE_RANK_AG",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       },
> > > > > >       {
> > > > > >         "column": "ROOM_BUNDLE",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       },
> > > > > >       {
> > > > > >         "column": "SITE_ID",
> > > > > >         "encoding": "dict",
> > > > > >         "isShardBy": false
> > > > > >       }
> > > > > >     ]
> > > > > >   },
> > > > > >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> > > > > >   "last_modified": 1470070273935,
> > > > > >   "model_name": "hpa_reporting_model2",
> > > > > >   "null_string": null,
> > > > > >   "hbase_mapping": {
> > > > > >     "column_family": [
> > > > > >       {
> > > > > >         "name": "F1",
> > > > > >         "columns": [
> > > > > >           {
> > > > > >             "qualifier": "M",
> > > > > >             "measure_refs": [
> > > > > >               "_COUNT_",
> > > > > >               "CLICKS"
> > > > > >             ]
> > > > > >           }
> > > > > >         ]
> > > > > >       }
> > > > > >     ]
> > > > > >   },
> > > > > >   "aggregation_groups": [
> > > > > >     {
> > > > > >       "includes": [
> > > > > >         "REPORT_DATE",
> > > > > >         "SEARCH_TYPE",
> > > > > >         "HOTEL_ID",
> > > > > >         "CHILD_ID",
> > > > > >         "COUNTRY",
> > > > > >         "DEVICE_TYPE",
> > > > > >         "STAY_LENGTH",
> > > > > >         "TRUE_RANK_AG",
> > > > > >         "ROOM_BUNDLE",
> > > > > >         "SITE_ID"
> > > > > >       ],
> > > > > >       "select_rule": {
> > > > > >         "hierarchy_dims": [],
> > > > > >         "mandatory_dims": [
> > > > > >           "SITE_ID",
> > > > > >           "CHILD_ID"
> > > > > >         ],
> > > > > >         "joint_dims": [
> > > > > >           [
> > > > > >             "ROOM_BUNDLE",
> > > > > >             "TRUE_RANK_AG"
> > > > > >           ]
> > > > > >         ]
> > > > > >       }
> > > > > >     }
> > > > > >   ],
> > > > > >   "notify_list": [],
> > > > > >   "status_need_notify": [
> > > > > >     "ERROR",
> > > > > >     "DISCARDED",
> > > > > >     "SUCCEED"
> > > > > >   ],
> > > > > >   "partition_date_start": 0,
> > > > > >   "partition_date_end": 3153600000000,
> > > > > >   "auto_merge_time_ranges": [
> > > > > >     604800000,
> > > > > >     2419200000
> > > > > >   ],
> > > > > >   "retention_range": 0,
> > > > > >   "engine_type": 2,
> > > > > >   "storage_type": 2,
> > > > > >   "override_kylin_properties": {}
> > > > > > }
> > > > > >
> > > > > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <
> > shaofengshi@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > As Yiming mentioned, the cube design matters for the
> performance
> > of
> > > > > both
> > > > > > > build and query; please check "Optimize Cube" in the document
> web
> > > > page
> > > > > > and
> > > > > > > do optimizaiton as much as possible;
> > > > > > >
> > > > > > > Besides, the cluster's capacity and Hadoop configuration is
> also
> > an
> > > > > > > important factor; Try to identify the bottleneck and then
> > optimize
> > > or
> > > > > add
> > > > > > > capacity.
> > > > > > >
> > > > > > > From 1.5 Kylin ships with two cubing algorithm; the steps
> "Build
> > > > > > > N-Dimension Cuboid" are the legacy "Layered" cubing algorithm;
> > They
> > > > > will
> > > > > > be
> > > > > > > skipped when Kylin selects to use the new "Fast" cubing
> > algorithm,
> > > > > which
> > > > > > is
> > > > > > > the "Build Cube" step after them. Please click the hadoop link
> in
> > > > that
> > > > > > step
> > > > > > > to inspect the MR job's statistics;
> > > > > > >
> > > > > > > Hope this helps to some extend;
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <liuyiming.vip@gmail.com
> >:
> > > > > > >
> > > > > > > > Hi Jason,
> > > > > > > >
> > > > > > > > Cube design is the performance key for Kylin, not only query,
> > but
> > > > > also
> > > > > > > cube
> > > > > > > > building process. How to select dimensions, how to define the
> > > > > > > relationship
> > > > > > > > between dimensions, how to select encode method, how to
> define
> > > > > measure,
> > > > > > > > even how to choose the Hbase key order will have a
> significant
> > > > impact
> > > > > > on
> > > > > > > > performance.  There are quite a few wonderful documents
> > > introducing
> > > > > how
> > > > > > > to
> > > > > > > > do this, http://kylin.apache.org/docs15/ .
> > > > > > > >
> > > > > > > > One more thing, if you could share your cube design, you
> would
> > > get
> > > > > help
> > > > > > > > easier here.
> > > > > > > >
> > > > > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > > > >
> > > > > > > > > I'm setting up a test case for a portion of our dataset, to
> > > > > evaluate
> > > > > > > > Kylin
> > > > > > > > > and I'm not seeing the performance that I would expect.
> > > > > > > > >
> > > > > > > > > The cube building process is taking about 5-6 hours with
> > > > > ~69,000,000
> > > > > > > > > records and 10 dimensions. I'm not sure if that's the
> > expected
> > > > > build
> > > > > > > > time,
> > > > > > > > > but the other problem is the query performance after
> building
> > > the
> > > > > > cube.
> > > > > > > > >
> > > > > > > > > All queries were tested with a very simple query (e.g.
> SELECT
> > > > > > > SUM(clicks)
> > > > > > > > > FROM reporting GROUP BY search_type)
> > > > > > > > >
> > > > > > > > > Grouping by 1 or 2 dimensions gives me very responsive
> > queries
> > > > > > (under 2
> > > > > > > > > seconds), but adding more dimensions drastically increases
> > the
> > > > > query
> > > > > > > time
> > > > > > > > > (over 1 minute and it times out through hbase). I would
> > expect
> > > > > these
> > > > > > > > > queries to have all similar query times since they should
> > query
> > > > the
> > > > > > > > > respective cuboid, so I'm not sure why the performance
> would
> > > > > suffer.
> > > > > > I
> > > > > > > > > didn't set up any special rules for the cube, but during
> the
> > > > build
> > > > > > > > process
> > > > > > > > > it showed all the N-dimension cubes and the log simply said
> > > > > > 'skipped'.
> > > > > > > > >
> > > > > > > > > Is there something I'm missing in the configuration?
> > > > > > > > >
> > > > > > > > > I have a HDP cluster with 3 nodes and 1 client node on
> which
> > > > Kylin
> > > > > is
> > > > > > > > > installed. Do I need to adjust the hadoop configuration.
> I'm
> > > > using
> > > > > > most
> > > > > > > > of
> > > > > > > > > the default HDP settings.
> > > > > > > > >
> > > > > > > > > What more information can I provide?
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > With Warm regards
> > > > > > > >
> > > > > > > > Yiming Liu (刘一鸣)
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Shaofeng Shi
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > > Shaofeng Shi
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>



-- 
Regards,

*Bin Mahone | 马洪宾*

Re: Kylin Cube Performance

Posted by ShaoFeng Shi <sh...@apache.org>.
The log is pretty clear; the cuboid is exact match, but the scan count is
massive:

Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to 992
Total scan count: 12306477

Please add where condition to narrow down the result set as much as
possible; It doesn't make sense for an OLAP query to return millions of
records;

2016-08-04 13:05 GMT+08:00 Jason Hale <ja...@koddi.com>:

> Sure, see kylin.log below:
>
> 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> controller.QueryController:175 : The original query:  SELECT SUM(clicks)
> FROM hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
> report_date
> 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> service.QueryService:266 : The corrected query: SELECT SUM(clicks) FROM
> hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
> report_date
> LIMIT 50000
> 2016-08-04 00:47:35,908 INFO  [http-bio-7070-exec-7] routing.QueryRouter:48
> : The project manager's reference is
> org.apache.kylin.metadata.project.ProjectManager@3a3735a5
> 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7] routing.QueryRouter:60
> : Find candidates by table DEFAULT.HPA_REPORTING2 and project=KODDI_DEV :
> org.apache.kylin.query.routing.Candidate@51ed1b3b
> 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7] routing.QueryRouter:49
> : Applying rule: class
> org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
> realizations before: [hpa_reporting2_cube_clone(CUBE)], realizations
> after:
> [hpa_reporting2_cube_clone(CUBE)]
> 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7] routing.QueryRouter:49
> : Applying rule: class
> org.apache.kylin.query.routing.rules.RealizationSortRule, realizations
> before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
> [hpa_reporting2_cube_clone(CUBE)]
> 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7] routing.QueryRouter:72
> : The realizations remaining: [hpa_reporting2_cube_clone(CUBE)] And the
> final chosen one is the first one
> 2016-08-04 00:47:35,975 DEBUG [http-bio-7070-exec-7]
> enumerator.OLAPEnumerator:107 : query storage...
> 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> v2.CubeStorageQuery:239 : exactAggregation is true
> 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> v2.CubeStorageQuery:357 : Enable limit 50000
> 2016-08-04 00:47:35,977 DEBUG [http-bio-7070-exec-7]
> v2.CubeHBaseEndpointRPC:257 : New scanner for current segment
> hpa_reporting2_cube_clone[19700101000000_20160828000000] will use
> SCAN_FILTER_AGGR_CHECKMEM as endpoint's behavior
> 2016-08-04 00:47:35,979 DEBUG [http-bio-7070-exec-7]
> v2.CubeHBaseEndpointRPC:313 : Serialized scanRequestBytes 836 bytes,
> rawScanBytesString 56 bytes
> 2016-08-04 00:47:35,979 INFO  [http-bio-7070-exec-7]
> v2.CubeHBaseEndpointRPC:315 : The scan 31b2dd4c for segment
> hpa_reporting2_cube_clone[19700101000000_20160828000000] is as below with
> 1
> separate raw scans, shard part of start/end key is set to 0
> 2016-08-04 00:47:35,980 INFO  [http-bio-7070-exec-7] v2.CubeHBaseRPC:271 :
> Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to 992
> Start:
> \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> x00\x00\x00\x00\x00
> (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> x00\x00\x00\x00\x00)
> Stop:
>  \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> xFF\xFF\xFF\xFF\xFF\x00
> (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> xFF\xFF\xFF\xFF\xFF\x00),
> No Fuzzy Key
> 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> v2.CubeHBaseEndpointRPC:320 : Submitting rpc to 1 shards starting from
> shard 2, scan range count 1
> 2016-08-04 00:47:35,981 INFO  [http-bio-7070-exec-7]
> v2.CubeHBaseEndpointRPC:103 : Timeout for ExpectedSizeIterator is: 99000
> 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> enumerator.OLAPEnumerator:127 : return TupleIterator...
> 2016-08-04 00:47:52,773 INFO  [pool-6-thread-1] v2.CubeHBaseEndpointRPC:351
> : <sub-thread for GTScanRequest 31b2dd4c> Endpoint RPC returned from HTable
> KYLIN_RIK9O18H07 Shard
> \x4B\x59\x4C\x49\x4E\x5F\x52\x49\x4B\x39\x4F\x31\x38\x48\
> x30\x37\x2C\x00\x02\x2C\x31\x34\x37\x30\x31\x35\x35\x33\
> x31\x34\x39\x33\x37\x2E\x61\x33\x61\x35\x34\x37\x39\x61\
> x32\x63\x37\x61\x61\x64\x30\x36\x33\x66\x30\x33\x64\x63\
> x34\x65\x31\x30\x36\x33\x61\x33\x61\x37\x2E
> on host: ip-10-0-0-157.ec2.internal.Total scanned row: 12306477. Total
> filtered/aggred row: 0. Time elapsed in EP: 16562(ms). Server CPU usage:
> 0.24348086721950246, server physical mem left: 7.195234304E9, server swap
> mem left:0.0.Etc message: start latency: 15@1,agg done@13760,compress
> done@16562,server stats done@16562,
> debugGitTag:cf4d2940b67d622eacd2ac9a913b221091a35c2e;.Normal Complete:
> true.
> 2016-08-04 00:47:54,068 DEBUG [pool-6-thread-1] util.CompressionUtils:67 :
> Original: 46465726 bytes. Decompressed: 150553629 bytes. Time: 1294
> 2016-08-04 00:48:29,303 INFO  [pool-4-thread-1]
> threadpool.DefaultScheduler:106 : Job Fetcher: 0 running, 0 actual
> running,
> 0 ready, 12 others
> 2016-08-04 00:48:31,990 INFO  [http-bio-7070-exec-7]
> service.QueryService:399 : Scan count for each storageContext: 12306477,
> 2016-08-04 00:48:31,991 INFO  [http-bio-7070-exec-7]
> controller.QueryController:197 : Stats of SQL response: isException: false,
> duration: 56152, total scan count 12306477
> 2016-08-04 00:48:32,000 WARN  [http-bio-7070-exec-7]
> sizeof.ObjectGraphWalker:209 : The configured limit of 1,000 object
> references was reached while attempting to calculate the size of the object
> graph. Severe performance degradation could occur if the sizing operation
> continues. This can be avoided by setting the CacheManger or Cache
> <sizeOfPolicy> elements maxDepthExceededBehavior to "abort" or adding stop
> points with @IgnoreSizeOf annotations. If performance degradation is NOT an
> issue at the configured limit, raise the limit value using the CacheManager
> or Cache <sizeOfPolicy> elements maxDepth attribute. For more information,
> see the Ehcache configuration documentation.
> 2016-08-04 00:48:32,091 INFO  [http-bio-7070-exec-7]
> service.QueryService:250 :
> ==========================[QUERY]===============================
> SQL: SELECT SUM(clicks) FROM hpa_reporting2 GROUP BY site_id, child_id,
> search_type, hotel_id, report_date
> User: ADMIN
> Success: true
> Duration: 56.152
> Project: koddi_dev
> Realization Names: [hpa_reporting2_cube_clone]
> Cuboid Ids: [992]
> Total scan count: 12306477
> Result row count: 50000
> Accept Partial: true
> Is Partial Result: false
> Hit Exception Cache: false
> Storage cache used: false
> Message: null
> ==========================[QUERY]===============================
>
> On Wed, Aug 3, 2016 at 8:38 PM, ShaoFeng Shi <sh...@apache.org>
> wrote:
>
> > Hi Jason, could you please provide the full log since sending query to
> and
> > getting result back? The key information is which cuboid is used for the
> > query, cuboid exact match or fuzzy match, how many records be scanned and
> > how long it tooks; Thanks.
> >
> > 2016-08-03 23:19 GMT+08:00 Jason Hale <ja...@koddi.com>:
> >
> > > Yes, it would have to do post-aggregation in that case, but the strange
> > > thing is that query was running fast (about 1 second), while queries
> with
> > > more dimensions, such as "SELECT SUM(clicks) FROM reporting GROUP BY
> > > site_id, child_id, report_date, hotel_id". This query will take about
> 106
> > > seconds, but it shouldn't need to do any post-aggregation so I would
> > think
> > > it should return much quicker than that from the respective cuboid.
> > >
> > > Here's the explain plan:
> > > OLAPToEnumerableConverter
> > > OLAPProjectRel(EXPR$0=[$4])
> > > OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
> > > OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
> > > HOTEL_ID=[$2], CLICKS=[$10])
> > > OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1, 2, 3,
> 4,
> > 5,
> > > 6, 7, 8, 9, 10, 11]])
> > >
> > > On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <sh...@apache.org>
> > > wrote:
> > >
> > > > In the cube definition, you defined "SITE_ID", "CHILD_ID" as
> > "Mandatory"
> > > > dimension, which means they will not be aggregated in cube build
> phase
> > > for
> > > > all combinations.
> > > >
> > > > So when you run a query like  "SELECT SUM(clicks) FROM reporting
> GROUP
> > BY
> > > > search_type", Kylin will use the combination  "SITE_ID" + "CHILD_ID"
> +
> > > > "SEARCH_TYPE" to serve, there will be post-aggregation in runtime;
> The
> > > > performance is much depent on the cardinality of "SITE_ID" and
> > > "CHILD_ID".
> > > >
> > > >
> > > > 2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > >
> > > > > I've looked over the optimization options before, but did not
> notice
> > > the
> > > > > rowkey ordering. I can try this and see if this helps me. This is
> the
> > > > only
> > > > > thing I see that I can attempt to optimize further in the design,
> but
> > > > I'll
> > > > > provide my cube design below. I only have one measure to keep it
> > > simple:
> > > > >
> > > > > {
> > > > >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> > > > >   "version": "1.5.2",
> > > > >   "name": "hpa_reporting2_cube",
> > > > >   "description": "",
> > > > >   "dimensions": [
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "REPORT_DATE",
> > > > >       "derived": null
> > > > >     },
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "SEARCH_TYPE",
> > > > >       "derived": null
> > > > >     },
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "HOTEL_ID",
> > > > >       "derived": null
> > > > >     },
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "CHILD_ID",
> > > > >       "derived": null
> > > > >     },
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "COUNTRY",
> > > > >       "derived": null
> > > > >     },
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "DEVICE_TYPE",
> > > > >       "derived": null
> > > > >     },
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "STAY_LENGTH",
> > > > >       "derived": null
> > > > >     },
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "TRUE_RANK_AG",
> > > > >       "derived": null
> > > > >     },
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "ROOM_BUNDLE",
> > > > >       "derived": null
> > > > >     },
> > > > >     {
> > > > >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > >       "column": "SITE_ID",
> > > > >       "derived": null
> > > > >     }
> > > > >   ],
> > > > >   "measures": [
> > > > >     {
> > > > >       "name": "_COUNT_",
> > > > >       "function": {
> > > > >         "expression": "COUNT",
> > > > >         "parameter": {
> > > > >           "type": "constant",
> > > > >           "value": "1",
> > > > >           "next_parameter": null
> > > > >         },
> > > > >         "returntype": "bigint"
> > > > >       },
> > > > >       "dependent_measure_ref": null
> > > > >     },
> > > > >     {
> > > > >       "name": "CLICKS",
> > > > >       "function": {
> > > > >         "expression": "SUM",
> > > > >         "parameter": {
> > > > >           "type": "column",
> > > > >           "value": "CLICKS",
> > > > >           "next_parameter": null
> > > > >         },
> > > > >         "returntype": "decimal"
> > > > >       },
> > > > >       "dependent_measure_ref": null
> > > > >     }
> > > > >   ],
> > > > >   "rowkey": {
> > > > >     "rowkey_columns": [
> > > > >       {
> > > > >         "column": "REPORT_DATE",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       },
> > > > >       {
> > > > >         "column": "SEARCH_TYPE",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       },
> > > > >       {
> > > > >         "column": "HOTEL_ID",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       },
> > > > >       {
> > > > >         "column": "CHILD_ID",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       },
> > > > >       {
> > > > >         "column": "COUNTRY",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       },
> > > > >       {
> > > > >         "column": "DEVICE_TYPE",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       },
> > > > >       {
> > > > >         "column": "STAY_LENGTH",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       },
> > > > >       {
> > > > >         "column": "TRUE_RANK_AG",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       },
> > > > >       {
> > > > >         "column": "ROOM_BUNDLE",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       },
> > > > >       {
> > > > >         "column": "SITE_ID",
> > > > >         "encoding": "dict",
> > > > >         "isShardBy": false
> > > > >       }
> > > > >     ]
> > > > >   },
> > > > >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> > > > >   "last_modified": 1470070273935,
> > > > >   "model_name": "hpa_reporting_model2",
> > > > >   "null_string": null,
> > > > >   "hbase_mapping": {
> > > > >     "column_family": [
> > > > >       {
> > > > >         "name": "F1",
> > > > >         "columns": [
> > > > >           {
> > > > >             "qualifier": "M",
> > > > >             "measure_refs": [
> > > > >               "_COUNT_",
> > > > >               "CLICKS"
> > > > >             ]
> > > > >           }
> > > > >         ]
> > > > >       }
> > > > >     ]
> > > > >   },
> > > > >   "aggregation_groups": [
> > > > >     {
> > > > >       "includes": [
> > > > >         "REPORT_DATE",
> > > > >         "SEARCH_TYPE",
> > > > >         "HOTEL_ID",
> > > > >         "CHILD_ID",
> > > > >         "COUNTRY",
> > > > >         "DEVICE_TYPE",
> > > > >         "STAY_LENGTH",
> > > > >         "TRUE_RANK_AG",
> > > > >         "ROOM_BUNDLE",
> > > > >         "SITE_ID"
> > > > >       ],
> > > > >       "select_rule": {
> > > > >         "hierarchy_dims": [],
> > > > >         "mandatory_dims": [
> > > > >           "SITE_ID",
> > > > >           "CHILD_ID"
> > > > >         ],
> > > > >         "joint_dims": [
> > > > >           [
> > > > >             "ROOM_BUNDLE",
> > > > >             "TRUE_RANK_AG"
> > > > >           ]
> > > > >         ]
> > > > >       }
> > > > >     }
> > > > >   ],
> > > > >   "notify_list": [],
> > > > >   "status_need_notify": [
> > > > >     "ERROR",
> > > > >     "DISCARDED",
> > > > >     "SUCCEED"
> > > > >   ],
> > > > >   "partition_date_start": 0,
> > > > >   "partition_date_end": 3153600000000,
> > > > >   "auto_merge_time_ranges": [
> > > > >     604800000,
> > > > >     2419200000
> > > > >   ],
> > > > >   "retention_range": 0,
> > > > >   "engine_type": 2,
> > > > >   "storage_type": 2,
> > > > >   "override_kylin_properties": {}
> > > > > }
> > > > >
> > > > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <
> shaofengshi@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jason,
> > > > > >
> > > > > > As Yiming mentioned, the cube design matters for the performance
> of
> > > > both
> > > > > > build and query; please check "Optimize Cube" in the document web
> > > page
> > > > > and
> > > > > > do optimizaiton as much as possible;
> > > > > >
> > > > > > Besides, the cluster's capacity and Hadoop configuration is also
> an
> > > > > > important factor; Try to identify the bottleneck and then
> optimize
> > or
> > > > add
> > > > > > capacity.
> > > > > >
> > > > > > From 1.5 Kylin ships with two cubing algorithm; the steps "Build
> > > > > > N-Dimension Cuboid" are the legacy "Layered" cubing algorithm;
> They
> > > > will
> > > > > be
> > > > > > skipped when Kylin selects to use the new "Fast" cubing
> algorithm,
> > > > which
> > > > > is
> > > > > > the "Build Cube" step after them. Please click the hadoop link in
> > > that
> > > > > step
> > > > > > to inspect the MR job's statistics;
> > > > > >
> > > > > > Hope this helps to some extend;
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <li...@gmail.com>:
> > > > > >
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > Cube design is the performance key for Kylin, not only query,
> but
> > > > also
> > > > > > cube
> > > > > > > building process. How to select dimensions, how to define the
> > > > > > relationship
> > > > > > > between dimensions, how to select encode method, how to define
> > > > measure,
> > > > > > > even how to choose the Hbase key order will have a significant
> > > impact
> > > > > on
> > > > > > > performance.  There are quite a few wonderful documents
> > introducing
> > > > how
> > > > > > to
> > > > > > > do this, http://kylin.apache.org/docs15/ .
> > > > > > >
> > > > > > > One more thing, if you could share your cube design, you would
> > get
> > > > help
> > > > > > > easier here.
> > > > > > >
> > > > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > > >
> > > > > > > > I'm setting up a test case for a portion of our dataset, to
> > > > evaluate
> > > > > > > Kylin
> > > > > > > > and I'm not seeing the performance that I would expect.
> > > > > > > >
> > > > > > > > The cube building process is taking about 5-6 hours with
> > > > ~69,000,000
> > > > > > > > records and 10 dimensions. I'm not sure if that's the
> expected
> > > > build
> > > > > > > time,
> > > > > > > > but the other problem is the query performance after building
> > the
> > > > > cube.
> > > > > > > >
> > > > > > > > All queries were tested with a very simple query (e.g. SELECT
> > > > > > SUM(clicks)
> > > > > > > > FROM reporting GROUP BY search_type)
> > > > > > > >
> > > > > > > > Grouping by 1 or 2 dimensions gives me very responsive
> queries
> > > > > (under 2
> > > > > > > > seconds), but adding more dimensions drastically increases
> the
> > > > query
> > > > > > time
> > > > > > > > (over 1 minute and it times out through hbase). I would
> expect
> > > > these
> > > > > > > > queries to have all similar query times since they should
> query
> > > the
> > > > > > > > respective cuboid, so I'm not sure why the performance would
> > > > suffer.
> > > > > I
> > > > > > > > didn't set up any special rules for the cube, but during the
> > > build
> > > > > > > process
> > > > > > > > it showed all the N-dimension cubes and the log simply said
> > > > > 'skipped'.
> > > > > > > >
> > > > > > > > Is there something I'm missing in the configuration?
> > > > > > > >
> > > > > > > > I have a HDP cluster with 3 nodes and 1 client node on which
> > > Kylin
> > > > is
> > > > > > > > installed. Do I need to adjust the hadoop configuration. I'm
> > > using
> > > > > most
> > > > > > > of
> > > > > > > > the default HDP settings.
> > > > > > > >
> > > > > > > > What more information can I provide?
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > With Warm regards
> > > > > > >
> > > > > > > Yiming Liu (刘一鸣)
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >
> > > > > > Shaofeng Shi
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > > Shaofeng Shi
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi
> >
>



-- 
Best regards,

Shaofeng Shi

Re: Kylin Cube Performance

Posted by Jason Hale <ja...@koddi.com>.
Sure, see kylin.log below:

2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
controller.QueryController:175 : The original query:  SELECT SUM(clicks)
FROM hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
report_date
2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
service.QueryService:266 : The corrected query: SELECT SUM(clicks) FROM
hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
report_date
LIMIT 50000
2016-08-04 00:47:35,908 INFO  [http-bio-7070-exec-7] routing.QueryRouter:48
: The project manager's reference is
org.apache.kylin.metadata.project.ProjectManager@3a3735a5
2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7] routing.QueryRouter:60
: Find candidates by table DEFAULT.HPA_REPORTING2 and project=KODDI_DEV :
org.apache.kylin.query.routing.Candidate@51ed1b3b
2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7] routing.QueryRouter:49
: Applying rule: class
org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
realizations before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
[hpa_reporting2_cube_clone(CUBE)]
2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7] routing.QueryRouter:49
: Applying rule: class
org.apache.kylin.query.routing.rules.RealizationSortRule, realizations
before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
[hpa_reporting2_cube_clone(CUBE)]
2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7] routing.QueryRouter:72
: The realizations remaining: [hpa_reporting2_cube_clone(CUBE)] And the
final chosen one is the first one
2016-08-04 00:47:35,975 DEBUG [http-bio-7070-exec-7]
enumerator.OLAPEnumerator:107 : query storage...
2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
v2.CubeStorageQuery:239 : exactAggregation is true
2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
v2.CubeStorageQuery:357 : Enable limit 50000
2016-08-04 00:47:35,977 DEBUG [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:257 : New scanner for current segment
hpa_reporting2_cube_clone[19700101000000_20160828000000] will use
SCAN_FILTER_AGGR_CHECKMEM as endpoint's behavior
2016-08-04 00:47:35,979 DEBUG [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:313 : Serialized scanRequestBytes 836 bytes,
rawScanBytesString 56 bytes
2016-08-04 00:47:35,979 INFO  [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:315 : The scan 31b2dd4c for segment
hpa_reporting2_cube_clone[19700101000000_20160828000000] is as below with 1
separate raw scans, shard part of start/end key is set to 0
2016-08-04 00:47:35,980 INFO  [http-bio-7070-exec-7] v2.CubeHBaseRPC:271 :
Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to 992
Start:
\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\x00\x00\x00\x00\x00
(\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\x00\x00\x00\x00\x00)
Stop:
 \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x00
(\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x00),
No Fuzzy Key
2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:320 : Submitting rpc to 1 shards starting from
shard 2, scan range count 1
2016-08-04 00:47:35,981 INFO  [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:103 : Timeout for ExpectedSizeIterator is: 99000
2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
enumerator.OLAPEnumerator:127 : return TupleIterator...
2016-08-04 00:47:52,773 INFO  [pool-6-thread-1] v2.CubeHBaseEndpointRPC:351
: <sub-thread for GTScanRequest 31b2dd4c> Endpoint RPC returned from HTable
KYLIN_RIK9O18H07 Shard
\x4B\x59\x4C\x49\x4E\x5F\x52\x49\x4B\x39\x4F\x31\x38\x48\x30\x37\x2C\x00\x02\x2C\x31\x34\x37\x30\x31\x35\x35\x33\x31\x34\x39\x33\x37\x2E\x61\x33\x61\x35\x34\x37\x39\x61\x32\x63\x37\x61\x61\x64\x30\x36\x33\x66\x30\x33\x64\x63\x34\x65\x31\x30\x36\x33\x61\x33\x61\x37\x2E
on host: ip-10-0-0-157.ec2.internal.Total scanned row: 12306477. Total
filtered/aggred row: 0. Time elapsed in EP: 16562(ms). Server CPU usage:
0.24348086721950246, server physical mem left: 7.195234304E9, server swap
mem left:0.0.Etc message: start latency: 15@1,agg done@13760,compress
done@16562,server stats done@16562,
debugGitTag:cf4d2940b67d622eacd2ac9a913b221091a35c2e;.Normal Complete: true.
2016-08-04 00:47:54,068 DEBUG [pool-6-thread-1] util.CompressionUtils:67 :
Original: 46465726 bytes. Decompressed: 150553629 bytes. Time: 1294
2016-08-04 00:48:29,303 INFO  [pool-4-thread-1]
threadpool.DefaultScheduler:106 : Job Fetcher: 0 running, 0 actual running,
0 ready, 12 others
2016-08-04 00:48:31,990 INFO  [http-bio-7070-exec-7]
service.QueryService:399 : Scan count for each storageContext: 12306477,
2016-08-04 00:48:31,991 INFO  [http-bio-7070-exec-7]
controller.QueryController:197 : Stats of SQL response: isException: false,
duration: 56152, total scan count 12306477
2016-08-04 00:48:32,000 WARN  [http-bio-7070-exec-7]
sizeof.ObjectGraphWalker:209 : The configured limit of 1,000 object
references was reached while attempting to calculate the size of the object
graph. Severe performance degradation could occur if the sizing operation
continues. This can be avoided by setting the CacheManger or Cache
<sizeOfPolicy> elements maxDepthExceededBehavior to "abort" or adding stop
points with @IgnoreSizeOf annotations. If performance degradation is NOT an
issue at the configured limit, raise the limit value using the CacheManager
or Cache <sizeOfPolicy> elements maxDepth attribute. For more information,
see the Ehcache configuration documentation.
2016-08-04 00:48:32,091 INFO  [http-bio-7070-exec-7]
service.QueryService:250 :
==========================[QUERY]===============================
SQL: SELECT SUM(clicks) FROM hpa_reporting2 GROUP BY site_id, child_id,
search_type, hotel_id, report_date
User: ADMIN
Success: true
Duration: 56.152
Project: koddi_dev
Realization Names: [hpa_reporting2_cube_clone]
Cuboid Ids: [992]
Total scan count: 12306477
Result row count: 50000
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Message: null
==========================[QUERY]===============================

On Wed, Aug 3, 2016 at 8:38 PM, ShaoFeng Shi <sh...@apache.org> wrote:

> Hi Jason, could you please provide the full log since sending query to and
> getting result back? The key information is which cuboid is used for the
> query, cuboid exact match or fuzzy match, how many records be scanned and
> how long it tooks; Thanks.
>
> 2016-08-03 23:19 GMT+08:00 Jason Hale <ja...@koddi.com>:
>
> > Yes, it would have to do post-aggregation in that case, but the strange
> > thing is that query was running fast (about 1 second), while queries with
> > more dimensions, such as "SELECT SUM(clicks) FROM reporting GROUP BY
> > site_id, child_id, report_date, hotel_id". This query will take about 106
> > seconds, but it shouldn't need to do any post-aggregation so I would
> think
> > it should return much quicker than that from the respective cuboid.
> >
> > Here's the explain plan:
> > OLAPToEnumerableConverter
> > OLAPProjectRel(EXPR$0=[$4])
> > OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
> > OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
> > HOTEL_ID=[$2], CLICKS=[$10])
> > OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1, 2, 3, 4,
> 5,
> > 6, 7, 8, 9, 10, 11]])
> >
> > On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <sh...@apache.org>
> > wrote:
> >
> > > In the cube definition, you defined "SITE_ID", "CHILD_ID" as
> "Mandatory"
> > > dimension, which means they will not be aggregated in cube build phase
> > for
> > > all combinations.
> > >
> > > So when you run a query like  "SELECT SUM(clicks) FROM reporting GROUP
> BY
> > > search_type", Kylin will use the combination  "SITE_ID" + "CHILD_ID" +
> > > "SEARCH_TYPE" to serve, there will be post-aggregation in runtime; The
> > > performance is much depent on the cardinality of "SITE_ID" and
> > "CHILD_ID".
> > >
> > >
> > > 2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > >
> > > > I've looked over the optimization options before, but did not notice
> > the
> > > > rowkey ordering. I can try this and see if this helps me. This is the
> > > only
> > > > thing I see that I can attempt to optimize further in the design, but
> > > I'll
> > > > provide my cube design below. I only have one measure to keep it
> > simple:
> > > >
> > > > {
> > > >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> > > >   "version": "1.5.2",
> > > >   "name": "hpa_reporting2_cube",
> > > >   "description": "",
> > > >   "dimensions": [
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "REPORT_DATE",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "SEARCH_TYPE",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "HOTEL_ID",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "CHILD_ID",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "COUNTRY",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "DEVICE_TYPE",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "STAY_LENGTH",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "TRUE_RANK_AG",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "ROOM_BUNDLE",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "SITE_ID",
> > > >       "derived": null
> > > >     }
> > > >   ],
> > > >   "measures": [
> > > >     {
> > > >       "name": "_COUNT_",
> > > >       "function": {
> > > >         "expression": "COUNT",
> > > >         "parameter": {
> > > >           "type": "constant",
> > > >           "value": "1",
> > > >           "next_parameter": null
> > > >         },
> > > >         "returntype": "bigint"
> > > >       },
> > > >       "dependent_measure_ref": null
> > > >     },
> > > >     {
> > > >       "name": "CLICKS",
> > > >       "function": {
> > > >         "expression": "SUM",
> > > >         "parameter": {
> > > >           "type": "column",
> > > >           "value": "CLICKS",
> > > >           "next_parameter": null
> > > >         },
> > > >         "returntype": "decimal"
> > > >       },
> > > >       "dependent_measure_ref": null
> > > >     }
> > > >   ],
> > > >   "rowkey": {
> > > >     "rowkey_columns": [
> > > >       {
> > > >         "column": "REPORT_DATE",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "SEARCH_TYPE",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "HOTEL_ID",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "CHILD_ID",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "COUNTRY",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "DEVICE_TYPE",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "STAY_LENGTH",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "TRUE_RANK_AG",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "ROOM_BUNDLE",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "SITE_ID",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       }
> > > >     ]
> > > >   },
> > > >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> > > >   "last_modified": 1470070273935,
> > > >   "model_name": "hpa_reporting_model2",
> > > >   "null_string": null,
> > > >   "hbase_mapping": {
> > > >     "column_family": [
> > > >       {
> > > >         "name": "F1",
> > > >         "columns": [
> > > >           {
> > > >             "qualifier": "M",
> > > >             "measure_refs": [
> > > >               "_COUNT_",
> > > >               "CLICKS"
> > > >             ]
> > > >           }
> > > >         ]
> > > >       }
> > > >     ]
> > > >   },
> > > >   "aggregation_groups": [
> > > >     {
> > > >       "includes": [
> > > >         "REPORT_DATE",
> > > >         "SEARCH_TYPE",
> > > >         "HOTEL_ID",
> > > >         "CHILD_ID",
> > > >         "COUNTRY",
> > > >         "DEVICE_TYPE",
> > > >         "STAY_LENGTH",
> > > >         "TRUE_RANK_AG",
> > > >         "ROOM_BUNDLE",
> > > >         "SITE_ID"
> > > >       ],
> > > >       "select_rule": {
> > > >         "hierarchy_dims": [],
> > > >         "mandatory_dims": [
> > > >           "SITE_ID",
> > > >           "CHILD_ID"
> > > >         ],
> > > >         "joint_dims": [
> > > >           [
> > > >             "ROOM_BUNDLE",
> > > >             "TRUE_RANK_AG"
> > > >           ]
> > > >         ]
> > > >       }
> > > >     }
> > > >   ],
> > > >   "notify_list": [],
> > > >   "status_need_notify": [
> > > >     "ERROR",
> > > >     "DISCARDED",
> > > >     "SUCCEED"
> > > >   ],
> > > >   "partition_date_start": 0,
> > > >   "partition_date_end": 3153600000000,
> > > >   "auto_merge_time_ranges": [
> > > >     604800000,
> > > >     2419200000
> > > >   ],
> > > >   "retention_range": 0,
> > > >   "engine_type": 2,
> > > >   "storage_type": 2,
> > > >   "override_kylin_properties": {}
> > > > }
> > > >
> > > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <shaofengshi@apache.org
> >
> > > > wrote:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > As Yiming mentioned, the cube design matters for the performance of
> > > both
> > > > > build and query; please check "Optimize Cube" in the document web
> > page
> > > > and
> > > > > do optimizaiton as much as possible;
> > > > >
> > > > > Besides, the cluster's capacity and Hadoop configuration is also an
> > > > > important factor; Try to identify the bottleneck and then optimize
> or
> > > add
> > > > > capacity.
> > > > >
> > > > > From 1.5 Kylin ships with two cubing algorithm; the steps "Build
> > > > > N-Dimension Cuboid" are the legacy "Layered" cubing algorithm; They
> > > will
> > > > be
> > > > > skipped when Kylin selects to use the new "Fast" cubing algorithm,
> > > which
> > > > is
> > > > > the "Build Cube" step after them. Please click the hadoop link in
> > that
> > > > step
> > > > > to inspect the MR job's statistics;
> > > > >
> > > > > Hope this helps to some extend;
> > > > >
> > > > >
> > > > >
> > > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <li...@gmail.com>:
> > > > >
> > > > > > Hi Jason,
> > > > > >
> > > > > > Cube design is the performance key for Kylin, not only query, but
> > > also
> > > > > cube
> > > > > > building process. How to select dimensions, how to define the
> > > > > relationship
> > > > > > between dimensions, how to select encode method, how to define
> > > measure,
> > > > > > even how to choose the Hbase key order will have a significant
> > impact
> > > > on
> > > > > > performance.  There are quite a few wonderful documents
> introducing
> > > how
> > > > > to
> > > > > > do this, http://kylin.apache.org/docs15/ .
> > > > > >
> > > > > > One more thing, if you could share your cube design, you would
> get
> > > help
> > > > > > easier here.
> > > > > >
> > > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > >
> > > > > > > I'm setting up a test case for a portion of our dataset, to
> > > evaluate
> > > > > > Kylin
> > > > > > > and I'm not seeing the performance that I would expect.
> > > > > > >
> > > > > > > The cube building process is taking about 5-6 hours with
> > > ~69,000,000
> > > > > > > records and 10 dimensions. I'm not sure if that's the expected
> > > build
> > > > > > time,
> > > > > > > but the other problem is the query performance after building
> the
> > > > cube.
> > > > > > >
> > > > > > > All queries were tested with a very simple query (e.g. SELECT
> > > > > SUM(clicks)
> > > > > > > FROM reporting GROUP BY search_type)
> > > > > > >
> > > > > > > Grouping by 1 or 2 dimensions gives me very responsive queries
> > > > (under 2
> > > > > > > seconds), but adding more dimensions drastically increases the
> > > query
> > > > > time
> > > > > > > (over 1 minute and it times out through hbase). I would expect
> > > these
> > > > > > > queries to have all similar query times since they should query
> > the
> > > > > > > respective cuboid, so I'm not sure why the performance would
> > > suffer.
> > > > I
> > > > > > > didn't set up any special rules for the cube, but during the
> > build
> > > > > > process
> > > > > > > it showed all the N-dimension cubes and the log simply said
> > > > 'skipped'.
> > > > > > >
> > > > > > > Is there something I'm missing in the configuration?
> > > > > > >
> > > > > > > I have a HDP cluster with 3 nodes and 1 client node on which
> > Kylin
> > > is
> > > > > > > installed. Do I need to adjust the hadoop configuration. I'm
> > using
> > > > most
> > > > > > of
> > > > > > > the default HDP settings.
> > > > > > >
> > > > > > > What more information can I provide?
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > With Warm regards
> > > > > >
> > > > > > Yiming Liu (刘一鸣)
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > > Shaofeng Shi
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>

Re: Kylin Cube Performance

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Jason, could you please provide the full log since sending query to and
getting result back? The key information is which cuboid is used for the
query, cuboid exact match or fuzzy match, how many records be scanned and
how long it tooks; Thanks.

2016-08-03 23:19 GMT+08:00 Jason Hale <ja...@koddi.com>:

> Yes, it would have to do post-aggregation in that case, but the strange
> thing is that query was running fast (about 1 second), while queries with
> more dimensions, such as "SELECT SUM(clicks) FROM reporting GROUP BY
> site_id, child_id, report_date, hotel_id". This query will take about 106
> seconds, but it shouldn't need to do any post-aggregation so I would think
> it should return much quicker than that from the respective cuboid.
>
> Here's the explain plan:
> OLAPToEnumerableConverter
> OLAPProjectRel(EXPR$0=[$4])
> OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
> OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
> HOTEL_ID=[$2], CLICKS=[$10])
> OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1, 2, 3, 4, 5,
> 6, 7, 8, 9, 10, 11]])
>
> On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <sh...@apache.org>
> wrote:
>
> > In the cube definition, you defined "SITE_ID", "CHILD_ID" as "Mandatory"
> > dimension, which means they will not be aggregated in cube build phase
> for
> > all combinations.
> >
> > So when you run a query like  "SELECT SUM(clicks) FROM reporting GROUP BY
> > search_type", Kylin will use the combination  "SITE_ID" + "CHILD_ID" +
> > "SEARCH_TYPE" to serve, there will be post-aggregation in runtime; The
> > performance is much depent on the cardinality of "SITE_ID" and
> "CHILD_ID".
> >
> >
> > 2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:
> >
> > > I've looked over the optimization options before, but did not notice
> the
> > > rowkey ordering. I can try this and see if this helps me. This is the
> > only
> > > thing I see that I can attempt to optimize further in the design, but
> > I'll
> > > provide my cube design below. I only have one measure to keep it
> simple:
> > >
> > > {
> > >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> > >   "version": "1.5.2",
> > >   "name": "hpa_reporting2_cube",
> > >   "description": "",
> > >   "dimensions": [
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "REPORT_DATE",
> > >       "derived": null
> > >     },
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "SEARCH_TYPE",
> > >       "derived": null
> > >     },
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "HOTEL_ID",
> > >       "derived": null
> > >     },
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "CHILD_ID",
> > >       "derived": null
> > >     },
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "COUNTRY",
> > >       "derived": null
> > >     },
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "DEVICE_TYPE",
> > >       "derived": null
> > >     },
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "STAY_LENGTH",
> > >       "derived": null
> > >     },
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "TRUE_RANK_AG",
> > >       "derived": null
> > >     },
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "ROOM_BUNDLE",
> > >       "derived": null
> > >     },
> > >     {
> > >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> > >       "table": "DEFAULT.HPA_REPORTING2",
> > >       "column": "SITE_ID",
> > >       "derived": null
> > >     }
> > >   ],
> > >   "measures": [
> > >     {
> > >       "name": "_COUNT_",
> > >       "function": {
> > >         "expression": "COUNT",
> > >         "parameter": {
> > >           "type": "constant",
> > >           "value": "1",
> > >           "next_parameter": null
> > >         },
> > >         "returntype": "bigint"
> > >       },
> > >       "dependent_measure_ref": null
> > >     },
> > >     {
> > >       "name": "CLICKS",
> > >       "function": {
> > >         "expression": "SUM",
> > >         "parameter": {
> > >           "type": "column",
> > >           "value": "CLICKS",
> > >           "next_parameter": null
> > >         },
> > >         "returntype": "decimal"
> > >       },
> > >       "dependent_measure_ref": null
> > >     }
> > >   ],
> > >   "rowkey": {
> > >     "rowkey_columns": [
> > >       {
> > >         "column": "REPORT_DATE",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       },
> > >       {
> > >         "column": "SEARCH_TYPE",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       },
> > >       {
> > >         "column": "HOTEL_ID",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       },
> > >       {
> > >         "column": "CHILD_ID",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       },
> > >       {
> > >         "column": "COUNTRY",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       },
> > >       {
> > >         "column": "DEVICE_TYPE",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       },
> > >       {
> > >         "column": "STAY_LENGTH",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       },
> > >       {
> > >         "column": "TRUE_RANK_AG",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       },
> > >       {
> > >         "column": "ROOM_BUNDLE",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       },
> > >       {
> > >         "column": "SITE_ID",
> > >         "encoding": "dict",
> > >         "isShardBy": false
> > >       }
> > >     ]
> > >   },
> > >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> > >   "last_modified": 1470070273935,
> > >   "model_name": "hpa_reporting_model2",
> > >   "null_string": null,
> > >   "hbase_mapping": {
> > >     "column_family": [
> > >       {
> > >         "name": "F1",
> > >         "columns": [
> > >           {
> > >             "qualifier": "M",
> > >             "measure_refs": [
> > >               "_COUNT_",
> > >               "CLICKS"
> > >             ]
> > >           }
> > >         ]
> > >       }
> > >     ]
> > >   },
> > >   "aggregation_groups": [
> > >     {
> > >       "includes": [
> > >         "REPORT_DATE",
> > >         "SEARCH_TYPE",
> > >         "HOTEL_ID",
> > >         "CHILD_ID",
> > >         "COUNTRY",
> > >         "DEVICE_TYPE",
> > >         "STAY_LENGTH",
> > >         "TRUE_RANK_AG",
> > >         "ROOM_BUNDLE",
> > >         "SITE_ID"
> > >       ],
> > >       "select_rule": {
> > >         "hierarchy_dims": [],
> > >         "mandatory_dims": [
> > >           "SITE_ID",
> > >           "CHILD_ID"
> > >         ],
> > >         "joint_dims": [
> > >           [
> > >             "ROOM_BUNDLE",
> > >             "TRUE_RANK_AG"
> > >           ]
> > >         ]
> > >       }
> > >     }
> > >   ],
> > >   "notify_list": [],
> > >   "status_need_notify": [
> > >     "ERROR",
> > >     "DISCARDED",
> > >     "SUCCEED"
> > >   ],
> > >   "partition_date_start": 0,
> > >   "partition_date_end": 3153600000000,
> > >   "auto_merge_time_ranges": [
> > >     604800000,
> > >     2419200000
> > >   ],
> > >   "retention_range": 0,
> > >   "engine_type": 2,
> > >   "storage_type": 2,
> > >   "override_kylin_properties": {}
> > > }
> > >
> > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <sh...@apache.org>
> > > wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > As Yiming mentioned, the cube design matters for the performance of
> > both
> > > > build and query; please check "Optimize Cube" in the document web
> page
> > > and
> > > > do optimizaiton as much as possible;
> > > >
> > > > Besides, the cluster's capacity and Hadoop configuration is also an
> > > > important factor; Try to identify the bottleneck and then optimize or
> > add
> > > > capacity.
> > > >
> > > > From 1.5 Kylin ships with two cubing algorithm; the steps "Build
> > > > N-Dimension Cuboid" are the legacy "Layered" cubing algorithm; They
> > will
> > > be
> > > > skipped when Kylin selects to use the new "Fast" cubing algorithm,
> > which
> > > is
> > > > the "Build Cube" step after them. Please click the hadoop link in
> that
> > > step
> > > > to inspect the MR job's statistics;
> > > >
> > > > Hope this helps to some extend;
> > > >
> > > >
> > > >
> > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <li...@gmail.com>:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > Cube design is the performance key for Kylin, not only query, but
> > also
> > > > cube
> > > > > building process. How to select dimensions, how to define the
> > > > relationship
> > > > > between dimensions, how to select encode method, how to define
> > measure,
> > > > > even how to choose the Hbase key order will have a significant
> impact
> > > on
> > > > > performance.  There are quite a few wonderful documents introducing
> > how
> > > > to
> > > > > do this, http://kylin.apache.org/docs15/ .
> > > > >
> > > > > One more thing, if you could share your cube design, you would get
> > help
> > > > > easier here.
> > > > >
> > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > >
> > > > > > I'm setting up a test case for a portion of our dataset, to
> > evaluate
> > > > > Kylin
> > > > > > and I'm not seeing the performance that I would expect.
> > > > > >
> > > > > > The cube building process is taking about 5-6 hours with
> > ~69,000,000
> > > > > > records and 10 dimensions. I'm not sure if that's the expected
> > build
> > > > > time,
> > > > > > but the other problem is the query performance after building the
> > > cube.
> > > > > >
> > > > > > All queries were tested with a very simple query (e.g. SELECT
> > > > SUM(clicks)
> > > > > > FROM reporting GROUP BY search_type)
> > > > > >
> > > > > > Grouping by 1 or 2 dimensions gives me very responsive queries
> > > (under 2
> > > > > > seconds), but adding more dimensions drastically increases the
> > query
> > > > time
> > > > > > (over 1 minute and it times out through hbase). I would expect
> > these
> > > > > > queries to have all similar query times since they should query
> the
> > > > > > respective cuboid, so I'm not sure why the performance would
> > suffer.
> > > I
> > > > > > didn't set up any special rules for the cube, but during the
> build
> > > > > process
> > > > > > it showed all the N-dimension cubes and the log simply said
> > > 'skipped'.
> > > > > >
> > > > > > Is there something I'm missing in the configuration?
> > > > > >
> > > > > > I have a HDP cluster with 3 nodes and 1 client node on which
> Kylin
> > is
> > > > > > installed. Do I need to adjust the hadoop configuration. I'm
> using
> > > most
> > > > > of
> > > > > > the default HDP settings.
> > > > > >
> > > > > > What more information can I provide?
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > With Warm regards
> > > > >
> > > > > Yiming Liu (刘一鸣)
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > > Shaofeng Shi
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi
> >
>



-- 
Best regards,

Shaofeng Shi

Re: Kylin Cube Performance

Posted by Jason Hale <ja...@koddi.com>.
Yes, it would have to do post-aggregation in that case, but the strange
thing is that query was running fast (about 1 second), while queries with
more dimensions, such as "SELECT SUM(clicks) FROM reporting GROUP BY
site_id, child_id, report_date, hotel_id". This query will take about 106
seconds, but it shouldn't need to do any post-aggregation so I would think
it should return much quicker than that from the respective cuboid.

Here's the explain plan:
OLAPToEnumerableConverter
OLAPProjectRel(EXPR$0=[$4])
OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
HOTEL_ID=[$2], CLICKS=[$10])
OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11]])

On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <sh...@apache.org> wrote:

> In the cube definition, you defined "SITE_ID", "CHILD_ID" as "Mandatory"
> dimension, which means they will not be aggregated in cube build phase for
> all combinations.
>
> So when you run a query like  "SELECT SUM(clicks) FROM reporting GROUP BY
> search_type", Kylin will use the combination  "SITE_ID" + "CHILD_ID" +
> "SEARCH_TYPE" to serve, there will be post-aggregation in runtime; The
> performance is much depent on the cardinality of "SITE_ID" and "CHILD_ID".
>
>
> 2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:
>
> > I've looked over the optimization options before, but did not notice the
> > rowkey ordering. I can try this and see if this helps me. This is the
> only
> > thing I see that I can attempt to optimize further in the design, but
> I'll
> > provide my cube design below. I only have one measure to keep it simple:
> >
> > {
> >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> >   "version": "1.5.2",
> >   "name": "hpa_reporting2_cube",
> >   "description": "",
> >   "dimensions": [
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "REPORT_DATE",
> >       "derived": null
> >     },
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "SEARCH_TYPE",
> >       "derived": null
> >     },
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "HOTEL_ID",
> >       "derived": null
> >     },
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "CHILD_ID",
> >       "derived": null
> >     },
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "COUNTRY",
> >       "derived": null
> >     },
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "DEVICE_TYPE",
> >       "derived": null
> >     },
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "STAY_LENGTH",
> >       "derived": null
> >     },
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "TRUE_RANK_AG",
> >       "derived": null
> >     },
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "ROOM_BUNDLE",
> >       "derived": null
> >     },
> >     {
> >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> >       "table": "DEFAULT.HPA_REPORTING2",
> >       "column": "SITE_ID",
> >       "derived": null
> >     }
> >   ],
> >   "measures": [
> >     {
> >       "name": "_COUNT_",
> >       "function": {
> >         "expression": "COUNT",
> >         "parameter": {
> >           "type": "constant",
> >           "value": "1",
> >           "next_parameter": null
> >         },
> >         "returntype": "bigint"
> >       },
> >       "dependent_measure_ref": null
> >     },
> >     {
> >       "name": "CLICKS",
> >       "function": {
> >         "expression": "SUM",
> >         "parameter": {
> >           "type": "column",
> >           "value": "CLICKS",
> >           "next_parameter": null
> >         },
> >         "returntype": "decimal"
> >       },
> >       "dependent_measure_ref": null
> >     }
> >   ],
> >   "rowkey": {
> >     "rowkey_columns": [
> >       {
> >         "column": "REPORT_DATE",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       },
> >       {
> >         "column": "SEARCH_TYPE",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       },
> >       {
> >         "column": "HOTEL_ID",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       },
> >       {
> >         "column": "CHILD_ID",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       },
> >       {
> >         "column": "COUNTRY",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       },
> >       {
> >         "column": "DEVICE_TYPE",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       },
> >       {
> >         "column": "STAY_LENGTH",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       },
> >       {
> >         "column": "TRUE_RANK_AG",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       },
> >       {
> >         "column": "ROOM_BUNDLE",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       },
> >       {
> >         "column": "SITE_ID",
> >         "encoding": "dict",
> >         "isShardBy": false
> >       }
> >     ]
> >   },
> >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> >   "last_modified": 1470070273935,
> >   "model_name": "hpa_reporting_model2",
> >   "null_string": null,
> >   "hbase_mapping": {
> >     "column_family": [
> >       {
> >         "name": "F1",
> >         "columns": [
> >           {
> >             "qualifier": "M",
> >             "measure_refs": [
> >               "_COUNT_",
> >               "CLICKS"
> >             ]
> >           }
> >         ]
> >       }
> >     ]
> >   },
> >   "aggregation_groups": [
> >     {
> >       "includes": [
> >         "REPORT_DATE",
> >         "SEARCH_TYPE",
> >         "HOTEL_ID",
> >         "CHILD_ID",
> >         "COUNTRY",
> >         "DEVICE_TYPE",
> >         "STAY_LENGTH",
> >         "TRUE_RANK_AG",
> >         "ROOM_BUNDLE",
> >         "SITE_ID"
> >       ],
> >       "select_rule": {
> >         "hierarchy_dims": [],
> >         "mandatory_dims": [
> >           "SITE_ID",
> >           "CHILD_ID"
> >         ],
> >         "joint_dims": [
> >           [
> >             "ROOM_BUNDLE",
> >             "TRUE_RANK_AG"
> >           ]
> >         ]
> >       }
> >     }
> >   ],
> >   "notify_list": [],
> >   "status_need_notify": [
> >     "ERROR",
> >     "DISCARDED",
> >     "SUCCEED"
> >   ],
> >   "partition_date_start": 0,
> >   "partition_date_end": 3153600000000,
> >   "auto_merge_time_ranges": [
> >     604800000,
> >     2419200000
> >   ],
> >   "retention_range": 0,
> >   "engine_type": 2,
> >   "storage_type": 2,
> >   "override_kylin_properties": {}
> > }
> >
> > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <sh...@apache.org>
> > wrote:
> >
> > > Hi Jason,
> > >
> > > As Yiming mentioned, the cube design matters for the performance of
> both
> > > build and query; please check "Optimize Cube" in the document web page
> > and
> > > do optimizaiton as much as possible;
> > >
> > > Besides, the cluster's capacity and Hadoop configuration is also an
> > > important factor; Try to identify the bottleneck and then optimize or
> add
> > > capacity.
> > >
> > > From 1.5 Kylin ships with two cubing algorithm; the steps "Build
> > > N-Dimension Cuboid" are the legacy "Layered" cubing algorithm; They
> will
> > be
> > > skipped when Kylin selects to use the new "Fast" cubing algorithm,
> which
> > is
> > > the "Build Cube" step after them. Please click the hadoop link in that
> > step
> > > to inspect the MR job's statistics;
> > >
> > > Hope this helps to some extend;
> > >
> > >
> > >
> > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <li...@gmail.com>:
> > >
> > > > Hi Jason,
> > > >
> > > > Cube design is the performance key for Kylin, not only query, but
> also
> > > cube
> > > > building process. How to select dimensions, how to define the
> > > relationship
> > > > between dimensions, how to select encode method, how to define
> measure,
> > > > even how to choose the Hbase key order will have a significant impact
> > on
> > > > performance.  There are quite a few wonderful documents introducing
> how
> > > to
> > > > do this, http://kylin.apache.org/docs15/ .
> > > >
> > > > One more thing, if you could share your cube design, you would get
> help
> > > > easier here.
> > > >
> > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > >
> > > > > I'm setting up a test case for a portion of our dataset, to
> evaluate
> > > > Kylin
> > > > > and I'm not seeing the performance that I would expect.
> > > > >
> > > > > The cube building process is taking about 5-6 hours with
> ~69,000,000
> > > > > records and 10 dimensions. I'm not sure if that's the expected
> build
> > > > time,
> > > > > but the other problem is the query performance after building the
> > cube.
> > > > >
> > > > > All queries were tested with a very simple query (e.g. SELECT
> > > SUM(clicks)
> > > > > FROM reporting GROUP BY search_type)
> > > > >
> > > > > Grouping by 1 or 2 dimensions gives me very responsive queries
> > (under 2
> > > > > seconds), but adding more dimensions drastically increases the
> query
> > > time
> > > > > (over 1 minute and it times out through hbase). I would expect
> these
> > > > > queries to have all similar query times since they should query the
> > > > > respective cuboid, so I'm not sure why the performance would
> suffer.
> > I
> > > > > didn't set up any special rules for the cube, but during the build
> > > > process
> > > > > it showed all the N-dimension cubes and the log simply said
> > 'skipped'.
> > > > >
> > > > > Is there something I'm missing in the configuration?
> > > > >
> > > > > I have a HDP cluster with 3 nodes and 1 client node on which Kylin
> is
> > > > > installed. Do I need to adjust the hadoop configuration. I'm using
> > most
> > > > of
> > > > > the default HDP settings.
> > > > >
> > > > > What more information can I provide?
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > With Warm regards
> > > >
> > > > Yiming Liu (刘一鸣)
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>

Re: Kylin Cube Performance

Posted by ShaoFeng Shi <sh...@apache.org>.
In the cube definition, you defined "SITE_ID", "CHILD_ID" as "Mandatory"
dimension, which means they will not be aggregated in cube build phase for
all combinations.

So when you run a query like  "SELECT SUM(clicks) FROM reporting GROUP BY
search_type", Kylin will use the combination  "SITE_ID" + "CHILD_ID" +
"SEARCH_TYPE" to serve, there will be post-aggregation in runtime; The
performance is much depent on the cardinality of "SITE_ID" and "CHILD_ID".


2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:

> I've looked over the optimization options before, but did not notice the
> rowkey ordering. I can try this and see if this helps me. This is the only
> thing I see that I can attempt to optimize further in the design, but I'll
> provide my cube design below. I only have one measure to keep it simple:
>
> {
>   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
>   "version": "1.5.2",
>   "name": "hpa_reporting2_cube",
>   "description": "",
>   "dimensions": [
>     {
>       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "REPORT_DATE",
>       "derived": null
>     },
>     {
>       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "SEARCH_TYPE",
>       "derived": null
>     },
>     {
>       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "HOTEL_ID",
>       "derived": null
>     },
>     {
>       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "CHILD_ID",
>       "derived": null
>     },
>     {
>       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "COUNTRY",
>       "derived": null
>     },
>     {
>       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "DEVICE_TYPE",
>       "derived": null
>     },
>     {
>       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "STAY_LENGTH",
>       "derived": null
>     },
>     {
>       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "TRUE_RANK_AG",
>       "derived": null
>     },
>     {
>       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "ROOM_BUNDLE",
>       "derived": null
>     },
>     {
>       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
>       "table": "DEFAULT.HPA_REPORTING2",
>       "column": "SITE_ID",
>       "derived": null
>     }
>   ],
>   "measures": [
>     {
>       "name": "_COUNT_",
>       "function": {
>         "expression": "COUNT",
>         "parameter": {
>           "type": "constant",
>           "value": "1",
>           "next_parameter": null
>         },
>         "returntype": "bigint"
>       },
>       "dependent_measure_ref": null
>     },
>     {
>       "name": "CLICKS",
>       "function": {
>         "expression": "SUM",
>         "parameter": {
>           "type": "column",
>           "value": "CLICKS",
>           "next_parameter": null
>         },
>         "returntype": "decimal"
>       },
>       "dependent_measure_ref": null
>     }
>   ],
>   "rowkey": {
>     "rowkey_columns": [
>       {
>         "column": "REPORT_DATE",
>         "encoding": "dict",
>         "isShardBy": false
>       },
>       {
>         "column": "SEARCH_TYPE",
>         "encoding": "dict",
>         "isShardBy": false
>       },
>       {
>         "column": "HOTEL_ID",
>         "encoding": "dict",
>         "isShardBy": false
>       },
>       {
>         "column": "CHILD_ID",
>         "encoding": "dict",
>         "isShardBy": false
>       },
>       {
>         "column": "COUNTRY",
>         "encoding": "dict",
>         "isShardBy": false
>       },
>       {
>         "column": "DEVICE_TYPE",
>         "encoding": "dict",
>         "isShardBy": false
>       },
>       {
>         "column": "STAY_LENGTH",
>         "encoding": "dict",
>         "isShardBy": false
>       },
>       {
>         "column": "TRUE_RANK_AG",
>         "encoding": "dict",
>         "isShardBy": false
>       },
>       {
>         "column": "ROOM_BUNDLE",
>         "encoding": "dict",
>         "isShardBy": false
>       },
>       {
>         "column": "SITE_ID",
>         "encoding": "dict",
>         "isShardBy": false
>       }
>     ]
>   },
>   "signature": "KixlKWxevr6mO+UlSaR5ig==",
>   "last_modified": 1470070273935,
>   "model_name": "hpa_reporting_model2",
>   "null_string": null,
>   "hbase_mapping": {
>     "column_family": [
>       {
>         "name": "F1",
>         "columns": [
>           {
>             "qualifier": "M",
>             "measure_refs": [
>               "_COUNT_",
>               "CLICKS"
>             ]
>           }
>         ]
>       }
>     ]
>   },
>   "aggregation_groups": [
>     {
>       "includes": [
>         "REPORT_DATE",
>         "SEARCH_TYPE",
>         "HOTEL_ID",
>         "CHILD_ID",
>         "COUNTRY",
>         "DEVICE_TYPE",
>         "STAY_LENGTH",
>         "TRUE_RANK_AG",
>         "ROOM_BUNDLE",
>         "SITE_ID"
>       ],
>       "select_rule": {
>         "hierarchy_dims": [],
>         "mandatory_dims": [
>           "SITE_ID",
>           "CHILD_ID"
>         ],
>         "joint_dims": [
>           [
>             "ROOM_BUNDLE",
>             "TRUE_RANK_AG"
>           ]
>         ]
>       }
>     }
>   ],
>   "notify_list": [],
>   "status_need_notify": [
>     "ERROR",
>     "DISCARDED",
>     "SUCCEED"
>   ],
>   "partition_date_start": 0,
>   "partition_date_end": 3153600000000,
>   "auto_merge_time_ranges": [
>     604800000,
>     2419200000
>   ],
>   "retention_range": 0,
>   "engine_type": 2,
>   "storage_type": 2,
>   "override_kylin_properties": {}
> }
>
> On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <sh...@apache.org>
> wrote:
>
> > Hi Jason,
> >
> > As Yiming mentioned, the cube design matters for the performance of both
> > build and query; please check "Optimize Cube" in the document web page
> and
> > do optimizaiton as much as possible;
> >
> > Besides, the cluster's capacity and Hadoop configuration is also an
> > important factor; Try to identify the bottleneck and then optimize or add
> > capacity.
> >
> > From 1.5 Kylin ships with two cubing algorithm; the steps "Build
> > N-Dimension Cuboid" are the legacy "Layered" cubing algorithm; They will
> be
> > skipped when Kylin selects to use the new "Fast" cubing algorithm, which
> is
> > the "Build Cube" step after them. Please click the hadoop link in that
> step
> > to inspect the MR job's statistics;
> >
> > Hope this helps to some extend;
> >
> >
> >
> > 2016-08-02 8:44 GMT+08:00 Yiming Liu <li...@gmail.com>:
> >
> > > Hi Jason,
> > >
> > > Cube design is the performance key for Kylin, not only query, but also
> > cube
> > > building process. How to select dimensions, how to define the
> > relationship
> > > between dimensions, how to select encode method, how to define measure,
> > > even how to choose the Hbase key order will have a significant impact
> on
> > > performance.  There are quite a few wonderful documents introducing how
> > to
> > > do this, http://kylin.apache.org/docs15/ .
> > >
> > > One more thing, if you could share your cube design, you would get help
> > > easier here.
> > >
> > > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > >
> > > > I'm setting up a test case for a portion of our dataset, to evaluate
> > > Kylin
> > > > and I'm not seeing the performance that I would expect.
> > > >
> > > > The cube building process is taking about 5-6 hours with  ~69,000,000
> > > > records and 10 dimensions. I'm not sure if that's the expected build
> > > time,
> > > > but the other problem is the query performance after building the
> cube.
> > > >
> > > > All queries were tested with a very simple query (e.g. SELECT
> > SUM(clicks)
> > > > FROM reporting GROUP BY search_type)
> > > >
> > > > Grouping by 1 or 2 dimensions gives me very responsive queries
> (under 2
> > > > seconds), but adding more dimensions drastically increases the query
> > time
> > > > (over 1 minute and it times out through hbase). I would expect these
> > > > queries to have all similar query times since they should query the
> > > > respective cuboid, so I'm not sure why the performance would suffer.
> I
> > > > didn't set up any special rules for the cube, but during the build
> > > process
> > > > it showed all the N-dimension cubes and the log simply said
> 'skipped'.
> > > >
> > > > Is there something I'm missing in the configuration?
> > > >
> > > > I have a HDP cluster with 3 nodes and 1 client node on which Kylin is
> > > > installed. Do I need to adjust the hadoop configuration. I'm using
> most
> > > of
> > > > the default HDP settings.
> > > >
> > > > What more information can I provide?
> > > >
> > >
> > >
> > >
> > > --
> > > With Warm regards
> > >
> > > Yiming Liu (刘一鸣)
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi
> >
>



-- 
Best regards,

Shaofeng Shi

Re: Kylin Cube Performance

Posted by Jason Hale <ja...@koddi.com>.
I've looked over the optimization options before, but did not notice the
rowkey ordering. I can try this and see if this helps me. This is the only
thing I see that I can attempt to optimize further in the design, but I'll
provide my cube design below. I only have one measure to keep it simple:

{
  "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
  "version": "1.5.2",
  "name": "hpa_reporting2_cube",
  "description": "",
  "dimensions": [
    {
      "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "REPORT_DATE",
      "derived": null
    },
    {
      "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "SEARCH_TYPE",
      "derived": null
    },
    {
      "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "HOTEL_ID",
      "derived": null
    },
    {
      "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "CHILD_ID",
      "derived": null
    },
    {
      "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "COUNTRY",
      "derived": null
    },
    {
      "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "DEVICE_TYPE",
      "derived": null
    },
    {
      "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "STAY_LENGTH",
      "derived": null
    },
    {
      "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "TRUE_RANK_AG",
      "derived": null
    },
    {
      "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "ROOM_BUNDLE",
      "derived": null
    },
    {
      "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
      "table": "DEFAULT.HPA_REPORTING2",
      "column": "SITE_ID",
      "derived": null
    }
  ],
  "measures": [
    {
      "name": "_COUNT_",
      "function": {
        "expression": "COUNT",
        "parameter": {
          "type": "constant",
          "value": "1",
          "next_parameter": null
        },
        "returntype": "bigint"
      },
      "dependent_measure_ref": null
    },
    {
      "name": "CLICKS",
      "function": {
        "expression": "SUM",
        "parameter": {
          "type": "column",
          "value": "CLICKS",
          "next_parameter": null
        },
        "returntype": "decimal"
      },
      "dependent_measure_ref": null
    }
  ],
  "rowkey": {
    "rowkey_columns": [
      {
        "column": "REPORT_DATE",
        "encoding": "dict",
        "isShardBy": false
      },
      {
        "column": "SEARCH_TYPE",
        "encoding": "dict",
        "isShardBy": false
      },
      {
        "column": "HOTEL_ID",
        "encoding": "dict",
        "isShardBy": false
      },
      {
        "column": "CHILD_ID",
        "encoding": "dict",
        "isShardBy": false
      },
      {
        "column": "COUNTRY",
        "encoding": "dict",
        "isShardBy": false
      },
      {
        "column": "DEVICE_TYPE",
        "encoding": "dict",
        "isShardBy": false
      },
      {
        "column": "STAY_LENGTH",
        "encoding": "dict",
        "isShardBy": false
      },
      {
        "column": "TRUE_RANK_AG",
        "encoding": "dict",
        "isShardBy": false
      },
      {
        "column": "ROOM_BUNDLE",
        "encoding": "dict",
        "isShardBy": false
      },
      {
        "column": "SITE_ID",
        "encoding": "dict",
        "isShardBy": false
      }
    ]
  },
  "signature": "KixlKWxevr6mO+UlSaR5ig==",
  "last_modified": 1470070273935,
  "model_name": "hpa_reporting_model2",
  "null_string": null,
  "hbase_mapping": {
    "column_family": [
      {
        "name": "F1",
        "columns": [
          {
            "qualifier": "M",
            "measure_refs": [
              "_COUNT_",
              "CLICKS"
            ]
          }
        ]
      }
    ]
  },
  "aggregation_groups": [
    {
      "includes": [
        "REPORT_DATE",
        "SEARCH_TYPE",
        "HOTEL_ID",
        "CHILD_ID",
        "COUNTRY",
        "DEVICE_TYPE",
        "STAY_LENGTH",
        "TRUE_RANK_AG",
        "ROOM_BUNDLE",
        "SITE_ID"
      ],
      "select_rule": {
        "hierarchy_dims": [],
        "mandatory_dims": [
          "SITE_ID",
          "CHILD_ID"
        ],
        "joint_dims": [
          [
            "ROOM_BUNDLE",
            "TRUE_RANK_AG"
          ]
        ]
      }
    }
  ],
  "notify_list": [],
  "status_need_notify": [
    "ERROR",
    "DISCARDED",
    "SUCCEED"
  ],
  "partition_date_start": 0,
  "partition_date_end": 3153600000000,
  "auto_merge_time_ranges": [
    604800000,
    2419200000
  ],
  "retention_range": 0,
  "engine_type": 2,
  "storage_type": 2,
  "override_kylin_properties": {}
}

On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <sh...@apache.org> wrote:

> Hi Jason,
>
> As Yiming mentioned, the cube design matters for the performance of both
> build and query; please check "Optimize Cube" in the document web page and
> do optimizaiton as much as possible;
>
> Besides, the cluster's capacity and Hadoop configuration is also an
> important factor; Try to identify the bottleneck and then optimize or add
> capacity.
>
> From 1.5 Kylin ships with two cubing algorithm; the steps "Build
> N-Dimension Cuboid" are the legacy "Layered" cubing algorithm; They will be
> skipped when Kylin selects to use the new "Fast" cubing algorithm, which is
> the "Build Cube" step after them. Please click the hadoop link in that step
> to inspect the MR job's statistics;
>
> Hope this helps to some extend;
>
>
>
> 2016-08-02 8:44 GMT+08:00 Yiming Liu <li...@gmail.com>:
>
> > Hi Jason,
> >
> > Cube design is the performance key for Kylin, not only query, but also
> cube
> > building process. How to select dimensions, how to define the
> relationship
> > between dimensions, how to select encode method, how to define measure,
> > even how to choose the Hbase key order will have a significant impact on
> > performance.  There are quite a few wonderful documents introducing how
> to
> > do this, http://kylin.apache.org/docs15/ .
> >
> > One more thing, if you could share your cube design, you would get help
> > easier here.
> >
> > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> >
> > > I'm setting up a test case for a portion of our dataset, to evaluate
> > Kylin
> > > and I'm not seeing the performance that I would expect.
> > >
> > > The cube building process is taking about 5-6 hours with  ~69,000,000
> > > records and 10 dimensions. I'm not sure if that's the expected build
> > time,
> > > but the other problem is the query performance after building the cube.
> > >
> > > All queries were tested with a very simple query (e.g. SELECT
> SUM(clicks)
> > > FROM reporting GROUP BY search_type)
> > >
> > > Grouping by 1 or 2 dimensions gives me very responsive queries (under 2
> > > seconds), but adding more dimensions drastically increases the query
> time
> > > (over 1 minute and it times out through hbase). I would expect these
> > > queries to have all similar query times since they should query the
> > > respective cuboid, so I'm not sure why the performance would suffer. I
> > > didn't set up any special rules for the cube, but during the build
> > process
> > > it showed all the N-dimension cubes and the log simply said 'skipped'.
> > >
> > > Is there something I'm missing in the configuration?
> > >
> > > I have a HDP cluster with 3 nodes and 1 client node on which Kylin is
> > > installed. Do I need to adjust the hadoop configuration. I'm using most
> > of
> > > the default HDP settings.
> > >
> > > What more information can I provide?
> > >
> >
> >
> >
> > --
> > With Warm regards
> >
> > Yiming Liu (刘一鸣)
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>

Re: Kylin Cube Performance

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Jason,

As Yiming mentioned, the cube design matters for the performance of both
build and query; please check "Optimize Cube" in the document web page and
do optimizaiton as much as possible;

Besides, the cluster's capacity and Hadoop configuration is also an
important factor; Try to identify the bottleneck and then optimize or add
capacity.

From 1.5 Kylin ships with two cubing algorithm; the steps "Build
N-Dimension Cuboid" are the legacy "Layered" cubing algorithm; They will be
skipped when Kylin selects to use the new "Fast" cubing algorithm, which is
the "Build Cube" step after them. Please click the hadoop link in that step
to inspect the MR job's statistics;

Hope this helps to some extend;



2016-08-02 8:44 GMT+08:00 Yiming Liu <li...@gmail.com>:

> Hi Jason,
>
> Cube design is the performance key for Kylin, not only query, but also cube
> building process. How to select dimensions, how to define the relationship
> between dimensions, how to select encode method, how to define measure,
> even how to choose the Hbase key order will have a significant impact on
> performance.  There are quite a few wonderful documents introducing how to
> do this, http://kylin.apache.org/docs15/ .
>
> One more thing, if you could share your cube design, you would get help
> easier here.
>
> 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
>
> > I'm setting up a test case for a portion of our dataset, to evaluate
> Kylin
> > and I'm not seeing the performance that I would expect.
> >
> > The cube building process is taking about 5-6 hours with  ~69,000,000
> > records and 10 dimensions. I'm not sure if that's the expected build
> time,
> > but the other problem is the query performance after building the cube.
> >
> > All queries were tested with a very simple query (e.g. SELECT SUM(clicks)
> > FROM reporting GROUP BY search_type)
> >
> > Grouping by 1 or 2 dimensions gives me very responsive queries (under 2
> > seconds), but adding more dimensions drastically increases the query time
> > (over 1 minute and it times out through hbase). I would expect these
> > queries to have all similar query times since they should query the
> > respective cuboid, so I'm not sure why the performance would suffer. I
> > didn't set up any special rules for the cube, but during the build
> process
> > it showed all the N-dimension cubes and the log simply said 'skipped'.
> >
> > Is there something I'm missing in the configuration?
> >
> > I have a HDP cluster with 3 nodes and 1 client node on which Kylin is
> > installed. Do I need to adjust the hadoop configuration. I'm using most
> of
> > the default HDP settings.
> >
> > What more information can I provide?
> >
>
>
>
> --
> With Warm regards
>
> Yiming Liu (刘一鸣)
>



-- 
Best regards,

Shaofeng Shi

Re: Kylin Cube Performance

Posted by Yiming Liu <li...@gmail.com>.
Hi Jason,

Cube design is the performance key for Kylin, not only query, but also cube
building process. How to select dimensions, how to define the relationship
between dimensions, how to select encode method, how to define measure,
even how to choose the Hbase key order will have a significant impact on
performance.  There are quite a few wonderful documents introducing how to
do this, http://kylin.apache.org/docs15/ .

One more thing, if you could share your cube design, you would get help
easier here.

2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:

> I'm setting up a test case for a portion of our dataset, to evaluate Kylin
> and I'm not seeing the performance that I would expect.
>
> The cube building process is taking about 5-6 hours with  ~69,000,000
> records and 10 dimensions. I'm not sure if that's the expected build time,
> but the other problem is the query performance after building the cube.
>
> All queries were tested with a very simple query (e.g. SELECT SUM(clicks)
> FROM reporting GROUP BY search_type)
>
> Grouping by 1 or 2 dimensions gives me very responsive queries (under 2
> seconds), but adding more dimensions drastically increases the query time
> (over 1 minute and it times out through hbase). I would expect these
> queries to have all similar query times since they should query the
> respective cuboid, so I'm not sure why the performance would suffer. I
> didn't set up any special rules for the cube, but during the build process
> it showed all the N-dimension cubes and the log simply said 'skipped'.
>
> Is there something I'm missing in the configuration?
>
> I have a HDP cluster with 3 nodes and 1 client node on which Kylin is
> installed. Do I need to adjust the hadoop configuration. I'm using most of
> the default HDP settings.
>
> What more information can I provide?
>



-- 
With Warm regards

Yiming Liu (刘一鸣)