You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by 青椒肉丝 <13...@qq.com> on 2019/08/22 05:47:29 UTC

How to build model and cube for dimension tables that update data frequently

Hi,Guys:

     Now I have a dimension table in which the name field changes frequently, but the ID field does not change. The ID and name correspond one by one.  Now I have put the name dictionary in the derivative dimension of cube. But whenever I change the value of the name field, I have to rebuild the cube to find the latest name value. So when my cube accumulates two years of data, rebuilding the cube becomes expensive. Can't cube in kylin query the latest dimension data? If feasible, how can I build model and cube?

 

 

David

Re: How to build model and cube for dimension tables that update data frequently

Posted by Xiaoxiang Yu <xi...@kyligence.io>.
Hi,

 I think you should use Lookup Refresh in cube manage list. Here is my screenshot pic.
  By Lookup Refresh, you can let kylin query the latest dimension data. You may have to wait for a moment for snapshot rebuilt success. Kylin.log will tell you when it finished successfully.

[cid:image001.png@01D5594A.3E6EABF0]

--------
Following is log of related job.

2019-08-23 00:24:02,072 INFO  [FetcherRunner 198474142-43] threadpool.FetcherRunner:62 : LookupSnapshotBuildJob{id=780d706d-72ba-81bc-4442-6151a91ebab1, name=Lookup  CUBE - M1C2 -  TABLE - LACUS.KYLIN_ACCOUNT - CST 2019-08-23 00:23:36, state=READY} prepare to schedule and its priority is 30
2019-08-23 00:24:02,073 INFO  [FetcherRunner 198474142-43] threadpool.FetcherRunner:66 : LookupSnapshotBuildJob{id=780d706d-72ba-81bc-4442-6151a91ebab1, name=Lookup  CUBE - M1C2 -  TABLE - LACUS.KYLIN_ACCOUNT - CST 2019-08-23 00:23:36, state=READY} scheduled
2019-08-23 00:24:02,073 INFO  [FetcherRunner 198474142-43] threadpool.DefaultFetcherRunner:85 : Job Fetcher: 0 should running, 1 actual running, 0 stopped, 1 ready, 6 already succeed, 0 error, 0 discarded, 0 others
2019-08-23 00:24:02,073 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] execution.AbstractExecutable:162 : Executing AbstractExecutable (Lookup  CUBE - M1C2 -  TABLE - LACUS.KYLIN_ACCOUNT - CST 2019-08-23 00:23:36)
2019-08-23 00:24:02,077 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] execution.ExecutableManager:471 : job id:780d706d-72ba-81bc-4442-6151a91ebab1 from READY to RUNNING
2019-08-23 00:24:02,077 DEBUG [pool-6-thread-1] cachesync.Broadcaster:116 : Servers in the cluster: [localhost:7193]
2019-08-23 00:24:02,077 DEBUG [pool-6-thread-1] cachesync.Broadcaster:126 : Announcing new broadcast to all: BroadcastEvent{entity=execute_output, event=update, cacheKey=780d706d-72ba-81bc-4442-6151a91ebab1}
2019-08-23 00:24:02,079 DEBUG [pool-6-thread-1] cachesync.Broadcaster:116 : Servers in the cluster: [localhost:7193]
2019-08-23 00:24:02,079 DEBUG [pool-6-thread-1] cachesync.Broadcaster:126 : Announcing new broadcast to all: BroadcastEvent{entity=execute_output, event=update, cacheKey=780d706d-72ba-81bc-4442-6151a91ebab1}
2019-08-23 00:24:02,080 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] execution.AbstractExecutable:162 : Executing AbstractExecutable (Take Snapshot to Metadata Store)
2019-08-23 00:24:02,081 DEBUG [http-bio-7193-exec-9] cachesync.Broadcaster:246 : Broadcasting UPDATE, execute_output, 780d706d-72ba-81bc-4442-6151a91ebab1
2019-08-23 00:24:02,082 DEBUG [http-bio-7193-exec-9] cachesync.Broadcaster:280 : Done broadcasting UPDATE, execute_output, 780d706d-72ba-81bc-4442-6151a91ebab1
2019-08-23 00:24:02,084 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] execution.ExecutableManager:471 : job id:780d706d-72ba-81bc-4442-6151a91ebab1-00 from READY to RUNNING
2019-08-23 00:24:02,086 DEBUG [http-bio-7193-exec-9] cachesync.Broadcaster:246 : Broadcasting UPDATE, execute_output, 780d706d-72ba-81bc-4442-6151a91ebab1
2019-08-23 00:24:02,086 DEBUG [http-bio-7193-exec-9] cachesync.Broadcaster:280 : Done broadcasting UPDATE, execute_output, 780d706d-72ba-81bc-4442-6151a91ebab1
2019-08-23 00:24:02,189 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] hive.metastore:385 : Trying to connect to metastore with URI thrift://cdh-master:9083
2019-08-23 00:24:02,191 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] hive.metastore:430 : Opened a connection to metastore, current connections: 3
2019-08-23 00:24:02,202 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] hive.metastore:482 : Connected to metastore.
2019-08-23 00:24:02,350 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] lookup.LookupSnapshotToMetaStoreStep:65 : take snapshot for table:LACUS.KYLIN_ACCOUNT
2019-08-23 00:24:02,382 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] lookup.SnapshotManager:244 : Loading snapshotTable from /table_snapshot/LACUS.KYLIN_ACCOUNT/7b38cfc3-9e01-f456-a87f-d01403c9ac77.snapshot, with loadData: false
2019-08-23 00:24:02,452 DEBUG [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] hbase.HBaseConnection:181 : Using the working dir FS for HBase: hdfs://cdh-master:8020
2019-08-23 00:24:02,706 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] hive.metastore:385 : Trying to connect to metastore with URI thrift://cdh-master:9083
2019-08-23 00:24:02,707 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] hive.metastore:430 : Opened a connection to metastore, current connections: 4
2019-08-23 00:24:02,708 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] hive.metastore:482 : Connected to metastore.
2019-08-23 00:24:03,041 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] Configuration.deprecation:1174 : mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
2019-08-23 00:24:03,066 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] mapred.FileInputFormat:249 : Total input paths to process : 1
2019-08-23 00:24:03,099 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] mapreduce.InternalUtil:155 : Initializing org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe with properties {name=LACUS.KYLIN_ACCOUNT, numFiles=1, field.delim=,, columns.types=bigint,int,int,string,string, serialization.format=,, columns=account_id,account_buyer_level,account_seller_level,account_country,account_contact, rawDataSize=0, numRows=0, serialization.lib=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, COLUMN_STATS_ACCURATE=true, totalSize=200000, serialization.null.format=\N, transient_lastDdlTime=1561880270}
2019-08-23 00:24:03,548 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] mapred.FileInputFormat:249 : Total input paths to process : 1
2019-08-23 00:24:03,558 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] mapreduce.InternalUtil:155 : Initializing org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe with properties {name=LACUS.KYLIN_ACCOUNT, numFiles=1, field.delim=,, columns.types=bigint,int,int,string,string, serialization.format=,, columns=account_id,account_buyer_level,account_seller_level,account_country,account_contact, rawDataSize=0, numRows=0, serialization.lib=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, COLUMN_STATS_ACCURATE=true, totalSize=200000, serialization.null.format=\N, transient_lastDdlTime=1561880270}
2019-08-23 00:24:03,704 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] lookup.SnapshotManager:244 : Loading snapshotTable from /table_snapshot/LACUS.KYLIN_ACCOUNT/7b38cfc3-9e01-f456-a87f-d01403c9ac77.snapshot, with loadData: true
2019-08-23 00:24:03,773 DEBUG [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] lookup.SnapshotManager:251 : Loaded snapshot at /table_snapshot/LACUS.KYLIN_ACCOUNT/7b38cfc3-9e01-f456-a87f-d01403c9ac77.snapshot
2019-08-23 00:24:03,779 DEBUG [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] persistence.HDFSResourceStore:98 : Writing pushdown file /kylin/kylin_verify_timeout/resources/table_snapshot/LACUS.KYLIN_ACCOUNT/b40c9e0d-b758-1898-7b01-27510a29443b.snapshot.temp.-1395423725
2019-08-23 00:24:03,939 DEBUG [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] persistence.HDFSResourceStore:117 : Move /kylin/kylin_verify_timeout/resources/table_snapshot/LACUS.KYLIN_ACCOUNT/b40c9e0d-b758-1898-7b01-27510a29443b.snapshot.temp.-1395423725 to /kylin/kylin_verify_timeout/resources/table_snapshot/LACUS.KYLIN_ACCOUNT/b40c9e0d-b758-1898-7b01-27510a29443b.snapshot
2019-08-23 00:24:03,946 DEBUG [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] persistence.HDFSResourceStore:65 : Writing marker for big resource /table_snapshot/LACUS.KYLIN_ACCOUNT/b40c9e0d-b758-1898-7b01-27510a29443b.snapshot
2019-08-23 00:24:04,001 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] lookup.LookupSnapshotToMetaStoreStep:68 : update snapshot path to cube metadata
2019-08-23 00:24:04,002 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] cube.CubeManager:372 : Updating cube instance 'M1C2'
2019-08-23 00:24:04,003 DEBUG [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] cachesync.CachedCrudAssist:198 : Saving CubeInstance at /cube/M1C2.json
2019-08-23 00:24:04,005 DEBUG [pool-6-thread-1] cachesync.Broadcaster:116 : Servers in the cluster: [localhost:7193]
2019-08-23 00:24:04,006 DEBUG [pool-6-thread-1] cachesync.Broadcaster:126 : Announcing new broadcast to all: BroadcastEvent{entity=cube, event=update, cacheKey=M1C2}
2019-08-23 00:24:04,010 DEBUG [http-bio-7193-exec-4] cachesync.Broadcaster:246 : Broadcasting UPDATE, cube, M1C2
2019-08-23 00:24:04,011 DEBUG [http-bio-7193-exec-4] cachesync.Broadcaster:246 : Broadcasting UPDATE, project_data, VerifyTimeout
2019-08-23 00:24:04,012 INFO  [http-bio-7193-exec-4] service.CacheService:123 : cleaning cache for project VerifyTimeout (currently remove all entries)
2019-08-23 00:24:04,012 DEBUG [http-bio-7193-exec-4] cachesync.Broadcaster:280 : Done broadcasting UPDATE, project_data, VerifyTimeout
2019-08-23 00:24:04,012 DEBUG [http-bio-7193-exec-4] cachesync.Broadcaster:280 : Done broadcasting UPDATE, cube, M1C2
2019-08-23 00:24:04,017 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] execution.ExecutableManager:471 : job id:780d706d-72ba-81bc-4442-6151a91ebab1-00 from RUNNING to SUCCEED
2019-08-23 00:24:04,021 DEBUG [pool-6-thread-1] cachesync.Broadcaster:116 : Servers in the cluster: [localhost:7193]
2019-08-23 00:24:04,021 DEBUG [pool-6-thread-1] cachesync.Broadcaster:126 : Announcing new broadcast to all: BroadcastEvent{entity=execute_output, event=update, cacheKey=780d706d-72ba-81bc-4442-6151a91ebab1}
2019-08-23 00:24:04,024 INFO  [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] execution.ExecutableManager:471 : job id:780d706d-72ba-81bc-4442-6151a91ebab1 from RUNNING to SUCCEED
2019-08-23 00:24:04,024 DEBUG [pool-6-thread-1] cachesync.Broadcaster:116 : Servers in the cluster: [localhost:7193]
2019-08-23 00:24:04,024 DEBUG [Scheduler 384496282 Job 780d706d-72ba-81bc-4442-6151a91ebab1-166] execution.AbstractExecutable:332 : no need to send email, user list is empty
2019-08-23 00:24:04,024 DEBUG [pool-6-thread-1] cachesync.Broadcaster:126 : Announcing new broadcast to all: BroadcastEvent{entity=execute_output, event=update, cacheKey=780d706d-72ba-81bc-4442-6151a91ebab1}

----------------
Best wishes,
Xiaoxiang Yu


发件人: 青椒肉丝 <13...@qq.com>
答复: "user@kylin.apache.org" <us...@kylin.apache.org>
日期: 2019年8月22日 星期四 13:47
收件人: user <us...@kylin.apache.org>
主题: How to build model and cube for dimension tables that update data frequently

Hi,Guys:
     Now I have a dimension table in which the name field changes frequently, but the ID field does not change. The ID and name correspond one by one.  Now I have put the name dictionary in the derivative dimension of cube. But whenever I change the value of the name field, I have to rebuild the cube to find the latest name value. So when my cube accumulates two years of data, rebuilding the cube becomes expensive. Can't cube in kylin query the latest dimension data? If feasible, how can I build model and cube?


David