You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/10/26 05:23:27 UTC

[GitHub] [incubator-doris] wangbo opened a new issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

wangbo opened a new issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788


   **Why Do So**
   We observe that Doris's query performance is significantly slower that Kylin when the query contains bitmap computation.
   The reason is that even a query hits rollup,doris's still need to do additional data scan and computation.
   Doris's rollup use base table's distribution key which causes that rollup's bucket data may still has intersection.
   Bitmap/HLL are greatly affected by this situation.
   
   **Solution**
   Use rollup's aggregation key as rollup's bucket key to make data truly pre-aggregated.
   
   **POC**
   env:
   * 1 FE,3BE
   * data:
   	* v2 storage format,one replica
   	* six bitmap column, each column's cardinality is about 5000,0000
   
   Test result
   * test sql: the sql completely hits rollup which contains six bitmap column
   * case 1 : data has just load to Doris BE and not compaction completely
   	* rollup use base key as bucket key:
   		* query time:14s
   	* rollup use agg-key as buceck key:6s
   		* query time:6s
   * case 2: data compaction completely
   	* rollup use base key as bucket key:
   		* first query time(without OS cache and BE's page cache):1.2s
   		* second querty time(hits be'a page cache): 1.0s
   		* scan bytes:241M
   		* scan rows: 1104
   		* return rows: 1104
   	* rollup use agg-key as buceck key:6s
   		* first query time(without OS cache and BE's page cache):1.2s
   		* second querty time(hits be'a page cache): 1.0s
   		* scan bytes:662M
   		* scan rows: 10079
   		* return rows: 1104
   * case 3: data consistent
   	* query result is same whether rollup use rollup's agg-key as bucket key or rollup use base table's distribution key as bucket key
   
   So we can see that when using rollup's agg-key as bucket key by is about three times performance improve than using base table's distribution key.
   And because of rollup is truely pre-aggreagtion, scan data and computation is reduced.
   
   **Future Work**
   * Stage 1 :  Make this feature available in production env quickly
   Make it a configurable property for OLAP table when user wants to use rollup key's agg-key as bucket key in stream load/spark load.
   Even they can set rollup's bucket num.
   
   * Stage 2: Support Schema Change
   I prefer to support schema change for this feature in Spark Job.
   The reason as below:
   ```Read Write Separation``` is a necessary feature for Doris.
   This feature needs to shuffle data when doing schema change would have a greater impact on the stability for Doris 
    Especially for a big table.
    I don't think Doris is good at and need to be good at long time shuffle.
    So Spark is the best choice.
   
   * Stage 3: Support Colocate Join
       Rollup has different agg-key need to shuffle join when query.
   
   * Stage 4: Support Materialized view
       Need further research.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
wangbo commented on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-718710533


   >can we config the replica num of rollup table ? or it is the same nums of the base table?
   It is the same nums of the base table


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
wangbo commented on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-718326842


   >the rollup table has the same replica of base table in default, so this is the same implement in you design?
   or add the config of replica of rollup table ?
   
   @Yao-MR  I didn't understand what you mean


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Yao-MR commented on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
Yao-MR commented on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-716385648


   why not give the right to the user, and user can set the distribute key of rollup table, instead of the aggreate key, so that
   user can control the location of data, cause the aggreate key may improve the performance of bigmap union Scenes, but may not cause the improvement of performance in other Scenes


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Yao-MR commented on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
Yao-MR commented on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-717792319


   the rollup table has the same replica of base table, so this is also the same implement in you design?
   or add the config of replica of rollup table ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo edited a comment on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
wangbo edited a comment on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-718710533


   >can we config the replica num of rollup table ? or it is the same nums of the base table?
   
   It is the same nums of the base table


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
morningman commented on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-716314611


   Good proposal.
   I think we can allow user to create rollup will different distribution key when there is no data in table.
   This is simple and no need the "data shuffle" things.
   And then we can implement more complicated schema change in future.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Yao-MR commented on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
Yao-MR commented on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-716397086


   > > why not give the right to the user, and user can set the distribute key of rollup table, instead of the aggreate key, so that
   > > user can control the location of data, cause the aggreate key may improve the performance of bigmap union Scenes, but may not cause the improvement of performance in other Scenes
   > 
   > @Yao-MR This is more flexible, and I also plan to do as you say.
   > Later I'll provide user interface design, please review as you have time.
   > But I'm more curious about your real usage scenarios for `user can control the location of data`
   
   First, we can know that location of rollup table will not affect the data consistent, even if in the bitmap union scenarios,
   
   Second, the bucket key just control the data location,  and if user`s subquery hit the the rollup table, we can not know the follow useage of the rollup table, so can can just think the rollup table like the base table, 
   so, like we can not refuse the use define bucket key,  the base table can have the user define bucket key, why not set the user define key as the bucket key, 
   
   one usage scenarios i know is that, the two rollup table have the same user define key, when the rollup table join on the same user define key, this will improve the performance of the join, which is same as the colocation join use the DISTRIBUTED BY HASH(`join key`) 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
wangbo commented on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-717767146


   **User instructions**
   **1 create table**
   ```
   CREATE TABLE example_db.rolup_index_table
       (
           event_day DATE,
           siteid INT DEFAULT '10',
           citycode SMALLINT,
           username VARCHAR(32) DEFAULT '',
           pv BIGINT SUM DEFAULT '0'
       )
       AGGREGATE KEY(event_day, siteid, citycode, username)
       DISTRIBUTED BY HASH(siteid) BUCKETS 10
       rollup (
       r1(event_day,siteid) distributed by hash(event_day) buckets 5, //  rollup r1 use user specified distribution
       r3(event_day) // user not specify distribution info here , so use default distribution
       )
       PROPERTIES("replication_num" = "1");
   ```
   **2 add rollup**
   only support add rollup when table has no data
   case 1: user specify distribution
   ```
   alter table rolup_index_table add rollup r1(event_day,siteid) distributed by hash(event_day) buckets 5;
   ```
   case 2: user not specify distribution info,so use default distribution info
   ```
   alter table rolup_index_table add rollup r1(event_day,siteid);
   ```
   
   **3 add partition**
   only support modify bucket num
   case 1:   modify bucket num
   ```
   alter table rolup_index_table add p1 partition values [(2017,2018),[2018,2019))
     rollup (r1 distributed by hash(k1,k2) bucket 3)
   ```
   case 2:  use default distribution info
   ```
   alter table rolup_index_table add p1 partition values [(2017,2018),[2018,2019))  
   ```
   
   **Current limitation and will be supported in the feature**
    * Only can add rollup when table is empty
    * Not Support colocate join
    * Not Support Materialized view
    * Not Support Restore
    * Not Support Spark Load
   
   **Persistence Data Structure When Write to Edit Log**
   **1 OlapTable**
   ```
   Add a field:    Map<Long, DistributionInfo> indexIdToDistributionInfo, key=indexId,value=rollup's distribution
   indexIdToDistributionInfo only keep rollup's distribution
   If a rollup's distribution can't be found in indexIdToDistributionInfo, it use defaultDistribution
   ```
   **2 Partition**
   ```
   Add a field:    Map<Long, DistributionInfo> indexIdToDistributionInfo, key=indexId,value=rollup's distribution
   indexIdToDistributionInfo only keep rollup's distribution
   If a rollup's distribution can't be found in indexIdToDistributionInfo, it use defaultDistribution
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Yao-MR edited a comment on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
Yao-MR edited a comment on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-716397086


   > > why not give the right to the user, and user can set the distribute key of rollup table, instead of the aggreate key, so that
   > > user can control the location of data, cause the aggreate key may improve the performance of bigmap union Scenes, but may not cause the improvement of performance in other Scenes
   > 
   > @Yao-MR This is more flexible, and I also plan to do as you say.
   > Later I'll provide user interface design, please review as you have time.
   > But I'm more curious about your real usage scenarios for `user can control the location of data`
   
   First, we can know that location of rollup table will not affect the data consistent, even if in the bitmap union scenarios,
   
   Second, the bucket key just control the data location,  and if user`s subquery hit the the rollup table, we can not know the follow useage of the rollup table, so can can just think the rollup table like the special 'base table', 
   so, like we can not refuse the use define bucket key,  the base table can have the user define bucket key, why not set the user define key as the bucket key of the rollup table
   
   one usage scenarios i know is that, the two rollup table have the same user define key, when the rollup table join on the same user define key, this will improve the performance of the join, which is same as the colocation join use the DISTRIBUTED BY HASH(`join key`) 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Yao-MR edited a comment on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
Yao-MR edited a comment on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-717792319


   the rollup table has the same replica of base table in default, so this is the same implement in you design?
   or add the config of replica of rollup table ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Yao-MR edited a comment on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
Yao-MR edited a comment on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-716397086


   > > why not give the right to the user, and user can set the distribute key of rollup table, instead of the aggreate key, so that
   > > user can control the location of data, cause the aggreate key may improve the performance of bigmap union Scenes, but may not cause the improvement of performance in other Scenes
   > 
   > @Yao-MR This is more flexible, and I also plan to do as you say.
   > Later I'll provide user interface design, please review as you have time.
   > But I'm more curious about your real usage scenarios for `user can control the location of data`
   
   First, we can know that location of rollup table will not affect the data consistent, even if in the bitmap union scenarios,
   
   Second, the bucket key just control the data location,  and if user`s subquery hit the the rollup table, we can not know the follow useage of the rollup table, so can can just think the rollup table like the special 'base table', 
   so, like we can not refuse the use define bucket key,  the base table can have the user define bucket key, why not set the user define key as the bucket key of the rollup table
   
   one usage scenarios i know is that, the two rollup table have the same user define bucket key, when the rollup table join on the same user define key, this will improve the performance of the join, which is same as the colocation join use the DISTRIBUTED BY HASH(`join key`) 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
wangbo commented on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-716391152


   >why not give the right to the user, and user can set the distribute key of rollup table, instead of the aggreate key, so that
   user can control the location of data, cause the aggreate key may improve the performance of bigmap union Scenes, but may not cause the improvement of performance in other Scenes
   
   @Yao-MR  This is more flexible, and I also plan to do as you say.
   Later I'll provide user interface design, please review as you have time.
   But I'm more curious about your real usage scenarios for ```user can control the location of data```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Yao-MR commented on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
Yao-MR commented on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-718335902


   > > the rollup table has the same replica of base table in default, so this is the same implement in you design?
   > > or add the config of replica of rollup table ?
   > 
   > @Yao-MR I didn't understand what you mean
   
   can we config the replica num of rollup table ? or it is the same nums of the base table?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Yao-MR edited a comment on issue #4788: [Proposal] Use Rollup's Aggregation Key as Rollup's Bucket Key

Posted by GitBox <gi...@apache.org>.
Yao-MR edited a comment on issue #4788:
URL: https://github.com/apache/incubator-doris/issues/4788#issuecomment-716397086


   > > why not give the right to the user, and user can set the distribute key of rollup table, instead of the aggreate key, so that
   > > user can control the location of data, cause the aggreate key may improve the performance of bigmap union Scenes, but may not cause the improvement of performance in other Scenes
   > 
   > @Yao-MR This is more flexible, and I also plan to do as you say.
   > Later I'll provide user interface design, please review as you have time.
   > But I'm more curious about your real usage scenarios for `user can control the location of data`
   
   First, we can know that location of rollup table will not affect the data consistent, even if in the bitmap union scenarios,
   
   Second, the bucket key just control the data location,  and if user`s subquery hit the the rollup table, we can not know the follow useage of the rollup table, so can can just think the rollup table like the special 'base table', 
   so, like we can not refuse the use define bucket key,  the base table can have the user define bucket key, why not set the user define key as the bucket key, 
   
   one usage scenarios i know is that, the two rollup table have the same user define key, when the rollup table join on the same user define key, this will improve the performance of the join, which is same as the colocation join use the DISTRIBUTED BY HASH(`join key`) 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org