You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2019/10/30 09:41:36 UTC
[GitHub] [incubator-doris] EmmyMiao87 opened a new issue #2101: 明细表支持预聚合

EmmyMiao87 opened a new issue #2101: 明细表支持预聚合
URL: https://github.com/apache/incubator-doris/issues/2101
 
 
   #背景
     在实际的业务场景中，通常存在两种场景并存的分析需求：对固定维度的聚合分析 和 对原始明细数据任意维度的分析。
     
     例如，在销售场景中，每条订单数据包含这几个维度信息（item\_id, sold\_time, customer\_id, price）。在这种场景下，有两种分析需求并存：
     
     1. 业务方需要获取某个商品在某天的销售额是多少，那么仅需要在维度（item\_id, sold\_time）维度上对 price 进行聚合即可。
     2. 分析某个人在某天对某个商品的购买明细数据。
   
   在现有的 Doris 数据模型中，如果仅建立一个聚合模型的表，比如（item\_id, sold\_time, customer\_id, sum(price)）。由于聚合损失了数据的部分信息，无法满足用户对明细数据的分析需求。
     
     如果仅建立一个 Duplicate 模型，虽可以满足任意维度的分析需求，但由于不支持 Rollup， 分析性能不佳，无法快速完成分析。
     
     如果同时建立一个聚合模型和一个 Duplicate 模型，虽可以满足性能和任意维度分析，但两表之间本身无关联，需要业务方自行选择分析表。不灵活也不易用。
     
   #设计目标
     支持基于 Duplicate 数据模型创建 Rollup 表，既满足用户直接使用明细表分析的需求，也同时满足某些特定维度分析的高效查询。
     
   
   #预聚合介绍
   ##名词解释
   1. Duplicate 数据模型：Doris中的用于存放明细数据的数据模型，建表可指定，数据不会被聚合。
   2. Base 表：Doris 中通过 CREATE TABLE 命令创建出来的表。
   3. Rollup （预聚合）表：预聚合表一般是某个 base 表的上卷操作，其维度是 base 表的子集。
   
   ##介绍
     使用聚合函数（如sum和count）的查询，在已经包含预聚合表时可以更高效地执行。这种改进的效率对于查询大量数据尤其适用。
     
     Rollup 表的实现原理类似物化视图，表中的数据被物化在存储节点中，并且在增量更新中能和 Base 表保持一致。
   
     用户创建预聚合表后，查询优化器支持选择一个最高效的预聚合映射，并重写 SQL 以此直接对预聚合表进行查询而不是 Base 表。
     
     由于 Rollup 表数据通常比 Base 表数据小很多，因此命中 Rollup 表的查询速度会快很多。视 Rollup 表聚合情况而定，查询效率会提高 5~100 倍左右，甚至更多。
     
   ##例子
   对于销售场景的分析来说，业务方创建了一个存储订单信息的表 sales。
     
   ```
   CREATE TABLE sales (
     order_time datatime,
     user_id int,
     sex string,
     country string,
     quantity int,
     price bigint) ENGINE=OLAP
   DUPLICATE KEY(`order_time`, `user_id`, `sex`, `country`, `quantity`)
   DISTRIBUTED BY HASH(`order_time`) BUCKETS 100
   PROPERTIES (
     "storage_type" = "COLUMN"
   )
   ```
     
     此时，如果想对计算出不同城市，不同性别的人，购买的总物品个数，和总价格，则可以基于 sales 这个 Base 表创建如下 Rollup 表。
     
   ```
   alter table sales add rollup agg_sales as
     SELECT country, sex, sum(quantity), sum(price)
     FROM sales
     GROUP BY country, sex
    
   or 
     
   create rollup agg_sales as
     SELECT country, sex, sum(quantity), sum(price)
     FROM sales
     GROUP BY country, sex
     
   ```
     
   这时，如果查询下面 query 就可以命中 Rollup 表, 业务方可以通过 EXPLAIN 语句来确定是否命中了 Rollup 表。
     
   ```
   select country, sex, sum(quantity), sum(price) 
   from sales 
   group by country, sex;
   ```
   
   ##支持的聚合函数
   SUM, MAX, MIN, HLL\_UNION（二阶段支持）, BITMAP\_UNION（二阶段支持）
   
   ##查询数据
     查询时根据当前的 query 选出一个 Base 表或最优的 Rollup 表进行查询。或用户也可以指定选中某个 Rollup 表。
     
     Doris 如何选出一个最合适 query 的表：
     
   1. 根据特定的代数关系规则，收集备选的 Rollup 表。（key 和 value 列是 Rollup表子集的）
   2. 从备选表中选出能匹配前缀索引最多的，如果都无法匹配则不 filter 备选表
   3. 从 step2 备选的 Rollup 表中找到最小的 Rowcount 的表
   4. 使用 step3 选出的最佳 Rollup 表改写原始查询
   
   比如下面这些查询就可以匹配到刚才创建的 agg_sales 这个 Rollup 表
   
   ```
   SELECT country, sex, sum(quantity), sum(price) from sales GROUP BY country, sex
   
   SELECT sex, sum(quantity) from sales GROUP BY sex
   
   SELECT sum(price), country from sales GROUP BY country
   ```
   
   但下面这些则无法匹配到
   
   ```
   SELECT user_id, country, sex, sum(quantity), sum(price) from sales GROUP BY user_id, country, sex
   
   SELECT sex, avg(quantity) from sales GROUP BY sex
   
   SELECT country, max(price) from sales GROUP BY country
   ```
   
   ###用户指定查询 Rollup 表
     有时，用户能确定查询要选中哪个 Rollup 表，就在 Base 表后增加一个指定的 Rollup 名称，使用方法如下：
     
   ```
   select country, sex, sum(quantity), sum(price) from sales [agg_sales]
   ```
   
     *注意：如果用户选择的 Rollup 表无法匹配 Query，则查询会失败*
   
   ###DISTINCT
     查询中带有 DISTINCT 关键字也可以匹配到 Rollup 表。下面例子说明：
     
   ```
   查询语句, 
   select country, count(distinct user_id) from sales group by country;
   ```
   
     查询可以提配到下面这个 Rollup 表，这个表之所以需要一个 sum(price) 的列，主要是因为 Rollup 表至少需要一个聚合列。
     
   ```
   create rollup country_user_sales as
       select country, user_id , sum(price) 
       from sales 
       group by country, user_id;
   ```
   
   ###HLL
     对明细数据进行 HLL 聚合并且在查询时，使用 HLL 函数分析数据。主要适用于快速计算 PV, UV，count(distinct) 。
   
   ```
   创建 Rollup 表
   create rollup dt_uv as 
       select dt, page_id, hll_hash(user_id) 
       from user_view
       group by dt 
   ```
   
   查询时，需要指定 HLL 分析函数，比如下面查询就可以匹配到 Rollup 表。
   
   ```
   求每个网页每天的的PV
   select dt, page_id, HLL_CARDINALITY(HLL_HASH(user_id)) from user_view;
   求网站每天的UV
   select dt, HLL_CARDINALITY(HLL_HASH(user_id)) from user_view;
   ```
   
   *注意：创建 Rollup 表时，可以指定 HLL\_HASH 作为聚合函数，但查询时不能单独指定 HLL\_HASH 函数，必须结合其他 HLL 分析函数一起使用*
   
   ###BITMAP
     对明细数据进行 BITMAP_UNION 聚合，并且查询的时候使用 BITMAP 函数分析数据。
     
   ```
   创建 Rollup 表
   create rollup dt_uv as
       select dt, page_id, to_bitmap(user_id)
       from user_view
       group by dt
   ```
   
   查询时，需要指定 BITMAP 分析函数，比如下面查询就可以匹配到 Rollup 表。
   
   ```
   求每个网页每天的的PV
   select dt, page_id, bitmap_count(bitmap_union(to_bitmap(user_id))) from user_view;
   求网站每天的UV
   select dt, bitmap_count(bitmap_union(to_bitmap(user_id))) from user_view
   
   ```
   
   ##导入数据
     对 Base 表的增量导入都会作用到所有关联的 Rollup 表中。在 Base 表及所有的 Rollup 表均完成后，导入才算完成，数据才能被看到。
     
     Base 表和 Rollup 表之间的数据是一致的。查询 Base 表和查询 Rollup 表不会存在数据差异。
     
   ##推荐使用
   + 对大数据量的聚合分析查询
   + 聚合后的 Rollup 表大小远小于 Base 表 -- 1~10%的 Base 表，或者更小。
   
   ##限制
   + schema change: 
   	+ 新增列：只会在 Base 表上新增
   	+ 删除列：如果删除的列在 Base 表和 Rollup 表上均存在，则均会生效。删除 Rollup 的 key 列则会使得 Rollup 根据新的 key 重新聚合。如果删除列后 Rollup 不存在 value 列则不能删除该列。
   	+ 修改列的类型：限制同当前 Rollup
   + delete:
   	+ delete from: 禁止执行
   	+ drop partition: 支持，即使 Rollup 表没有 Partition column，也可以删除。
   	+ truncate table: 支持，同步删除 Rollup 数据
   	+ drop table: 支持，同步删除 Rollup 表
   
   ##必要条件
   Doris version 0.12.0 + 
   
   #待定问题
   ##如何支持对明细表的 HLL_HASH 聚合
   现状：
   1. 不支持对原始数据进行 HLL_HASH 聚合，只能在导入的时候指定 HLL_HASH 函数
   
   ##如何支持 AVG 聚合算子
   1. 目前不支持预聚合表，指定聚合类型为 AVG 函数。
   2. 查询时，可以指定聚合类型 AVG。
   3. 将 AVG 改写为 SUM/COUNT 目前 COUNT 在预聚合中也不支持
   
   + 方案一：如果预聚合表需要支持指定 AVG 作为聚合函数，就需要在每次增量更新时，重新执行并计算聚合列。
   + 方案二：预聚合指定 SUM 和 COUNT 两个聚合函数，查询匹配时，将 AVG 改写为 SUM/COUNT 进行 Rollup 匹配。 
   
   
   ##如何支持 Replace 聚合算子
   背景：业务方希望保留明细数据，但同时需要 Replace 型的预聚合表。
   
   现状：
   
   1. 目前根本不支持查询指定 Replace 聚合。
   2. 创建预聚合表时，虽可以指定 Replace 函数，但无法判断明细数据的先后问题。
   
     Replace 方法比较特殊，他在匹配时查询的 key 列和 Rollup 表的 key 列相同，否则无法命中 Rollup 表。
   
   ```
   create rollup replace_quantity as 
       select order_time, user_id, sex, country, replace(quantity) 
       from sales
       group by order_time, user_id, sex, country
   ```
   
     下面 query 就不能匹配到上面的 replace_quantity 表
     
   ```
   select user_id, sex, country, replace(quantity) 
   from sales
   group by user_id, sex, country
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org