You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/08/13 12:09:15 UTC
[GitHub] [incubator-doris] EmmyMiao87 commented on a change in pull request #4198: Add bitmap longitudinal cutting udaf

EmmyMiao87 commented on a change in pull request #4198:
URL: https://github.com/apache/incubator-doris/pull/4198#discussion_r469900668



##########
File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md
##########
@@ -0,0 +1,228 @@
+---
+{
+    "title": "正交的BITMAP计算UDAF",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 正交的BITMAP计算UDAF
+
+## 背景
+
+Doris原有的Bitmap聚合函数设计比较通用，但对亿级别以上bitmap大基数的交并集计算性能较差。排查后端be的bitmap聚合函数逻辑，发现主要有两个原因。一是当bitmap基数较大时，如bitmap大小超过1g，网络/磁盘IO处理时间比较长；二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算，给顶层单节点带来压力，成为处理瓶颈。
+
+解决思路是将bitmap列的值按照range划分，不同range的值存储在不同的分桶中，保证了不同分桶的bitmap值是正交的。这样，数据分布更均匀，一个查询会扫描所有分桶，在每个分桶中将正交的BITMAP进行聚合计算，然后把计算结果传输至顶层节点汇总，如此会大大提高计算效率，解决了顶层单节点计算瓶颈问题。

Review comment:
       这一句有点歧义 `在每个分桶中将正交的BITMAP进行聚合计算` 是否改为 `先分别对不同分桶中的正交bitmap进行聚合计算， 然后顶层节点直接将聚合计算后的值合并汇总，并输出`。

##########
File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md
##########
@@ -0,0 +1,228 @@
+---
+{
+    "title": "正交的BITMAP计算UDAF",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 正交的BITMAP计算UDAF
+
+## 背景
+
+Doris原有的Bitmap聚合函数设计比较通用，但对亿级别以上bitmap大基数的交并集计算性能较差。排查后端be的bitmap聚合函数逻辑，发现主要有两个原因。一是当bitmap基数较大时，如bitmap大小超过1g，网络/磁盘IO处理时间比较长；二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算，给顶层单节点带来压力，成为处理瓶颈。
+
+解决思路是将bitmap列的值按照range划分，不同range的值存储在不同的分桶中，保证了不同分桶的bitmap值是正交的。这样，数据分布更均匀，一个查询会扫描所有分桶，在每个分桶中将正交的BITMAP进行聚合计算，然后把计算结果传输至顶层节点汇总，如此会大大提高计算效率，解决了顶层单节点计算瓶颈问题。
+

Review comment:
       这里可以加一个总说，
   
   第一步：建表。这一步主要是为了xxx
   第二步：编译 udaf，也就是编译 xxx， 是为了xxx
   第三步：将udaf 注册到doris中
   第四部：如何使用
   
   然后再针对每项分别说。

##########
File path: docs/zh-CN/extending-doris/udf/contrib/udaf-orthogonal-bitmap-manual.md
##########
@@ -0,0 +1,209 @@
+---
+{
+    "title": "BITMAP正交计算UDAF",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# BITMAP正交计算UDAF
+
+## 背景
+
+Doris原有的Bitmap聚合函数设计比较通用，但对亿级别以上bitmap大基数的交集和并集计算性能较差。排查后端be的bitmap聚合函数逻辑，发现主要有两个原因。一是当bitmap基数较大时，如数据大小超过1g，网络/磁盘IO处理时间比较长；二是后端be实例在scan数据后全部传输到顶层节点进行求交和并运算，给顶层单节点带来压力，成为处理瓶颈。
+
+解决方案是建表时增加hid列，罐库时hid列按照bitmap列的range划分，并且按hid均匀分桶。这样按range划分的聚合bitmap数据会均匀地分布在所有后端be实例上。在schema表的基础上，优化udaf聚合函数，使其在所有扫描节点参与分布式正交并算，然后在顶层节点进行汇总，如此会大大提高计算效率。
+
+## Create table
+
+建表时需要使用聚合模型，数据类型是 bitmap , 聚合函数是 bitmap_union
+
+```
+CREATE TABLE `user_tag_bitmap` (
+  `tag` bigint(20) NULL COMMENT "用户标签",
+  `hid` smallint(6) NULL COMMENT "分桶id",
+  `user_id` bitmap BITMAP_UNION NULL COMMENT ""
+) ENGINE=OLAP
+AGGREGATE KEY(`tag`, `hid`)
+COMMENT "OLAP"
+DISTRIBUTED BY HASH(`hid`) BUCKETS 3
+```
+表schema增加hid列，表示id范围, 作为hash分桶列。
+
+注：hid数和BUCKETS要设置合理，hid数设置至少是BUCKETS的5倍以上，以使数据hash分桶尽量均衡
+
+## Data Load
+
+``` 
+LOAD LABEL user_tag_bitmap_test
+(
+DATA INFILE('hdfs://abc')
+INTO TABLE user_tag_bitmap
+COLUMNS TERMINATED BY ','
+(tmp_tag, tmp_user_id)
+SET (
+tag = tmp_tag,
+hid = ceil(tmp_user_id/5000000),
+user_id = to_bitmap(tmp_user_id)
+)
+)
+...
+```
+数据格式：
+``` 
+11111111,1
+11111112,2
+11111113,3
+11111114,4
+...
+```
+注：第一列代表用户标签，如'男', '90后', '10-20万'等，已由中文转换成数字
+
+load数据时，对用户bitmap进行纵向切割，例如，用户id在1-5000000范围内的hid相同，hid相同的会被均匀的hash分配后端be实例进行union聚合。在bitmap的udaf实现上，可以利用tablet在be上平均分散的特性，在local节点scan数据后，直接进行交集、并集计算，在top节点merge阶段进行汇总计算结果，此设计能充分发挥所有be并发计算的特性。
+
+## 自定义UDAF
+Doris查询前设置参数
+```
+set parallel_fragment_exec_instance_num=5
+```
+注：根据集群情况设置并发参数，提高并发计算性能
+
+新udaf需要在doris定义聚合函数时注册函数符号，函数符号通过动态库.so的方式被加载。
+
+### bitmap_orthogonal_intersect 
+
+求交集函数
+  bitmap_orthogonal_intersect(bitmap_column, column_to_filter, filter_values)
+  
+参数：
+   第一个参数是Bitmap列，第二个参数是用来过滤的维度列，第三个参数开始是变长参数，含义是过滤维度列的不同取值
+   
+说明：
+  此udaf，在此表schema的基础上，查询规划上聚合分2层，在第一层be节点（update、serialize）先按filter_values为key进行hash聚合，然后对所有key的bitmap求交集，结果序列化后发送至第二层be节点(merge、finalize)，在第二层be节点对所有来源于第一层节点的bitmap值循环求并集
+   
+
+定义：
+```
+drop FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...);
+CREATE AGGREGATE FUNCTION bitmap_orthogonal_intersect(BITMAP,BIGINT,BIGINT, ...) RETURNS BITMAP INTERMEDIATE varchar(1)

Review comment:
       比如前面都是 orthogonal_bitmap 那么这里也最好是 orthogonal_bitmap




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org