You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by ji...@apache.org on 2022/07/29 05:41:21 UTC

[doris] branch master updated: [Doc]Add Introduction to Apache Doris (#11114)

This is an automated email from the ASF dual-hosted git repository.

jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 8eac06134f [Doc]Add Introduction to Apache Doris (#11114)
8eac06134f is described below

commit 8eac06134fb5bf9f2e4390878ce2f31f2c899c2f
Author: jiafeng.zhang <zh...@gmail.com>
AuthorDate: Fri Jul 29 13:41:16 2022 +0800

    [Doc]Add Introduction to Apache Doris (#11114)
    
    Add Introduction to Apache Doris
---
 docs/en/docs/summary/basic-summary.md    | 84 ++++++++++++++++++++++++++------
 docs/zh-CN/docs/summary/basic-summary.md | 82 ++++++++++++++++++++++++++++---
 2 files changed, 146 insertions(+), 20 deletions(-)

diff --git a/docs/en/docs/summary/basic-summary.md b/docs/en/docs/summary/basic-summary.md
index c67af1a334..245468185c 100644
--- a/docs/en/docs/summary/basic-summary.md
+++ b/docs/en/docs/summary/basic-summary.md
@@ -1,11 +1,8 @@
 ---
-{
-    "title": "Doris base concept",
-    "language": "en"
-}
+{ 'title': 'Introduction to Apache Doris', 'language': 'en' }
 ---
 
-<!-- 
+<!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
@@ -24,13 +21,72 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Doris base concept
+# Introduction to Apache Doris
+
+Apache Doris is a high-performance, real-time analytical database based on MPP architecture, known for its extreme speed and ease of use. It only requires a sub-second response time to return query results under massive data and can support not only high-concurrent point query scenarios but also high-throughput complex analysis scenarios. Based on this, Apache Doris can better meet the scenarios of report analysis, ad-hoc query, unified data warehouse, Data Lake Query Acceleration, etc.  [...]
+
+Apache Doris was first born as Palo project for Baidu's ad reporting business, officially open-sourced in 2017, donated by Baidu to the Apache Foundation for incubation in July 2018, and then incubated and operated by members of the incubator project management committee under the guidance of Apache mentors. Currently, the Apache Doris community has gathered more than 300 contributors from nearly 100 companies in different industries, and the number of active contributors is close to 100 [...]
+
+Apache Doris now has a wide user base in China and around the world, and as of today, Apache Doris is used in production environments in over 500 companies worldwide. More than 80% of the top 50 Internet companies in China in terms of market capitalization or valuation have been using Apache Doris for a long time, including Baidu, Meituan, Xiaomi, Jingdong, Bytedance, Tencent, NetEase, Kwai, Weibo, and Ke Holdings. It is also widely used in some traditional industries such as finance, en [...]
+
+# Usage Scenarios
+
+As shown in the figure below, after various data integration and processing, the data sources are usually stored in the real-time data warehouse Doris and the offline data lake or data warehouse (in Apache Hive, Apache Iceberg or Apache Hudi).
+![origin_img_v2_33e733e0-df43-4d69-8047-b8bd90cbbd7g](/images/origin_img_v2_33e733e0-df43-4d69-8047-b8bd90cbbd7g.png)
+
+Apache Doris is widely used in the following scenarios:
+
+-   Reporting Analysis
+
+    -   Real-time Dashboards
+    -   Reports for in-house analysts and managers
+    -   Highly concurrent user-oriented or customer-oriented report analysis: For example, in the scenarios of site analysis for website owners and advertising reports for advertisers, the concurrency usually requires thousands of QPS and the query latency requires sub-seconds response. The famous e-commerce company JD.com uses Doris in advertising reports, writing 10 billion rows of data per day, with tens of thousands of concurrent query QPS and 150ms query latency for the 99th percentile.
+
+-   Ad-Hoc Query. Analyst-oriented self-service analytics with irregular query patterns and high throughput requirements. XiaoMi has built a growth analytics platform (Growth Analytics, GA) based on Doris, using user behavior data for business growth analysis, with an average query latency of 10 seconds and a 95th percentile query latency of 30 seconds or less, and tens of thousands of SQL queries per day.
+
+-   Unified data warehouse construction. A platform to meet the needs of unified data warehouse construction and simplify the complicated data software stack. HaiDiLao's Doris-based unified data warehouse replaces the old architecture consisting of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix, and greatly simplifies the architecture.
+
+-   Data Lake Query. By federating the data located in Apache Hive, Apache Iceberg, and Apache Hudi using external tables, the query performance is greatly improved while avoiding data copying.
+
+# Technical Overview
+
+The overall architecture of Apache Doris is shown in the following figure. The Doris architecture is very simple, with only two types of processes.
+
+-   Frontend(FE): It is mainly responsible for user request access, query parsing and planning, management of metadata, and node management-related work.
+-   Backend(BE): It is mainly responsible for data storage and query plan execution.
+
+Both types of processes are horizontally scalable, and a single cluster can support up to hundreds of machines and tens of petabytes of storage capacity. And these two types of processes guarantee high availability of services and high reliability of data through consistency protocols. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system.
+
+![origin_img_v2_28d005e1-21d6-4801-956f-0c06373a7a9g](/images/origin_img_v2_28d005e1-21d6-4801-956f-0c06373a7a9g.png)
+
+Apache Doris adopts MySQL protocol, highly compatible with MySQL dialect, and supports standard SQL. Users can access Doris through various client tools and support seamless connection with BI tools.
+
+In terms of the storage engine, Doris uses columnar storage to encode and compress and read data by column, enabling a very high compression ratio while reducing a large number of scans of non-relevant data, thus making more efficient use of IO and CPU resources.
+Doris also supports a relatively rich index structure to reduce data scans:
+
+-   Support sorted compound key index: Up to three columns can be specified to form a compound sort key. With this index, data can be effectively pruned to better support high concurrent reporting scenarios.
+-   Z-order index :Using Z-order indexing, you can efficiently run range queries on any combination of fields in your schema.
+-   MIN/MAX indexing: Effective filtering of equivalence and range queries for numeric types
+-   Bloom Filter: very effective for equivalence filtering and pruning of high cardinality columns
+-   Invert Index: It enables the fast search of any field
+
+In terms of storage models, Doris supports a variety of storage models, with specific optimizations for different scenarios:
+
+-   Aggregate Key Model: Merge the value columns with the same keys, by aggregating in advance to significantly improve performance.
+-   Unique Key model: The key is unique. Data with the same key will be overwritten to achieve row-level data updates.
+-   Duplicate Key model: The detailed data model can satisfy the detailed storage of fact tables.
+
+Doris also supports strong consistent materialized views, where updates and selections of materialized views are made automatically within the system and do not require manual selection by the user, thus significantly reducing the cost of materialized view maintenance.
+
+In terms of query engine, Doris adopts the MPP model, with parallel execution between and within nodes, and also supports distributed shuffle join for multiple large tables, which can better cope with complex queries.
+
+![origin_img_v2_cee507bd-d6ed-4359-9e52-51e9b8458f8g](/images/origin_img_v2_cee507bd-d6ed-4359-9e52-51e9b8458f8g.png)
+
+The Doris query engine is vectorized, and all memory structures can be laid out in a columnar format to achieve significant reductions in virtual function calls, improved Cache hit rates, and efficient use of SIMD instructions. Performance in wide table aggregation scenarios is 5–10 times higher than in non-vectorized engines.
+
+![origin_img_v2_ad65aae9-9ed0-463e-a34c-94e32b092a4g](/images/origin_img_v2_ad65aae9-9ed0-463e-a34c-94e32b092a4g.png)
+
+Apache Doris uses Adaptive Query Execution technology, which can dynamically adjust the execution plan based on runtime statistics, such as runtime filter technology to generate filters to push to the probe side at runtime and to automatically penetrate the filters to the probe side which drastically reduces the amount of data in the probe and speeds up join performance. Doris' runtime filter supports In/Min/Max/Bloom filter.
+
+In terms of the optimizer, Doris uses a combination of CBO and RBO, with RBO supporting constant folding, subquery rewriting, predicate pushdown, etc., and CBO supporting Join Reorder. CBO is still under continuous optimization, mainly focusing on more accurate statistical information collection and derivation, more accurate cost model prediction, etc.
 
-- FE: Frontend, the front-end node of Doris. It is mainly responsible for receiving and returning client requests, metadata, cluster management, and query plan generation.
-- BE: Backend, the backend node of Doris. Mainly responsible for data storage and management, query plan execution and other work.
-- Broker: Broker is a stateless process. It is mainly to help Doris access external data sources such as data on HDFS in a Unix-like file system interface. For example, it is used in data import or data export operations.
-- Tablet: Tablet is the actual physical storage unit of a table. A table is stored in units of Tablet in the distributed storage layer composed of BE after partitioning and bucketing. Each Tablet includes meta information and several consecutive RowSets. .
-- Rowset: Rowset is a data collection of a data change in the tablet, and the data change includes data import, deletion, and update. Rowset records by version information. A version is generated for each change.
-- Version: It consists of two attributes, Start and End, and maintains the record information of data changes. Usually used to indicate the version range of Rowset, after a new import, a Rowset with equal Start and End is generated, and a Rowset version with a range is generated after Compaction.
-- Segment: Indicates the data segment in the Rowset. Multiple Segments form a Rowset.
-- Compaction: The process of merging consecutive versions of Rowset is called Compaction, and the data will be compressed during the merging process.
\ No newline at end of file
diff --git a/docs/zh-CN/docs/summary/basic-summary.md b/docs/zh-CN/docs/summary/basic-summary.md
index 1c0c152d15..b511306fa0 100644
--- a/docs/zh-CN/docs/summary/basic-summary.md
+++ b/docs/zh-CN/docs/summary/basic-summary.md
@@ -1,11 +1,11 @@
 ---
-{
-    "title": "Doris 基本概念",
-    "language": "zh-CN"
+{ 
+'title': 'Doris 介绍', 
+'language': 'zh-CN' 
 }
 ---
 
-<!-- 
+<!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
@@ -24,6 +24,76 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Doris 基本概念
+# Doris 介绍
 
-(TODO)
\ No newline at end of file
+Apache Doris 是一个基于 MPP 架构的高性能、实时的分析型数据库,以极速易用的特点被人们所熟知,仅需亚秒级响应时间即可返回海量数据下的查询结果,不仅可以支持高并发的点查询场景,也能支持高吞吐的复杂分析场景。基于此,Apache Doris 能够较好的满足报表分析、即席查询、统一数仓构建、数据湖联邦查询加速等使用场景,用户可以在此之上构建用户行为分析、AB 实验平台、日志检索分析、用户画像分析、订单分析等应用。
+
+Apache Doris 最早是诞生于百度广告报表业务的 Palo 项目,2017 年正式对外开源,2018 年 7 月由百度捐赠给 Apache 基金会进行孵化,之后在 Apache 导师的指导下由孵化器项目管理委员会成员进行孵化和运营。目前 Apache Doris 社区已经聚集了来自不同行业近百家企业的 300 余位贡献者,并且每月活跃贡献者人数也接近 100 位。 2022 年 6 月,Apache Doris 成功从 Apache 孵化器毕业,正式成为 Apache 顶级项目(Top-Level Project,TLP)
+
+Apache Doris 如今在中国乃至全球范围内都拥有着广泛的用户群体,截止目前, Apache Doris 已经在全球超过 500 家企业的生产环境中得到应用,在中国市值或估值排行前 50 的互联网公司中,有超过 80% 长期使用 Apache Doris,包括百度、美团、小米、京东、字节跳动、腾讯、网易、快手、微博、贝壳等。同时在一些传统行业如金融、能源、制造、电信等领域也有着丰富的应用。
+
+# 使用场景
+
+如下图所示,数据源经过各种数据集成和加工处理后,通常会入库到实时数仓 Doris 和离线湖仓(Hive, Iceberg, Hudi 中),Apache Doris 被广泛应用在以下场景中。
+![origin_img_v2_33e733e0-df43-4d69-8047-b8bd90cbbd7g](/images/origin_img_v2_33e733e0-df43-4d69-8047-b8bd90cbbd7g.png)
+
+-   报表分析
+
+    -   实时看板 (Dashboards)
+    -   面向企业内部分析师和管理者的报表
+    -   面向用户或者客户的高并发报表分析(Customer Facing Analytics)。比如面向网站主的站点分析、面向广告主的广告报表,并发通常要求成千上万的 QPS ,查询延时要求毫秒级响应。著名的电商公司京东在广告报表中使用 Apache Doris ,每天写入 100 亿行数据,查询并发 QPS 上万,99 分位的查询延时 150ms。
+
+-   即席查询(Ad-hoc Query):面向分析师的自助分析,查询模式不固定,要求较高的吞吐。小米公司基于 Doris 构建了增长分析平台(Growing Analytics,GA),利用用户行为数据对业务进行增长分析,平均查询延时 10s,95 分位的查询延时 30s 以内,每天的 SQL 查询量为数万条。
+
+-   统一数仓构建 :一个平台满足统一的数据仓库建设需求,简化繁琐的大数据软件栈。海底捞基于 Doris 构建的统一数仓,替换了原来由 Spark、Hive、Kudu、Hbase、Phoenix 组成的旧架构,架构大大简化。
+
+-   数据湖联邦查询:通过外表的方式联邦分析位于 Hive、Iceberg、Hudi 中的数据,在避免数据拷贝的前提下,查询性能大幅提升。
+
+# 技术概述
+
+Doris**整体架构**如下图所示,Doris 架构非常简单,只有两类进程
+
+-   **Frontend(FE)**,主要负责用户请求的接入、查询解析规划、元数据的管理、节点管理相关工作。
+
+-   **Backend(BE)**,主要负责数据存储、查询计划的执行。
+
+这两类进程都是可以横向扩展的,单集群可以支持到数百台机器,数十 PB 的存储容量。并且这两类进程通过一致性协议来保证服务的高可用和数据的高可靠。这种高度集成的架构设计极大的降低了一款分布式系统的运维成本。
+
+![origin_img_v2_28d005e1-21d6-4801-956f-0c06373a7a9g](/images/origin_img_v2_28d005e1-21d6-4801-956f-0c06373a7a9g.png)
+
+在**使用接口**方面,Doris 采用 MySQL 协议,高度兼容 MySQL 语法,支持标准 SQL,用户可以通过各类客户端工具来访问 Doris,并支持与 BI 工具的无缝对接。
+
+在**存储引擎**方面,Doris 采用列式存储,按列进行数据的编码压缩和读取,能够实现极高的压缩比,同时减少大量非相关数据的扫描,从而更加有效利用 IO 和 CPU 资源。
+
+Doris 也支持比较丰富的索引结构,来减少数据的扫描:
+
+-   Sorted Compound Key Index,可以最多指定三个列组成复合排序键,通过该索引,能够有效进行数据裁剪,从而能够更好支持高并发的报表场景
+
+-   Z-order Index :使用 Z-order 索引,可以高效对数据模型中的任意字段组合进行范围查询
+
+-   Min/Max :有效过滤数值类型的等值和范围查询
+
+-   Bloom Filter :对高基数列的等值过滤裁剪非常有效
+
+-   Invert Index :能够对任意字段实现快速检索
+
+在存储模型方面,Doris 支持多种存储模型,针对不同的场景做了针对性的优化:
+
+-   Aggregate Key 模型:相同 Key 的 Value 列合并,通过提前聚合大幅提升性能
+
+-   Unique Key 模型:Key 唯一,相同 Key 的数据覆盖,实现行级别数据更新
+
+-   Duplicate Key 模型:明细数据模型,满足事实表的明细存储
+
+Doris 也支持强一致的物化视图,物化视图的更新和选择都在系统内自动进行,不需要用户手动选择,从而大幅减少了物化视图维护的代价。
+
+在**查询引擎**方面,Doris 采用 MPP 的模型,节点间和节点内都并行执行,也支持多个大表的分布式 Shuffle Join,从而能够更好应对复杂查询。
+![origin_img_v2_cee507bd-d6ed-4359-9e52-51e9b8458f8g](/images/origin_img_v2_cee507bd-d6ed-4359-9e52-51e9b8458f8g.png)
+
+**Doris 查询引擎是向量化**的查询引擎,所有的内存结构能够按照列式布局,能够达到大幅减少虚函数调用、提升 Cache 命中率,高效利用 SIMD 指令的效果。在宽表聚合场景下性能是非向量化引擎的 5-10 倍。
+
+![origin_img_v2_ad65aae9-9ed0-463e-a34c-94e32b092a4g](/images/origin_img_v2_ad65aae9-9ed0-463e-a34c-94e32b092a4g.png)
+
+**Doris 采用了 Adaptive Query Execution 技术,** 可以根据 Runtime Statistics 来动态调整执行计划,比如通过 Runtime Filter 技术能够在运行时生成生成 Filter 推到 Probe 侧,并且能够将 Filter 自动穿透到 Probe 侧最底层的 Scan 节点,从而大幅减少 Probe 的数据量,加速 Join 性能。Doris 的 Runtime Filter 支持 In/Min/Max/Bloom Filter。
+
+在**优化器**方面 Doris 使用 CBO 和 RBO 结合的优化策略,RBO 支持常量折叠、子查询改写、谓词下推等,CBO 支持 Join Reorder。目前 CBO 还在持续优化中,主要集中在更加精准的统计信息收集和推导,更加精准的代价模型预估等方面。
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org