You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by ji...@apache.org on 2022/05/20 05:01:48 UTC

[incubator-doris-website] branch master updated: Analysis of storage structure design of Apache Doris storage layer design

This is an automated email from the ASF dual-hosted git repository.

jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris-website.git


The following commit(s) were added to refs/heads/master by this push:
     new 8e02b0806e Analysis of storage structure design of Apache Doris storage layer design
8e02b0806e is described below

commit 8e02b0806efa2f66ab06e73a080ae8c55902ee8a
Author: jiafeng.zhang <zh...@gmail.com>
AuthorDate: Fri May 20 13:01:13 2022 +0800

    Analysis of storage structure design of Apache Doris storage layer design
    
    Analysis of storage structure design of Apache Doris storage layer design
---
 blogs/en/doris-storage-struct-design.md            | 322 ++++++++++++++++++++
 .../storage/044434894abc13376ee9d14d78c5eff1.png   | Bin 0 -> 255536 bytes
 .../storage/2D89E0227253499AAFB77477B64DC2E5.png   | Bin 0 -> 47627 bytes
 .../storage/2a47fa7348f47e00e01bc93e38a1a547.png   | Bin 0 -> 151482 bytes
 .../storage/3001a1785a41628cd88c6e2928290d2f.png   | Bin 0 -> 106661 bytes
 .../storage/60C96B2D06D64E58A0B33384A59A0936.png   | Bin 0 -> 47623 bytes
 .../storage/694799b9202d288a80868175bc91c33f.png   | Bin 0 -> 139640 bytes
 .../storage/6abc0dd9922ec1768e127d4e94030731.png   | Bin 0 -> 71750 bytes
 .../storage/71b27dcd0a14ebe82562e2b5979d8c19.png   | Bin 0 -> 20523 bytes
 .../storage/89DBFA60C385454DBE666C574DCDE408.png   | Bin 0 -> 50194 bytes
 .../storage/C7EC885556D24E8587BC37E6EC70930B.png   | Bin 0 -> 47387 bytes
 .../storage/EAD7EEF330B048BC8C1EBD8EF4842772.png   | Bin 0 -> 46158 bytes
 .../storage/b9a87a028af1fc40babe2bf136334ec9.png   | Bin 0 -> 52366 bytes
 .../storage/dc49cfbc6dc5ac90fcc45c2b2bce54d4.png   | Bin 0 -> 60339 bytes
 .../storage/e2c62616c1c12fa05457eb6c443ebc48.png   | Bin 0 -> 22317 bytes
 .../storage/e9a2a4defc1204c507c0b9359225650f.png   | Bin 0 -> 14963 bytes
 .../storage/f74e7c5fc5358ce8faa3e79ad7e625d3.png   | Bin 0 -> 169036 bytes
 blogs/zh-CN/doris-storage-struct-design.md         | 326 +++++++++++++++++++++
 18 files changed, 648 insertions(+)

diff --git a/blogs/en/doris-storage-struct-design.md b/blogs/en/doris-storage-struct-design.md
new file mode 100644
index 0000000000..806d6f31de
--- /dev/null
+++ b/blogs/en/doris-storage-struct-design.md
@@ -0,0 +1,322 @@
+---
+{
+    "title": "Analysis of storage structure design of Apache Doris storage layer design",
+    "description": "This article mainly analyzes the implementation principle of the storage layer of the Doris BE module by reading the code of the Doris BE module, and expounds and decrypts the core technology behind the efficient writing and query capabilities of Doris. It includes Doris column storage design, index design, data read and write process, Compaction process, version management of Tablet and Rowset, data backup and other functions.",
+    "date": "2022-05-20",
+    "metaTitle": "Analysis of storage structure design of Apache Doris storage layer design",
+    "isArticle": true,
+    "language": "en",
+    "author": "ApacheDoris",
+    "layout": "Article",
+    "sidebar": false
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Analysis of storage structure design of Apache Doris storage layer design
+
+## 1. Overall introduction
+
+Doris is an interactive SQL data warehouse based on MPP architecture, mainly used to solve near real-time reporting and multidimensional analysis. Doris's efficient import and query are inseparable from the sophisticated design of its storage structure.
+
+This article mainly analyzes the implementation principle of the storage layer of the Doris BE module by reading the code of the Doris BE module, and expounds and decrypts the core technology behind the efficient writing and query capabilities of Doris. It includes Doris column storage design, index design, data read and write process, Compaction process, version management of Tablet and Rowset, data backup and other functions.
+
+This article introduces the storage layer structure of the Segment V2 version, including rich functions such as ordered storage, sparse index, prefix index, bitmap index, BloomFilter, etc., which can provide extremely fast query capabilities for various complex scenarios.
+
+## 2 Design goals
+
+- Bulk import, few updates
+- the vast majority of read requests
+- Wide table scenario, read a lot of rows, few columns
+- Non-transactional scenarios
+- good scalability
+
+## 3 save file format
+
+### 3.1 Storage directory structure
+
+The storage layer's management of storage data is configured through the storage_root_path path, which can be multiple. The next layer of the storage directory is organized according to buckets. Specific tablets are stored in the bucket directory, and subdirectories are named according to tablet_id.
+
+Segment files are stored in the tablet_id directory and managed according to SchemaHash. There can be multiple Segment files, generally divided according to size, the default is 256MB. Among them, the segment v2 file naming rule is: ${rowset_id}_${segment_id}.dat.
+
+The specific storage directory storage format is shown in the following figure:
+
+
+
+![img](/images/blogs/storage/b9a87a028af1fc40babe2bf136334ec9.png)
+
+
+
+### 3.2 Segment v2 file structure
+
+The overall file format of Segment is divided into three parts: data area, index area and footer, as shown in the following figure:
+
+
+
+![img](/images/blogs/storage/f74e7c5fc5358ce8faa3e79ad7e625d3.png)
+
+
+
+- Data Region: used to store the data information of each column, the data here is loaded in pages as needed
+- Index Region: Doris stores the index data of each column uniformly in the Index Region. The data here will be loaded according to the column granularity, so it is stored separately from the column data information.
+- Footer information
+- SegmentFooterPB: Define the metadata information of the file
+- 4 bytes checksum of FooterPB content
+- 4 bytes of FileFooterPB message length for reading FileFooterPB
+
+The following distribution introduces the design of the storage format of each part.
+
+## 4 Footer Information
+
+The Footer information segment is at the end of the file, which stores the overall structure of the file, including the location of the data field, the location of the index field and other information, including SegmentFooterPB, CheckSum, Length, MAGIC CODE 4 parts.
+
+SegmentFooterPB data structure is as follows:
+
+![img](/images/blogs/storage/044434894abc13376ee9d14d78c5eff1.png)
+
+SegmentFooterPB uses the PB format for storage, which mainly includes the meta information of the column, the meta information of the index, the short key index information of the segment, and the total number of rows.
+
+### 4.1 Column meta information
+
+ColumnId: the serial number of the current column in the schema
+
+UniqueId: globally unique id
+
+Type: the type information of the column
+
+Length: the length information of the column
+
+Encoding: encoding format
+
+Compression: Compression format
+
+Dict PagePointer: Dictionary information
+
+### 4.2 Meta information of column index
+
+- OrdinalIndex: Stores the sparse index meta information of the column.
+- ZoneMapIndex: Stores the meta information of the ZoneMap index, including the maximum value, the minimum value, whether there is a null value, and whether there is no non-null value. SegmentZoneMap stores the global ZoneMap information, and PageZoneMaps stores the statistical information of each page.
+- BitMapIndex: Stores the meta information of BitMap index, including BitMap type, dictionary data BitMap data.
+- BloomFilterIndex: Stores the BloomFilter index information.
+
+In order to prevent the data volume of the index itself from being too large, ZoneMapIndex, BitMapIndex, and BloomFilterIndex adopt two-level Page management. Corresponding to the structure of IndexColumnMeta, when a Page can be put down, the current Page directly stores the index data, that is, a level 1 structure is adopted; when a Page cannot be put down, the index data is written into a new Page, and the Root Page stores the address information of the data Page .
+
+## 5 Ordinal Index
+
+The Ordinal Index index provides the physical address of the Column Data Page data page by row number. Ordinal Index can align column-stored data by row, which can be understood as a first-level index. When looking for data in other indexes, the Ordinal Index is used to find the location of the data Page. Therefore, the Ordinal Index index is introduced here first.
+
+In a segment, data is always stored in the sorted order of keys (AGGREGATE KEY, UNIQ KEY, and DUPLICATE KEY), that is, the sorting of keys determines the physical structure of data storage. The physical structure order of the column data is determined. When writing data, the Column Data Page is managed by the Ordinal index. The Ordinal index records the position offset, size and row number information of the first data item of each Column Data Page. Namely Ordinal. In this way, each colu [...]
+
+### 5.1 Storage structure
+
+Ordinal index meta information is stored in OrdinalIndexMeta for each column in SegmentFooterPB . The specific structure is shown in the following figure:
+
+![img](/images/blogs/storage/694799b9202d288a80868175bc91c33f.png)
+
+The root page address corresponding to the index data is stored in OrdinalIndexMeta. Some optimizations are made here. When the data has only one page, the address here can directly point to the only data page; when a page cannot be placed, it points to the second page of the OrdinalIndex type Hierarchical structure index page, each data item in the index data corresponds to the Column Data Page offset position, size and ordinal row number information. The Ordinal index index granularity [...]
+
+## 6 column data store
+
+### 6.1 data page storage structure
+
+DataPage is mainly divided into two parts: Data part and Page Footer.
+
+The Data section stores the data of the columns of the current Page. When the Null value is allowed, the Bitmap of the Null value is stored separately for the null value, and the row number of the Null value is recorded by the RLE format encoding through the bool type.
+
+
+
+![img](/images/blogs/storage/71b27dcd0a14ebe82562e2b5979d8c19.png)
+
+
+
+Page Footer contains Page type Type, UncompressedSize uncompressed data size, FirstOrdinal RowId of the first row of the current Page, NumValues is the number of rows of the current Page, NullMapSize corresponds to the size of NullBitmap.
+
+## 6.2 Data compression
+
+Different encodings are used for different field types. By default, the correspondences adopted for different types are as follows:
+
+![img](/images/blogs/storage/89DBFA60C385454DBE666C574DCDE408.png)
+
+The data is compressed in LZ4F format by default.
+
+## 7 Short Key Index Index
+
+### 7.1 Storage structure
+
+Short Key Index prefix index is an index method to quickly query data according to a given prefix column based on the sorting of keys (AGGREGATE KEY, UNIQ KEY and DUPLICATE KEY). Here, the Short Key Index index also adopts a sparse index structure. During the data writing process, an index item will be generated every certain number of rows. The number of rows is 1024 rows by default for the index granularity, which can be configured. The process is shown in the following diagram:
+
+![img](/images/blogs/storage/2a47fa7348f47e00e01bc93e38a1a547.png)
+
+Among them, KeyBytes stores the index item data, and OffsetBytes stores the offset of the index item in KeyBytes.
+
+### 7.2 Index Generation Rules
+
+The Short Key Index uses the first 36 bytes as the prefix index for this row of data. Prefix indexes are simply truncated when a VARCHAR type is encountered.
+
+### 7.3 Application Cases
+
+(1) The prefix index of the following table structure is user_id(8Byte) + age(4Bytes) + message(prefix 24 Bytes).
+
+![img](/images/blogs/storage/C7EC885556D24E8587BC37E6EC70930B.png)
+
+(2) The prefix index of the following table structure is user_name (20 Bytes). Even if it does not reach 36 bytes, because VARCHAR is encountered, it is directly truncated and will not continue further.
+
+![img](/images/blogs/storage/60C96B2D06D64E58A0B33384A59A0936.png)
+
+When our query condition is the prefix of the prefix index, the query speed can be greatly accelerated. For example, in the first example, we execute the following query:
+
+```sql
+SELECT * FROM table WHERE user_id=1829239 and age=20;
+````
+
+This query will be much more efficient than the following query:
+
+```sql
+SELECT * FROM table WHERE age=20;
+````
+
+Therefore, when building a table, choosing the correct column order can greatly improve query efficiency.
+
+## 8 ZoneMap Index index
+
+The ZoneMap index stores the statistics of the Segment and each column corresponding to each Page. These statistics can help speed up the query and reduce the amount of scanned data. The statistics include the maximum value of Min, the minimum value of Max, HashNull null value, and HasNotNull not all null information.
+
+### 8.1 Storage structure
+
+The index storage structure of ZoneMap is shown in the following figure:
+
+
+
+![img](/images/blogs/storage/6abc0dd9922ec1768e127d4e94030731.png)
+
+
+
+In the SegmentFootPB structure, each column of index metadata ColumnIndexMeta stores the ZoneMapIndex index data information of the current column. ZoneMapIndex has two parts, SegmentZoneMap and PageZoneMaps. SegmentZoneMap stores the global ZoneMap index information of the current Segment, and PageZoneMaps stores the ZoneMap index information of each Data Page.
+
+PageZoneMaps corresponds to the IndexedColumnMeta structure of the Page information stored in the index data. Currently, there is no compression in the implementation, and the encoding method is also Plain. The OrdinalIndexPage in IndexedColumnMeta points to the offset and size of the root page of the index data. The second-level Page optimization is also done here. When there is only one DataPage, OrdinalIndexMeta directly points to this DataPage; when there are multiple DataPages, Ordi [...]
+
+### 8.2 Index Generation Rules
+
+Doris opens the ZoneMap index for the key column by default; when the model of the table is DUPULCATE, the ZoneMap index is enabled for all fields. When the column data is written to the Page, the data is automatically compared, and the index information of the ZoneMap of the current Segment and the ZoneMap of the current Page is continuously maintained.
+
+### 8.3 Application Cases
+
+During data query, the fields that will be filtered according to the range conditions will select the scanned data range according to the ZoneMap statistics. For example, in case 1, filter on the age field. The query statement is as follows:
+
+```sql
+SELECT * FROM table WHERE age > 20 and age < 1000
+````
+
+If the Short Key Index is not hit, it will use the ZoneMap index to find the ordinary range of data that should be scanned according to the query conditions of age in the conditional statement, reducing the number of pages to be scanned.
+
+## 9 BloomFilter
+
+Doris provides BloomFilter index when some fields cannot use Short Key Index and the field has a high degree of discrimination.
+
+### 9.1 Storage structure
+
+The storage structure of BloomFilter is shown in the following figure::
+
+![img](/images/blogs/storage/dc49cfbc6dc5ac90fcc45c2b2bce54d4.png)
+
+The BloomFilterIndex information stores the produced Hash strategy, Hash algorithm and the corresponding data Page information of BloomFilter. Hash algorithm adopts HASH_MURMUR3, Hash strategy adopts BlockSplitBloomFilter block implementation strategy, and the expected false positive rate fpp is configured to be 0.05 by default.
+
+The storage of data pages corresponding to BloomFilter index data is similar to that of ZoneMapIndex, and the optimization of secondary pages has been done, which will not be described in detail here.
+
+### 9.2 Index Generation Rules
+
+BloomFilter is generated by Page granularity. When data is written to a complete Page, Doris will generate the BloomFilter index data of this Page at the same time according to the Hash strategy. Currently bloom filter does not support tinyint/hll/float/double types, other types are already supported. When using, you need to specify bloom_filter_columns in PROPERTIES The fields to be indexed by BloomFilter.
+
+### 9.3 Application Cases
+
+When querying data, the query conditions are filtered in the field with bloom filter set. When the bloom filter is not hit, it means that there is no such data in the page, which can reduce the number of pages to be scanned.
+
+Case: The schema of the table is as follows:
+
+![img](/images/blogs/storage/2D89E0227253499AAFB77477B64DC2E5.png)
+
+The query sql here is as follows:
+
+```sql
+SELECT * FROM table WHERE name = 'Zhang San'
+````
+
+Due to the high degree of discrimination of name, in order to improve the query performance of sql, a BloomFilter index, PROPERTIES ( "bloom_filter_columns" = "name" ), is added to the name data. At query time, the BloomFilter index can filter out a large number of Pages.
+
+## 10 Bitmap Index index
+
+Doris also provides BitmapIndex to speed up data queries.
+
+## 10.1 Storage structure
+
+Bitmap storage format is as follows:
+
+![img](/images/blogs/storage/3001a1785a41628cd88c6e2928290d2f.png)
+
+The meta information of BitmapIndex is also stored in SegmentFootPB. BitmapIndex includes three parts, BitMap type, dictionary information DictColumn, and bitmap index data information BitMapColumn. Among them, DictColumn and BitMapColumn correspond to the IndexedColumnData structure, and store the Page address offset and size of dictionary data and index data respectively. The optimization of the secondary page is also done here, and will not be explained in detail.
+
+The difference between this and other index storage structures is that the DictColumn dictionary data is LZ4F compressed, and the first value in the Data Page is stored when the secondary Page offset is recorded.
+
+### 10.2 Index Generation Rules
+
+When creating a BitMap, it needs to be created through CREATE INDEX. The index of the Bitmap is the index of the Column field in the entire Segment, rather than generating a separate copy for each Page. When writing data, a map structure is maintained to record the row number corresponding to each key value, and the Roaring bitmap is used to encode the rowid. The main structure is as follows:
+
+![img](/images/blogs/storage/e9a2a4defc1204c507c0b9359225650f.png)
+
+When generating index data, the dictionary data is first written, and the key value of the map structure is written into the DictColumn. Then, the key corresponds to the Roaring-encoded rowid to write data into BitMapColumn in bytes.
+
+### 10.3 Application Cases
+
+When querying data, bitmap indexes can be used to optimize data columns with small degrees of differentiation and small column cardinality. For example, gender, marriage, geographic information, etc.
+
+Case: The schema of the table is as follows:
+
+![img](/images/blogs/storage/EAD7EEF330B048BC8C1EBD8EF4842772.png)
+
+The query sql here is as follows:
+
+```sql
+SELECT * FROM table WHERE city in ("Beijing", "Shanghai")
+````
+
+Since the value of city is relatively small, after the data dictionary and bitmap are established, matching rows can be quickly found by scanning the bitmap. And after bitmap compression, the amount of data itself is small, and the entire column can be accurately matched by scanning less data.
+
+## 11 Index query process
+
+When querying data in a Segment, according to the query conditions executed, the data will be filtered first according to the field indexing. Then read the data, the overall query process is as follows:
+
+
+
+![img](/images/blogs/storage/e2c62616c1c12fa05457eb6c443ebc48.png)
+
+
+
+1. First, a row_bitmap will be constructed according to the number of rows in the Segment, indicating that the data needs to be read to record. If no index is used, all data needs to be read.
+2. When the key is used in the query condition according to the prefix index rule, the ShortKey Index will be filtered first, and the ordinal row number range matched in the ShortKey Index can be merged into the row_bitmap.
+3. When there is a BitMap Index index in the column field in the query condition, the ordinal row number that meets the conditions will be directly found according to the BitMap index, and the intersection filter with row_bitmap will be obtained. The filtering here is accurate, and after removing the query condition, this field will not be filtered by the subsequent index.
+4. When there is a BloomFilter index in the column field in the query condition and the condition is equal (eq, in, is), it will be filtered by the BloomFilter index, here will go through all the indexes, filter the BloomFilter of each Page, and find out the query condition can be All Pages hit. Intersect the ordinal row number range in the index information with row_bitmap.
+5. When there is a ZoneMap index in the column field in the query condition, it will be filtered by the ZoneMap index. Here, all the indexes will also be traversed to find all the pages that the query condition can intersect with the ZoneMap. Intersect the ordinal row number range in the index information with row_bitmap.
+6. After the row_bitmap is generated, find the specific Data Page in batches through the OrdinalIndex of each Column.
+7. Batch read the data of the Column Data Page of each column. When reading, for a page with a null value, judge whether the current row is null according to the null value bitmap. If it is null, it can be filled directly.
+
+## 12 Summary
+
+Doris currently adopts a complete column storage structure and provides rich indexes to deal with different query scenarios, laying a solid foundation for Doris's efficient writing and query performance. The Doris storage layer is designed to be flexible, and functions such as new indexes and enhanced data deletion can be further added in the future.
\ No newline at end of file
diff --git a/blogs/images/blogs/storage/044434894abc13376ee9d14d78c5eff1.png b/blogs/images/blogs/storage/044434894abc13376ee9d14d78c5eff1.png
new file mode 100644
index 0000000000..9bf2dc1d44
Binary files /dev/null and b/blogs/images/blogs/storage/044434894abc13376ee9d14d78c5eff1.png differ
diff --git a/blogs/images/blogs/storage/2D89E0227253499AAFB77477B64DC2E5.png b/blogs/images/blogs/storage/2D89E0227253499AAFB77477B64DC2E5.png
new file mode 100644
index 0000000000..73cb18a330
Binary files /dev/null and b/blogs/images/blogs/storage/2D89E0227253499AAFB77477B64DC2E5.png differ
diff --git a/blogs/images/blogs/storage/2a47fa7348f47e00e01bc93e38a1a547.png b/blogs/images/blogs/storage/2a47fa7348f47e00e01bc93e38a1a547.png
new file mode 100644
index 0000000000..1b8ef0b27d
Binary files /dev/null and b/blogs/images/blogs/storage/2a47fa7348f47e00e01bc93e38a1a547.png differ
diff --git a/blogs/images/blogs/storage/3001a1785a41628cd88c6e2928290d2f.png b/blogs/images/blogs/storage/3001a1785a41628cd88c6e2928290d2f.png
new file mode 100644
index 0000000000..886d224923
Binary files /dev/null and b/blogs/images/blogs/storage/3001a1785a41628cd88c6e2928290d2f.png differ
diff --git a/blogs/images/blogs/storage/60C96B2D06D64E58A0B33384A59A0936.png b/blogs/images/blogs/storage/60C96B2D06D64E58A0B33384A59A0936.png
new file mode 100644
index 0000000000..84729d161b
Binary files /dev/null and b/blogs/images/blogs/storage/60C96B2D06D64E58A0B33384A59A0936.png differ
diff --git a/blogs/images/blogs/storage/694799b9202d288a80868175bc91c33f.png b/blogs/images/blogs/storage/694799b9202d288a80868175bc91c33f.png
new file mode 100644
index 0000000000..28f693e1f6
Binary files /dev/null and b/blogs/images/blogs/storage/694799b9202d288a80868175bc91c33f.png differ
diff --git a/blogs/images/blogs/storage/6abc0dd9922ec1768e127d4e94030731.png b/blogs/images/blogs/storage/6abc0dd9922ec1768e127d4e94030731.png
new file mode 100644
index 0000000000..2e00037b96
Binary files /dev/null and b/blogs/images/blogs/storage/6abc0dd9922ec1768e127d4e94030731.png differ
diff --git a/blogs/images/blogs/storage/71b27dcd0a14ebe82562e2b5979d8c19.png b/blogs/images/blogs/storage/71b27dcd0a14ebe82562e2b5979d8c19.png
new file mode 100644
index 0000000000..95369e91e5
Binary files /dev/null and b/blogs/images/blogs/storage/71b27dcd0a14ebe82562e2b5979d8c19.png differ
diff --git a/blogs/images/blogs/storage/89DBFA60C385454DBE666C574DCDE408.png b/blogs/images/blogs/storage/89DBFA60C385454DBE666C574DCDE408.png
new file mode 100644
index 0000000000..22e444dff7
Binary files /dev/null and b/blogs/images/blogs/storage/89DBFA60C385454DBE666C574DCDE408.png differ
diff --git a/blogs/images/blogs/storage/C7EC885556D24E8587BC37E6EC70930B.png b/blogs/images/blogs/storage/C7EC885556D24E8587BC37E6EC70930B.png
new file mode 100644
index 0000000000..a7744b9fbf
Binary files /dev/null and b/blogs/images/blogs/storage/C7EC885556D24E8587BC37E6EC70930B.png differ
diff --git a/blogs/images/blogs/storage/EAD7EEF330B048BC8C1EBD8EF4842772.png b/blogs/images/blogs/storage/EAD7EEF330B048BC8C1EBD8EF4842772.png
new file mode 100644
index 0000000000..40e240f9d3
Binary files /dev/null and b/blogs/images/blogs/storage/EAD7EEF330B048BC8C1EBD8EF4842772.png differ
diff --git a/blogs/images/blogs/storage/b9a87a028af1fc40babe2bf136334ec9.png b/blogs/images/blogs/storage/b9a87a028af1fc40babe2bf136334ec9.png
new file mode 100644
index 0000000000..3c99705c94
Binary files /dev/null and b/blogs/images/blogs/storage/b9a87a028af1fc40babe2bf136334ec9.png differ
diff --git a/blogs/images/blogs/storage/dc49cfbc6dc5ac90fcc45c2b2bce54d4.png b/blogs/images/blogs/storage/dc49cfbc6dc5ac90fcc45c2b2bce54d4.png
new file mode 100644
index 0000000000..439180e172
Binary files /dev/null and b/blogs/images/blogs/storage/dc49cfbc6dc5ac90fcc45c2b2bce54d4.png differ
diff --git a/blogs/images/blogs/storage/e2c62616c1c12fa05457eb6c443ebc48.png b/blogs/images/blogs/storage/e2c62616c1c12fa05457eb6c443ebc48.png
new file mode 100644
index 0000000000..14962ae96a
Binary files /dev/null and b/blogs/images/blogs/storage/e2c62616c1c12fa05457eb6c443ebc48.png differ
diff --git a/blogs/images/blogs/storage/e9a2a4defc1204c507c0b9359225650f.png b/blogs/images/blogs/storage/e9a2a4defc1204c507c0b9359225650f.png
new file mode 100644
index 0000000000..b3ef2f1c1d
Binary files /dev/null and b/blogs/images/blogs/storage/e9a2a4defc1204c507c0b9359225650f.png differ
diff --git a/blogs/images/blogs/storage/f74e7c5fc5358ce8faa3e79ad7e625d3.png b/blogs/images/blogs/storage/f74e7c5fc5358ce8faa3e79ad7e625d3.png
new file mode 100644
index 0000000000..7badd7cd39
Binary files /dev/null and b/blogs/images/blogs/storage/f74e7c5fc5358ce8faa3e79ad7e625d3.png differ
diff --git a/blogs/zh-CN/doris-storage-struct-design.md b/blogs/zh-CN/doris-storage-struct-design.md
new file mode 100644
index 0000000000..0f339f29f7
--- /dev/null
+++ b/blogs/zh-CN/doris-storage-struct-design.md
@@ -0,0 +1,326 @@
+---
+{
+    "title": "Apache Doris存储层设计之存储结构设计解析",
+    "description": "本文主要通过阅读 Doris BE 模块代码,详细分析了 Doris BE 模块存储层的实现原理,阐述和解密 Doris 高效的写入、查询能力背后的核心技术。其中包括 Doris 列存的设计、索引设计、数据读写流程、Compaction 流程、Tablet 和 Rowset 的版本管理、数据备份等功能.",
+    "date": "2022-05-20",
+    "metaTitle": "Apache Doris存储层设计之存储结构设计解析",
+    "isArticle": true,
+    "language": "zh-CN",
+    "author": "ApacheDoris",
+    "layout": "Article",
+    "sidebar": false
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 存储层设计之存储结构设计解析
+
+## 1. 整体介绍
+
+Doris 是基于 MPP 架构的交互式 SQL 数据仓库,主要用于解决近实时的报表和多维分析。Doris 高效的导入、查询离不开其存储结构精巧的设计。
+
+本文主要通过阅读 Doris BE 模块代码,详细分析了 Doris BE 模块存储层的实现原理,阐述和解密 Doris 高效的写入、查询能力背后的核心技术。其中包括 Doris 列存的设计、索引设计、数据读写流程、Compaction 流程、Tablet 和 Rowset 的版本管理、数据备份等功能。
+
+文章介绍了 Segment V2 版本的存储层结构,包括了有序存储、稀疏索引、前缀索引、位图索引、BloomFilter 等丰富功能,可以应对各种复杂的场景提供极速的查询能力。
+
+## 2 设计目标
+
+- 批量导入,少量更新
+- 绝大多数的读请求
+- 宽表场景,读取大量行,少量列
+- 非事务场景
+- 良好的扩展性
+
+## 3 储存文件格式
+
+### 3.1 存储目录结构
+
+存储层对存储数据的管理通过 storage_root_path 路径进行配置,路径可以是多个。存储目录下一层按照分桶进行组织,分桶目录下存放具体的 tablet,按照 tablet_id 命名子目录。
+
+Segment 文件存放在 tablet_id 目录下按 SchemaHash 管理。Segment 文件可以有多个,一般按照大小进行分割,默认为 256MB。其中,Segment v2 文件命名规则为:${rowset_id}_${segment_id}.dat。
+
+具体存储目录存放格式如下图所示:
+
+
+
+![img](/images/blogs/storage/b9a87a028af1fc40babe2bf136334ec9.png)
+
+
+
+### 3.2 Segment v2 文件结构
+
+Segment 整体的文件格式分为数据区域,索引区域和 footer 三个部分,如下图所示:
+
+
+
+![img](/images/blogs/storage/f74e7c5fc5358ce8faa3e79ad7e625d3.png)
+
+
+
+- Data Region: 用于存储各个列的数据信息,这里的数据是按需分 page 加载的
+- Index Region: Doris 中将各个列的 index 数据统一存储在 Index Region,这里的数据会按照列粒度进行加载,所以跟列的数据信息分开存储
+- Footer 信息
+- SegmentFooterPB: 定义文件的元数据信息
+- 4 个字节的 FooterPB 内容的 checksum
+- 4 个字节的 FileFooterPB 消息长度,用于读取 FileFooterPB
+
+下面分布介绍各个部分的存储格式的设计。
+
+## 4 Footer 信息
+
+Footer 信息段在文件的尾部,存储了文件的整体结构,包括数据域的位置,索引域的位置等信息,其中有 SegmentFooterPB,CheckSum,Length,MAGIC CODE 4 个部分。
+
+SegmentFooterPB 数据结构如下:
+
+![img](/images/blogs/storage/044434894abc13376ee9d14d78c5eff1.png)
+
+SegmentFooterPB 采用了 PB 格式进行存储,主要包含了列的 meta 信息、索引的 meta 信息,Segment 的 short key 索引信息、总行数。
+
+### 4.1 列的 meta 信息
+
+ColumnId:当前列在 schema 中的序号
+
+UniqueId:全局唯一的 id
+
+Type:列的类型信息
+
+Length:列的长度信息
+
+Encoding:编码格式
+
+Compression:压缩格式
+
+Dict PagePointer:字典信息
+
+### 4.2 列索引的 meta 信息
+
+- OrdinalIndex:存放列的稀疏索引 meta 信息。
+- ZoneMapIndex:存放 ZoneMap 索引的 meta 信息,内容包括了最大值、最小值、是否有空值、是否没有非空值。SegmentZoneMap 存放了全局的 ZoneMap 信息,PageZoneMaps 则存放了每个页面的统计信息。
+- BitMapIndex:存放 BitMap 索引的 meta 信息,内容包括了 BitMap 类型,字典数据 BitMap 数据。
+- BloomFilterIndex:存放了 BloomFilter 索引信息。
+
+为了防止索引本身数据量过大,ZoneMapIndex、BitMapIndex、BloomFilterIndex 采用了两级的 Page 管理。对应了 IndexColumnMeta 的结构,当一个 Page 能够放下时,当前 Page 直接存放索引数据,即采用 1 级结构;当一个 Page 无法放下时,索引数据写入新的 Page 中,Root Page 存储数据 Page 的地址信息。
+
+## 5 Ordinal Index (一级索引)
+
+Ordinal Index 索引提供了通过行号来查找 Column Data Page 数据页的物理地址。Ordinal Index 能够将按列存储数据按行对齐,可以理解为一级索引。其他索引查找数据时,都要通过 Ordinal Index 查找数据 Page 的位置。因此,这里先介绍 Ordinal Index 索引。
+
+在一个 segment 中,数据始终按照 key(AGGREGATE KEY、UNIQ KEY 和 DUPLICATE KEY)排序顺序进行存储,即 key 的排序决定了数据存储的物理结构。确定了列数据的物理结构顺序,在写入数据时,Column Data Page 是由 Ordinal index 进行管理,Ordinal index 记录了每个 Column Data Page 的位置 offset、大小 size 和第一个数据项行号信息,即 Ordinal。这样每个列具有按行信息进行快速扫描的能力。Ordinal index 采用的稀疏索引结构,就像是一本书目录,记录了每个章节对应的页码。
+
+### 5.1 存储结构
+
+Ordinal index 元信息存储在 SegmentFooterPB 中的每个列的 OrdinalIndexMeta 中。具体结构如下图所示:
+
+![img](/images/blogs/storage/694799b9202d288a80868175bc91c33f.png)
+
+在 OrdinalIndexMeta 中存放了索引数据对应的 root page 地址,这里做了一些优化,当数据仅有一个 page 时,这里的地址可以直接指向唯一的数据 page;当一个 page 放不下时,指向 OrdinalIndex 类型的二级结构索引 page,索引数据中每个数据项对应了 Column Data Page offset 位置、size 大小和 ordinal 行号信息。其中 Ordinal index 索引粒度与 page 粒度一致,默认 64*1024 字节。
+
+## 6 列数据存储
+
+### 6.1 data page 存储结构
+
+DataPage 主要为 Data 部分、Page Footer 两个部分。
+
+Data 部分存放了当前 Page 的列的数据。当允许存在 Null 值时,对空值单独存放了 Null 值的 Bitmap,由 RLE 格式编码通过 bool 类型记录 Null 值的行号。
+
+
+
+![img](/images/blogs/storage/71b27dcd0a14ebe82562e2b5979d8c19.png)
+
+
+
+Page Footer 包含了 Page 类型 Type、UncompressedSize 未压缩时的数据大小、FirstOrdinal 当前 Page 第一行的 RowId、NumValues 为当前 Page 的行数、NullMapSize 对应了 NullBitmap 的大小。
+
+## 6.2 数据压缩
+
+针对不同的字段类型采用了不同的编码。默认情况下,针对不同类型采用的对应关系如下:
+
+![img](/images/blogs/storage/89DBFA60C385454DBE666C574DCDE408.png)
+
+默认采用 LZ4F 格式对数据进行压缩。
+
+## 7 Short Key Index 索引
+
+### 7.1 存储结构
+
+Short Key Index 前缀索引,是在 key(AGGREGATE KEY、UNIQ KEY 和 DUPLICATE KEY)排序的基础上,实现的一种根据给定前缀列,快速查询数据的索引方式。这里 Short Key Index 索引也采用了稀疏索引结构,在数据写入过程中,每隔一定行数,会生成一个索引项。这个行数为索引粒度默认为 1024 行,可配置。该过程如下图所示:
+
+![img](/images/blogs/storage/2a47fa7348f47e00e01bc93e38a1a547.png)
+
+其中,KeyBytes 中存放了索引项数据,OffsetBytes 存放了索引项在 KeyBytes 中的偏移。
+
+### 7.2 索引生成规则
+
+Short Key Index 采用了前 36 个字节,作为这行数据的前缀索引。当遇到 VARCHAR 类型时,前缀索引会直接截断。
+
+### 7.3 应用案例
+
+(1)以下表结构的前缀索引为 user_id(8Byte) + age(4Bytes) + message(prefix 24 Bytes)。
+
+![img](/images/blogs/storage/C7EC885556D24E8587BC37E6EC70930B.png)
+
+(2)以下表结构的前缀索引为 user_name(20 Bytes)。即使没有达到 36 个字节,因为遇到 VARCHAR,所以直接截断,不再往后继续。
+
+![img](/images/blogs/storage/60C96B2D06D64E58A0B33384A59A0936.png)
+
+当我们的查询条件,是前缀索引的前缀时,可以极大的加快查询速度。比如在第一个例子中,我们执行如下查询:
+
+```sql
+SELECT * FROM table WHERE user_id=1829239 and age=20;
+```
+
+该查询的效率会远高于如下查询:
+
+```sql
+SELECT * FROM table WHERE age=20;
+```
+
+所以在建表时,正确的选择列顺序,能够极大地提高查询效率。
+
+## 8 ZoneMap Index 索引
+
+ZoneMap 索引存储了 Segment 和每个列对应每个 Page 的统计信息。这些统计信息可以帮助在查询时提速,减少扫描数据量,统计信息包括了 Min 最大值、Max 最小值、HashNull 空值、HasNotNull 不全为空的信息。
+
+### 8.1 存储结构
+
+ZoneMap 索引存储结构如下图所示:
+
+
+
+![img](/images/blogs/storage/6abc0dd9922ec1768e127d4e94030731.png)
+
+
+
+在 SegmentFootPB 结构中,每一列索引元数据 ColumnIndexMeta 中存放了当前列的 ZoneMapIndex 索引数据信息。ZoneMapIndex 有两个部分,SegmentZoneMap 和 PageZoneMaps。SegmentZoneMap 存放了当前 Segment 全局的 ZoneMap 索引信息,PageZoneMaps 存放了每个 Data Page 的 ZoneMap 索引信息。
+
+PageZoneMaps 对应了索引数据存放的 Page 信息 IndexedColumnMeta 结构,目前实现上没有进行压缩,编码方式也为 Plain。IndexedColumnMeta 中的 OrdinalIndexPage 指向索引数据 root page 的偏移和大小,这里同样做了优化二级 Page 优化,当仅有一个 DataPage 时,OrdinalIndexMeta 直接指向这个 DataPage;有多个 DataPage 时,OrdinalIndexMeta 先指向 OrdinalIndexPage,OrdinalIndexPage 是一个二级 Page 结构,里面的数据项为索引数据 DataPage 的地址偏移 offset,大小 Size 和 ordinal 信息。
+
+### 8.2 索引生成规则
+
+Doris 默认为 key 列开启 ZoneMap 索引;当表的模型为 DUPULCATE 时,会所有字段开启 ZoneMap 索引。在列数据写入 Page 时,自动对数据进行比较,不断维护当前 Segment 的 ZoneMap 和当前 Page 的 ZoneMap 索引信息。
+
+### 8.3 应用案例
+
+在数据查询时,会根据范围条件过滤的字段会按照 ZoneMap 统计信息选取扫描的数据范围。例如在案例 1 中,对 age 字段进行过滤。查询语句如下:
+
+```sql
+SELECT * FROM table WHERE age > 20 and age < 1000
+```
+
+在没有命中 Short Key Index 的情况下,会根据条件语句中 age 的查询条件,利用 ZoneMap 索引找到应该扫描的数据 ordinary 范围,减少要扫描的 page 数量。
+
+## 9 BloomFilter
+
+当一些字段不能利用 Short Key Index 并且字段存在区分度比较大时,Doris 提供了 BloomFilter 索引。
+
+### 9.1 存储结构
+
+BloomFilter 的存储结构如下图所示:
+
+![img](/images/blogs/storage/dc49cfbc6dc5ac90fcc45c2b2bce54d4.png)
+
+BloomFilterIndex 信息存放了生产的 Hash 策略、Hash 算法和 BloomFilter 过对应的数据 Page 信息。Hash 算法采用了 HASH_MURMUR3,Hash 策略采用了 BlockSplitBloomFilter 分块实现策略,期望的误判率 fpp 默认配置为 0.05。
+
+BloomFilter 索引数据对应数据 Page 的存放与 ZoneMapIndex 类似,做了二级 Page 的优化,这里不再详细阐述。
+
+
+
+### 9.2 索引生成规则
+
+BloomFilter 按 Page 粒度生成,在数据写入一个完整的 Page 时,Doris 会根据 Hash 策略同时生成这个 Page 的 BloomFilter 索引数据。目前 bloom 过滤器不支持 tinyint/hll/float/double 类型,其他类型均已支持。使用时需要在 PROPERTIES 中指定 bloom_filter_columns 要使用 BloomFilter 索引的字段。
+
+### 9.3 应用案例
+
+在数据查询时,查询条件在设置有 bloom 过滤器的字段进行过滤,当 bloom 过滤器没有命中时表示该 Page 中没有该数据,这样可以减少要扫描的 page 数量。
+
+案例:table 的 schema 如下:
+
+![img](/images/blogs/storage/2D89E0227253499AAFB77477B64DC2E5.png)
+
+这里的查询 sql 如下:
+
+```sql
+SELECT * FROM table WHERE name = '张三'
+```
+
+由于 name 的区分度较大,为了提升 sql 的查询性能,对 name 数据增加了 BloomFilter 索引,PROPERTIES ( "bloom_filter_columns" = "name" )。在查询时通过 BloomFilter 索引能够大量过滤掉 Page。
+
+## 10 Bitmap Index 索引
+
+Doris 还提供了 BitmapIndex 用来加速数据的查询。
+
+## 10.1 存储结构
+
+Bitmap 存储格式如下:
+
+![img](/images/blogs/storage/3001a1785a41628cd88c6e2928290d2f.png)
+
+BitmapIndex 的 meta 信息同样存放在 SegmentFootPB 中,BitmapIndex 包含了三部分,BitMap 的类型、字典信息 DictColumn、位图索引数据信息 BitMapColumn。其中 DictColumn、BitMapColumn 都对应 IndexedColumnData 结构,分别存放了字典数据和索引数据的 Page 地址 offset、大小 size。这里同样做了二级 page 的优化,不再具体阐述。
+
+这里与其他索引存储结构有差异的地方是 DictColumn 字典数据进行了 LZ4F 压缩,在记录二级 Page 偏移时存放的是 Data Page 中的第一个值。
+
+### 10.2 索引生成规则
+
+BitMap 创建时需要通过 CREATE INDEX 进行创建。Bitmap 的索引是整个 Segment 中的 Column 字段的索引,而不是为每个 Page 单独生成一份。在写入数据时,会维护一个 map 结构记录下每个 key 值对应的行号,并采用 Roaring 位图对 rowid 进行编码。主要结构如下:
+
+![img](/images/blogs/storage/e9a2a4defc1204c507c0b9359225650f.png)
+
+生成索引数据时,首先写入字典数据,将 map 结构的 key 值写入到 DictColumn 中。然后,key 对应 Roaring 编码的 rowid 以字节方式将数据写入到 BitMapColumn 中。
+
+
+
+### 10.3应用案例
+
+在数据查询时,对于区分度不大,列的基数比较小的数据列,可以采用位图索引进行优化。比如,性别,婚姻,地理信息等。
+
+案例:table 的 schema 如下:
+
+![img](/images/blogs/storage/EAD7EEF330B048BC8C1EBD8EF4842772.png)
+
+这里的查询 sql 如下:
+
+```sql
+SELECT * FROM table WHERE city in ("北京", "上海")
+```
+
+由于 city 的取值比较少,建立数据字典和位图后,通过扫描位图便可以快速查找出匹配行。并且位图压缩后,数据量本身较小,通过扫描较少数据便能够对整个列进行精确的匹配。
+
+## 11 索引的查询流程
+
+在查询一个 Segment 中的数据时,根据执行的查询条件,会对首先根据字段加索引的情况对数据进行过滤。然后在进行读取数据,整体的查询流程如下:
+
+
+
+![img](/images/blogs/storage/e2c62616c1c12fa05457eb6c443ebc48.png)
+
+
+
+1. 首先,会按照 Segment 的行数构建一个 row_bitmap,表示记录那些数据需要进行读取,没有使用任何索引的情况下,需要读取所有数据。
+2. 当查询条件中按前缀索引规则使用到了 key 时,会先进行 ShortKey Index 的过滤,可以在 ShortKey Index 中匹配到的 ordinal 行号范围,合入到 row_bitmap 中。
+3. 当查询条件中列字段存在 BitMap Index 索引时,会按照 BitMap 索引直接查出符合条件的 ordinal 行号,与 row_bitmap 求交过滤。这里的过滤是精确的,之后去掉该查询条件,这个字段就不会再进行后面索引的过滤。
+4. 当查询条件中列字段存在 BloomFilter 索引并且条件为等值(eq,in,is)时,会按 BloomFilter 索引过滤,这里会走完所有索引,过滤每一个 Page 的 BloomFilter,找出查询条件能命中的所有 Page。将索引信息中的 ordinal 行号范围与 row_bitmap 求交过滤。
+5. 当查询条件中列字段存在 ZoneMap 索引时,会按 ZoneMap 索引过滤,这里同样会走完所有索引,找出查询条件能与 ZoneMap 有交集的所有 Page。将索引信息中的 ordinal 行号范围与 row_bitmap 求交过滤。
+6. 生成好 row_bitmap 之后,批量通过每个 Column 的 OrdinalIndex 找到到具体的 Data Page。
+7. 批量读取每一列的 Column Data Page 的数据。在读取时,对于有 null 值的 page,根据 null 值位图判断当前行是否是 null,如果为 null 进行直接填充即可。
+
+## 12 总结
+
+Doris 目前采用了完全的列存储结构,并提供了丰富的索引应对不同查询场景,为 Doris 高效的写入、查询性能奠定了夯实的基础。Doris 存储层设计灵活,未来还可以进一步增加新的索引、强化数据删除等功能。
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org