You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@iotdb.apache.org by qi...@apache.org on 2020/03/25 02:52:44 UTC
[incubator-iotdb] branch master updated: add system design eng (#938)

This is an automated email from the ASF dual-hosted git repository.

qiaojialin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-iotdb.git


The following commit(s) were added to refs/heads/master by this push:
     new 658f1c0  add system design eng (#938)
658f1c0 is described below

commit 658f1c098f9f40fbda66850c130480a9c003fb83
Author: Sail <37...@users.noreply.github.com>
AuthorDate: Wed Mar 25 10:52:34 2020 +0800

    add system design eng (#938)
---
 .../SystemDesign/0-Architecture/1-Architecture.md  |  54 +++
 .../SystemDesign/2-QueryEngine/1-QueryEngine.md    |  64 ++++
 .../SystemDesign/2-QueryEngine/2-Planner.md        |  65 ++++
 .../SystemDesign/2-QueryEngine/3-PlanExecutor.md   |  26 ++
 .../3-SchemaManager/1-SchemaManager.md             |  26 ++
 .../4-StorageEngine/1-StorageEngine.md             |  68 ++++
 .../SystemDesign/4-StorageEngine/2-WAL.md          |  26 ++
 .../SystemDesign/4-StorageEngine/3-FlushManager.md |  84 +++++
 .../SystemDesign/4-StorageEngine/4-MergeManager.md |  26 ++
 .../4-StorageEngine/5-DataPartition.md             |  86 +++++
 .../4-StorageEngine/6-DataManipulation.md          |  95 +++++
 .../SystemDesign/5-DataQuery/1-DataQuery.md        |  40 +++
 .../SystemDesign/5-DataQuery/2-SeriesReader.md     | 384 +++++++++++++++++++++
 .../SystemDesign/5-DataQuery/3-RawDataQuery.md     | 303 ++++++++++++++++
 .../SystemDesign/5-DataQuery/4-AggregationQuery.md | 114 ++++++
 .../SystemDesign/5-DataQuery/5-GroupByQuery.md     | 260 ++++++++++++++
 .../SystemDesign/5-DataQuery/6-LastQuery.md        | 122 +++++++
 .../5-DataQuery/7-AlignByDeviceQuery.md            | 203 +++++++++++
 docs/Documentation/SystemDesign/6-Tools/1-Sync.md  | 249 +++++++++++++
 .../SystemDesign/7-Connector/2-Hive-TsFile.md      | 114 ++++++
 .../SystemDesign/7-Connector/3-Spark-TsFile.md     |  94 +++++
 .../SystemDesign/7-Connector/4-Spark-IOTDB.md      |  87 +++++
 22 files changed, 2590 insertions(+)

diff --git a/docs/Documentation/SystemDesign/0-Architecture/1-Architecture.md b/docs/Documentation/SystemDesign/0-Architecture/1-Architecture.md
new file mode 100644
index 0000000..b4e3b56
--- /dev/null
+++ b/docs/Documentation/SystemDesign/0-Architecture/1-Architecture.md
@@ -0,0 +1,54 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Application Overview
+
+<img style="width:100%; max-width:800px; max-height:600px; margin-left:auto; margin-right:auto; display:block;" src="https://user-images.githubusercontent.com/19167280/73625222-ddd88680-467e-11ea-9098-e808ed4979c5.png">
+
+The architecture diagram of the IoT time series database Apache IoTDB is shown above. It covers the life-cycle data management functions such as collection, storage, query, analysis, and visualization of time series data. The gray part is the IoTDB component.
+
+## Introduction to IoTDB architecture
+
+As shown in the following figure ,  IoTDB uses a client-server architecture.
+
+<img style="width:100%; max-width:400px; max-height:600px; margin-left:auto; margin-right:auto; display:block;" src="https://user-images.githubusercontent.com/19167280/73625221-ddd88680-467e-11ea-9cf3-70367e5886f4.png">
+
+The server mainly includes a query engine that processes all user requests and distributes them to the corresponding management components, including the data writing layer, data query, schema management, and administration modules.
+
+* [TsFile](/#/SystemDesign/progress/chap1/sec1)
+* [QueryEngine](/#/SystemDesign/progress/chap2/sec1)
+* [SchemaManager](/#/SystemDesign/progress/chap3/sec1)
+* [StorageEngine](/#/SystemDesign/progress/chap4/sec1)
+* [DataQuery](/#/SystemDesign/progress/chap5/sec1)
+
+## System Tools
+
+* [Data synchronization tool](/#/SystemDesign/progress/chap6/sec1)
+
+## Connector
+
+IoTDB is connected with big data systems.
+
+* [Hadoop-TsFile](/#/SystemDesign/progress/chap7/sec1)
+* [Hive-TsFile](/#/SystemDesign/progress/chap7/sec2)
+* [Spark-TsFile](/#/SystemDesign/progress/chap7/sec3)
+* [Spark-IoTDB](/#/SystemDesign/progress/chap7/sec4)
+* [Grafana](/#/SystemDesign/progress/chap7/sec5)
diff --git a/docs/Documentation/SystemDesign/2-QueryEngine/1-QueryEngine.md b/docs/Documentation/SystemDesign/2-QueryEngine/1-QueryEngine.md
new file mode 100644
index 0000000..549bcd5
--- /dev/null
+++ b/docs/Documentation/SystemDesign/2-QueryEngine/1-QueryEngine.md
@@ -0,0 +1,64 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# QueryEngine
+
+<img style="width:100%; max-width:800px; max-height:600px; margin-left:auto; margin-right:auto; display:block;" src="https://user-images.githubusercontent.com/19167280/73625242-f648a100-467e-11ea-921c-b954a3ecae7a.png">
+
+## Design ideas
+
+The query engine is responsible for parsing all user commands, generating plans, delivering them to the corresponding executors, and returning result sets.
+
+## Related classes
+
+* org.apache.iotdb.db.service.TSServiceImpl
+
+  IoTDB server-side RPC implementation, which directly interacts with the client.
+
+* org.apache.iotdb.db.qp.Planner
+
+  Parse SQL, generate logical plans, optimize logical plans, and generate physical plans.
+
+* org.apache.iotdb.db.qp.executor.PlanExecutor
+
+  Distribute the physical plan to the corresponding actuators, including the following four specific actuators.
+
+  * MManager: Metadata operations
+  * StorageEngine: Data write
+  * QueryRouter: Data query
+  * LocalFileAuthorizer: Permission operation
+
+* org.apache.iotdb.db.query.dataset.*
+
+  The batch result set is returned to the client and contains part of the query logic.
+
+## Query process
+
+* SQL parsing
+* Generate logical plans
+* Generate physical plans
+* Constructing a result set generator
+* Returning result sets in batches
+
+## Related documents
+
+* [Query Plan Generator](/#/SystemDesign/progress/chap2/sec2)
+* [PlanExecutor](/#/SystemDesign/progress/chap2/sec3)
diff --git a/docs/Documentation/SystemDesign/2-QueryEngine/2-Planner.md b/docs/Documentation/SystemDesign/2-QueryEngine/2-Planner.md
new file mode 100644
index 0000000..d2a63a0
--- /dev/null
+++ b/docs/Documentation/SystemDesign/2-QueryEngine/2-Planner.md
@@ -0,0 +1,65 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Execution plan generator
+
+* org.apache.iotdb.db.qp.Planner
+
+Transform the syntax tree parsed by SQL into logical plans, logical optimizations, and physical plans.
+
+## SQL  parsing
+
+SQL parsing using Antlr4
+
+* server/src/main/antlr4/org/apache/iotdb/db/qp/strategy/SqlBase.g4
+
+mvn clean compile 
+
+Generated code location ：server/target/generated-sources/antlr4
+
+## Logical plan generator
+
+* org.apache.iotdb.db.qp.strategy.LogicalGenerator
+
+## Logical plan optimizer
+
+There are currently three logical plan optimizers
+
+* org.apache.iotdb.db.qp.strategy.optimizer.ConcatPathOptimizer
+
+  The path optimizer splices query paths in SQL, interacts with MManager, removes wildcards, and performs path checking.
+
+* org.apache.iotdb.db.qp.strategy.optimizer.RemoveNotOptimizer
+
+  The predicate de-optimizer removes the non-operators in the predicate logic.
+
+* org.apache.iotdb.db.qp.strategy.optimizer.DnfFilterOptimizer
+
+  Turn predicates into disjunctive normal form.
+
+* org.apache.iotdb.db.qp.strategy.optimizer.MergeSingleFilterOptimizer
+
+  Combine predicates of the same path logically.
+
+## Physical plan generator
+
+* org.apache.iotdb.db.qp.strategy.PhysicalGenerator
+
diff --git a/docs/Documentation/SystemDesign/2-QueryEngine/3-PlanExecutor.md b/docs/Documentation/SystemDesign/2-QueryEngine/3-PlanExecutor.md
new file mode 100644
index 0000000..9977a5d
--- /dev/null
+++ b/docs/Documentation/SystemDesign/2-QueryEngine/3-PlanExecutor.md
@@ -0,0 +1,26 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Plan executor
+
+* org.apache.iotdb.db.qp.executor.PlanExecutor
+
+Perform a physical plan.
\ No newline at end of file
diff --git a/docs/Documentation/SystemDesign/3-SchemaManager/1-SchemaManager.md b/docs/Documentation/SystemDesign/3-SchemaManager/1-SchemaManager.md
new file mode 100644
index 0000000..e5cc2fc
--- /dev/null
+++ b/docs/Documentation/SystemDesign/3-SchemaManager/1-SchemaManager.md
@@ -0,0 +1,26 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Metadata management
+
+<img style="width:100%; max-width:800px; max-height:600px; margin-left:auto; margin-right:auto; display:block;" src="https://user-images.githubusercontent.com/19167280/73625246-fc3e8200-467e-11ea-8815-67b9c4ab716e.png">
+
+IoTDB's metadata management is in the form of a directory tree. The penultimate layer is the device layer and the last layer is the sensor layer.
\ No newline at end of file
diff --git a/docs/Documentation/SystemDesign/4-StorageEngine/1-StorageEngine.md b/docs/Documentation/SystemDesign/4-StorageEngine/1-StorageEngine.md
new file mode 100644
index 0000000..dc862c4
--- /dev/null
+++ b/docs/Documentation/SystemDesign/4-StorageEngine/1-StorageEngine.md
@@ -0,0 +1,68 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Storage engine
+
+<img style="width:100%; max-width:800px; max-height:600px; margin-left:auto; margin-right:auto; display:block;" src="https://user-images.githubusercontent.com/19167280/73625255-03fe2680-467f-11ea-91ae-64407ef1125c.png">
+
+## Design ideas
+
+The storage engine is based on the LSM design. The data is first written to the memory buffer memtable and then flushed to disk. For each device, the maximum timestamp being flushed (including those that have been flushed and are being flushed) is maintained in memory. The data is divided into sequential data and out-of-order data according to this timestamp. Different types of data are separated into different memtables and flushed into different TsFiles.
+
+Each data file TsFile corresponds to a file index information TsFileResource in memory for query use.
+
+In addition, the storage engine includes asynchronous persistence and file merge mechanisms.
+
+## Write process
+
+### Related code
+
+* org.apache.iotdb.db.engine.StorageEngine
+
+  Responsible for writing and accessing an IoTDB instance and managing all StorageGroupProsessor.
+
+* org.apache.iotdb.db.engine.storagegroup.StorageGroupProcessor
+
+  Responsible for writing and accessing data within a time partition of a storage group. 
+
+  Manages all partitions‘ TsFileProcessor .
+
+* org.apache.iotdb.db.engine.storagegroup.TsFileProcessor
+
+  Responsible for data writing and accessing a TsFile file.
+
+## Data write
+See details:
+* [Data write](/#/SystemDesign/progress/chap4/sec6)
+
+## Data access
+
+* Main entrance（StorageEngine）: public QueryDataSource query(SingleSeriesExpression seriesExpression, QueryContext context, QueryFileManager filePathsManager)
+      
+	* Find all ordered and out-of-order TsFileResources containing this time series and return them for use by the query engine
+
+## Related documents
+
+* [Write Ahead Log (WAL)](/#/SystemDesign/progress/chap4/sec2)
+
+* [memtable Endurance](/#/SystemDesign/progress/chap4/sec3)
+
+* [File merge mechanism](/#/SystemDesign/progress/chap4/sec4)
diff --git a/docs/Documentation/SystemDesign/4-StorageEngine/2-WAL.md b/docs/Documentation/SystemDesign/4-StorageEngine/2-WAL.md
new file mode 100644
index 0000000..f15b6fe
--- /dev/null
+++ b/docs/Documentation/SystemDesign/4-StorageEngine/2-WAL.md
@@ -0,0 +1,26 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# WAL
+
+## Related code
+
+* org.apache.iotdb.db.writelog.*
diff --git a/docs/Documentation/SystemDesign/4-StorageEngine/3-FlushManager.md b/docs/Documentation/SystemDesign/4-StorageEngine/3-FlushManager.md
new file mode 100644
index 0000000..21050e6
--- /dev/null
+++ b/docs/Documentation/SystemDesign/4-StorageEngine/3-FlushManager.md
@@ -0,0 +1,84 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Flush Memtable
+
+## Design ideas
+
+After the memory buffer memtable reaches a certain threshold, it will be handed over to the FlushManager for asynchronous persistence without blocking normal writes. The persistence process is pipelined.
+
+## Related idea
+
+* org.apache.iotdb.db.engine.flush.FlushManager
+
+	Memtable's Flush task manager.
+	
+* org.apache.iotdb.db.engine.flush.MemtableFlushTask
+
+	flush a Memtable。
+
+## FlushManager: Persistence manager
+
+FlushManager can accept memtable persistent tasks. There are two submitters. The first is TsFileProcessor and the second is the persistent child thread FlushThread.
+
+Each TsFileProcessor will only have one flush task executed at a time. A TsFileProcessor may correspond to multiple memtables that need to be persisted.
+
+## MemTableFlushTask: Persistent task
+
+<img style="width:100%; max-width:800px; max-height:600px; margin-left:auto; margin-right:auto; display:block;" src="https://user-images.githubusercontent.com/19167280/73625254-03fe2680-467f-11ea-8197-115f3a749cbd.png">
+
+Background: Each memtable can contain multiple devices, and each device can contain multiple measurements.
+
+### Three threads
+
+A memtable's persistence process has three threads, and the main thread's work does not end until all tasks are completed.
+
+* MemTableFlushTask  Thread
+
+  The sorting thread (the main thread), responsible for sorting the chunks corresponding to each measurement and submit tasks to the encoding task.
+
+* encodingTask Thread
+
+  The encoding thread is responsible for encoding each Chunk and encoding it into a byte array.
+
+* ioTask Thread
+
+  The IO thread is responsible for persisting the encoded Chunk to the TsFile on the disk.
+
+### Two task queues
+
+Three threads interact through two task queues
+
+* encodingTaskQueue: Sorting thread-> encoding thread, including three tasks
+	
+	* StartFlushGroupIOTask：Began to persist a device (ChunkGroup), encoding does not process this command, and sends it directly to the IO thread.
+	
+	* Pair\<TVList, MeasurementSchema\>：Encoding a Chunk
+	
+	* EndChunkGroupIoTask：End the persistence of a device (ChunkGroup). The encoding thread does not process this command and  send directly to the IO thread.
+
+* ioTaskQueue: Encoding thread-> IO thread, including three tasks
+	
+	* StartFlushGroupIOTask：Starting to persist a device (ChunkGroup).
+	
+	* IChunkWriter：Persisting a Chunk to Disk
+	
+	* EndChunkGroupIoTask：Ends the persistence of a device (ChunkGroup).
diff --git a/docs/Documentation/SystemDesign/4-StorageEngine/4-MergeManager.md b/docs/Documentation/SystemDesign/4-StorageEngine/4-MergeManager.md
new file mode 100644
index 0000000..32ce98f
--- /dev/null
+++ b/docs/Documentation/SystemDesign/4-StorageEngine/4-MergeManager.md
@@ -0,0 +1,26 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# File merge mechanism
+
+## Related code
+
+* org.apache.iotdb.db.engine.merge.*
\ No newline at end of file
diff --git a/docs/Documentation/SystemDesign/4-StorageEngine/5-DataPartition.md b/docs/Documentation/SystemDesign/4-StorageEngine/5-DataPartition.md
new file mode 100644
index 0000000..62a6711
--- /dev/null
+++ b/docs/Documentation/SystemDesign/4-StorageEngine/5-DataPartition.md
@@ -0,0 +1,86 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Data partition
+
+Time series data is partitioned on two levels of storage groups and time ranges.
+
+## Storage group
+
+The storage group is specified by the user display. Use the statement "SET STORAGE GROUP TO" to specify the storage group. Each storage group has a corresponding StorageGroupProcessor.
+
+The main fields it has are:
+
+* Read-write lock: insertLock
+
+* Unclosed sequential file processors for each time partition: workSequenceTsFileProcessors
+
+* Unclosed out-of-order file processor corresponding to each time partition: workUnsequenceTsFileProcessors
+
+* Full sequential file list for this storage group (sorted by time): sequenceFileTreeSet
+
+* List of all out-of-order files for this storage group (unordered): unSequenceFileList
+
+* A map that records the last write time of each device. When sequential data is flashed, the time recorded by this map is used: latestTimeForEachDevice
+
+* A map that records the last flash time of each device to distinguish between sequential and out-of-order data: latestFlushedTimeForEachDevice
+
+* A version generator map corresponding to each time partition, which is convenient for determining the priority of different chunks when querying: timePartitionIdVersionControllerMap
+
+
+### Related code
+
+* src/main/java/org/apache/iotdb/db/engine/StorageEngine.java
+
+
+## Time range
+
+The data in the same storage group is partitioned according to the time range specified by the user. The related parameter is partition_interval and the default is week. That is, data of different weeks will be placed in different partitions.
+
+### Implementation logic
+
+StorageGroupProcessor performs partition calculation on the inserted data to find the specified TsFileProcessor, and the TsFile corresponding to each TsFileProcessor will be placed in a different partition folder.
+
+### File structure
+
+The file structure after partitioning is as follows:
+
+data
+
+-- sequence
+
+---- [Storage group name1]
+
+------ [Time division ID1]
+
+-------- xxxx.tsfile
+
+-------- xxxx.resource
+
+------ [Time division ID2]
+
+---- [Storage group name 2]
+
+-- unsequence
+
+### Related code
+
+* getOrCreateTsFileProcessorIntern  method in src/main/java/org/apache/iotdb/db/engine/storagegroup.StoragetGroupProcessor.java
diff --git a/docs/Documentation/SystemDesign/4-StorageEngine/6-DataManipulation.md b/docs/Documentation/SystemDesign/4-StorageEngine/6-DataManipulation.md
new file mode 100644
index 0000000..3c771ca
--- /dev/null
+++ b/docs/Documentation/SystemDesign/4-StorageEngine/6-DataManipulation.md
@@ -0,0 +1,95 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Data addition, deletion and modification
+
+The following describes four common data manipulation operations, which are insert, update, delete, and TTL settings.
+
+## Data insertion
+
+### Write a single line of data (one device, one timestamp, multiple values)
+
+* Corresponding interface
+  * JDBC's execute and executeBatch interfaces
+  * Session's insert and insertInBatch
+* Main entrance: ```public void insert(InsertPlan insertPlan)```   StorageEngine.java
+  * Find the corresponding StorageGroupProcessor
+  * Find the corresponding TsFileProcessor according to the time of writing the data and the last time stamp of the current device order
+  * Pre-write log
+  * Typo in mestable corresponding to TsFileProcessor
+      * If the file is out of order, update the endTimeMap in tsfileResource
+      * If there is no information about the device in tsfile, then update the startTimeMap in tsfileResource
+  * Determine whether to trigger asynchronous persistent memtable operation based on memtable size
+      * If it is a sequential file and the flashing action is performed, the endTimeMap in tsfileResource is updated
+  * Determine whether to trigger a file close operation based on the size of the current disk TsFile
+
+### Batch data (multiple timestamp multiple values for one device) write
+
+* Corresponding interface
+	* Session‘s insertBatch
+
+* Main entrance: ```public Integer[] insertBatch(BatchInsertPlan batchInsertPlan)```  StorageEngine.java
+    * Find the corresponding StorageGroupProcessor
+	* According to the time of this batch of data and the last timestamp of the current device order, this batch of data is divided into small batches, which correspond to a TsFileProcessor
+	* Pre-write log
+	* Write each small batch to the corresponding memtable of TsFileProcessor
+	    * If the file is out of order, update the endTimeMap in tsfileResource
+	    * If there is no information about the device in tsfile, then update the startTimeMap in tsfileResource
+	* Determine whether to trigger asynchronous persistent memtable operation based on memtable size
+	    * If it is a sequential file and the flashing action is performed, the endTimeMap in tsfileResource is updated
+	* Determine whether to trigger a file close operation based on the size of the current disk TsFile
+
+
+## Data Update
+
+Currently does not support data in-place update operations, that is, update statements, but users can directly insert new data, the same time series at the same time point is based on the latest inserted data.
+Old data is automatically deleted by merging, see:
+
+* [File merge mechanism](/#/SystemDesign/progress/chap4/sec4)
+
+## Data deletion
+
+* Corresponding interface
+  * JDBC's execute interface, using delete SQL statements
+
+* Main entrance: public void delete(String deviceId, String measurementId, long timestamp) StorageEngine.java
+    * Find the corresponding StorageGroupProcessor
+    * Find all TsfileProcessor affected
+    * Pre-write log
+    * Find all TsfileResources affected
+    * Record the point in time of deletion in the mod file
+    * If the file is not closed (the corresponding TsfileProcessor exists), delete the data in memory
+
+
+## Data TTL setting
+
+* Corresponding interface
+	* JDBC's execute interface, using the SET TTL statement
+
+* Main entrance: ```public void setTTL(String storageGroup, long dataTTL) ```StorageEngine.java
+    * Find the corresponding StorageGroupProcessor
+    * Set new data ttl in StorageGroupProcessor
+    * TTL check on all TsfileResource
+    * If a file expires under the current TTL, delete the file
+
+At the same time, we started a thread to periodically check the file TTL in StorageEngine.
+
+- start method in src/main/java/org/apache/iotdb/db/engine/StorageEngine.java
\ No newline at end of file
diff --git a/docs/Documentation/SystemDesign/5-DataQuery/1-DataQuery.md b/docs/Documentation/SystemDesign/5-DataQuery/1-DataQuery.md
new file mode 100644
index 0000000..a2a76b2
--- /dev/null
+++ b/docs/Documentation/SystemDesign/5-DataQuery/1-DataQuery.md
@@ -0,0 +1,40 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Data query
+
+There are several types of data queries
+
+* Raw data query
+* Aggregate query
+* Downsampling query
+* Single point supplementary null query
+* Latest data query
+
+In order to achieve the above kinds of queries, a basic query component for a single time series is designed in the IoTDB query engine, and on this basis, various query functions are implemented.
+
+## Related documents
+
+* [Basic query components](/#/SystemDesign/progress/chap5/sec2)
+* [Raw data query](/#/SystemDesign/progress/chap5/sec3)
+* [Aggregate query](/#/SystemDesign/progress/chap5/sec4)
+* [Downsampling query](/#/SystemDesign/progress/chap5/sec5)
+* [Recent timestamp query](/#/SystemDesign/progress/chap5/sec6)
diff --git a/docs/Documentation/SystemDesign/5-DataQuery/2-SeriesReader.md b/docs/Documentation/SystemDesign/5-DataQuery/2-SeriesReader.md
new file mode 100644
index 0000000..9ac7c1a
--- /dev/null
+++ b/docs/Documentation/SystemDesign/5-DataQuery/2-SeriesReader.md
@@ -0,0 +1,384 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Query basic components
+
+## Design principle
+
+The IoTDB server module provides a total of 3 different forms of reading interfaces for a single time series to support different forms of queries.
+
+* The raw data query interface returns BatchData with time or value filters. Both filters cannot exist at the same time.
+* Aggregation query interface (mainly used for aggregation query and downsampling query)
+* Interface for querying corresponding values by increasing timestamp (mainly used for queries with value filtering)
+
+## Related interfaces
+
+The above three ways to read a single time series data correspond to the three interfaces in the code.
+
+### org.apache.iotdb.tsfile.read.reader.IBatchReader
+
+#### Main method
+
+```
+// Determine if there is still BatchData
+boolean hasNextBatch() throws IOException;
+
+// Get the next BatchData and move the cursor back
+BatchData nextBatch() throws IOException;
+```
+
+#### use process
+
+```
+while (batchReader.hasNextBatch()) {
+	BatchData batchData = batchReader.nextBatch();
+	
+	// use batchData to do some work
+	...
+}
+```
+
+### org.apache.iotdb.db.query.reader.series.IAggregateReader
+
+#### Main method
+
+```
+// Determine if there is still Chunk
+boolean hasNextChunk() throws IOException;
+
+// Determine if you can use the current Chunk statistics
+boolean canUseCurrentChunkStatistics();
+
+// Get statistics for the current Chunk
+Statistics currentChunkStatistics();
+
+// Skip the current Chunk
+void skipCurrentChunk();
+
+// Determine if the current Chunk has a next Page
+boolean hasNextPage() throws IOException;
+
+// Determine if the statistics of the current Page can be used
+boolean canUseCurrentPageStatistics() throws IOException;
+
+// Get statistics for the current Page
+Statistics currentPageStatistics() throws IOException;
+
+// Skip the current Page
+void skipCurrentPage();
+
+// Get data for the current Page
+BatchData nextPage() throws IOException;
+```
+
+#### General use process
+
+```
+while (aggregateReader.hasNextChunk()) {
+  if (aggregateReader.canUseCurrentChunkStatistics()) {
+    Statistics chunkStatistics = aggregateReader.currentChunkStatistics();
+    
+    // Calculate with the statistics of the chunk layer
+    ...
+    
+    aggregateReader.skipCurrentChunk();
+    continue;
+  }
+  
+  // Run out of pages in the current chunk
+  while (aggregateReader.hasNextPage()) {
+	 if (aggregateReader.canUseCurrentPageStatistics()) {
+	   // Can use statistics
+	   Statistics pageStatistic = aggregateReader.currentPageStatistics();
+	   
+	   // Calculate with page-level statistics
+	   ...
+	  
+	   aggregateReader.skipCurrentPage();
+	   continue;
+	 } else {
+	   // Can't use statistics, need to calculate with data
+	   BatchData batchData = aggregateReader.nextOverlappedPage();
+	   
+	   // Calculate with batchData
+	   ...
+	 }
+  }
+}
+```
+
+### org.apache.iotdb.db.query.reader.IReaderByTimestamp
+
+#### Main method
+
+``` 
+// Get the value of the given timestamp, or null if none exists (requires that the //timestamp passed in is incremented)
+Object getValueInTimestamp(long timestamp) throws IOException;
+
+// Given a batch of timestamp values, return a batch of results (reduce the number //of method calls)
+Object[] getValuesInTimestamps(long[] timestamps) throws IOException;
+```
+
+#### General use process
+
+This interface is used in queries with value filtering. After TimeGenerator generates a timestamp, use this interface to obtain the value corresponding to the timestamp.
+
+```
+Object value = readerByTimestamp.getValueInTimestamp(timestamp);
+
+or
+
+Object[] values = readerByTimestamp.getValueInTimestamp(timestamps);
+```
+
+## Concrete implementation class
+
+The above three interfaces all have their corresponding implementation classes. As the above three queries have similarities, we have designed a basic SeriesReader tool class that encapsulates the basic methods for a time series read operation to help implement the above three interfaces. The following first introduces the design principle of the SeriesReader, and then introduces the specific implementation of the three interfaces in turn.
+
+### org.apache.iotdb.db.query.reader.series.SeriesReader
+
+#### Design ideas
+
+Background knowledge: TsFile source (TsFilesource) can be unpacked to get ChunkMetadata, ChunkMetadata can be unpacked to get a bunch of PageReader, PageReader can directly return BatchData data points.
+
+To support the above three interfaces
+
+The data is divided into four types according to the granularity: file, chunk, page, and intersecting data points. In the original data query, the largest data block return granularity is a page. If a page and other pages cover each other due to out-of-order writing, they are unraveled into data points for merging. Aggregate queries use Chunk's statistics first, followed by Page's statistics, and finally intersecting data points.
+
+The design principle is to use the larger granularity instead of the smaller granularity.
+
+First introduce some important fields in SeriesReader
+
+```
+
+/*
+ * File layer
+ */
+private final List<TsFileResource> seqFileResource;
+	Sequential file list, because the sequential file itself is guaranteed to be ordered, and the timestamps do not overlap each other, just use List to store
+	
+private final PriorityQueue<TsFileResource> unseqFileResource;
+	Out-of-order file list, because out-of-order files do not guarantee order between each other, and may overlap
+	
+/*
+ * chunk layer
+ * 
+ * The data between the three fields is never duplicated, and first is always the first (minimum start time)
+ */
+private ChunkMetaData firstChunkMetaData;
+	This field is filled first when filling the chunk layer to ensure that this chunk has the current minimum start time
+	
+private final List<ChunkMetaData> seqChunkMetadatas;
+	The ChunkMetaData obtained after the sequential files are unpacked is stored here. It is ordered and does not overlap with each other, so the List is used for storage.
+
+private final PriorityQueue<ChunkMetaData> unseqChunkMetadatas;
+	ChunkMetaData obtained after unordered files are stored is stored here, there may be overlap between each other, in order to ensure order, priority queue is used for storage
+	
+/*
+ * page layer
+ *
+ * The data between the two fields is never duplicated, and first is always the first (minimum start time)
+ */ 
+private VersionPageReader firstPageReader;
+	Page reader with the smallest start time
+	
+private PriorityQueue<VersionPageReader> cachedPageReaders;
+	All page readers currently acquired, sorted by the start time of each page
+	
+/*
+ * Intersecting data point layer
+ */ 
+private PriorityMergeReader mergeReader;
+	Essentially, there are multiple pages with priority, and the data points are output from low to high according to the timestamp. When the timestamps are the same, the high priority page is retained.
+
+/*
+ * Caching of results from intersecting data points
+ */ 
+private boolean hasCachedNextOverlappedPage;
+	Whether the next batch is cached
+	
+private BatchData cachedBatchData;
+	Cached reference to the next batch
+```
+
+The following describes the important methods in SeriesReader
+
+#### hasNextChunk()
+
+* Main function: determine whether the time series has the next chunk.
+
+* Constraint: Before calling this method, you need to ensure that there is no page and data point level data in the `SeriesReader` , that is, all the previously unlocked chunks have been consumed. .
+
+* Implementation: If `firstChunkMetaData` is not empty, it means that the first` ChunkMetaData` is currently cached and not used, and returns `true` directly;
+
+  Try to untie the first sequential file and the first out-of-order file to fill the chunk layer. And unpack all files that coincide with `firstChunkMetadata`.
+
+#### isChunkOverlapped()
+
+* Main function: determine whether the current chunk overlaps with other Chunk
+
+* Constraint: Before calling this method, make sure that the chunk layer has cached `firstChunkMetadata`, that is, hasNextChunk () is called and is true.
+
+* Implementation: Compare `firstChunkMetadata` with` seqChunkMetadatas` and `unseqChunkMetadatas` directly. Because it has been guaranteed that all files that intersect with `firstChunkMetadata` will be unzipped.
+
+#### currentChunkStatistics()
+
+Returns statistics for `firstChunkMetaData`.
+
+#### skipCurrentChunk()
+
+Skip the current chunk. Just set `firstChunkMetaData` to` null`.
+
+#### hasNextPage()
+
+* Main function: determine whether there are already unwrapped pages in the SeriesReader. If there are intersecting pages, construct `cachedBatchData` and cache, otherwise cache` firstPageReader`.
+
+* Implementation: If `cachedBatchData` is already cached, return directly. If there are intersecting data points, a `cachedBatchData` is constructed. If `firstPageReader` is already cached, return directly.
+
+	If the current `firstChunkMetadata` has not been solved, then all the ChunkMetadata which overlaps with it are constructed to construct the firstPageReader.
+	
+	Determine if `firstPageReader` and` cachedPageReaders` intersect, then construct `cachedBatchData`, otherwise return directly.
+
+#### isPageOverlapped()
+
+* Main function: determine whether the current page overlaps with other pages
+
+* Constraint: Before calling this method, you need to ensure that hasNextPage () is called and is true. That is, it is possible to cache an intersecting `cachedBatchData` or an disjoint` firstPageReader`.
+
+* Implementation: First determine if there is `cachedBatchData`, if not, it means that the current page does not intersect, then there is no data in` mergeReader`. Then determine whether `firstPageReader` intersects with page in` cachedPageReaders`.
+
+#### currentPageStatistics()
+
+Returns statistics for `firstPageReader`.
+
+#### skipCurrentPage()
+
+Skip the current Page. Just set `firstPageReader` to null.
+
+#### nextPage()
+
+* Main function: return to the next intersecting or unwanted page
+
+* Constraint: Before calling this method, you need to ensure that hasNextPage () is called and is true. That is, it is possible to cache an intersecting `cachedBatchData` or an disjoint` firstPageReader`.
+
+* Implementation: If `hasCachedNextOverlappedPage` is true, it means that an intersecting page is cached, and` cachedBatchData` is returned directly. Otherwise, the current page does not intersect, and the data of the current page is taken directly from firstPageReader.
+
+#### hasNextOverlappedPage()
+
+* Main function: internal method, used to determine whether there is currently overlapping data, and construct intersecting pages and cache them.
+
+* Implementation: If `hasCachedNextOverlappedPage` is` true`, return `true` directly.
+
+	Otherwise, first call the `tryToPutAllDirectlyOverlappedPageReadersIntoMergeReader ()` method, and put all of the cachedPageReaders that overlap with the firstPageReader into the mergeReader. `mergeReader` maintains a` currentLargestEndTime` variable, which is updated each time a new Reader is added to record the maximum end time of the page currently added to `mergeReader`.
+	Then first take out the current maximum end time from `mergeReader`, as the end time of the first batch of data, record it as` currentPageEndTime`. Then go through `mergeReader` until the current timestamp is greater than` currentPageEndTime`.
+	
+	Before moving a point from mergeReader, we must first determine whether there is a file, chunk, or page that overlaps with the current timestamp. (The reason for this is to make another judgment here because, for example, the current page is 1-30, and he directly The intersecting pages are 20-50, and there is another page 40-60. Every time you take a point, you want to unlock 40-60. If so, unpack the corresponding file or chunk or page and put it in Enter `mergeReader`. After the overla [...]
+
+	After completing the iteration, the data will be cached in `cachedBatchData`, and` hasCachedNextOverlappedPage` will be set to `true`.
+
+#### nextOverlappedPage()
+
+Return cached `cachedBatchData` and set` hasCachedNextOverlappedPage` to `false`.
+
+### org.apache.iotdb.db.query.reader.series.SeriesRawDataBatchReader
+
+`SeriesRawDataBatchReader` implements` IBatchReader`.
+
+The core judgment flow of its method `hasNextBatch ()` is
+
+```
+// There are cached batches, return directly
+if (hasCachedBatchData) {
+  return true;
+}
+
+/*
+ * If there are still pages in the SeriesReader, return to page
+ */
+if (readPageData()) {
+  hasCachedBatchData = true;
+  return true;
+}
+
+/*
+ * If there is a chunk and a page, return page
+ */
+while (seriesReader.hasNextChunk()) {
+  if (readPageData()) {
+    hasCachedBatchData = true;
+    return true;
+  }
+}
+return hasCachedBatchData;
+```
+
+### org.apache.iotdb.db.query.reader.series.SeriesReaderByTimestamp
+
+`SeriesReaderByTimestamp` implements `IReaderByTimestamp`.
+
+Design idea: When a time stamp is used to query the value, this time stamp can be converted into a filter condition with time> = x. Keep updating this filter, and skip files, chunks and pages that don't meet.
+
+Method to realize:
+
+```
+/*
+ * Prioritize whether the next page is currently checked, skip it if you can
+ */
+if (readPageData(timestamp)) {
+  return true;
+}
+
+/*
+ * Determine if the next chunk has the current search time, skip it if it can
+ */
+while (seriesReader.hasNextChunk()) {
+  Statistics statistics = seriesReader.currentChunkStatistics();
+  if (!satisfyTimeFilter(statistics)) {
+    seriesReader.skipCurrentChunk();
+    continue;
+  }
+  /*
+   * The chunk cannot be skipped, continue to check the page in the chunk
+   */
+  if (readPageData(timestamp)) {
+    return true;
+  }
+}
+return false;
+```
+
+### org.apache.iotdb.db.query.reader.series.SeriesAggregateReader
+
+`SeriesAggregateReader`implements `IAggregateReader`
+
+Most interface methods of `IAggregateReader` have corresponding implementations in` SeriesReader`, except for `canUseCurrentChunkStatistics ()` and `canUseCurrentPageStatistics ()`.
+
+#### canUseCurrentChunkStatistics()
+
+Design Idea: The conditions under which the statistical information can be used are that the current chunks do not overlap and meet the filtering conditions.
+
+First call the `CurrentChunkStatistics ()` method of `SeriesReader` to obtain the statistics of the current chunk, then call the` isChunkOverlapped () `method of` SeriesReader` to determine whether the current chunks overlap. If the current chunks do not overlap and their statistics meet the filtering If true, return `true`, otherwise return` false`.
+
+#### canUseCurrentPageStatistics()
+
+Design idea: The conditions under which the statistical information can be used are that the current pages do not overlap and meet the filter conditions.
+
+First call the `CurrentPageStatistics ()` method of `SeriesReader` to obtain the statistical information of the current page, and then call the` isPageOverlapped () `method of` SeriesReader` to determine whether the current pages overlap. If the current pages do not overlap, and their statistics meet the filtering If true, return `true`, otherwise return` false`.
\ No newline at end of file
diff --git a/docs/Documentation/SystemDesign/5-DataQuery/3-RawDataQuery.md b/docs/Documentation/SystemDesign/5-DataQuery/3-RawDataQuery.md
new file mode 100644
index 0000000..3c1a551
--- /dev/null
+++ b/docs/Documentation/SystemDesign/5-DataQuery/3-RawDataQuery.md
@@ -0,0 +1,303 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Raw data query
+
+## Design principle
+
+The raw data query can be divided into two categories based on whether it contains value filtering conditions.  When no value filter is included, it can be divided into two categories based on the result set structure.
+
+* No value filter (no filter or only time filter)
+	* Result set aligned by timestamp (default raw data query)
+	* The result set is not aligned with the timestamp (disable align)
+* Include value filters
+	* Result set aligned by timestamp
+
+The above three queries correspond to three different DataSets in the code and encapsulate the execution logic of these three queries.
+
+## No value filter + result set aligned by timestamp
+
+### org.apache.iotdb.db.query.dataset.RawQueryDataSetWithoutValueFilter
+
+`RawQueryDataSetWithoutValueFilter` implements query logic that has no value filtering conditions and needs to be aligned according to the timestamp.  Although the final query results require that each time series be aligned according to the timestamp, each time series query can be parallelized.  Here, with the idea of a consumer-producer queue, the operation of obtaining data for each time series is decoupled from the operation of finally aligning all time series.  Each time series corr [...]
+
+In specific implementation, considering the resource constraints of the machine, instead of creating a thread for each time series of each query, the thread pool technology is used to submit each time series producer task as a Runnable task  To the thread pool for execution.
+
+The following introduces the producer's code first. It is encapsulated in an internal class ReadTask which is a RawQueryDataSetWithoutValueFilter and implements the Runnable interface.
+
+### org.apache.iotdb.db.query.dataset.RawQueryDataSetWithoutValueFilter.ReadTask
+
+`ReadTask` has two fields
+
+* private final ManagedSeriesReader reader;
+* private BlockingQueue<BatchData> blockingQueue;
+
+The `ManagedSeriesReader` interface inherits the IBatchReader interface, which is mainly used to read data from a single time series, and adds the following four methods
+
+```
+boolean isManagedByQueryManager();
+
+void setManagedByQueryManager(boolean managedByQueryManager);
+
+boolean hasRemaining();
+
+void setHasRemaining(boolean hasRemaining);
+```
+
+The first two methods are used to characterize whether the producer task corresponding to the time series is managed by the query manager, that is, has the producer task exited by itself because the blocking queue is full (the following will explain why it does not block waiting, and  Exit directly); the latter two methods are used to characterize whether there is any data in the reader corresponding to the time series.
+
+`blockingQueue` is the blocking queue of the producer task. In fact, the blocking queue will only block unilaterally when the consumer fetches the data. When the producer puts the data, if the queue is full, it will exit directly without blocking.
+
+Let ’s take a look at the `run ()` method of `ReadTask`. The explanation of the execution process is shown in the code in the form of comments.
+
+#### run()
+
+```
+public void run() {
+  try {
+    // The reason for the lock here is to ensure that the judgment of the fullness of the blockingQueue is correctly synchronized.
+    synchronized (reader) {
+      // Because every time a producer task is submitted (whether the producer submits itself recursively or the consumer finds that the producer task exits and submits itself), the queue is checked to see if it is full.  Producer task
+      // So once the producer task is submitted, there must be a free space in the blockingQueue, we do not need to check whether the queue is full
+      // If the reader corresponding to the time series still has data, enter the loop body
+      while (reader.hasNextBatch()) {
+        BatchData batchData = reader.nextBatch();
+        // Since the BatchData obtained may be empty, it needs to iterate to the first BatchData that is not empty
+        if (batchData.isEmpty()) {
+          continue;
+        }
+        // Put the non-empty batchData into the blocking queue. At this time, the blocking queue must be dissatisfied, so it will not block.
+        blockingQueue.put(batchData);
+        // If the blocking queue is still not full, the producer task recursively submits itself to the thread pool for the next batchData
+        if (blockingQueue.remainingCapacity() > 0) {
+          pool.submit(this);
+        }
+        // If the blocking queue is full, the producer task exits and the managedByQueryManager corresponding to the reader is set to false
+        else {
+          reader.setManagedByQueryManager(false);
+        }
+        return;
+      }
+      // The code is executed here, which means that the previous while loop condition is not satisfied, that is, there is no data in the reader corresponding to the time series.
+      // We put a SignalBatchData in the blocking queue to inform consumers that there is no more data in this time series, and there is no need to fetch data from the queue corresponding to this time series
+      blockingQueue.put(SignalBatchData.getInstance());
+      // Set the reader's hasRemaining field to false
+      // Inform consumers that they no longer need to submit producer tasks for this time series
+      reader.setHasRemaining(false);
+      // Set the reader's managedByQueryManager field to false
+      reader.setManagedByQueryManager(false);
+    }
+  } catch (InterruptedException e) {
+    LOGGER.error("Interrupted while putting into the blocking queue: ", e);
+    Thread.currentThread().interrupt();
+  } catch (IOException e) {
+    LOGGER.error("Something gets wrong while reading from the series reader: ", e);
+  } catch (Exception e) {
+    LOGGER.error("Something gets wrong: ", e);
+  }
+}
+```
+
+Next, introduce the code of the consumer. The main logic of the consumer is to take the value from the queue of each time series, align the timestamps, and then piece together the result set.  The alignment of timestamps is mainly achieved through a minimum heap of timestamps. If the timestamp of the time series is equal to the timestamp at the top of the heap, then the value is taken out; otherwise, the time series value under the timestamp is set to `null  `.
+
+First introduce some important fields of consumer tasks
+
+* TreeSet<Long> timeHeap
+
+  The smallest heap of timestamps for timestamp alignment
+
+* BlockingQueue<BatchData>[] blockingQueueArray;
+
+  An array of blocking queues to store the blocking queues corresponding to each time series
+
+* boolean[] noMoreDataInQueueArray
+
+  There is no value in the blocking queue to represent a certain time series. If it is false, the consumer will not call the `take ()` method to prevent the consumer thread from being blocked.
+  
+* BatchData[] cachedBatchDataArray
+
+  Cache a BatchData fetched from the blocking queue, because the `BatchData` from` take () `in the blocking queue cannot be consumed all at once, so you need to cache
+  
+
+The `init ()` method was first called in the constructor of the consumer `RawQueryDataSetWithoutValueFilter`
+
+#### init()
+
+```
+private void init() throws InterruptedException {
+	timeHeap = new TreeSet<>();
+	// Build producer tasks for each time series
+	for (int i = 0; i < seriesReaderList.size(); i++) {
+	  ManagedSeriesReader reader = seriesReaderList.get(i);
+	  reader.setHasRemaining(true);
+	  reader.setManagedByQueryManager(true);
+	  pool.submit(new ReadTask(reader, blockingQueueArray[i]));
+	}
+	// Initialize the minimum heap and fill the cache for each time series
+	for (int i = 0; i < seriesReaderList.size(); i++) {
+	  // Call fillCache (int) method to fill the cache
+	  fillCache(i);
+	  // Try to put the current minimum timestamp of each time series into the heap
+	  if (cachedBatchDataArray[i] != null && cachedBatchDataArray[i].hasCurrent()) {
+	    long time = cachedBatchDataArray[i].currentTime();
+	    timeHeap.add(time);
+	  }
+	}
+}
+```
+
+####  fillCache(int)
+
+This method is responsible for fetching data from the blocking queue and filling the cache. For the specific logic, see the note below.
+
+```
+private void fillCache(int seriesIndex) throws InterruptedException {
+    // Get data from the blocking queue, if there is no data, it will block waiting for data in the queue
+	BatchData batchData = blockingQueueArray[seriesIndex].take();
+	// If it is a signal BatchData, set oMoreDataInQueue of the corresponding time series to false
+	if (batchData instanceof SignalBatchData) {
+	  noMoreDataInQueueArray[seriesIndex] = true;
+	}
+	else {
+	  // Cache the retrieved BatchData into cachedBatchDataArray
+	  cachedBatchDataArray[seriesIndex] = batchData;
+	
+	  // The reason for locking here is the same as that of the producer task, in order to ensure that the judgment of the fullness of the blockingQueue is correctly synchronized.
+	  synchronized (seriesReaderList.get(seriesIndex)) {
+	    // Only when the blocking queue is not full, do we need to determine whether it is necessary to submit the producer task. This also guarantees that the producer task will be submitted if and only if the blocking queue is not full.
+	    if (blockingQueueArray[seriesIndex].remainingCapacity() > 0) {
+	      ManagedSeriesReader reader = seriesReaderList.get(seriesIndex);、
+	      // If the reader of the time series is not managed by the query manager (that is, the producer task exits because the queue is full), and there is still data in the reader, we need to submit the producer task of the time series again
+	      if (!reader.isManagedByQueryManager() && reader.hasRemaining()) {
+	        reader.setManagedByQueryManager(true);
+	        pool.submit(new ReadTask(reader, blockingQueueArray[seriesIndex]));
+	      }
+	    }
+	  }
+	}
+}
+```
+
+With the data for each time series, the next step is to align the data for each time stamp and assemble the results into a TSQueryDataSet to return.  The logic here is encapsulated in the fillBuffer () method. This method also contains the logic of limit and offset, and the format of the result set. We will not go into details here, but only analyze the process of data reading and time stamp alignment.
+
+```
+// Fetch the current timestamp from the smallest heap
+long minTime = timeHeap.pollFirst();
+for (int seriesIndex = 0; seriesIndex < seriesNum; seriesIndex++) {
+	if (cachedBatchDataArray[seriesIndex] == null
+	    || !cachedBatchDataArray[seriesIndex].hasCurrent()
+	    || cachedBatchDataArray[seriesIndex].currentTime() != minTime) {
+	  // The time series has no data at the current timestamp and is set to null
+	  ...
+	  
+	} else {
+	  // The time series has data at the current timestamp, and the data is formatted into a result set format
+	  TSDataType type = cachedBatchDataArray[seriesIndex].getDataType();
+	  ...
+	  
+	}
+		
+  // Move the batchdata cursor of this time series buffer back
+  cachedBatchDataArray[seriesIndex].next();
+	
+  // If the currently cached batchdata is empty and the blocking queue still has data, call the fillCache () method again to fill the cache
+  if (!cachedBatchDataArray[seriesIndex].hasCurrent()
+      && !noMoreDataInQueueArray[seriesIndex]) {
+    fillCache(seriesIndex);
+  }
+	
+  // Try to put the next timestamp of that time series into the smallest heap
+  if (cachedBatchDataArray[seriesIndex].hasCurrent()) {
+    long time = cachedBatchDataArray[seriesIndex].currentTime();
+    timeHeap.add(time);
+  }
+}
+```
+
+## No value filter + result set is not aligned by timestamp
+
+### org.apache.iotdb.db.query.dataset.NonAlignEngineDataSet
+
+`NonAlignEngineDataSet` implements query logic that has no value filtering conditions and does not need to be aligned by timestamp.  The query logic here is similar to RawQueryDataSetWithoutValueFilter, but its consumer logic is simpler, because it does not need to perform timestamp alignment.  And each producer task can also do more work, not only can take out BatchData from Reader, but can further say that the taken out BatchData is formatted into the output required by the result set, [...]
+
+The specific query logic is not repeated here. You can refer to the query logic analysis of RawQueryDataSetWithoutValueFilter.
+
+## Include value filter + result set aligned by timestamp
+
+### org.apache.iotdb.db.query.dataset.EngineDataSetWithValueFilter
+
+`EngineDataSetWithValueFilter` implements query logic with value filter conditions.
+
+Its query logic is to first generate a timestamp that meets the filtering conditions according to the query conditions, query the value of the projection column by the timestamp that meets the conditions, and then return the result set.  It has four fields
+
+* private EngineTimeGenerator timeGenerator;
+
+  Is used to generate timestamps that satisfy the filter
+  
+* private List<IReaderByTimestamp> seriesReaderByTimestampList;
+
+  Reader for each time series, used to get data based on timestamp
+
+* private boolean hasCachedRowRecord;
+
+  Whether data rows are currently cached
+  
+* private RowRecord cachedRowRecord;
+
+  Data lines currently cached
+  
+
+Its main query logic is encapsulated in the `cacheRowRecord ()` method. For detailed analysis, see the comments in the code.
+
+#### cacheRowRecord()
+
+```
+private boolean cacheRowRecord() throws IOException {
+   // Determine if there is a next eligible timestamp
+	while (timeGenerator.hasNext()) {
+	  boolean hasField = false;
+	  // Get the current eligible timestamp
+	  long timestamp = timeGenerator.next();
+	  RowRecord rowRecord = new RowRecord(timestamp);
+	  for (int i = 0; i < seriesReaderByTimestampList.size(); i++) {
+	    // Root to get the value under the current timestamp of each time series
+	    IReaderByTimestamp reader = seriesReaderByTimestampList.get(i);
+	    Object value = reader.getValueInTimestamp(timestamp);
+	    // Null if the time series has no value under the current timestamp
+	    if (value == null) {
+	      rowRecord.addField(null);
+	    } 
+	    // Otherwise set hasField to true
+	    else {
+	      hasField = true;
+	      rowRecord.addField(value, dataTypes.get(i));
+	    }
+	  }
+	  // If there is a value in any time series under the timestamp, it means that the timestamp is valid, and the data line is cached
+	  if (hasField) {
+	    hasCachedRowRecord = true;
+	    cachedRowRecord = rowRecord;
+	    break;
+	  }
+	}
+	return hasCachedRowRecord;
+}
+```
\ No newline at end of file
diff --git a/docs/Documentation/SystemDesign/5-DataQuery/4-AggregationQuery.md b/docs/Documentation/SystemDesign/5-DataQuery/4-AggregationQuery.md
new file mode 100644
index 0000000..486f8de
--- /dev/null
+++ b/docs/Documentation/SystemDesign/5-DataQuery/4-AggregationQuery.md
@@ -0,0 +1,114 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Aggregation query
+
+The main logic of the aggregation query is in AggregateExecutor
+
+* org.apache.iotdb.db.query.executor.AggregationExecutor
+
+## Aggregation query without value filter
+
+For aggregate queries without value filters, the results are obtained by the `executeWithoutValueFilter()` method and a dataSet is constructed. First use the `mergeSameSeries()` method to merge aggregate queries for the same time series. For example: if you need to calculate count(s1), sum(s2), count(s3), sum(s1), you need to calculate two aggregation values of s1, then the pathToAggrIndexesMap result will be: s1-> 0, 3; s2-> 1; s3-> 2.
+
+Then you will get `pathToAggrIndexesMap`, where each entry is an aggregate query of series, so you can calculate its aggregate value `aggregateResults` by calling the `groupAggregationsBySeries()` method.  Before you finally create the result set, you need to restore its order to the order of the user query.  Finally use the `constructDataSet()` method to create a result set and return it.
+
+The `groupAggregationsBySeries ()` method is explained in detail below.  First create an `IAggregateReader`:
+```
+IAggregateReader seriesReader = new SeriesAggregateReader(
+        pathToAggrIndexes.getKey(), tsDataType, context, QueryResourceManager.getInstance()
+        .getQueryDataSource(seriesPath, context, timeFilter), timeFilter, null);
+```
+
+For each entry (that is, series), first create an aggregate result `AggregateResult` for each aggregate query. Maintain a boolean list `isCalculatedList`, corresponding to whether each `AggregateResult` has been calculated. Record the remaining number of functions to be calculated in `remainingToCalculate`.  The list of boolean values and this count value will make some aggregate functions (such as `FIRST_VALUE`) not need to continue the entire loop process after obtaining the result.
+
+Next, update `AggregateResult` according to the usage method of `aggregateReader` introduced in Section 5.2:
+
+```
+while (aggregateReader.hasNextChunk()) {
+  if (aggregateReader.canUseCurrentChunkStatistics()) {
+    Statistics chunkStatistics = aggregateReader.currentChunkStatistics();
+    
+    // do some aggregate calculation using chunk statistics
+    ...
+    
+    aggregateReader.skipCurrentChunk();
+    continue;
+  }
+	  
+  while (aggregateReader.hasNextPage()) {
+	 if (aggregateReader.canUseCurrentPageStatistics()) {
+	   Statistics pageStatistic = aggregateReader.currentPageStatistics();
+	   
+	   // do some aggregate calculation using page statistics
+      ...
+	   
+	   aggregateReader.skipCurrentPage();
+	   continue;
+	 } else {
+	 	BatchData batchData = aggregateReader.nextPage();
+	 	// do some aggregate calculation using batch data
+      ...
+	 }	 
+  }
+}
+```
+
+It should be noted that before updating each result, you need to first determine whether it has been calculated (using the isCalculatedList list); after each update, call the isCalculatedAggregationResult () method to update the boolean values in the list  .  If all values in the list are true, that is, the value of `remainingToCalculate` is 0, it proves that all aggregate function results have been calculated and can be returned.
+```
+if (Boolean.FALSE.equals(isCalculatedList.get(i))) {
+  AggregateResult aggregateResult = aggregateResultList.get(i);
+  ... // update
+  if (aggregateResult.isCalculatedAggregationResult()) {
+    isCalculatedList.set(i, true);
+    remainingToCalculate--;
+    if (remainingToCalculate == 0) {
+      return aggregateResultList;
+    }
+  }
+}
+```
+
+When using `overlapedPageData` to update, since each batch function result will traverse this batchData, you need to call the` resetBatchData () `method to point the pointer to its starting position, so that the next function can traverse.
+
+## Aggregated query with value filter
+For an aggregate query with a value filter, obtain the results through the `executeWithoutValueFilter()` method and build a dataSet.  First create a `timestampGenerator` based on the expression, then create a `SeriesReaderByTimestamp` for each time series and place it in the `readersOfSelectedSeries` list; create an aggregate result for each query as `AggregateResult`, and place it in the `aggregateResults` list.
+
+After initialization is complete, call the `aggregateWithValueFilter()` method to update the result:
+```
+while (timestampGenerator.hasNext()) {
+  // Generate timestamps
+  long[] timeArray = new long[aggregateFetchSize];
+  int timeArrayLength = 0;
+  for (int cnt = 0; cnt < aggregateFetchSize; cnt++) {
+    if (!timestampGenerator.hasNext()) {
+      break;
+    }
+    timeArray[timeArrayLength++] = timestampGenerator.next();
+  }
+
+  // Calculate aggregate results using timestamps
+  for (int i = 0; i < readersOfSelectedSeries.size(); i++) {
+    aggregateResults.get(i).updateResultUsingTimestamps(timeArray, timeArrayLength,
+      readersOfSelectedSeries.get(i));
+    }
+  }
+```
diff --git a/docs/Documentation/SystemDesign/5-DataQuery/5-GroupByQuery.md b/docs/Documentation/SystemDesign/5-DataQuery/5-GroupByQuery.md
new file mode 100644
index 0000000..a7987fe
--- /dev/null
+++ b/docs/Documentation/SystemDesign/5-DataQuery/5-GroupByQuery.md
@@ -0,0 +1,260 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Downsampling query
+
+* org.apache.iotdb.db.query.dataset.groupby.GroupByEngineDataSet
+
+The result set of the downsampling query will inherit `GroupByEngineDataSet`, this class contains the following fields:
+* protected long queryId
+* private long interval
+* private long slidingStep
+
+The following two fields are for the entire query, and the time period is left closed and right open, which is `[startTime, endTime)`:
+* private long startTime
+* private long endTime
+
+The following fields are for the current segment, and the time period is left closed and right open, which is `[startTime, endTime)`:
+
+* protected long curStartTime;
+* protected long curEndTime;
+* private int usedIndex;
+* protected boolean hasCachedTimeInterval;
+
+
+The core method of `GroupByEngineDataSet` is very easy. First, determine if there is a next segment based on whether there is a cached time period, and return `true`; if not, calculate the segmentation start time and increase `usedIndex` by 1.  If the segment start time has exceeded the query end time, return `false`; otherwise, calculate the query end time, set `hasCachedTimeInterval` to `true`, and return` true`:
+```
+protected boolean hasNextWithoutConstraint() {
+  if (hasCachedTimeInterval) {
+    return true;
+  }
+
+  curStartTime = usedIndex * slidingStep + startTime;
+  usedIndex++;
+  if (curStartTime < endTime) {
+    hasCachedTimeInterval = true;
+    curEndTime = Math.min(curStartTime + interval, endTime);
+    return true;
+  } else {
+    return false;
+  }
+}
+```
+
+## Downsampling query without value filter
+
+The downsampling query logic without value filter is mainly in the `GroupByWithoutValueFilterDataSet` class, which inherits` GroupByEngineDataSet`.
+
+
+This class has the following key fields:
+* private Map <Path, GroupByExecutor> pathExecutors classifies aggregate functions for the same `Path` and encapsulates them as` GroupByExecutor`,
+  `GroupByExecutor` encapsulates the data calculation logic and method of each Path, which will be described later
+
+* private TimeRange timeRange encapsulates the time interval of each calculation into an object, which is used to determine whether Statistics can directly participate in the calculation
+* private Filter timeFilter Generates a user-defined query interval as a Filter object, which is used to filter the available files, chunks, and pages.
+
+
+First, in the initialization `initGroupBy()` method, the `timeFilter` is calculated based on the expression, and `GroupByExecutor` is generated for each `path`.
+
+First, in the initialization `initGroupBy()` method, the `timeFilter` is calculated based on the expression, and `GroupByExecutor` is generated for each `path`.
+The following method is used to convert the result list into a RowRecord. Note that when there are no results in the list, add `null` to the RowRecord:
+
+```
+for (AggregateResult res : fields) {
+  if (res == null) {
+    record.addField(null);
+    continue;
+  }
+  record.addField(res.getResult(), res.getResultDataType());
+}
+```
+
+
+### GroupByExecutor
+Encapsulating the calculation method of all aggregate functions under the same path, this class has the following key fields:
+* private IAggregateReader reader  the `SeriesAggregateReader` used to read the current `Path` data
+* private BatchData preCachedData Every time the data read from `Reader` is a batch, and it is likely to exceed the current time period. This `BatchData` will be cached for next use
+* private List<Pair<AggregateResult, Integer>> results stores all aggregation methods in the current `Path`, for example: `select count(a), sum(a), avg(b)`, `count` and `sum` can be stored together.
+     The `Integer` on the right is used to convert the result set to the order of the user query before converting it to RowRecord.
+
+#### Main method
+
+```
+//Read data from the reader and calculate the main method of this class.
+private List<Pair<AggregateResult, Integer>> calcResult() throws IOException, QueryProcessException;
+
+//Add aggregation operation for current path
+private void addAggregateResult(AggregateResult aggrResult, int index);
+
+//Determine whether the current path has completed all aggregation calculations
+private boolean isEndCalc();
+
+//Calculate results from BatchData that did not run out of cache last calculation
+private boolean calcFromCacheData() throws IOException;
+
+//Calculation using BatchData
+private void calcFromBatch(BatchData batchData) throws IOException;
+
+//Calculate results directly using Page or Chunk's Statistics
+private void calcFromStatistics(Statistics statistics) throws QueryProcessException;
+
+//Clear all calculation results
+private void resetAggregateResults();
+
+//Iterate through and calculate the data in the page
+private boolean readAndCalcFromPage() throws IOException, QueryProcessException;
+
+```
+
+In `GroupByExecutor`, because different aggregate functions of the same path use the same data, the entry method `calcResult` is responsible for reading all the data of the `Path`.
+The retrieved data then calls the `calcFromBatch` method to complete the calculation of `BatchData` through all the aggregate functions.
+
+The `calcResult` method returns all AggregateResult under the current Path and the position of the current aggregated value in the user query order. Its main logic is:
+
+```
+//Calculate the data left over from the last time, and end the calculation if you can get the results directly
+if (calcFromCacheData()) {
+    return results;
+}
+
+//Because a chunk contains multiple pages, the page of the current chunk must be used up before the next chunk is opened.
+if (readAndCalcFromPage()) {
+    return results;
+}
+
+//If the remaining data is calculated, open a new chunk to continue the calculation.
+while (reader.hasNextChunk()) {
+    Statistics chunkStatistics = reader.currentChunkStatistics();
+      // Determine if Statistics is available and perform calculations
+       ....
+      // Skip current chunk
+      reader.skipCurrentChunk();
+      // End calculation if all results have been obtained
+      if (isEndCalc()) {
+        return true;
+      }
+      continue;
+    }
+    //If you cannot use chunkStatistics, you need to use page data to calculate
+    if (readAndCalcFromPage()) {
+      return results;
+    }
+}
+```
+
+The `readAndCalcFromPage` method is to obtain the page data from the currently opened chunk and calculate the aggregate result.  Returns true when all calculations are completed, otherwise returns false.  The main logic:
+
+```
+while (reader.hasNextPage()) {
+    Statistics pageStatistics = reader.currentPageStatistics();
+    //PageStatistics can only be used if the page does not intersect with other pages
+    if (pageStatistics != null) {
+        // Determine if Statistics is available and perform calculations
+        ....
+        // Skip current page
+        reader.skipCurrentPage();
+        // End calculation if all results have been obtained
+        if (isEndCalc()) {
+          return true;
+        }
+        continue;
+      }
+    }
+    // When Statistics is not available, you can only fetch all data for calculation
+    BatchData batchData = reader.nextPage();
+    if (batchData == null || !batchData.hasCurrent()) {
+      continue;
+    }
+    // If the page just opened exceeds the time range, the data retrieved is cached and the calculation is directly ended.
+    if (batchData.currentTime() >= curEndTime) {
+      preCachedData = batchData;
+      return true;
+    }
+    //Perform calculations
+    calcFromBatch(batchData);
+    ...
+}
+
+```
+
+The `calcFromBatch` method is to traverse all the aggregate functions to calculate the retrieved BatchData. The main logic is:
+
+```
+for (Pair<AggregateResult, Integer> result : results) {
+    //If a function has already been calculated, it will not be calculated, such as the minimum calculation.
+    if (result.left.isCalculatedAggregationResult()) {
+      continue;
+    }
+    // Perform calculations
+    ....
+}
+//Determine if the data in the current batchdata can still be used next time, if it can be added to the cache
+if (batchData.getMaxTimestamp() >= curEndTime) {
+    preCachedData = batchData;
+}
+```
+
+## Aggregated query with value filter
+The downsampling query logic with value filtering conditions is mainly in the `GroupByWithValueFilterDataSet` class, which inherits `GroupByEngineDataSet`.
+
+This class has the following key fields:
+* private List\<IReaderByTimestamp\> allDataReaderList
+* private GroupByPlan groupByPlan
+* private TimeGenerator timestampGenerator
+* private long timestamp is used to cache timestamp for the next group by partition
+* private boolean hasCachedTimestamp used to determine whether there is a timestamp cache for the next group by partition
+* private int timeStampFetchSize is the size of the group by calculating the batch
+
+First, in the initialization ``initGroupBy ()``method, create a `timestampGenerator` based on the expression; then create a `SeriesReaderByTimestamp` for each time series and place it in the `allDataReaderList` list. After initialization is complete, call the `nextWithoutConstraint ()`method to update the result.  If timestamp is cached for the next group by partition and the time meets the requirements, add it to `timestampArray`, otherwise return the `aggregateResultList` result direct [...]
+
+```
+while (timestampGenerator.hasNext()) {
+  // Call constructTimeArrayForOneCal () method to get a list of timestamp
+  timeArrayLength = constructTimeArrayForOneCal(timestampArray, timeArrayLength);
+
+  // Call the updateResultUsingTimestamps () method to calculate the aggregate result using the timestamp list
+  for (int i = 0; i < paths.size(); i++) {
+    aggregateResultList.get(i).updateResultUsingTimestamps(
+        timestampArray, timeArrayLength, allDataReaderList.get(i));
+  }
+
+  timeArrayLength = 0;
+  // Determine if it is over
+  if (timestamp >= curEndTime) {
+    hasCachedTimestamp = true;
+    break;
+  }
+}
+```
+
+The `constructTimeArrayForOneCal ()`method traverses timestampGenerator to build a list of timestamps:
+
+```
+for (int cnt = 1; cnt < timeStampFetchSize && timestampGenerator.hasNext(); cnt++) {
+  timestamp = timestampGenerator.next();
+  if (timestamp < curEndTime) {
+    timestampArray[timeArrayLength++] = timestamp;
+  } else {
+    hasCachedTimestamp = true;
+    break;
+  }
+}
+```
diff --git a/docs/Documentation/SystemDesign/5-DataQuery/6-LastQuery.md b/docs/Documentation/SystemDesign/5-DataQuery/6-LastQuery.md
new file mode 100644
index 0000000..5411c66
--- /dev/null
+++ b/docs/Documentation/SystemDesign/5-DataQuery/6-LastQuery.md
@@ -0,0 +1,122 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Last query
+
+The main logic of Last query is in LastQueryExecutor
+
+* org.apache.iotdb.db.query.executor.LastQueryExecutor
+
+The Last query executes the `calculateLastPairForOneSeries` method for each specified time series.
+
+## Read MNode cache data
+
+We add a Last data cache to the MNode structure corresponding to the time series that needs to be queried.
+
+`calculateLastPairForOneSeries` method For the last query of a certain time series, first try to read the cached data in the MNode.
+
+```
+try {
+  node = MManager.getInstance().getDeviceNodeWithAutoCreateStorageGroup(seriesPath.toString());
+} catch (MetadataException e) {
+  throw new QueryProcessException(e);
+}
+if (((LeafMNode) node).getCachedLast() != null) {
+  return ((LeafMNode) node).getCachedLast();
+}
+```
+If it is found that the cache has not been written, execute the following standard query process to read the TsFile data.
+
+## Last standard query process
+
+Last standard query process needs to traverse all sequential files and unsequential files to get query results, and finally write the query results back to the MNode cache.  In the algorithm, sequential files and unsequential files are processed separately.
+- The sequential file is sorted by its writing time, so use the `loadChunkMetadataFromTsFileResource` method directly to get the last` ChunkMetadata`, and get the maximum timestamp and corresponding value through the statistical data of `ChunkMetadata`.
+    ```
+    if (!seqFileResources.isEmpty()) {
+      List<ChunkMetaData> chunkMetadata =
+          FileLoaderUtils.loadChunkMetadataFromTsFileResource(
+              seqFileResources.get(seqFileResources.size() - 1), seriesPath, context);
+      if (!chunkMetadata.isEmpty()) {
+        ChunkMetaData lastChunkMetaData = chunkMetadata.get(chunkMetadata.size() - 1);
+        Statistics chunkStatistics = lastChunkMetaData.getStatistics();
+        resultPair =
+            constructLastPair(
+                chunkStatistics.getEndTime(), chunkStatistics.getLastValue(), tsDataType);
+      }
+    }
+    ```
+- Unsequential files need to traverse all `ChunkMetadata` structures to get the maximum timestamp data.  It should be noted that when multiple `ChunkMetadata` have the same timestamp, we take the data in` ChunkMatadata` with the largest `version` value as the result of Last.
+
+    ```
+    long version = 0;
+    for (TsFileResource resource : unseqFileResources) {
+      if (resource.getEndTimeMap().get(seriesPath.getDevice()) < resultPair.getTimestamp()) {
+        break;
+      }
+      List<ChunkMetaData> chunkMetadata =
+          FileLoaderUtils.loadChunkMetadataFromTsFileResource(resource, seriesPath, context);
+      for (ChunkMetaData chunkMetaData : chunkMetadata) {
+        if (chunkMetaData.getEndTime() == resultPair.getTimestamp()
+            && chunkMetaData.getVersion() > version) {
+          Statistics chunkStatistics = chunkMetaData.getStatistics();
+          resultPair =
+              constructLastPair(
+                  chunkStatistics.getEndTime(), chunkStatistics.getLastValue(), tsDataType);
+          version = chunkMetaData.getVersion();
+        }
+      }
+    }
+    ```
+ - Finally write the query results to the MNode's Last cache
+    ```
+    ((LeafMNode) node).updateCachedLast(resultPair, false, Long.MIN_VALUE);
+    ```
+
+## Last cache update strategy
+
+The last cache update logic is located in the `UpdateCachedLast` method of` LeafMNode`. Here, two additional parameters `highPriorityUpdate` and` latestFlushTime` are introduced.  `highPriorityUpdate` is used to indicate whether this update is high priority. Cache updates caused by new data writing are considered high priority updates, and the update cache defaults to low priority updates when querying.  `latestFlushTime` is used to record the maximum timestamp of data that has been curr [...]
+
+The cache update strategy is as follows:
+
+1. When there is no record in the cache, the query results are written directly into the cache for the last data that is queried.
+2. When there is no record in the cache, if the latest data written is a timestamp greater than or equal to `latestFlushTime`, the written data is written to the cache.
+3. When there are records in the cache, the timestamp of the query or written data is compared with the timestamp in the current cache.  The written data has a high priority, and the cache is updated when the timestamp is not less than the cache record; the data that is queried has a lower priority and must be greater than the timestamp of the cache record to update the cache.
+
+The specific code is as follows
+```
+public synchronized void updateCachedLast(
+  TimeValuePair timeValuePair, boolean highPriorityUpdate, Long latestFlushedTime) {
+    if (timeValuePair == null || timeValuePair.getValue() == null) return;
+    
+    if (cachedLastValuePair == null) {
+      // If no cached last, (1) a last query (2) an unseq insertion or (3) a seq insertion will update cache.
+      if (!highPriorityUpdate || latestFlushedTime <= timeValuePair.getTimestamp()) {
+        cachedLastValuePair =
+            new TimeValuePair(timeValuePair.getTimestamp(), timeValuePair.getValue());
+      }
+    } else if (timeValuePair.getTimestamp() > cachedLastValuePair.getTimestamp()
+        || (timeValuePair.getTimestamp() == cachedLastValuePair.getTimestamp()
+            && highPriorityUpdate)) {
+      cachedLastValuePair.setTimestamp(timeValuePair.getTimestamp());
+      cachedLastValuePair.setValue(timeValuePair.getValue());
+    }
+}
+```
diff --git a/docs/Documentation/SystemDesign/5-DataQuery/7-AlignByDeviceQuery.md b/docs/Documentation/SystemDesign/5-DataQuery/7-AlignByDeviceQuery.md
new file mode 100644
index 0000000..f65611a
--- /dev/null
+++ b/docs/Documentation/SystemDesign/5-DataQuery/7-AlignByDeviceQuery.md
@@ -0,0 +1,203 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Query by device alignment
+
+The table structure of AlignByDevice Query is:
+
+| time | device | sensor1 | sensor2 | sensor3 | ...  |
+| ---- | ------ | ------- | ------- | ------- | ---- |
+|      |        |         |         |         |      |
+
+## Design principle
+
+The implementation principle of the device-by-device query is mainly to calculate the measurement points and filter conditions corresponding to each device in the query, and then the query is performed separately for each device, and the result set is assembled and returned.
+
+### Meaning of important fields in AlignByDevicePlan
+
+First explain the meaning of some important fields in AlignByDevicePlan:
+- `List<String> measurements`：The list of measurements that appear in the query.
+- `Map<Path, TSDataType> dataTypeMapping`: This variable inherits from the base class QueryPlan, and its main role is to provide the data type corresponding to the path of this query when calculating the execution path of each device.
+- `Map<String, Set<String>> deviceToMeasurementsMap`, `Map<String, IExpression> deviceToFilterMap`: These two fields are used to store the measurement points and filter conditions corresponding to the device.
+- `Map<String, TSDataType> measurementDataTypeMap`：AlignByDevicePlan requires that the data type of the sensor of the same name be the same for different devices. This field is a Map structure of `measurementName-> dataType`.  For example `root.sg.d1.s1` and` root.sg.d2.s1` should be of the same data type.
+- `enum MeasurementType`：Three measurement types are recorded.  Measurements that do not exist in any device are of type `NonExist`; measurements with single or double quotes are of type` Constant`; measurements that exist are of type `Exist`.
+- `Map<String, MeasurementType> measurementTypeMap`: This field is a Map structure of `measureName-> measurementType`, which is used to record all measurement types in the query.
+- groupByPlan, fillQueryPlan, aggregationPlan：To avoid redundancy, these three execution plans are set as subclasses of RawDataQueryPlan and set as variables in AlignByDevicePlan.  If the query plan belongs to one of these three plans, the field is assigned and saved.
+
+Before explaining the specific implementation process, a relatively complete example is given first, and the following explanation will be used in conjunction with this example.
+
+```sql
+SELECT s1, "1", *, s2, s5 FROM root.sg.d1, root.sg.* WHERE time = 1 AND s1 < 25 ALIGN BY DEVICE
+```
+
+Among them, the time series in the system is:
+
+- root.sg.d1.s1
+- root.sg.d1.s2
+- root.sg.d2.s1
+
+The storage group `root.sg` contains two devices d1 and d2, where d1 has two sensors s1 and s2, d2 has only sensor s1, and the same sensor s1 has the same data type.
+
+The following will be explained according to the specific process:
+
+### Logical plan generation
+
+- org.apache.iotdb.db.qp.Planner
+
+Unlike the original data query, the alignment by device query does not concatenate the suffix paths in the SELECT statement and the WHERE statement at this stage, but when the physical plan is subsequently generated, the mapping value and filter conditions corresponding to each device are calculated.  Therefore, the work done at this stage by device alignment only includes optimization of filter conditions in WHERE statements.
+
+The optimization of the filtering conditions mainly includes three parts: removing the negation, transforming the disjunction paradigm, and merging the same path filtering conditions.  The corresponding optimizers are: RemoveNotOptimizer, DnfFilterOptimizer, MergeSingleFilterOptimizer.  This part of the logic can refer to:[Planner](/#/SystemDesign/progress/chap2/sec2).
+
+### Physical plan generation
+
+- org.apache.iotdb.db.qp.strategy.PhysicalGenerator
+
+After the logical plan is generated, the `transformToPhysicalPlan ()` method in the PhysicalGenerator class is called to convert the logical plan into a physical plan.  For device-aligned queries, the main logic of this method is implemented in the transformQuery () method.
+
+**The main work done at this stage is to generate the corresponding** `AlignByDevicePlan`，**Fill in the variable information。**
+
+First explain the meaning of some important fields in the `transformQuery ()` method (see above for the duplicate fields in AlignByDevicePlan):
+
+- prefixPaths, suffixPaths：The former is the prefix path in the FROM clause, in the example `[root.sg.d1, root.sg. *]`; The latter is the suffix path in the SELECT clause, in the example `[s1," 1 "  , *, s2, s5] `.
+- devices：The device list obtained by removing wildcards and de-duplicating the prefix path, in the example, [[root.sg.d1, root.sg.d2] `.
+- measurementSetOfGivenSuffix：The intermediate variable records the measurement corresponding to a suffix. In the example, for the suffix \ *, `measurementSetOfGivenSuffix = {s1, s2}`, for the suffix s1, `measurementSetOfGivenSuffix = {s1}`;
+
+Next, introduce the calculation process of AlignByDevicePlan:
+
+1. Check whether the query type is one of three types of queries: groupByPlan, fillQueryPlan, aggregationPlan. If it is, assign the corresponding variable and change the query type of `AlignByDevicePlan`.
+2. Iterate through the SELECT suffix path, and set an intermediate variable for each suffix path as `measurementSetOfGivenSuffix` to record all measurements corresponding to the suffix path.  If the suffix path starts with single or double quotes, increase the value directly in `measurements` and note that its type is` Constant`.
+3. Otherwise, the device list is stitched with the suffix path to obtain a complete path. If the spliced path does not exist, you need to further determine whether the measurement exists in other devices. If none, temporarily identify it as NonExist. If the subsequent device appears,  measurement, the NonExist value is overridden.
+4. If the path exists after splicing, it is proved that the measurement is of type `Exist`, and the consistency of the data type needs to be checked. If it is not satisfied, an error message is returned. If it is met, the measurement is recorded.
+5. After the suffix loop of a layer ends, the `measurementSetOfGivenSuffix` appearing in the loop of that layer is added to the` measurements`.  At the end of the entire loop, the variable information obtained in the loop is assigned to AlignByDevicePlan.  The list of measurements obtained here is not duplicated and will be de-duplicated when the ColumnHeader is generated.
+6. Finally, call the `concatFilterByDevice ()` method to calculate `deviceToFilterMap`, and get the corresponding Filter information after splicing each device separately.
+
+```java
+Map<String, IExpression> concatFilterByDevice(List<String> devices,
+      FilterOperator operator)
+Input：Deduplicated devices list and un-stitched FilterOperator
+Input：The deviceToFilterMap after splicing records the Filter information corresponding to each device
+```
+
+The main processing logic of the `concatFilterByDevice ()` method is in `concatFilterPath ()`:
+
+The `concatFilterPath ()` method traverses the unspliced FilterOperator binary tree to determine whether the node is a leaf node. If so, the path of the leaf node is taken. If the path starts with time or root, it is not processed, otherwise the device name and node are not processed.  The paths are spliced and returned; if not, all children of the node are iteratively processed.  In the example, the result of splicing the filter conditions of device 1 is `time = 1 AND root.sg.d1.s1 <25` [...]
+
+The following example summarizes the variable information calculated through this stage:
+
+- measurement list `measurements`：`[s1, "1", s1, s2, s2, s5]`
+- measurement type `measurementTypeMap`：
+  -  `s1 -> Exist`
+  -  `s2 -> Exist`
+  -  `"1" -> Constant`
+  -  `s5 -> NonExist`
+- Measuring points for each device `deviceToMeasurementsMap`:
+  -  `root.sg.d1 -> s1, s2`
+  -  `root.sg.d2 -> s1`
+- Filter condition `deviceToFilterMap` for each device:
+  -  `root.sg.d1 -> time = 1 AND root.sg.d1.s1 < 25`
+  -  `root.sg.d2 -> time = 1 AND root.sg.d2.s1 < 25`
+
+### Constructing a Header (ColumnHeader)
+
+- org.apache.iotdb.db.service.TSServiceImpl
+
+After generating the physical plan, you can execute the executeQueryStatement () method in TSServiceImpl to generate a result set and return it. The first step is to construct the header.
+
+Query by device alignment After calling the TSServiceImpl.getQueryColumnHeaders () method, enter TSServiceImpl.getAlignByDeviceQueryHeaders () according to the query type to construct the headers.
+
+The `getAlignByDeviceQueryHeaders ()` method is declared as follows:
+
+```java
+private void getAlignByDeviceQueryHeaders(
+      AlignByDevicePlan plan, List<String> respColumns, List<String> columnTypes)
+Input：The currently executing physical plan AlignByDevicePlan and the column names that need to be output respColumns and their corresponding data types columnTypes
+Input：Calculated column name respColumns and data type columnTypes
+```
+
+The specific implementation logic is as follows:
+
+1. First add the `Device` column, whose data type is` TEXT`;
+2. Traverse the list of measurements without deduplication to determine the type of measurement currently traversed. If it is an Exist type, get its type from the measurementTypeMap; set the other two types to TEXT, and then add measurement and its type to the header data structure.
+3. Deduplicate measurements based on the intermediate variable deduplicatedMeasurements.
+
+The resulting header is:
+
+| Time | Device | s1  | 1   | s1  | s2  | s2  | s5  |
+| ---- | ------ | --- | --- | --- | --- | --- | --- |
+|      |        |     |     |     |     |     |     |
+
+The deduplicated `measurements` are` [s1, "1", s2, s5] `.
+
+### Result set generation
+
+After the ColumnHeader is generated, the final step is to populate the result set with the results and return.
+
+#### Result set creation
+
+- org.apache.iotdb.db.service.TSServiceImpl
+
+At this stage, you need to call `TSServiceImpl.createQueryDataSet ()` to create a new result set. This part of the implementation logic is relatively simple. For AlignByDeviceQuery, you only need to create a new `AlignByDeviceDataSet`. In the constructor, the parameters in AlignByDevicePlan  Assign to the newly created result set.
+
+#### Result set population
+
+- org.apache.iotdb.db.utils.QueryDataSetUtils
+
+Next you need to fill the results. AlignByDeviceQuery will call the `TSServiceImpl.fillRpcReturnData ()` method, and then enter the `QueryDataSetUtils.convertQueryDataSetByFetchSize ()` method according to the query type.
+
+The important method for getting results in the `convertQueryDataSetByFetchSize ()` method is the `hasNext ()` method of QueryDataSet.
+
+The main logic of the `hasNext ()` method is as follows:
+
+1. Determine if there is a specified row offset `rowOffset`, if there is, skip the number of rows that need to be offset; if the total number of results is less than the specified offset, return false.
+2. Determines whether there is a specified limit on the number of rows `rowLimit`, if there is, it compares the current number of output rows, and returns false if the current number of output rows is greater than the limit.
+3. Enter `AlignByDeviceDataSet.hasNextWithoutConstraint ()` method
+
+<br>
+
+- org.apache.iotdb.db.query.dataset.AlignByDeviceDataSet
+
+First explain the meaning of the important fields in the result set:
+
+- `deviceIterator`: Query by device is essentially to calculate the mapping value and filtering conditions corresponding to each device, and then the query is performed separately for each device. This field is an iterator for the device. Each query obtains a device to perform.
+- `currentDataSet`：This field represents the result set obtained by querying a certain device.
+
+The work done by the `hasNextWithoutConstraint ()` method is mainly to determine whether the current result set has the next result, if not, the next device is obtained, the path, data type and filter conditions required by the device to execute the query are calculated, and then executed according to its query type  The result set is obtained after a specific query plan, until no device is available for querying.
+
+The specific implementation logic is as follows:
+
+1. First determine whether the current result set is initialized and there is a next result. If it is, it returns true directly, that is, you can call the next () method to get the next RowRecord; otherwise, the result set is not initialized and proceeds to step 2.
+2. Iterate deviceIterator to get the devices needed for this execution, and then get the corresponding measurement points in the deviceToMeasurementsMap to get executeColumns.
+3. Concatenate the current device name and measurements to calculate the query path, data type, and filter conditions of the current device. The corresponding fields are `executePaths`,` tsDataTypes`, and `expression`. If it is an aggregate query, you need to calculate` executeAggregations`  .
+4. Determine whether the current subquery type is GroupByQuery, AggregationQuery, FillQuery or RawDataQuery. Perform the corresponding query and return the result set. The implementation logic [Raw data query](/#/SystemDesign/progress/chap5/sec3)，[Aggregate query](/#/SystemDesign/progress/chap5/sec4)，[Downsampling query](/#/SystemDesign/progress/chap5/sec5)  can be referenced.
+
+After initializing the result set through the `hasNextWithoutConstraint ()` method and ensuring that there is a next result, you can call `QueryDataSet.next ()` method to get the next `RowRecord`.
+
+The `next ()` method is mainly implemented as the `AlignByDeviceDataSet.nextWithoutConstraint ()` method.
+
+The work done by the `nextWithoutConstraint ()` method is to ** transform the time-aligned result set form obtained by a single device query into a device-aligned result set form **, and return the transformed `RowRecord`.
+
+The specific implementation logic is as follows:
+
+1. First get the next time-aligned `originRowRecord` from the result set.
+2. Create a new `RowRecord` with timestamp, add device columns to it, and first create a Map structure` currentColumnMap` of `measurementName-> Field` according to` executeColumns` and the obtained result.
+3. After that, you only need to traverse the deduplicated `measurements` list to determine its type. If the type is` Exist`, get the corresponding result from the `currentColumnMap` according to the measurementName. If not, set it to` null`; if it is `NonExist  `Type is set to` null` directly; if it is `Constant` type,` measureName` is used as the value of this column.
+
+After writing the output data stream according to the transformed `RowRecord`, the result set can be returned.
diff --git a/docs/Documentation/SystemDesign/6-Tools/1-Sync.md b/docs/Documentation/SystemDesign/6-Tools/1-Sync.md
new file mode 100644
index 0000000..32c3211
--- /dev/null
+++ b/docs/Documentation/SystemDesign/6-Tools/1-Sync.md
@@ -0,0 +1,249 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Sync Tool
+
+The synchronization tool is a suite tool that periodically uploads the newly added persistent tsfile files on the local disk to the cloud and loads them into Apache IoTDB.
+
+## Overview
+
+This document mainly introduces the requirements definition and module design of synchronization tools.
+
+### Scenes
+
+The requirements of synchronization tools are mainly the following:
+
+* In a production environment, Apache IoTDB collects data generated by data sources (industrial equipment, mobile terminals, etc.) and stores them locally.  Since the data sources may be distributed in different places, there may be multiple Apache IoTDBs responsible for collecting data at the same time.  For each IoTDB, it needs to synchronize its local data into the data center.  The data center is responsible for collecting and managing data from multiple Apache IoTDBs.
+
+* With the widespread application of the Apache IoTDB system, users need to load and apply the tsfile files generated by some Apache IoTDB instances to the data directory of another Apache IoTDB instance to achieve data synchronization according to the target business needs.
+
+* The synchronization module exists in the form of an independent process on the sending end, and is located in the same process as the Apache IoTDB on the receiving end.
+
+* Supports one sender to synchronize data with multiple receivers and one receiver can receive data from multiple senders at the same time, but you need to ensure that the data synchronized by multiple senders does not conflict (that is, there can only be one data source for one device), otherwise  Need to prompt conflict.
+
+### Goals
+
+The synchronization tool can be used to transfer and load data files between two Apache IoTDB instances.  When network instability or downtime occurs, ensure that files can be completely and correctly transferred to the data center.
+
+## Directory Structure
+
+For the convenience of explanation, suppose the application scenario is that the node `192.168.130.15` synchronizes data with the node` 192.168.130.16: 5555`, and the node `192.168.130.15` receives data synchronized from the node` 192.168.130.14`.  Since the node `192.168.130.15` serves as both a sending end and a receiving end, the following describes the directory structure with the node` 192.168.130.15`.
+
+### Directory structure design
+
+<img style="width:100%; max-width:800px; max-height:600px; margin-left:auto; margin-right:auto; display:block;" src="https://user-images.githubusercontent.com/26211279/74145347-849dc380-4c39-11ea-9ef2-e10a3fe2074d.png">
+
+### Directory structure description
+
+The sync-sender folder contains temporary files, status logs, etc. during the data synchronization when this node is used as the sender.
+
+The sync-receiver folder contains temporary files, status logs, and so on during which the node receives data and loads it as a receiver.
+
+The schema / sync folder holds the synchronization information that needs to be persisted.
+
+#### Sender
+
+`data / sync-sender` is the sender's folder. The folder name in this directory represents the IP and port of the receiver. In this example, there is a receiver` 192.168.130.16: 5555`. Each folder contains the following  Several files:
+
+* last_local_files.txt 
+Records a list of all local tsfile files that have been synchronized after the synchronization task ends, and is updated after each synchronization task ends.
+
+* snapshot 
+During data synchronization, this folder contains hard links to all tsfile files to be synchronized.
+
+* sync.log
+Record the task progress log of the synchronization module for system downtime recovery. The structure of this file will be explained in detail later.
+
+#### Receiving end
+
+`sync-receiver` is the folder of the receiving end. The folder name in this directory represents the IP and UUID of the sending end, and it indicates the data files and file loading logs received from the sending end. In this example, there is a sending end.  `192.168.130.14`, and its UUID is` a45b6e63eb434aad891264b5c08d448e`.  Each folder contains the following files:
+
+* load.log 
+This file records the task progress log loaded by the tsfile file, and is used when the system is recovered from downtime.
+
+* data
+This folder contains the tsfile file that has been received from the sender.
+
+#### Others
+
+The `schema / sync` folder contains the following information:
+
+* As a sender, the file lock sync.lock of the sender instance is intended to ensure that the same sender can only start one sender instance to the same receiver, that is, there is only one process that synchronizes data to the receiver.  The directory 192.168.130.16_5555 / sync_lock in the figure indicates the instance lock synchronized to the receiving end 192.168.130.16_5555.  Each time it is started, it will first check whether the file is locked. If the lock indicates that there is a [...]
+
+* When acting as the sender, the unique identifier of the sender UUID `uuid.txt`
+    Each sender has a unique identifier for the receiver to distinguish between different senders
+
+* As the sender, the synchronization progress of each receiver's schema `sync_schema_pos`
+
+    Because the schema log `mlog.txt` data is appended, which records the change process of all meta-information, the current position is recorded after each synchronization of the schema, and direct incremental synchronization can reduce the repeated schema transmission after the next synchronization.
+
+* As the receiver, all information `device_owner.log` of each device in the receiver
+    In the application of the synchronization tool, one receiver can receive data from multiple senders at the same time, but no conflict can occur, otherwise the receiver will not be able to guarantee the correctness of the data.  Therefore, it is necessary to record which sender is synchronizing each device, following the first-come-first-served principle.
+
+The reason for placing this information separately under the schmea folder is that an Apache IoTDB instance can have multiple data file directories, that is, there can be multiple data directories, but there is only one schema folder, and this information is shared by a sender instance  The information in the data folder indicates the synchronization status in the file directory and belongs to the subtask information (each data file directory is a subtask).
+
+## Sync tool sender
+
+### Statement of needs
+
+* At regular intervals, the latest data collected by the sender is returned to the receiver.  At the same time, for the update and deletion of historical data, this part of information is synchronized to the receiving end.
+
+* The synchronization data must be complete. If the data file is incomplete or damaged due to factors such as network instability and machine failure during the transmission, it needs to be repaired during the next transmission.
+
+### Module design
+
+#### File management module
+
+##### package
+
+org.apache.iotdb.db.sync.sender.manage
+
+##### File selection
+
+The function of file selection is to select the list of closed tsfile files in the current Apache IoTDB instance (the corresponding `.resource` file, without the` .modification` file and the `.merge` file) and after the last synchronization task ends  There are two parts in the recorded tsfile file list: the deleted tsfile file list and the newly added tsfile file list.  And hard link all newly added files to prevent operations such as file deletion caused by system operation during sync [...]
+
+##### File cleanup
+
+When receiving the notification of the end of the task of the file transfer module, execute the following command:
+
+* Load the list of file names in the last_local_files.txt file into memory to form a set, and parse log.sync line by line to delete and add the set
+* Write the list of file names in memory to the `current_local_files.txt` file
+* Delete last_local_files.txt file
+* -Renamed `current_local_files.txt` to` last_local_files.txt`
+* Delete the sequence folder and sync.log file
+
+#### File transfer module
+
+##### package
+
+org.apache.iotdb.db.sync.sender.transfer
+
+##### Synchronization schema
+
+Before synchronizing the data file, first synchronize the newly added schmea information and update `sync_schema_pos`.
+
+##### Sync data file
+
+For each file path, call the file management module to obtain a list of deleted files and a list of newly added files, and then perform the following process:
+
+1. Start synchronization task, record `sync start` in` sync.log`
+2. Start syncing the list of deleted files. Record `sync deleted file names start` in` sync.log`
+3. Notify the receiving end of the list of file names to be deleted synchronously
+4. Delete each file name in the list
+    4.1. Transfer file name to receiver (example `1581324718762-101-1.tsfile`)
+    4.2. Successful transfer, record `1581324718762-101-1.tsfile` in` sync.log`
+5. Start to synchronize the list of newly added tsfile files. Record the sync deleted file names end and sync tsfile start in sync.log.
+6. Notify receiver to start syncing files
+7. For each tsfile in the new list:
+    7.1. Transfer the file to the receiver in blocks (example `1581324718762-101-1.tsfile`)
+    7.2. If the file transfer fails, try multiple times. If it tries more than a certain number of times (configurable by the user, the default is 5), abandon the file transfer; if the transfer is successful, record `1581324718762-101-1 'in` sync.log`.  tsfile`
+8. Notify the receiving end of the synchronization task, and record `sync tsfile end` and` sync end` in `sync.log`
+9. Invoke file management module to clean up files
+10. End synchronization task
+
+#### Recovery module
+
+##### package
+
+org.apache.iotdb.db.sync.sender.recover
+
+##### Process
+
+Each time the sending end of the synchronization tool starts a synchronization task, first check whether there is a corresponding receiving end folder under the sending end folder. If not, it means that no synchronization task has been performed with the receiving end and skip the recovery module; otherwise,  The files in the folder perform the recovery algorithm:
+
+1. If `current_local_files.txt` exists, skip to step 2; if not, skip to step 3
+2. If `last_local_files.txt` exists, delete the` current_local_files.txt` file and skip to step 3; if not, skip to step 7
+3. If `sync.log` exists, go to step 4; if not, go to step 8
+4. Load the list of file names in the last_local_files.txt file into memory to form a set, and parse the line by line sync.log to delete and add the set
+5. Write the list of file names in memory to the `current_local_files.txt` file
+6. Delete `last_local_files.txt` file
+7. Renamed `current_local_files.txt` to` last_local_files.txt`
+8. Delete the sequence folder and the `sync.log` file
+9. Algorithm ends
+
+
+## Sync tool receiver
+
+### Statement of needs
+
+* Because the receiving end needs to receive files from multiple sending ends at the same time, it is necessary to distinguish files from different sending ends and manage these files in a unified manner.
+
+* The receiving end receives the file from the transmitting end and verifies the file name, the file data, and the MD5 value of the file.  After the file is received, the file is stored locally at the receiving end, and the received tsfile file is checked for the MD5 value and the end of the file is checked. If the check is passed correctly, the file is retransmitted.
+
+* For the data file sent by the sender (which may include operations such as updating the old data and inserting new data), this part of data needs to be merged into the local file of the receiver.
+
+### Module design
+
+#### File transfer module
+
+##### package
+
+org.apache.iotdb.db.sync.receiver.transfer
+
+##### Process
+
+The file transfer module is responsible for receiving the file name and file transmitted from the sender. The process is as follows:
+
+1. Received the synchronization start instruction from the sender, and checked whether there is a sync.log file. If it exists, it means that the data of the last synchronization has not been loaded, and the synchronization task is rejected; otherwise, sync.start is recorded in the sync.log.
+2. Received the sender's instruction to start synchronous deletion of the file name list, and recorded sync deleted file names start in sync.log
+3. Receive the delete file name transmitted by the sender in turn
+    3.1. Received the file name transmitted by the sender (example `1581324718762-101-1.tsfile`)
+    3.2. Successfully received, record `1581324718762-101-1.tsfile` in` sync.log` and submit it to the data load module for processing
+4. Received the instruction to start the synchronous transmission of the file, and recorded `sync deleted file names end` and`sync tsfile start` in `sync.log`
+5. Receive the tsfile files transmitted by the sender in turn
+    5.1. Receive the file transmitted by the sender in blocks (example `1581324718762-101-2.tsfile`)
+    5.2. Verify the file. If the verification fails, delete the file and notify the sender of the failure; otherwise, record 158513214787662-101-2.tsfile in sync.log and submit it to the data load module for processing
+6. Received the sync task end command from the sender, and recorded `sync tsfile end` and` sync end` in `sync.log`
+7. Create empty file sync.end
+
+#### File loading module
+
+##### package
+
+org.apache.iotdb.db.sync.receiver.load
+
+##### File deletion
+
+For files that need to be deleted (example `1581324718762-101-1.tsfile`), search for` sequence tsfile list` in memory to see if the file exists, and if so, delete the file from the list maintained in memory and  Files on disk are deleted.  After successful execution, record `delete 1581324718762-101-1.tsfile` in` load.log`.
+
+##### Load new file
+
+For the file that needs to be loaded (example 15813214718762-101-1.tsfile), first use device_owner.log to check whether the file meets the application scenario, that is, whether the same device data is transmitted with other senders causing conflicts.  , Then reject the loading and send an error message to the sender; otherwise, update the device_owner.log information.
+
+After meeting the requirements of the application scenario, insert the file into the appropriate position in the sequence tsfile list and move the file to the data / sequence directory.  After successful execution, record `load 1581324718762-101-1.tsfile` in` load.log`.  After each file is loaded, check whether the sync.end file is included in the synchronized directory. If the file is included and the sequence folder is empty, delete the sync.log file, and then delete the load.log and s [...]
+
+#### Recovery module
+
+##### package
+org.apache.iotdb.db.sync.receiver.recover
+
+##### Process
+
+When the ApacheIoTDB system is started, each sub-folder under the sync folder is checked in turn, and each sub-file represents the synchronization task of the sender represented by the folder name.  Perform a recovery algorithm based on the files in each subfolder:
+
+1. If the `sync.log` file does not exist, go to step 4; if it does, go to step 2
+2. Scan the sync.log log line by line, and perform the corresponding delete file operation and load file operation. If the operation has been recorded in the `load.log` file, it indicates that the operation has been completed and the operation is skipped.  Go to step 3
+3. Delete file `sync.log` 
+4. Delete file `load.log`
+5. Delete file `sync.end`
+6. Algorithm ends
+
+At the beginning of each synchronization task, the receiving end checks and restores the corresponding subfolders.
\ No newline at end of file
diff --git a/docs/Documentation/SystemDesign/7-Connector/2-Hive-TsFile.md b/docs/Documentation/SystemDesign/7-Connector/2-Hive-TsFile.md
new file mode 100644
index 0000000..b393c41
--- /dev/null
+++ b/docs/Documentation/SystemDesign/7-Connector/2-Hive-TsFile.md
@@ -0,0 +1,114 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+## TsFile's Hive connector
+
+TsFile's Hive connector implements support for reading external TsFile type file formats through Hive, enabling users to manipulate TsFile through Hive.
+
+The main functions of the connector:
+
+* Load a single TsFile file into Hive, whether the file is stored on the local file system or in HDFS
+* Load all files in a specific directory into Hive, whether the files are stored in the local file system or HDFS
+* Querying TsFile with HQL
+* Until now, write operations are not supported in hive-connector. Therefore, insert operations in HQL are not allowed
+
+### Design principle
+
+The Hive connector needs to be able to parse the TsFile file format and convert it into a line-by-line format that Hive can recognize.  You also need to be able to format the output according to the form of a user-defined Table.  Therefore, the function implementation of the Hive connector is mainly divided into four parts
+
+* Slicing the entire TsFile file
+* Read data from shards and convert it into a data type that Hive can recognize
+* Parse user-defined Table
+* Deserialize data into Hive's output format
+
+### Concrete implementation class
+
+The above four main functional modules have their corresponding implementation classes. The four implementation classes are introduced below.
+
+#### org.apache.iotdb.hive.TSFHiveInputFormat
+
+This class is mainly responsible for formatting the input TsFile file. It inherits the `FileInputFormat <NullWritable, MapWritable>` class. Some general formatting operations have been implemented in `FileInputFormat`. This class overrides its  The getSplits (JobConf, int) `method customizes the sharding method for TsFile files; and the` getRecordReader (InputSpli, JobConf, Reporter) method is used to generate a `TSFHiveRecordReader` that specifically reads data from a slice.
+
+#### org.apache.iotdb.hive.TSFHiveRecordReader
+
+This class is mainly responsible for reading TsFile data from a shard.
+
+It implements the `IReaderSet` interface. This interface is a set of methods for setting internal properties of the class, mainly to extract the duplicated code sections in` TSRecordReader` and `TSHiveRecordReader`.
+
+```
+public interface IReaderSet {
+
+  void setReader(TsFileSequenceReader reader);
+
+  void setMeasurementIds(List<String> measurementIds);
+
+  void setReadDeviceId(boolean isReadDeviceId);
+
+  void setReadTime(boolean isReadTime);
+}
+```
+
+Let's first introduce some important fields of this class
+
+* private List<QueryDataSet> dataSetList = new ArrayList<>();
+
+  All QueryDataSets generated by this shard
+
+* private List<String> deviceIdList = new ArrayList<>();
+
+  Device name list, this order is consistent with the order of dataSetList, that is, deviceIdList [i] is the device name of dataSetList [i].
+
+* private int currentIndex = 0;
+
+  The index of the QueryDataSet currently being processed
+  
+
+This class calls the `initialize (TSFInputSplit, Configuration, IReaderSet, List <QueryDataSet>, List <String>)` method of `TSFRecordReader` in the constructor to initialize some of the class fields mentioned above.  It overrides the `next ()` method of `RecordReader` to return the data read from TsFile.
+
+##### next(NullWritable, MapWritable)
+
+We noticed that after reading the data from TsFile, it was returned in the form of `MapWritable`. Here` MapWritable` is actually a `Map ', except that its key and value are serialized and deserialized.  Special adaptation, its reading process is as follows
+
+1. First determine if there is a value for `QueryDataSet` at the current position of` dataSetList`. If there is no value, then increase `currentIndex` by 1 until the first` QueryDataSet` with a value is found
+2. Then call `next ()` method of `QueryDataSet` to get` RowRecord`
+3. Finally, the getCurrentValue () method of TSFRecordReader is called, and the value in RowRecord is placed in MapWritable.
+
+
+#### org.apache.iotdb.hive.TsFileSerDe
+
+This class inherits ʻAbstractSerDe` and is also necessary for us to implement Hive to read data from custom input formats.
+
+It overrides the Initialize () method of AbstractSerDe. In this method, the corresponding device name, sensor name, and corresponding type of the sensor are parsed from the user-created table sql.  An ObjectInspector object is also constructed. This object is mainly responsible for the conversion of data types. Since TsFile only supports primitive data types, when other data types occur, an exception needs to be thrown. The specific construction process can be seen in the createObjectIns [...]
+
+The main responsibility of this class is to serialize and deserialize data in different file formats. As our Hive connector only supports read operations for the time being, it does not support insert operations, so only the deserialization process, so only overwrite  The `deserialize (Writable)` method is called, which calls the `deserialize ()` method of `TsFileDeserializer`.
+
+
+#### org.apache.iotdb.hive.TsFileDeserializer
+
+This class deserializes the data into Hive's output format. There is only one `deserialize ()` method.
+
+##### public Object deserialize(List<String>, List<TypeInfo>, Writable, String)
+
+The `Writable` parameter of this method is the` MapWritable` generated by `next ()` of `TSFHiveRecordReader`.
+
+First determine if the `Writable` parameter is of type` MapWritable`, if not, throw an exception.
+
+Then take out the value of the sensor of the device from `MapWritable` in turn, throw an exception if a type mismatch is encountered, and finally return the generated result set.
diff --git a/docs/Documentation/SystemDesign/7-Connector/3-Spark-TsFile.md b/docs/Documentation/SystemDesign/7-Connector/3-Spark-TsFile.md
new file mode 100644
index 0000000..18708b2
--- /dev/null
+++ b/docs/Documentation/SystemDesign/7-Connector/3-Spark-TsFile.md
@@ -0,0 +1,94 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Spark Tsfile connector
+
+## aim of design
+
+* Use Spark SQL to read the data of the specified Tsfile and return it to the client in the form of a Spark DataFrame
+
+* Generate Tsfile with data from Spark Dataframe
+
+## Supported formats
+Wide table structure: Tsfile native format, IOTDB native path format
+
+| time | root.ln.wf02.wt02.temperature | root.ln.wf02.wt02.status | root.ln.wf02.wt02.hardware | root.ln.wf01.wt01.temperature | root.ln.wf01.wt01.status | root.ln.wf01.wt01.hardware |
+|------|-------------------------------|--------------------------|----------------------------|-------------------------------|--------------------------|----------------------------|
+|    1 | null                          | true                     | null                       | 2.2                           | true                     | null                       |
+|    2 | null                          | false                    | aaa                        | 2.2                           | null                     | null                       |
+|    3 | null                          | null                     | null                       | 2.1                           | true                     | null                       |
+|    4 | null                          | true                     | bbb                        | null                          | null                     | null                       |
+|    5 | null                          | null                     | null                       | null                          | false                    | null                       |
+|    6 | null                          | null                     | ccc                        | null                          | null                     | null                       |
+
+Narrow table structure: Relational database schema, IOTDB align by device format
+
+| time | device_name                   | status                   | hardware                   | temperature |
+|------|-------------------------------|--------------------------|----------------------------|-------------------------------|
+|    1 | root.ln.wf02.wt01             | true                     | null                       | 2.2                           |
+|    1 | root.ln.wf02.wt02             | true                     | null                       | null                          |
+|    2 | root.ln.wf02.wt01             | null                     | null                       | 2.2                          |
+|    2 | root.ln.wf02.wt02             | false                    | aaa                        | null                           |
+|    3 | root.ln.wf02.wt01             | true                     | null                       | 2.1                           |
+|    4 | root.ln.wf02.wt02             | true                     | bbb                        | null                          |
+|    5 | root.ln.wf02.wt01             | false                    | null                       | null                          |
+|    6 | root.ln.wf02.wt02             | null                     | ccc                        | null                          |
+
+## Query process steps
+
+#### 1. Table structure inference and generation
+This step is to make the table structure of the DataFrame match the table structure of the Tsfile to be queried.
+The main logic is inferSchema function in src / main / scala / org / apache / iotdb / spark / tsfile / DefaultSource.scala
+
+#### 2. SQL parsing
+The purpose of this step is to transform user SQL statements into Tsfile native query expressions.
+
+The main logic is the buildReader function in src / main / scala / org / apache / iotdb / spark / tsfile / DefaultSource.scala. SQL parsing wide table structure and narrow table structure
+
+#### 3. Wide table structure
+
+The main logic of the SQL analysis of the wide table structure is in src / main / scala / org / apache / iotdb / spark / tsfile / WideConverter.scala. This structure is basically the same as the Tsfile native query structure. No special processing is required, and the SQL statement is directly converted into  Corresponding query expression
+
+#### 4. Narrow table structure
+The main logic of the SQL analysis of the wide table structure is src / main / scala / org / apache / iotdb / spark / tsfile / NarrowConverter.scala. After the SQL is converted to an expression, the narrow table structure is different from the Tsfile native query structure.  The expression is converted into a disjunction expression related to the device before it can be converted into a query of Tsfile. The conversion code is in src / main / java / org / apache / iotdb / spark / tsfile / qp
+
+#### 5. Query execution
+The actual data query execution is performed by the Tsfile native component, see:
+
+* [Tsfile native query process](../1-TsFile/4-Read.md)
+
+## Write step flow
+Writing is mainly to convert the data in the Dataframe structure into Tsfile's RowRecord, and write using Tsfile Writer
+
+#### Wide table structure
+The main conversion code is in the following two files:
+
+* src/main/scala/org/apache/iotdb/spark/tsfile/WideConverter.scala responsible for structural transformation
+
+* src/main/scala/org/apache/iotdb/spark/tsfile/WideTsFileOutputWriter.scala responsible for matching the spark interface and performing writes, which will call the structure conversion function in the previous file
+
+#### Narrow table structure
+The main conversion code is in the following two files:
+
+* src/main/scala/org/apache/iotdb/spark/tsfile/NarrowConverter.scala responsible for structural transformation
+
+* src/main/scala/org/apache/iotdb/spark/tsfile/NarrowTsFileOutputWriter.scala responsible for matching the spark interface and performing writes, which will call the structure conversion function in the previous file
+
diff --git a/docs/Documentation/SystemDesign/7-Connector/4-Spark-IOTDB.md b/docs/Documentation/SystemDesign/7-Connector/4-Spark-IOTDB.md
new file mode 100644
index 0000000..3440638
--- /dev/null
+++ b/docs/Documentation/SystemDesign/7-Connector/4-Spark-IOTDB.md
@@ -0,0 +1,87 @@
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+    
+        http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+-->
+
+# Spark IOTDB connector
+
+## aim of design
+
+* Use Spark SQL to read IOTDB data and return it to the client in the form of a Spark DataFrame
+
+## main idea
+Because IOTDB has the ability to parse and execute SQL, this part can directly forward SQL to the IOTDB process for execution, and then convert the data to RDD.
+
+## Implementation process
+#### 1.Entrance
+
+* src/main/scala/org/apache/iotdb/spark/db/DefaultSource.scala
+
+#### 2. Building Relation
+Relation mainly saves RDD meta-information, such as column names, partitioning strategies, and so on. Calling Relation's buildScan method can create RDDs
+
+* src/main/scala/org/apache/iotdb/spark/db/IoTDBRelation.scala
+
+#### 3. Building RDD
+RDD executes SQL request to IOTDB and saves cursor
+
+* The compute method in src / main / scala / org / apache / iotdb / spark / db / IoTDBRDD.scala
+
+#### 4.Iterative RDD
+Due to Spark's lazy loading mechanism, the RDD iteration is called specifically when the user traverses the RDD, which is the fetch result of IOTDB
+
+* getNext method in src / main / scala / org / apache / iotdb / spark / db / IoTDBRDD.scala
+
+
+## Wide and narrow table structure conversion
+Wide table structure: IOTDB native path format
+
+| time | root.ln.wf02.wt02.temperature | root.ln.wf02.wt02.status | root.ln.wf02.wt02.hardware | root.ln.wf01.wt01.temperature | root.ln.wf01.wt01.status | root.ln.wf01.wt01.hardware |
+|------|-------------------------------|--------------------------|----------------------------|-------------------------------|--------------------------|----------------------------|
+|    1 | null                          | true                     | null                       | 2.2                           | true                     | null                       |
+|    2 | null                          | false                    | aaa                        | 2.2                           | null                     | null                       |
+|    3 | null                          | null                     | null                       | 2.1                           | true                     | null                       |
+|    4 | null                          | true                     | bbb                        | null                          | null                     | null                       |
+|    5 | null                          | null                     | null                       | null                          | false                    | null                       |
+|    6 | null                          | null                     | ccc                        | null                          | null                     | null                       |
+
+Narrow table structure: Relational database schema, IOTDB align by device format
+
+| time | device_name                   | status                   | hardware                   | temperature |
+|------|-------------------------------|--------------------------|----------------------------|-------------------------------|
+|    1 | root.ln.wf02.wt01             | true                     | null                       | 2.2                           |
+|    1 | root.ln.wf02.wt02             | true                     | null                       | null                          |
+|    2 | root.ln.wf02.wt01             | null                     | null                       | 2.2                          |
+|    2 | root.ln.wf02.wt02             | false                    | aaa                        | null                           |
+|    3 | root.ln.wf02.wt01             | true                     | null                       | 2.1                           |
+|    4 | root.ln.wf02.wt02             | true                     | bbb                        | null                          |
+|    5 | root.ln.wf02.wt01             | false                    | null                       | null                          |
+|    6 | root.ln.wf02.wt02             | null                     | ccc                        | null                          |
+
+Because the data queried by IOTDB defaults to a wide table structure, a wide-narrow table conversion is required. There are two implementation methods as follows
+
+#### 1. Use the IOTDB group by device statement
+This way you can get the narrow table structure directly, and the calculation is done by IOTDB
+
+#### 2. Use Transformer
+You can use Transformer to convert between wide and narrow tables. The calculation is done by Spark.
+
+* src/main/scala/org/apache/iotdb/spark/db/Transformer.scala
+
+Wide table to narrow table uses traversing the device list to generate the corresponding narrow table. The parallelization strategy is better (no shuffle). The narrow table to wide table uses a timestamp-based join operation. There is potential for shuffle.  Performance issues
\ No newline at end of file