You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@iotdb.apache.org by hx...@apache.org on 2020/06/29 15:11:39 UTC
[incubator-iotdb] branch master updated: Comparison IoTDB with
other TSDBs (#1189)
This is an automated email from the ASF dual-hosted git repository.
hxd pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-iotdb.git
The following commit(s) were added to refs/heads/master by this push:
new 6da16b4 Comparison IoTDB with other TSDBs (#1189)
6da16b4 is described below
commit 6da16b4902ef9e6ad2736a935a512c9ad4f9fda1
Author: Xiangdong Huang <hu...@tsinghua.edu.cn>
AuthorDate: Mon Jun 29 23:11:21 2020 +0800
Comparison IoTDB with other TSDBs (#1189)
* add a tsdb comaprison article
---
docs/UserGuide/Comparison/TSDB-Comparison.md | 386 +++++++++++++++++++++++++++
site/src/main/.vuepress/config.js | 6 +
2 files changed, 392 insertions(+)
diff --git a/docs/UserGuide/Comparison/TSDB-Comparison.md b/docs/UserGuide/Comparison/TSDB-Comparison.md
new file mode 100644
index 0000000..c12b8f5
--- /dev/null
+++ b/docs/UserGuide/Comparison/TSDB-Comparison.md
@@ -0,0 +1,386 @@
+<!--
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+-->
+
+# Comparison
+
+## Known Time Series Database
+
+As the time series data is more and more important,
+several open sourced time series databases are intorduced in the world.
+However, few of them are developed for IoT or IIoT (Industrial IoT) scenario in particular.
+
+
+We choose 3 kinds of TSDBs here.
+
+* InfluxDB - Native Time series database
+
+ InfluxDB is one of the most popular TSDBs.
+
+ Interface: InfluxQL and HTTP API
+
+* OpenTSDB and KairosDB - Time series database based on NoSQL
+
+ These two DBs are similar, while the first is based on HBase and the second is based on Cassandra.
+ Both of them provides RESTful style API.
+
+ Interface: Restful API
+
+* TimesacleDB - Time series database based on Relational Database
+
+ Interface: SQL
+
+Prometheus and Druid are also famous for time series data management.
+However, Prometheus focuses on how to collect data, how to visualize data and how to alert warnings.
+Druid focuses on how to analyze data with OLAP workload. We omit them here.
+
+
+## Comparison
+We compare the above time series database from two aspects: the feature comparison and the performance
+comparison.
+
+
+### Feature Comparison
+
+I list the basic features comparison of these databases.
+
+Legend:
+- O: big support greatly
+- o: support
+- x: not support
+- :\-( : support but not very good
+- ?: unknown
+
+
+#### Basic Features
+
+| TSDB | IoTDB | InfluxDB | OpenTSDB | KairosDB | TimescaleDB |
+|-----------------------------|-----------------------------|------------|------------|------------|-------------|
+| OpenSource | **o** | o | o | **o** | o |
+| SQL\-like | o | o | x | x | **O** |
+| Schema | "Tree\-based, tag\-based\" | tag\-based | tag\-based | tag\-based | Relational |
+| Writing out\-of\-order data | o | o | o | o | o |
+| Schema\-less | o | o | o | o | o |
+| Batch insertion | o | o | o | o | o |
+| Time range filter | o | o | o | o | o |
+| Order by time | **O** | o | x | x | o |
+| Value filter | o | o | x | x | o |
+| Downsampling | **O** | o | o | o | o |
+| Fill | **O** | o | o | x | o |
+| LIMIT | o | o | o | o | o |
+| SLIMIT | o | o | x | x | ? |
+| Latest value | O | o | o | x | o |
+
+**Details**
+
+* OpenSource:
+
+ * IoTDB uses Apache License 2.0 and it is in Apache incubator.
+ * InfluxDB uses MIT license. However, **the cluster version is not open sourced**.
+ * OpenTSDB uses LGPL2.1, which **is not compatible with Apache License**.
+ * KairosDB uses Apache License 2.0.
+ * TimescaleDB uses Timescale License, which is not free for enterprise.
+
+* SQL like:
+
+ * IoTDB and InfluxDB supports SQL like language. Besides, The integration of IoTDB and Calcite is alomost done (a PR has been submitted), which means IoTDB will support Standard SQL.
+ * OpenTSDB and KairosDB only support Rest API. Besides, IoTDB also supports Rest API (a PR has been submitted).
+ * TimescaleDB uses the SQL the same with PG.
+
+* Schema:
+
+ * IoTDB: IoTDB proposes a [Tree based schema](http://iotdb.apache.org/UserGuide/Master/Concept/Data%20Model%20and%20Terminology.html).
+ It is quite different with other TSDBs. However, the kind of schema has the following advantages:
+
+ * In many industrial scenarios, the management of devices are hierarchical, rather than flat.
+ That is why we think a tree based schema is better than tag-value based schema.
+
+ * In many real world applications, tag names are constant. For example, a wind turbine manufacturer
+ always identify their wind turbines by which country it locates, the farm name it belongs to, and its ID in the farm.
+ So, a 4-depth tree ("root.the-country-name.the-farm-name.the-id") is fine.
+ You do not need to repeat to tell IoTDB the 2nd level of the tree is for country name,
+ the 3rd level is for farm id, etc..
+
+ * A path based time series ID definition also supports flexible queries, like "root.\*.a.b.\*", wehre \* is wildcard character.
+
+ * InfluxDB, KairosDB, OpenTSDB are tag-value based, which is more popular currently.
+
+ * TimescaleDB uses relational table.
+
+* Order by time:
+
+ Order by time seems quite trivil for time series database. But... if we consider another feature, called align by time,
+ something becomes interesting. And, that is why we mark OpenTSDB and KairosDB unsupported.
+
+ Actually, in each time series, all these TSDBs support order data by timestamps.
+
+ However, OpenTSDB and KairosDB do not support order the data from different timeseries in the time order.
+
+ Ok, considering a new case: I have two time series, one is for the wind speed in wind farm1,
+ another is for the generated energy of wind turbine1 in farm1. If we want to analyze the relation between the
+ wind speed and the generated energy, we have to know the values of both at the same time.
+ That is to say, we have to align the two time series in the time dimension.
+
+ So, the result should be:
+
+ | timestamp | wind speed | generated energy |
+ |-----------|-------------|------------------|
+ | 1 | 5.0 | 13.1 |
+ | 2 | 6.0 | 13.3 |
+ | 3 | null | 13.1 |
+
+ or,
+
+ | timestamp | series name | value |
+ |-----------|-------------------|------------|
+ | 1 | wind speed | 5.0 |
+ | 1 | generated energy | 13.1 |
+ | 2 | wind speed | 6.0 |
+ | 2 | generated energy | 13.3 |
+ | 3 | generated energy | 13.1 |
+
+ Though the second table format does not align data by the time dimension, but it is easy to be implemented in the client-side,
+ by just scanning data row by row.
+
+ IoTDB supports the first table format (called align by time), InfluxDB supports the second table format.
+
+* Downsampling:
+
+ Downsampling is for changing the granularity of timeseries, e.g., from 10Hz to 1Hz, or 1 point per day.
+
+ Different with other systems, IoTDB downsamples data in real time, while others serialized downsampled data on disk.
+ That is to say,
+
+ * IoTDB supports **adhoc** downsampling data in **arbitrary time**.
+ e.g., a SQL returns 1 point per 5 minutes and start with 2020-04-27 08:00:00 while another SQL returns 1 point per 5 minutes + 10 seconds and start with 2020-04-27 08:00:01.
+ (InfluxDB also supports adhoc downsampling but the performance is ..... hm)
+
+ * There is no disk loss for IoTDB.
+
+
+* Fill:
+
+ Sometimes we thought the data is collected in some fixed frequency, e.g., 1Hz (1 point per second).
+ But usually, we may lost some data points, because the network is unstable, the machine is busy, or the machine is down for several minutes.
+
+ In this case, filling these holes is important. Data scientists can avoid to many so called dirty work, e.g., data clean.
+
+ InfluxDB and OpenTSDB only support using fill in a group by statement, while IoTDB supports to fill data when just given a particular timestamp.
+ Besides, IoTDB supports several strategies for filling data.
+
+* Slimit:
+
+ Slimit means return limited number of measurements (or, fields in InfluxDB).
+ For example, a wind turbine may have 1000 measurements (speed, voltage, etc..), using slimit and soffset can just return a part of them.
+
+
+* Latest value:
+
+ As one of the most basic timeseries based applications is monitoring the latest data.
+ Therefore, a query to return the latest value of a time series is very important.
+ IoTDB and OpenTSDB support that with a special SQL or API,
+ while InfluxDB supports that using an aggregation function.
+ (the reason why IoTDB provides a special SQL is IoTDB optimizes the query expressly.)
+
+
+
+**Conclusion**:
+
+Well, if we compare the basic features, we can find that OpenTSDB and KairosDB somehow lack some important query features.
+TimescaleDB can not be freely used in business.
+IoTDB and InfluxDB can meet most requirements of time series data management, while they have some difference.
+
+
+#### Advanced Features
+
+I listed some interesting features that these systems may differ.
+
+| TSDB | IoTDB | InfluxDB | OpenTSDB | KairosDB | TimescaleDB |
+|-----------------------------|---------------------------------|------------|------------|------------|-------------|
+| Align by time | **O** | o | x | x | o |
+| Compression | **O** | :\-( | :\-\( | :\-\( | :\-\( |
+| MQTT support | **O** | o | x | x | :\-\( |
+| Run on Edge-side Device | **O** | o | x | :\-\( | o |
+| Multi\-instance Sync | **O** | x | x | x | x |
+| JDBC Driver | **o** | x | x | x | x |
+| Standard SQL | o | x | x | x | **O** |
+| Spark integration | **O** | x | x | x | x |
+| Hive integration | **O** | x | x | x | x |
+| Writing data to NFS (HDFS) | **O** | x | o | x | x |
+| Flink integration | **O** | x | x | x | x |
+
+
+* Align by time: have been introduced. Let's skip it..
+
+* Compression:
+ * IoTDB supports many encoding and compression for time series, like RLE, 2DIFF, Gorilla, etc.. and Snappy compression.
+ In IoTDB, you can choose which encoding method you want, according to the data distribution. For more info, see [here](http://iotdb.apache.org/UserGuide/Master/Concept/Encoding.html).
+ * InfluxDB also supports encoding and compression, but you can not define which encoding method you want.
+ It just depends on the data type. For more info, see [here](https://docs.influxdata.com/influxdb/v1.7/concepts/storage_engine/).
+ * OpenTSDB and KairosDB use HBase and Cassandra in backend, and have no special encoding for time series.
+
+* MQTT protocol support:
+
+ MQTT protocol is an international standard and widely known in industrial users. only IoTDB and InfluxDB support user using MQTT client to write data.
+
+* Running on Edge-side Device:
+
+ Nowdays, edge computing is more and more popular, which means the edge device has more powerful compution resources.
+ Deploying a TSDB on the edge side is useful for managing data on the edge side and serve for edge computing.
+ As OpenTSDB and KairosDB rely another DB, the architecture is a little heavy. Especially, it is hard to run Hadoop on the edge side.
+
+* Multi-instance Sync:
+
+ Ok, now we have many TSDB instances on the edge-side. Then, how to upload their data to the data center, to form a ... data lake (or ocean, river,..., whatever).
+ One choice is read data from these instances and write the data point by point to the data center instance.
+ IoTDB provides another choice, just uploading the data file into the data center incrementally, then the data center can support service on the data.
+
+* JDBC driver:
+
+ Now only IoTDB supports a JDBC driver (though not all interfaces are implemented), and makes it possible to integrate many other JDBC driver based softwares.
+
+* Standard SQL:
+
+ As mentioned, the integration of IoTDB and Calcite is almost done (a PR has been submitted), which means IoTDB will support Standard SQL.
+
+* Spark and Hive integration:
+
+ It is very very important that letting big data analysis software to access the data in database for more complex data analysis.
+ IoTDB supports Hive-connector and Spark connector for better integration.
+
+* Writing data to NFS (HDFS):
+ Sharing nothing architecture is good, but sometimes you have to add new servers even your CPU and memory is idle but the disk is full...
+ Besides, if we can save the data file directly to HDFS, it will be more easy to use Spark and other softwares to analyze data, without ETL.
+
+ * IoTDB supports write data locally or on HDFS directly. IoTDB also allows user extend to store data on other NFS.
+ * InfluxDB, KairosDB have to write data locally.
+ * OpenTSDB has to write data on HDFS.
+
+**Conclusion**:
+
+ We can find that IoTDB has many powerful features that other TSDBs do not support.
+
+### Performance Comparison
+
+Ok... If you say, "well, I just want to use the basic features. If so, IoTDB has little difference with others.".
+It is somehow right. But, if you consider the performance, you may change your mind.
+
+#### quick review
+
+Given a workload:
+
+* Write:
+
+10 clients write data concurrently. The number of storage group is 50. There are 1000 devices and each device has 100 measurements (i.e.,, 100K time series totally).
+The data type is float and IoTDB uses RLE encoding and Snappy compression.
+IoTDB uses batch insertion API and the batch size is 100 (write 100 data points per write API call).
+
+* Read:
+
+50 clients read data concurrently. Each client just read data from 1 device with 10 measurements in one storage group.
+
+IoTDB is v0.9.0.
+
+**Write performance**:
+
+We write 112GB data totally.
+
+The write throughput (points/second) is:
+
+![Write Throughput (points/second)](https://user-images.githubusercontent.com/1021782/80472896-f1db0e00-8977-11ea-9424-96bf0021588d.png)
+<span id = "exp1"> <center>Figure 1. Write throughput (points/second) IoTDB v0.9</center></span>
+
+
+The disk occupation is:
+
+![Disk Occupation](https://user-images.githubusercontent.com/1021782/80472899-f3a4d180-8977-11ea-8233-268ad4e3713e.png)
+<center>Figure 2. Disk occupation(GB) IoTDB v0.9</center>
+
+**Query performance**
+
+![Aggregation query](https://user-images.githubusercontent.com/1021782/80472924-fef7fd00-8977-11ea-9ad4-b4d3c899605e.png)
+<center>Figure 3. Aggregation query time cost(ms) IoTDB v0.9</center>
+
+We can see that IoTDB outperforms others.
+
+
+#### More details
+
+We provide a benchmarking tool, called IoTDB-benchamrk (https://github.com/thulab/iotdb-benchmark, you may have to use the dev branch to compile it),
+it supports IoTDB, InfluxDB, KairosDB, TimescaleDB, OpenTSDB. We have a [article](https://arxiv.org/abs/1901.08304) for comparing these systems using the benchmark tool.
+When we publishing the article, IoTDB just entered Apache incubator, so we deleted the performance of IoTDB in that article. But we really did the comparison, and I will
+disclose some results here.
+
+- **IoTDB: 0.8.0**. (notice: **IoTDB v0.9 outperforms than v0.8**, we will update the result once we finish the experiments on v0.9)
+- InfluxDB: 1.5.1.
+- OpenTSDB: 2.3.1 (HBase 1.2.8)
+- KairosDB: 1.2.1 (Cassandra 3.11.3)
+- TimescaleDB: 1.0.0 (PostgreSQL 10.5)
+
+All TSDB run on the same server one by one.
+
+- For InfluxDB, we set the cache-max-memory-size and max-series-perbase as unlimited (otherwise it will be timeout quickly)
+
+- For OpenTSDB, we modified tsd.http.request.enable_chunked, tsd.http.request.max_chunk and tsd.storage.fix_duplicates for supporting write data in batch
+and write out-of-order data.
+
+- For KairosDB, we set Cassandra's read_repair_chance as 0.1 (However it has no effect because we just have one node).
+
+- For TimescaleDB, we use PGTune tool to optimize PostgreSQL.
+
+All TSDBs run on a server with Intel Xeon CPU E5-2697 v4 @2.3GHz, 256GB memory and 10 HDD disks with RAID-5.
+The OS is Ubuntu 16.04.2 LTS, 64bits.
+
+Another server run IoTDB benchmark tool.
+
+I omit the detailed workload here, let's see the result:
+
+Legend:
+- I: InfluxDB
+- O: OpenTSDB
+- T: TimescaleDB
+- K: KairosDB
+- **D: IoTDB**
+
+![Write experiments](https://user-images.githubusercontent.com/1021782/80476160-95c6b880-897c-11ea-9bb3-9d810cc0c79e.png)
+<span id = "exp4"><center>Figure 4. Write experiments IoTDB v0.8.0</center></span>
+
+![Query experiments](https://user-images.githubusercontent.com/1021782/80476181-9c553000-897c-11ea-8170-4768134f5841.png)
+<center>Figure 5. Query experiments IoTDB v0.8.0</center>
+
+We can see that IoTDB outperforms others hugely.
+
+In [Figure. 4(c)](#exp4), when the batch size reaches to 10000 points, InfluxDB is better than IoTDB v0.8.
+It is because in IoTDB v0.8, batch insert API is not optimized.
+
+From IoTDB v0.9 on, using batch insert API can obtain 8 to 10 times write performance improvement.
+
+
+For example, using IoTDB v0.8, the write throughput can only reach to 6 million data points per second.
+But using IoTDB v0.9, the write throughput can reach to 40 million data points per second on the same server with the same workload.
+(see [Figure. 4(a)](#exp4) vs [Figure. 1](#exp1)).
+
+
+## Conclusion
+
+If you are considering to find a TSDB for your IIoT application, then Apache IoTDB, a new time series, is your best choice.
+
+We will update this page once we release new version and finish the experiments.
+We also welcome more contributors correct this article and contribute IoTDB and reproduce experiments.
diff --git a/site/src/main/.vuepress/config.js b/site/src/main/.vuepress/config.js
index c0d9bb9..e135026 100644
--- a/site/src/main/.vuepress/config.js
+++ b/site/src/main/.vuepress/config.js
@@ -489,6 +489,12 @@ var config = {
['Architecture/Shared Nothing Cluster','Shared Nothing Cluster']
]
},
+ {
+ title: 'Comparison with TSDBs',
+ children: [
+ ['Comparison/TSDB-Comparison','Comparison']
+ ]
+ }
],
'/SystemDesign/': [
{