You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by hxd <hx...@qq.com> on 2018/11/14 11:28:49 UTC

[Result][Vote] vote for IoTDB incubation proposal

Hi,

With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1 votes, this VOTE passes. 

Thanks to everyone who voted!

Bellow is a voting tally:

Binding
Von Gosling
 Christofer Dutz 
 Kevin A. McGrail
 Felix Cheung
 Matt Sticker
 Joe Witt
 Justin Mclean 
 Willem Jiang 


Non-binding
 Sheng Wu
 Yang Bo
 
The vote thread: https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E <https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E> 
The proposal: https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal> 

Thanks,

Xiangdong Huang


> 在 2018年11月7日,下午3:46,hxd <hx...@qq.com> 写道:
> 
> Hi,
> 
> Sorry for the previous mail with bad format.
> I'd like to call a VOTE to accept IoTDB project, a database for managing large amounts of time series data  from IoT sensors in industrial applications, into the Apache Incubator. 
> The full proposal is available on the wiki: https://wiki.apache.org/incubator/IoTDBProposal
> and it is also attached below for your convenience.
> 
> Please cast your vote:
> 
>   [ ] +1, bring IoTDB into Incubator
>   [ ] +0, I don't care either way,
>   [ ] -1, do not bring IoTDB into Incubator, because...
> 
> The vote will open at least for 72 hours.
> 
> Thanks,
> Xiangdong Huang.
> 
> 
> = IoTDB Proposal  =
> v0.1.1
> 
> 
> == Abstract ==
> IoTDB is a data store for managing large amounts of time series data such as timestamped data from IoT sensors in industrial applications.
> 
> == Proposal ==
> IoTDB is a database for managing large amount of time series data with columnar storage, data encoding, pre-computation, and index techniques. It has SQL-like interface to write millions of data points per second per node and is optimized to get query results in few seconds over trillions of data points. It can also be easily integrated with Apache Hadoop MapReduce and Apache Spark for analytics.
> 
> == Background ==
> 
> A new class of data management system requirements is becoming increasingly important with the rise of the Internet of Things. There are some database systems and technologies aimed at time series data management.  For example, Gorilla and InfluxDB which are mainly built for data centers and monitoring application metrics. Other systems, for example, OpenTSDB and KairosDB, are built on Apache HBase and Apache Cassandra, respectively. 
> 
> However, many applications for time series data management have more requirements especially in industrial applications as follows:
> 
>  * Supporting time series data which has high data frequency. For example, a turbine engine may generate 1000 points per second (i.e., 1000Hz), while each CPU only reports 1 data points per 5 seconds in a data center monitoring application.
> 
>  * Supporting scanning data multi-resolutionally. For example, aggregation operation is important for time series data.
> 
>  * Supporting special queries for time series, such as pattern matching, time series segmentation, time-frequency transformation and frequency query.
> 
>  * Supporting a large number of monitoring targets (i.e. time series). An excavator may report more than 1000 time series, for example, revolving speed of the motor-engine, the speed of the excavator, the accelerated speed, the temperature of the water tank and so on, while a CPU or an application monitor has much fewer time series.
> 
>  * Optimization for out-of-order data points. In the industrial sector, it is common that equipment sends data using the UDP protocol rather than the TCP protocol. Sometimes, the network connect is unstable and parts of the data will be buffered for later sending.
> 
>  * Supporting long-term storage. Historical data is precious for equipment manufacturers. Therefore, removing or unloading historical data is highly desired for most industrial applications. The database system must not only support fast retrieval of historical data, but also should guarantee that the historical data does not impact the processing speed for “hot” or current data.
> 
>  * Supporting online transaction processing (OLTP) as well as complex analytics. It is obvious that supporting analyzing from the data files using Apache Spark/Apache Hadoop MapReduce directly is better than transforming data files to another file format for Big Data analytics.
> 
>  * Flexible deployment either on premise or in the cloud.  IoTDB is as simple and can be deployed on a Raspberry Pi handling hundreds of time series. Meanwhile, the system can be also deployed in the cloud so that it supports tens of millions ingestions per second, OLTP queries in milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
> 
>  * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a wind turbine, or a meteorological station, the deployment of the chosen database is designed to be simple. A device may have hundreds of time series (but less than a thousand time series) and the database needs to handle them.
>  * * (2) When deploying IoTDB in a data center, the computational resources (i.e., the hardware configuration of servers) is not a problem when compared to a Raspberry Pi. In this deployment, IoTDB can use more computation resources, and has the ability to handle more time seires (e.g., millions of time series).
> 
> Based on these requirements, we developed IoTDB, a new data store system for managing time series data. 
> 
> IoTDB started as a Tsinghua University research project. IoTDB's developer community has also grown to include additional institutions, for example, universities (e.g., Fudan University), research labs (e.g, NEL-BDS lab), and corporations (e.g., K2Data, Tencent). Funding has been provided by various institutions including the National Natural Science Foundation of China, and industry sponsors, such as Lenovo and K2Data. 
> 
> == Rationale ==
> Because there is no existed open-sourced time series databases covering all the above requirements, we developed IoTDB. As the system matures, we are seeking a long-term home for the project. We believe the Apache Software Foundation would be an ideal fit. Also joining Apache will help coordinate and improve the development effort of the growing number of organizations which contribute to IoTDB improving the diversity of our community.
> 
> IoTDB contains multiple modules, which are classified into categories:
> 
>  * '''TsFile Format''': TsFile is a new columnar file format. 
>  * '''Adaptor for Analytics and Visualization''': Integrating TsFile with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also provided.
>  * '''IoTDB Engine''': An engine which consists of SQL parser, query plan generator, memtable, authentication and authorization,write ahead log (WAL), crash recovery, out-of-order data handler, and index for aggregation and pattern matching. The engine stores system data in TsFile format.
>  * '''IoTDB JDBC''': An implementation of Java Database Connectivity (JDBC) for clients to connect to IoTDB using Java.
> 
> === TsFile Format ===
> 
> TsFile format is a columnar store, which is similar with Apache Parquet and Apache CarbonData. It has the concepts of Chunk Group, Column Chunk, Page and Footer. Comparing with Apache Parquet and Apache CarbonData, it is designed and optimized for time series:
> 
> ==== Time Series Friendly Encoding ====
> IoTDB currently supports run length encoding (RLE), delta-of-delta encoding, and Facebook's Gorilla encoding. 
> 
> Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA) and time-frequency transformation are works-in-progress.
> 
> 
> ==== Chunk Group ====
> The data part of a TsFile consists of many Chunk Groups. Each Chunk Group stores the data of a device at a time interval.  A Chunk Group is similar to the row group in Apache Parquet, while there are some constraints of the time dimension:  For each device, the time intervals of different Chunk Groups are not overlapped and the latter Chunk Group always has a larger timestamp.
> 
> Given a TsFile and a query with a time range filter, the query process can terminate scanning data once it reads data points whose timestamp reaches the time limit of the filter. We call the feature ''fast-return'' and it makes the time range query in a TsFile very efficient.
> 
> 
> 
> ==== Different Column Chunk Format (Unnecessary the Repetition (R) and Definition (D) Fields) ====
> 
> While Apache Parquet and Apache CarbonData support complex data types, e.g., nested data and sparse columns, TsFile is exclusively designed for time series whose data model is \<device_id, series_id, timestamp, value\>. 
> 
> In a `Chunk Group`, each time series is a `Column Chunk`. Even though these time series belong to the same device, the data points in different time series are not aligned in the time dimension originally. 
> 
> For example, if you have a device with 2 sensors on the same data collection frequencies, sensor 1 may collect data at time 1521622662000 while the other one collects data at time 1521622662001 (delta=1ms). Therefore, each Column Chunk has its timestamps and values, which is quite different from Apache Parquet and Apache CarbonData.  Because we store the time column along with each value column instead of making different chunks share the same time column for the sake of diverse data frequency for different time series, we do not store any null value on disk to align across time series. Besides, we do not need to attach  `repetition` (R) and `definition` (D) fields on each value. Therefore, the disk space is saved and the query latency is reduced (because we do not align data by calculating R and D fields).
> 
> 
> ==== Domain Specific Information in Each Page ====
> Similar to Apache Parquet and Apache CarbonData, a `Column Chunk` consists of several `Pages`, and each `Page` has a `Page header`. The `Page header` is a summary of the data in the page. 
> 
> Because TsFile is optimized for time series, the page header contains more domain specific information, such as the minimal and maximal value, the minimal and the maximal timestamp, the frequency and so on. TsFile can even store the histogram of values in the page header. 
> 
> This header information helps IoTDB in speeding up queries by skipping unnecessary pages.
> 
> 
> === Adaptor for Analytics ===
> The TsFile provides:
> 
>  * InputFormat/OutputFormat interfaces for Reading/Writing data.
>  * Deep integration with Apache Spark/Hadoop MapReduce including predicate push-down, column pruning, aggregation push down, etc. So users can use Apache Spark SQL/HiveQL to connect and query TsFiles.
> 
> 
> === IoTDB Engine ===
> The IoTDB engine is a database engine, which uses TsFile as its storage file format. The IoTDB Engine supports SQL-like query plus many useful functions:
> 
>  * Tree-based time series schema
>  * Log-Structured Merge (LSM)-based storage
>  * Overflow file for out-of-order data
>  * Scalable index framework
>  * Special queries for time series
> 
> ==== Tree-based Time Series Schema ====
> IoTDB manages all the time series definitions using a tree structure. A path from the root of the tree to a leaf node represents a time series. Therefore, the unique id of a time series is a path, e.g., `root.China.beijing.windFarm1.windTurbine1.speed`. 
> 
> This kind of schema can express `group by` naturally. For example, `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind turbines in wind farm 1 in Beijing, China.
> 
> ==== Log-Structured Merge (LSM)-based Storage ====
> In a time series, the data points should be ordered by their timestamps. In IoTDB, we use Log-Structured Merge (LSM) based mechanism. Therefore, a part of the data is stored in memory first and can be called as `memtable`. At this time, if data points come out-of-order, we resort them in memory. When this part of data exceeds the configured memory limit, we flush it on disk as a `Chunk Group` into an unclosed TsFile.  Finally, a TsFile may contain several Chunk Groups, for reducing the number of small data files, which is helpful to reduce the I/O load of the storage system and reduces the execution time of a file-merge in LSM. Notice that the data is time-ordered in one Chunk Group on disk, and this layout is helpful for fast filtering in one Chunk Group for a query.
> 
> Rule 1: In a TsFile, the Chunk Groups of one device are ordered by timestamp (Rule 1), and it is helpful for fast filtering among Chunk Groups for a query.
> 
> Rule 2: When the size of the unclosed TsFile reaches the threshold defined in the configuration file, we close the file and generate a new one to store new arriving data spanning the entire data set. Like many systems which use LSM-based storage, we never modify a TsFile which has been closed except for the file-merge process (Rule 2). 
> 
> Rule 3: To reduce the number of TsFiles involved in a query process, we guarantee that the data points in different TsFiles are not overlapping on the time dimension after file mergence (Rule 3). 
> 
> ==== Overflow File for Out-of-order Data ====
> When a part of data is flushed on disk (and will form a `Chunk Group` in a TsFile), the newly arriving data points whose timestamps are smaller than the largest timestamp in the Tsfile are `out-of-order`. 
> 
> To store the out-of-order data, we organize all the troublesome `out-of-order` data point insertions into a special TsFile, named `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device may be overlapping in the time dimension, which violates the Rule 1 and costs additional time compared to a normal TsFile for query filtering.
>   
> There is another special operation: updating all the data points in a time range, e.g., `update all the speed values of device1 as 0 where the data time is in [1521622000000, 1521622662000]`. The operation is called when: (1) a sensor malfunctions and the database receives wrong data for a period; (2) we may want to reset all the records. Many NoSQL time series databases do not support such an operation. To support the operation in IoTDB, we use a tree-based structure, Treap, to store this part of operations and store them as `Overflow` files. 
> 
> Therefore, there are 3 kinds of data files: TsFiles, UnSequenceTsFiles and Overflow files.  TsFiles should store most of the data. The volume of UnSequenceTsFiles depends on the workload: if there are too many out-of-order and the time span of out-of-order is huge, the volume will be large. Overflow files handle fewest data operations but will depend on the use of the special operations. 
> 
> ==== LSM-tree ====
> Normally, LSM-based storage engines merge data files level by level so that it looks like a tree structure. In this way, data is well organized. The disadvantage is that data will be read and written several times. If the tree has 4 levels, each data point will be rewritten at least 4 times. 
> 
> Currently, we do not merge all the TsFiles into one because (1) the number of TsFiles is kept lower than many LSM storage engines because a memtable is mapped to several Chunk Groups rather than a file; (2) different TsFiles are not overlapping with each other in the time dimension (because of Rule 3). 
> 
> As mentioned before,  TsFile supports ''fast-return'' to accelerate queries. However, UnSequenceTsFile and Overflow files do not allow this feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may be overlapped, which leads to more files involved in the query process. To accelerate these queries, there is a merging process to reorganize files in the background. All the three kinds of files: TsFiles, UnSequenceTsFiles and Overflow files, are involved in the merging process. The merging process is implemented using multi-threading, while each thread is responsible for a series family. 
> After merging, only TsFiles are left. These files have non-overlapping time spans and support the ''fast-return'' feature. 
> 
> ==== Scalable Index Framework ====
> We allow users to implement indexes for faster queries. We currently support an index for pattern matching query (KV-Match index, ICDE 2019). Another index for fast aggregation (PISA index, CIKM 2016) is a work-in-progress. 
> 
> ==== Special Queries ====
> We currently support `group by time interval` aggregation queries and `Fill by` operations, which are similar to those of InfluxDB. Time series segmentation operations and frequency queries are work-in-progress.
> 
> == Initial Goals ==
> The initial goals are to be open sourced and to integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines.
> 
> == Current Status ==
> We have developed the system for more than 2 years. There are currently 13k lines of code, some of which are generated by Antlr3 and Thrift.  There are 230 issues which have been solved and more than 1500 commits.  
> 
> The system has been deployed in the staging environment of the State Grid Corporation of China to handle ~3 million time series (i.e, ~30,000 power generation assembly * ~100 sensors) and an equipment service company in China managing ~2 million time series (i.e, ~20k devices * 100 sensors). The insertion speed reaches ~2 million points/second/node, which is faster than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
> 
> There are many new features in the works including those mentioned herein. We will add more analytics functions, improve the data file merge process, and finish the first released version of IoTDB. 
> 
> == Meritocracy ==
> The IoTDB project operates on meritocratic principles. Developers who submit more code with higher quality earn more merit. We have used `Issues` and `Pull Requests` modules on Github for collecting users' suggestions and patches. Users who submit issues, pull requests, documents and help the community management are welcomed and encouraged to become committers.
> 
> == Community ==
> 
> The IoTDB project users communicate on Github (
> https://github.com/thulab/tsfile) . Developers make the communication on a website which is similar with JIRA (Currently, only registered users can apply to access the project for communication, url: https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/
> ). We have also introduced IoTDB at many technical conferences. Next, we will build the mailing list for more convenience, broader communication and archived discussions. 
> 
> If IoTDB is accepted for incubation at the Apache Software Foundation, the primary goal is to build a larger community. We believe that IoTDB will become a key project for time series data management, and so, we will rely on a large community of users and developers.
> 
> TODO: IoTDB is currently on a private Github repository (
> https://github.com/thulab/iotdb), while its subproject TsFile (a file format for storing time series data) is open sourced on Github (https://github.com/thulab/tsfile
> ).
> 
> == Core Developers ==
> IoTDB was initially developed by 2 dozen of students and teachers at Tsinghua University. Now, more and more developers have joined coming from other universities: Fudan University, Northwestern Polytechnical University and Harbin Institute of Technology in China.  Other developers come from business companies such as Lenovo and Microsoft. We will be working to bring more and more developers into the project making contributions to IoTDB.
> 
> == Relationships with Other Apache Products ==
> IoTDB requires some Apache products (Apache Thrift, commons, collections, httpclient). 
> 
> IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed for supporting analysing time series data by using Apache Spark and MapReduce. 
> 
> Overall, IoTDB is designed as an open architecture, and it can be integrated with many other systems in the future.
> 
> As mentioned before, in the IoTDB project, we designed a new columnar file format, called TsFile, which is similar to Apache Parquet. However, the new file format is optimized for time series data. 
> 
> 
> 
> == Known Risks ==
> 
> === Orphaned Products ===
> Given the current level of investment in IoTDB, the risk of the project being abandoned is minimal. Time series data is more and more important and there are several constituents who are highly inspired to continue development. Tsinghua and NEL-BDS Lab relies on IoTDB as a platform for a large number of long-term research projects. We have deployed IoTDB in some company's staging environments for future applications.
> 
> === Inexperience with Open Source ===
> Students and researchers in Tsinghua University have been developing and using open source software for a long time. It is wonderful to be guided to join a formal open-source process for students. Some of our committers
> have  experiences contributing to open source, for example:
> 
>  * druid: 
> https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44
> 
>  * druid: 
> https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794
> 
>  * YCSB: 
> https://github.com/brianfrankcooper/YCSB/pull/776
> 
> 
> Additionally, several ASF veterans and industry veterans have agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
> 
> 
> === Reliance on Salaried Developers ===
> Most of current developers are students and researchers/professors in universities, and their researches focus on big data management and analytics. It is unlikely that they will change their research focus away from big data management.  We will work to ensure that the ability for the project to continuously be stewarded and to proceed forward independent of salaried developers is continued.
> 
> === An Excessive Fascination with the Apache Brand ===
> Most of the initial developers come from Tsinghua University with no intent to use the Apache brand for profit. We have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of IoTDB into Apache Incubator.
> 
> 
> == Initial Source ==
> IoTDB's github address and some required dependencies: 
> 
>  * The storage file format: 
> https://github.com/thulab/tsfile
> 
>  * Adaptor for Apache Hadoop MapReduce: 
> https://github.com/thulab/tsfile-hadoop-connector
> 
>  * Adaptor for Apache Spark: 
> https://github.com/thulab/tsfile-spark-connector
> 
>  * Adaptor for Grafana: 
> https://github.com/thulab/iotdb-grafana
> 
>  * The database engine: 
> https://github.com/thulab/iotdb
>  (private project up to now)
>  * The client driver: 
> https://github.com/thulab/iotdb-jdbc
> 
> 
> 
> === External Dependencies ===
> To the best of our knowledge, all dependencies of IoTDB are distributed under Apache compatible licenses. Upon acceptance to the incubator, we would begin a thorough analysis of all transitive dependencies to verify this fact and introduce license checking into the build and release process.
> 
> == Documentation ==
>  * Documentation for TsFile: 
> https://github.com/thulab/tsfile/wiki
> 
>  * Documentation for IoTDB and its JDBC:  
> http://tsfile.org/document
>  (Chinese only. An English version is in progress.)
> 
> == Required Resources ==
> === Mailing Lists ===
>  * 
> private@iotdb.incubator.apache.org
> 
>  * 
> dev@iotdb.incubator.apache.org
> 
>  * 
> commits@iotdb.incubator.apache.org
> 
> 
> === Git Repositories ===
>  * 
> https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git
> 
> 
> === Issue Tracking ===
>  *  JIRA IoTDB (We currently use the issue management provided by Github to track issues.)
> 
> 
> == Initial Committers ==
> Tsinghua University, K2Data Company, Lenovo, Microsoft
> 
> Jianmin Wang (jimwang at tsinghua dot edu dot cn )
> 
> Xiangdong Huang (sainthxd at gmail dot com)
> 
> Jun Yuan (richard_yuan16 at 163 dot com)
> 
> Chen Wang ( wang_chen at tsinghua dot edu dot cn)
> 
> Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
> 
> Jinrui Zhang (jinrzhan at microsoft dot com)
> 
> Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
> 
> Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
> 
> Shuo Zhang (zhangshuo at k2data dot com dot cn)
> 
> Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
> 
> Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
> 
> Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
> 
> Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
> 
> Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
> 
> Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
> 
> Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
> 
> Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
> 
> Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
> 
> == Sponsors ==
> === Champion ===
> Kevin A. McGrail (
> kmcgrail@apache.org
> )
> 
> === Nominated Mentors ===
> Justin Mclean (justin at classsoftware dot com)
> 
> Christofer Dutz (christofer.dutz at c-ware dot de)
> 
> Willem Jiang (willem.jiang at gmail dot com)
> 
> 


Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by "Kevin A. McGrail" <km...@apache.org>.
I will defer the intake of code to the secretary.

On Thu, Nov 15, 2018, 12:20 黄向东 <sainthxd@gmail.com wrote:

> > - When you say "open source" repo, do you mean private repo vs public
> > repo?
>
> Yes.
>
> >
> > - I believe Craig as Secretary will say an SGA never hurts but isn't
> > everything already licensed ASLv2?  It's been a few weeks and a few
> > proposals reviewed so it could be my memory.
>
> Currently, the licenses of the dependency libs of IoTDB includes:
> Apache2.0, BSD (antlr3), EPL1.0 (logback) and EPL2.0 (junit).
> We are working on checking all the licenses once again for avoiding
> mistakes.
>
> Regards,
> Xiangdong Huang
>
>
> > 在 2018年11月15日,下午10:43,Kevin A. McGrail <km...@apache.org> 写道:
> >
> > Well, first, let's ask some questions:
> >
> > - When you say "open source" repo, do you mean private repo vs public
> > repo?
> >
> > - I believe Craig as Secretary will say an SGA never hurts but isn't
> > everything already licensed ASLv2?  It's been a few weeks and a few
> > proposals reviewed so it could be my memory.
> >
> > Regards,
> > KAM
> >
> > --
> > Kevin A. McGrail
> > VP Fundraising, Apache Software Foundation
> > Chair Emeritus Apache SpamAssassin Project
> > https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >
> >
> > On Thu, Nov 15, 2018 at 7:27 AM hxd <hx...@qq.com> wrote:
> >
> >> Currently, there are 6 repositories (IoTDB, IoTDB-JDBC, TsFile,
> >> Spark-Connector, Hive-Connector, and Grafana-Connector) totally and we
> will
> >> merge them all in one repositories.
> >>
> >> Only the first one is private.
> >>
> >> Actually we are lack of experiences about how to open source.
> >>
> >> Should we open all the source now or after all the Apache legal
> documents
> >> are done?
> >>
> >> Best,
> >>
> >> Xiangdong Huang
> >>
> >>> 在 2018年11月15日,下午5:06,Willem Jiang <wi...@gmail.com> 写道:
> >>>
> >>> Here is a question for the source code repository
> >>>
> >>> The main source git repo[1] is still a private repo.  I think we need
> >>> to open source the repo before sending the SGA?
> >>>
> >>>
> >>> [1]https://github.com/thulab/iotdb
> >>>
> >>> Willem Jiang
> >>>
> >>> Twitter: willemjiang
> >>> Weibo: 姜宁willem
> >>> On Thu, Nov 15, 2018 at 4:08 PM hxd <hx...@qq.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> In the proposal discussion process, we got 3 mentors,  Justin Mclean,
> >> Christofer Dutz, and Willem Ning Jiang.
> >>>>
> >>>> In the vote process, we got a new mentor, Joe Witt.
> >>>>
> >>>> Totally, there are one Champion and four mentors, they are:
> >>>>
> >>>> Kevin A. McGrail (the Champion),
> >>>> Justin Mclean,
> >>>> Christofer Dutz,
> >>>> Willem Ning Jiang, and
> >>>> Joe Witt
> >>>>
> >>>> I have checked their name on
> >> http://people.apache.org/committer-index.html, and they are accurate
> now.
> >>>> The name list on the proposal list (
> >> https://wiki.apache.org/incubator/IoTDBProposal) is also correct.
> >>>>
> >>>> Regards,
> >>>> Xiangdong Huang
> >>>>
> >>>>
> >>>>
> >>>> 在 2018年11月15日,上午12:51,Kevin A. McGrail <km...@apache.org> 写道:
> >>>>
> >>>> Congratulations!  As champion, I think the next steps are:
> >>>>
> >>>> 1 - Xiangdong, Can you confirm the list of mentors on the proposal is
> >> accurate?
> >>>>
> >>>> 2 - Also Xiangdong, Is there anyone else that stepped forward as a
> >> mentor during the voting process that the project wants the IPMC to
> approve?
> >>>>
> >>>> 3 - Justin, I think you have to request the creation of the podling
> and
> >> then I as champion work on things like the meta data file from this
> page,
> >>>> https://incubator.apache.org/policy/incubation.html, correct?
> >>>>
> >>>> Regards,
> >>>> KAM
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Kevin A. McGrail
> >>>> VP Fundraising, Apache Software Foundation
> >>>> Chair Emeritus Apache SpamAssassin Project
> >>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >>>>
> >>>>
> >>>> On Wed, Nov 14, 2018 at 6:29 AM hxd <hx...@qq.com> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1
> >> votes, this VOTE passes.
> >>>>>
> >>>>> Thanks to everyone who voted!
> >>>>>
> >>>>> Bellow is a voting tally:
> >>>>>
> >>>>> Binding
> >>>>> Von Gosling
> >>>>> Christofer Dutz
> >>>>> Kevin A. McGrail
> >>>>> Felix Cheung
> >>>>> Matt Sticker
> >>>>> Joe Witt
> >>>>> Justin Mclean
> >>>>> Willem Jiang
> >>>>>
> >>>>>
> >>>>> Non-binding
> >>>>> Sheng Wu
> >>>>> Yang Bo
> >>>>>
> >>>>> The vote thread:
> >>
> https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E
> >> <
> >>
> https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E
> >>>
> >>>>> The proposal: https://wiki.apache.org/incubator/IoTDBProposal <
> >> https://wiki.apache.org/incubator/IoTDBProposal>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Xiangdong Huang
> >>>>>
> >>>>>
> >>>>>> 在 2018年11月7日,下午3:46,hxd <hx...@qq.com> 写道:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Sorry for the previous mail with bad format.
> >>>>>> I'd like to call a VOTE to accept IoTDB project, a database for
> >> managing large amounts of time series data  from IoT sensors in
> industrial
> >> applications, into the Apache Incubator.
> >>>>>> The full proposal is available on the wiki:
> >> https://wiki.apache.org/incubator/IoTDBProposal
> >>>>>> and it is also attached below for your convenience.
> >>>>>>
> >>>>>> Please cast your vote:
> >>>>>>
> >>>>>> [ ] +1, bring IoTDB into Incubator
> >>>>>> [ ] +0, I don't care either way,
> >>>>>> [ ] -1, do not bring IoTDB into Incubator, because...
> >>>>>>
> >>>>>> The vote will open at least for 72 hours.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Xiangdong Huang.
> >>>>>>
> >>>>>>
> >>>>>> = IoTDB Proposal  =
> >>>>>> v0.1.1
> >>>>>>
> >>>>>>
> >>>>>> == Abstract ==
> >>>>>> IoTDB is a data store for managing large amounts of time series data
> >> such as timestamped data from IoT sensors in industrial applications.
> >>>>>>
> >>>>>> == Proposal ==
> >>>>>> IoTDB is a database for managing large amount of time series data
> >> with columnar storage, data encoding, pre-computation, and index
> >> techniques. It has SQL-like interface to write millions of data points
> per
> >> second per node and is optimized to get query results in few seconds
> over
> >> trillions of data points. It can also be easily integrated with Apache
> >> Hadoop MapReduce and Apache Spark for analytics.
> >>>>>>
> >>>>>> == Background ==
> >>>>>>
> >>>>>> A new class of data management system requirements is becoming
> >> increasingly important with the rise of the Internet of Things. There
> are
> >> some database systems and technologies aimed at time series data
> >> management.  For example, Gorilla and InfluxDB which are mainly built
> for
> >> data centers and monitoring application metrics. Other systems, for
> >> example, OpenTSDB and KairosDB, are built on Apache HBase and Apache
> >> Cassandra, respectively.
> >>>>>>
> >>>>>> However, many applications for time series data management have more
> >> requirements especially in industrial applications as follows:
> >>>>>>
> >>>>>> * Supporting time series data which has high data frequency. For
> >> example, a turbine engine may generate 1000 points per second (i.e.,
> >> 1000Hz), while each CPU only reports 1 data points per 5 seconds in a
> data
> >> center monitoring application.
> >>>>>>
> >>>>>> * Supporting scanning data multi-resolutionally. For example,
> >> aggregation operation is important for time series data.
> >>>>>>
> >>>>>> * Supporting special queries for time series, such as pattern
> >> matching, time series segmentation, time-frequency transformation and
> >> frequency query.
> >>>>>>
> >>>>>> * Supporting a large number of monitoring targets (i.e. time
> series).
> >> An excavator may report more than 1000 time series, for example,
> revolving
> >> speed of the motor-engine, the speed of the excavator, the accelerated
> >> speed, the temperature of the water tank and so on, while a CPU or an
> >> application monitor has much fewer time series.
> >>>>>>
> >>>>>> * Optimization for out-of-order data points. In the industrial
> >> sector, it is common that equipment sends data using the UDP protocol
> >> rather than the TCP protocol. Sometimes, the network connect is unstable
> >> and parts of the data will be buffered for later sending.
> >>>>>>
> >>>>>> * Supporting long-term storage. Historical data is precious for
> >> equipment manufacturers. Therefore, removing or unloading historical
> data
> >> is highly desired for most industrial applications. The database system
> >> must not only support fast retrieval of historical data, but also should
> >> guarantee that the historical data does not impact the processing speed
> for
> >> “hot” or current data.
> >>>>>>
> >>>>>> * Supporting online transaction processing (OLTP) as well as complex
> >> analytics. It is obvious that supporting analyzing from the data files
> >> using Apache Spark/Apache Hadoop MapReduce directly is better than
> >> transforming data files to another file format for Big Data analytics.
> >>>>>>
> >>>>>> * Flexible deployment either on premise or in the cloud.  IoTDB is
> as
> >> simple and can be deployed on a Raspberry Pi handling hundreds of time
> >> series. Meanwhile, the system can be also deployed in the cloud so that
> it
> >> supports tens of millions ingestions per second, OLTP queries in
> >> milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
> >>>>>>
> >>>>>> * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a
> >> wind turbine, or a meteorological station, the deployment of the chosen
> >> database is designed to be simple. A device may have hundreds of time
> >> series (but less than a thousand time series) and the database needs to
> >> handle them.
> >>>>>> * * (2) When deploying IoTDB in a data center, the computational
> >> resources (i.e., the hardware configuration of servers) is not a problem
> >> when compared to a Raspberry Pi. In this deployment, IoTDB can use more
> >> computation resources, and has the ability to handle more time seires
> >> (e.g., millions of time series).
> >>>>>>
> >>>>>> Based on these requirements, we developed IoTDB, a new data store
> >> system for managing time series data.
> >>>>>>
> >>>>>> IoTDB started as a Tsinghua University research project. IoTDB's
> >> developer community has also grown to include additional institutions,
> for
> >> example, universities (e.g., Fudan University), research labs (e.g,
> NEL-BDS
> >> lab), and corporations (e.g., K2Data, Tencent). Funding has been
> provided
> >> by various institutions including the National Natural Science
> Foundation
> >> of China, and industry sponsors, such as Lenovo and K2Data.
> >>>>>>
> >>>>>> == Rationale ==
> >>>>>> Because there is no existed open-sourced time series databases
> >> covering all the above requirements, we developed IoTDB. As the system
> >> matures, we are seeking a long-term home for the project. We believe the
> >> Apache Software Foundation would be an ideal fit. Also joining Apache
> will
> >> help coordinate and improve the development effort of the growing
> number of
> >> organizations which contribute to IoTDB improving the diversity of our
> >> community.
> >>>>>>
> >>>>>> IoTDB contains multiple modules, which are classified into
> categories:
> >>>>>>
> >>>>>> * '''TsFile Format''': TsFile is a new columnar file format.
> >>>>>> * '''Adaptor for Analytics and Visualization''': Integrating TsFile
> >> with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark.
> Examples
> >> of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are
> also
> >> provided.
> >>>>>> * '''IoTDB Engine''': An engine which consists of SQL parser, query
> >> plan generator, memtable, authentication and authorization,write ahead
> log
> >> (WAL), crash recovery, out-of-order data handler, and index for
> aggregation
> >> and pattern matching. The engine stores system data in TsFile format.
> >>>>>> * '''IoTDB JDBC''': An implementation of Java Database Connectivity
> >> (JDBC) for clients to connect to IoTDB using Java.
> >>>>>>
> >>>>>> === TsFile Format ===
> >>>>>>
> >>>>>> TsFile format is a columnar store, which is similar with Apache
> >> Parquet and Apache CarbonData. It has the concepts of Chunk Group,
> Column
> >> Chunk, Page and Footer. Comparing with Apache Parquet and Apache
> >> CarbonData, it is designed and optimized for time series:
> >>>>>>
> >>>>>> ==== Time Series Friendly Encoding ====
> >>>>>> IoTDB currently supports run length encoding (RLE), delta-of-delta
> >> encoding, and Facebook's Gorilla encoding.
> >>>>>>
> >>>>>> Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA)
> >> and time-frequency transformation are works-in-progress.
> >>>>>>
> >>>>>>
> >>>>>> ==== Chunk Group ====
> >>>>>> The data part of a TsFile consists of many Chunk Groups. Each Chunk
> >> Group stores the data of a device at a time interval.  A Chunk Group is
> >> similar to the row group in Apache Parquet, while there are some
> >> constraints of the time dimension:  For each device, the time intervals
> of
> >> different Chunk Groups are not overlapped and the latter Chunk Group
> always
> >> has a larger timestamp.
> >>>>>>
> >>>>>> Given a TsFile and a query with a time range filter, the query
> >> process can terminate scanning data once it reads data points whose
> >> timestamp reaches the time limit of the filter. We call the feature
> >> ''fast-return'' and it makes the time range query in a TsFile very
> >> efficient.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ==== Different Column Chunk Format (Unnecessary the Repetition (R)
> >> and Definition (D) Fields) ====
> >>>>>>
> >>>>>> While Apache Parquet and Apache CarbonData support complex data
> >> types, e.g., nested data and sparse columns, TsFile is exclusively
> designed
> >> for time series whose data model is \<device_id, series_id, timestamp,
> >> value\>.
> >>>>>>
> >>>>>> In a `Chunk Group`, each time series is a `Column Chunk`. Even
> though
> >> these time series belong to the same device, the data points in
> different
> >> time series are not aligned in the time dimension originally.
> >>>>>>
> >>>>>> For example, if you have a device with 2 sensors on the same data
> >> collection frequencies, sensor 1 may collect data at time 1521622662000
> >> while the other one collects data at time 1521622662001 (delta=1ms).
> >> Therefore, each Column Chunk has its timestamps and values, which is
> quite
> >> different from Apache Parquet and Apache CarbonData.  Because we store
> the
> >> time column along with each value column instead of making different
> chunks
> >> share the same time column for the sake of diverse data frequency for
> >> different time series, we do not store any null value on disk to align
> >> across time series. Besides, we do not need to attach  `repetition` (R)
> and
> >> `definition` (D) fields on each value. Therefore, the disk space is
> saved
> >> and the query latency is reduced (because we do not align data by
> >> calculating R and D fields).
> >>>>>>
> >>>>>>
> >>>>>> ==== Domain Specific Information in Each Page ====
> >>>>>> Similar to Apache Parquet and Apache CarbonData, a `Column Chunk`
> >> consists of several `Pages`, and each `Page` has a `Page header`. The
> `Page
> >> header` is a summary of the data in the page.
> >>>>>>
> >>>>>> Because TsFile is optimized for time series, the page header
> contains
> >> more domain specific information, such as the minimal and maximal value,
> >> the minimal and the maximal timestamp, the frequency and so on. TsFile
> can
> >> even store the histogram of values in the page header.
> >>>>>>
> >>>>>> This header information helps IoTDB in speeding up queries by
> >> skipping unnecessary pages.
> >>>>>>
> >>>>>>
> >>>>>> === Adaptor for Analytics ===
> >>>>>> The TsFile provides:
> >>>>>>
> >>>>>> * InputFormat/OutputFormat interfaces for Reading/Writing data.
> >>>>>> * Deep integration with Apache Spark/Hadoop MapReduce including
> >> predicate push-down, column pruning, aggregation push down, etc. So
> users
> >> can use Apache Spark SQL/HiveQL to connect and query TsFiles.
> >>>>>>
> >>>>>>
> >>>>>> === IoTDB Engine ===
> >>>>>> The IoTDB engine is a database engine, which uses TsFile as its
> >> storage file format. The IoTDB Engine supports SQL-like query plus many
> >> useful functions:
> >>>>>>
> >>>>>> * Tree-based time series schema
> >>>>>> * Log-Structured Merge (LSM)-based storage
> >>>>>> * Overflow file for out-of-order data
> >>>>>> * Scalable index framework
> >>>>>> * Special queries for time series
> >>>>>>
> >>>>>> ==== Tree-based Time Series Schema ====
> >>>>>> IoTDB manages all the time series definitions using a tree
> structure.
> >> A path from the root of the tree to a leaf node represents a time
> series.
> >> Therefore, the unique id of a time series is a path, e.g.,
> >> `root.China.beijing.windFarm1.windTurbine1.speed`.
> >>>>>>
> >>>>>> This kind of schema can express `group by` naturally. For example,
> >> `root.China.beijing.windFarm1.*.speed` represents the speed of all the
> wind
> >> turbines in wind farm 1 in Beijing, China.
> >>>>>>
> >>>>>> ==== Log-Structured Merge (LSM)-based Storage ====
> >>>>>> In a time series, the data points should be ordered by their
> >> timestamps. In IoTDB, we use Log-Structured Merge (LSM) based mechanism.
> >> Therefore, a part of the data is stored in memory first and can be
> called
> >> as `memtable`. At this time, if data points come out-of-order, we resort
> >> them in memory. When this part of data exceeds the configured memory
> limit,
> >> we flush it on disk as a `Chunk Group` into an unclosed TsFile.
> Finally, a
> >> TsFile may contain several Chunk Groups, for reducing the number of
> small
> >> data files, which is helpful to reduce the I/O load of the storage
> system
> >> and reduces the execution time of a file-merge in LSM. Notice that the
> data
> >> is time-ordered in one Chunk Group on disk, and this layout is helpful
> for
> >> fast filtering in one Chunk Group for a query.
> >>>>>>
> >>>>>> Rule 1: In a TsFile, the Chunk Groups of one device are ordered by
> >> timestamp (Rule 1), and it is helpful for fast filtering among Chunk
> Groups
> >> for a query.
> >>>>>>
> >>>>>> Rule 2: When the size of the unclosed TsFile reaches the threshold
> >> defined in the configuration file, we close the file and generate a new
> one
> >> to store new arriving data spanning the entire data set. Like many
> systems
> >> which use LSM-based storage, we never modify a TsFile which has been
> closed
> >> except for the file-merge process (Rule 2).
> >>>>>>
> >>>>>> Rule 3: To reduce the number of TsFiles involved in a query process,
> >> we guarantee that the data points in different TsFiles are not
> overlapping
> >> on the time dimension after file mergence (Rule 3).
> >>>>>>
> >>>>>> ==== Overflow File for Out-of-order Data ====
> >>>>>> When a part of data is flushed on disk (and will form a `Chunk
> Group`
> >> in a TsFile), the newly arriving data points whose timestamps are
> smaller
> >> than the largest timestamp in the Tsfile are `out-of-order`.
> >>>>>>
> >>>>>> To store the out-of-order data, we organize all the troublesome
> >> `out-of-order` data point insertions into a special TsFile, named
> >> `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one
> device
> >> may be overlapping in the time dimension, which violates the Rule 1 and
> >> costs additional time compared to a normal TsFile for query filtering.
> >>>>>>
> >>>>>> There is another special operation: updating all the data points in
> a
> >> time range, e.g., `update all the speed values of device1 as 0 where the
> >> data time is in [1521622000000, 1521622662000]`. The operation is called
> >> when: (1) a sensor malfunctions and the database receives wrong data
> for a
> >> period; (2) we may want to reset all the records. Many NoSQL time series
> >> databases do not support such an operation. To support the operation in
> >> IoTDB, we use a tree-based structure, Treap, to store this part of
> >> operations and store them as `Overflow` files.
> >>>>>>
> >>>>>> Therefore, there are 3 kinds of data files: TsFiles,
> >> UnSequenceTsFiles and Overflow files.  TsFiles should store most of the
> >> data. The volume of UnSequenceTsFiles depends on the workload: if there
> are
> >> too many out-of-order and the time span of out-of-order is huge, the
> volume
> >> will be large. Overflow files handle fewest data operations but will
> depend
> >> on the use of the special operations.
> >>>>>>
> >>>>>> ==== LSM-tree ====
> >>>>>> Normally, LSM-based storage engines merge data files level by level
> >> so that it looks like a tree structure. In this way, data is well
> >> organized. The disadvantage is that data will be read and written
> several
> >> times. If the tree has 4 levels, each data point will be rewritten at
> least
> >> 4 times.
> >>>>>>
> >>>>>> Currently, we do not merge all the TsFiles into one because (1) the
> >> number of TsFiles is kept lower than many LSM storage engines because a
> >> memtable is mapped to several Chunk Groups rather than a file; (2)
> >> different TsFiles are not overlapping with each other in the time
> dimension
> >> (because of Rule 3).
> >>>>>>
> >>>>>> As mentioned before,  TsFile supports ''fast-return'' to accelerate
> >> queries. However, UnSequenceTsFile and Overflow files do not allow this
> >> feature. The time spans of UnSequenceTsFile, Overflow file andTsFile
> may be
> >> overlapped, which leads to more files involved in the query process. To
> >> accelerate these queries, there is a merging process to reorganize
> files in
> >> the background. All the three kinds of files: TsFiles, UnSequenceTsFiles
> >> and Overflow files, are involved in the merging process. The merging
> >> process is implemented using multi-threading, while each thread is
> >> responsible for a series family.
> >>>>>> After merging, only TsFiles are left. These files have
> >> non-overlapping time spans and support the ''fast-return'' feature.
> >>>>>>
> >>>>>> ==== Scalable Index Framework ====
> >>>>>> We allow users to implement indexes for faster queries. We currently
> >> support an index for pattern matching query (KV-Match index, ICDE 2019).
> >> Another index for fast aggregation (PISA index, CIKM 2016) is a
> >> work-in-progress.
> >>>>>>
> >>>>>> ==== Special Queries ====
> >>>>>> We currently support `group by time interval` aggregation queries
> and
> >> `Fill by` operations, which are similar to those of InfluxDB. Time
> series
> >> segmentation operations and frequency queries are work-in-progress.
> >>>>>>
> >>>>>> == Initial Goals ==
> >>>>>> The initial goals are to be open sourced and to integrate with the
> >> Apache development process. Furthermore, we plan for incremental
> >> development, and releases along with the Apache guidelines.
> >>>>>>
> >>>>>> == Current Status ==
> >>>>>> We have developed the system for more than 2 years. There are
> >> currently 13k lines of code, some of which are generated by Antlr3 and
> >> Thrift.  There are 230 issues which have been solved and more than 1500
> >> commits.
> >>>>>>
> >>>>>> The system has been deployed in the staging environment of the State
> >> Grid Corporation of China to handle ~3 million time series (i.e, ~30,000
> >> power generation assembly * ~100 sensors) and an equipment service
> company
> >> in China managing ~2 million time series (i.e, ~20k devices * 100
> sensors).
> >> The insertion speed reaches ~2 million points/second/node, which is
> faster
> >> than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
> >>>>>>
> >>>>>> There are many new features in the works including those mentioned
> >> herein. We will add more analytics functions, improve the data file
> merge
> >> process, and finish the first released version of IoTDB.
> >>>>>>
> >>>>>> == Meritocracy ==
> >>>>>> The IoTDB project operates on meritocratic principles. Developers
> who
> >> submit more code with higher quality earn more merit. We have used
> `Issues`
> >> and `Pull Requests` modules on Github for collecting users' suggestions
> and
> >> patches. Users who submit issues, pull requests, documents and help the
> >> community management are welcomed and encouraged to become committers.
> >>>>>>
> >>>>>> == Community ==
> >>>>>>
> >>>>>> The IoTDB project users communicate on Github (
> >>>>>> https://github.com/thulab/tsfile) . Developers make the
> >> communication on a website which is similar with JIRA (Currently, only
> >> registered users can apply to access the project for communication, url:
> >> https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/
> >>>>>> ). We have also introduced IoTDB at many technical conferences.
> Next,
> >> we will build the mailing list for more convenience, broader
> communication
> >> and archived discussions.
> >>>>>>
> >>>>>> If IoTDB is accepted for incubation at the Apache Software
> >> Foundation, the primary goal is to build a larger community. We believe
> >> that IoTDB will become a key project for time series data management,
> and
> >> so, we will rely on a large community of users and developers.
> >>>>>>
> >>>>>> TODO: IoTDB is currently on a private Github repository (
> >>>>>> https://github.com/thulab/iotdb), while its subproject TsFile (a
> >> file format for storing time series data) is open sourced on Github (
> >> https://github.com/thulab/tsfile
> >>>>>> ).
> >>>>>>
> >>>>>> == Core Developers ==
> >>>>>> IoTDB was initially developed by 2 dozen of students and teachers at
> >> Tsinghua University. Now, more and more developers have joined coming
> from
> >> other universities: Fudan University, Northwestern Polytechnical
> University
> >> and Harbin Institute of Technology in China.  Other developers come from
> >> business companies such as Lenovo and Microsoft. We will be working to
> >> bring more and more developers into the project making contributions to
> >> IoTDB.
> >>>>>>
> >>>>>> == Relationships with Other Apache Products ==
> >>>>>> IoTDB requires some Apache products (Apache Thrift, commons,
> >> collections, httpclient).
> >>>>>>
> >>>>>> IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed
> >> for supporting analysing time series data by using Apache Spark and
> >> MapReduce.
> >>>>>>
> >>>>>> Overall, IoTDB is designed as an open architecture, and it can be
> >> integrated with many other systems in the future.
> >>>>>>
> >>>>>> As mentioned before, in the IoTDB project, we designed a new
> columnar
> >> file format, called TsFile, which is similar to Apache Parquet. However,
> >> the new file format is optimized for time series data.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> == Known Risks ==
> >>>>>>
> >>>>>> === Orphaned Products ===
> >>>>>> Given the current level of investment in IoTDB, the risk of the
> >> project being abandoned is minimal. Time series data is more and more
> >> important and there are several constituents who are highly inspired to
> >> continue development. Tsinghua and NEL-BDS Lab relies on IoTDB as a
> >> platform for a large number of long-term research projects. We have
> >> deployed IoTDB in some company's staging environments for future
> >> applications.
> >>>>>>
> >>>>>> === Inexperience with Open Source ===
> >>>>>> Students and researchers in Tsinghua University have been developing
> >> and using open source software for a long time. It is wonderful to be
> >> guided to join a formal open-source process for students. Some of our
> >> committers
> >>>>>> have  experiences contributing to open source, for example:
> >>>>>>
> >>>>>> * druid:
> >>>>>>
> >>
> https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44
> >>>>>>
> >>>>>> * druid:
> >>>>>>
> >>
> https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794
> >>>>>>
> >>>>>> * YCSB:
> >>>>>> https://github.com/brianfrankcooper/YCSB/pull/776
> >>>>>>
> >>>>>>
> >>>>>> Additionally, several ASF veterans and industry veterans have agreed
> >> to mentor the project and are listed in this proposal. The project will
> >> rely on their guidance and collective wisdom to quickly transition the
> >> entire team of initial committers towards practicing the Apache Way.
> >>>>>>
> >>>>>>
> >>>>>> === Reliance on Salaried Developers ===
> >>>>>> Most of current developers are students and researchers/professors
> in
> >> universities, and their researches focus on big data management and
> >> analytics. It is unlikely that they will change their research focus
> away
> >> from big data management.  We will work to ensure that the ability for
> the
> >> project to continuously be stewarded and to proceed forward independent
> of
> >> salaried developers is continued.
> >>>>>>
> >>>>>> === An Excessive Fascination with the Apache Brand ===
> >>>>>> Most of the initial developers come from Tsinghua University with no
> >> intent to use the Apache brand for profit. We have no plans for making
> use
> >> of Apache brand in press releases nor posting billboards advertising
> >> acceptance of IoTDB into Apache Incubator.
> >>>>>>
> >>>>>>
> >>>>>> == Initial Source ==
> >>>>>> IoTDB's github address and some required dependencies:
> >>>>>>
> >>>>>> * The storage file format:
> >>>>>> https://github.com/thulab/tsfile
> >>>>>>
> >>>>>> * Adaptor for Apache Hadoop MapReduce:
> >>>>>> https://github.com/thulab/tsfile-hadoop-connector
> >>>>>>
> >>>>>> * Adaptor for Apache Spark:
> >>>>>> https://github.com/thulab/tsfile-spark-connector
> >>>>>>
> >>>>>> * Adaptor for Grafana:
> >>>>>> https://github.com/thulab/iotdb-grafana
> >>>>>>
> >>>>>> * The database engine:
> >>>>>> https://github.com/thulab/iotdb
> >>>>>> (private project up to now)
> >>>>>> * The client driver:
> >>>>>> https://github.com/thulab/iotdb-jdbc
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> === External Dependencies ===
> >>>>>> To the best of our knowledge, all dependencies of IoTDB are
> >> distributed under Apache compatible licenses. Upon acceptance to the
> >> incubator, we would begin a thorough analysis of all transitive
> >> dependencies to verify this fact and introduce license checking into the
> >> build and release process.
> >>>>>>
> >>>>>> == Documentation ==
> >>>>>> * Documentation for TsFile:
> >>>>>> https://github.com/thulab/tsfile/wiki
> >>>>>>
> >>>>>> * Documentation for IoTDB and its JDBC:
> >>>>>> http://tsfile.org/document
> >>>>>> (Chinese only. An English version is in progress.)
> >>>>>>
> >>>>>> == Required Resources ==
> >>>>>> === Mailing Lists ===
> >>>>>> *
> >>>>>> private@iotdb.incubator.apache.org
> >>>>>>
> >>>>>> *
> >>>>>> dev@iotdb.incubator.apache.org
> >>>>>>
> >>>>>> *
> >>>>>> commits@iotdb.incubator.apache.org
> >>>>>>
> >>>>>>
> >>>>>> === Git Repositories ===
> >>>>>> *
> >>>>>> https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git
> >>>>>>
> >>>>>>
> >>>>>> === Issue Tracking ===
> >>>>>> *  JIRA IoTDB (We currently use the issue management provided by
> >> Github to track issues.)
> >>>>>>
> >>>>>>
> >>>>>> == Initial Committers ==
> >>>>>> Tsinghua University, K2Data Company, Lenovo, Microsoft
> >>>>>>
> >>>>>> Jianmin Wang (jimwang at tsinghua dot edu dot cn )
> >>>>>>
> >>>>>> Xiangdong Huang (sainthxd at gmail dot com)
> >>>>>>
> >>>>>> Jun Yuan (richard_yuan16 at 163 dot com)
> >>>>>>
> >>>>>> Chen Wang ( wang_chen at tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Jinrui Zhang (jinrzhan at microsoft dot com)
> >>>>>>
> >>>>>> Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Shuo Zhang (zhangshuo at k2data dot com dot cn)
> >>>>>>
> >>>>>> Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
> >>>>>>
> >>>>>> Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
> >>>>>>
> >>>>>> == Sponsors ==
> >>>>>> === Champion ===
> >>>>>> Kevin A. McGrail (
> >>>>>> kmcgrail@apache.org
> >>>>>> )
> >>>>>>
> >>>>>> === Nominated Mentors ===
> >>>>>> Justin Mclean (justin at classsoftware dot com)
> >>>>>>
> >>>>>> Christofer Dutz (christofer.dutz at c-ware dot de)
> >>>>>>
> >>>>>> Willem Jiang (willem.jiang at gmail dot com)
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> >>> For additional commands, e-mail: general-help@incubator.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> >> For additional commands, e-mail: general-help@incubator.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by 黄向东 <sa...@gmail.com>.
> - When you say "open source" repo, do you mean private repo vs public
> repo?

Yes.

> 
> - I believe Craig as Secretary will say an SGA never hurts but isn't
> everything already licensed ASLv2?  It's been a few weeks and a few
> proposals reviewed so it could be my memory.

Currently, the licenses of the dependency libs of IoTDB includes: Apache2.0, BSD (antlr3), EPL1.0 (logback) and EPL2.0 (junit). 
We are working on checking all the licenses once again for avoiding mistakes.

Regards,
Xiangdong Huang


> 在 2018年11月15日,下午10:43,Kevin A. McGrail <km...@apache.org> 写道:
> 
> Well, first, let's ask some questions:
> 
> - When you say "open source" repo, do you mean private repo vs public
> repo?
> 
> - I believe Craig as Secretary will say an SGA never hurts but isn't
> everything already licensed ASLv2?  It's been a few weeks and a few
> proposals reviewed so it could be my memory.
> 
> Regards,
> KAM
> 
> --
> Kevin A. McGrail
> VP Fundraising, Apache Software Foundation
> Chair Emeritus Apache SpamAssassin Project
> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> 
> 
> On Thu, Nov 15, 2018 at 7:27 AM hxd <hx...@qq.com> wrote:
> 
>> Currently, there are 6 repositories (IoTDB, IoTDB-JDBC, TsFile,
>> Spark-Connector, Hive-Connector, and Grafana-Connector) totally and we will
>> merge them all in one repositories.
>> 
>> Only the first one is private.
>> 
>> Actually we are lack of experiences about how to open source.
>> 
>> Should we open all the source now or after all the Apache legal documents
>> are done?
>> 
>> Best,
>> 
>> Xiangdong Huang
>> 
>>> 在 2018年11月15日,下午5:06,Willem Jiang <wi...@gmail.com> 写道:
>>> 
>>> Here is a question for the source code repository
>>> 
>>> The main source git repo[1] is still a private repo.  I think we need
>>> to open source the repo before sending the SGA?
>>> 
>>> 
>>> [1]https://github.com/thulab/iotdb
>>> 
>>> Willem Jiang
>>> 
>>> Twitter: willemjiang
>>> Weibo: 姜宁willem
>>> On Thu, Nov 15, 2018 at 4:08 PM hxd <hx...@qq.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> In the proposal discussion process, we got 3 mentors,  Justin Mclean,
>> Christofer Dutz, and Willem Ning Jiang.
>>>> 
>>>> In the vote process, we got a new mentor, Joe Witt.
>>>> 
>>>> Totally, there are one Champion and four mentors, they are:
>>>> 
>>>> Kevin A. McGrail (the Champion),
>>>> Justin Mclean,
>>>> Christofer Dutz,
>>>> Willem Ning Jiang, and
>>>> Joe Witt
>>>> 
>>>> I have checked their name on
>> http://people.apache.org/committer-index.html, and they are accurate now.
>>>> The name list on the proposal list (
>> https://wiki.apache.org/incubator/IoTDBProposal) is also correct.
>>>> 
>>>> Regards,
>>>> Xiangdong Huang
>>>> 
>>>> 
>>>> 
>>>> 在 2018年11月15日,上午12:51,Kevin A. McGrail <km...@apache.org> 写道:
>>>> 
>>>> Congratulations!  As champion, I think the next steps are:
>>>> 
>>>> 1 - Xiangdong, Can you confirm the list of mentors on the proposal is
>> accurate?
>>>> 
>>>> 2 - Also Xiangdong, Is there anyone else that stepped forward as a
>> mentor during the voting process that the project wants the IPMC to approve?
>>>> 
>>>> 3 - Justin, I think you have to request the creation of the podling and
>> then I as champion work on things like the meta data file from this page,
>>>> https://incubator.apache.org/policy/incubation.html, correct?
>>>> 
>>>> Regards,
>>>> KAM
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Kevin A. McGrail
>>>> VP Fundraising, Apache Software Foundation
>>>> Chair Emeritus Apache SpamAssassin Project
>>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
>>>> 
>>>> 
>>>> On Wed, Nov 14, 2018 at 6:29 AM hxd <hx...@qq.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1
>> votes, this VOTE passes.
>>>>> 
>>>>> Thanks to everyone who voted!
>>>>> 
>>>>> Bellow is a voting tally:
>>>>> 
>>>>> Binding
>>>>> Von Gosling
>>>>> Christofer Dutz
>>>>> Kevin A. McGrail
>>>>> Felix Cheung
>>>>> Matt Sticker
>>>>> Joe Witt
>>>>> Justin Mclean
>>>>> Willem Jiang
>>>>> 
>>>>> 
>>>>> Non-binding
>>>>> Sheng Wu
>>>>> Yang Bo
>>>>> 
>>>>> The vote thread:
>> https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E
>> <
>> https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E
>>> 
>>>>> The proposal: https://wiki.apache.org/incubator/IoTDBProposal <
>> https://wiki.apache.org/incubator/IoTDBProposal>
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Xiangdong Huang
>>>>> 
>>>>> 
>>>>>> 在 2018年11月7日,下午3:46,hxd <hx...@qq.com> 写道:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Sorry for the previous mail with bad format.
>>>>>> I'd like to call a VOTE to accept IoTDB project, a database for
>> managing large amounts of time series data  from IoT sensors in industrial
>> applications, into the Apache Incubator.
>>>>>> The full proposal is available on the wiki:
>> https://wiki.apache.org/incubator/IoTDBProposal
>>>>>> and it is also attached below for your convenience.
>>>>>> 
>>>>>> Please cast your vote:
>>>>>> 
>>>>>> [ ] +1, bring IoTDB into Incubator
>>>>>> [ ] +0, I don't care either way,
>>>>>> [ ] -1, do not bring IoTDB into Incubator, because...
>>>>>> 
>>>>>> The vote will open at least for 72 hours.
>>>>>> 
>>>>>> Thanks,
>>>>>> Xiangdong Huang.
>>>>>> 
>>>>>> 
>>>>>> = IoTDB Proposal  =
>>>>>> v0.1.1
>>>>>> 
>>>>>> 
>>>>>> == Abstract ==
>>>>>> IoTDB is a data store for managing large amounts of time series data
>> such as timestamped data from IoT sensors in industrial applications.
>>>>>> 
>>>>>> == Proposal ==
>>>>>> IoTDB is a database for managing large amount of time series data
>> with columnar storage, data encoding, pre-computation, and index
>> techniques. It has SQL-like interface to write millions of data points per
>> second per node and is optimized to get query results in few seconds over
>> trillions of data points. It can also be easily integrated with Apache
>> Hadoop MapReduce and Apache Spark for analytics.
>>>>>> 
>>>>>> == Background ==
>>>>>> 
>>>>>> A new class of data management system requirements is becoming
>> increasingly important with the rise of the Internet of Things. There are
>> some database systems and technologies aimed at time series data
>> management.  For example, Gorilla and InfluxDB which are mainly built for
>> data centers and monitoring application metrics. Other systems, for
>> example, OpenTSDB and KairosDB, are built on Apache HBase and Apache
>> Cassandra, respectively.
>>>>>> 
>>>>>> However, many applications for time series data management have more
>> requirements especially in industrial applications as follows:
>>>>>> 
>>>>>> * Supporting time series data which has high data frequency. For
>> example, a turbine engine may generate 1000 points per second (i.e.,
>> 1000Hz), while each CPU only reports 1 data points per 5 seconds in a data
>> center monitoring application.
>>>>>> 
>>>>>> * Supporting scanning data multi-resolutionally. For example,
>> aggregation operation is important for time series data.
>>>>>> 
>>>>>> * Supporting special queries for time series, such as pattern
>> matching, time series segmentation, time-frequency transformation and
>> frequency query.
>>>>>> 
>>>>>> * Supporting a large number of monitoring targets (i.e. time series).
>> An excavator may report more than 1000 time series, for example, revolving
>> speed of the motor-engine, the speed of the excavator, the accelerated
>> speed, the temperature of the water tank and so on, while a CPU or an
>> application monitor has much fewer time series.
>>>>>> 
>>>>>> * Optimization for out-of-order data points. In the industrial
>> sector, it is common that equipment sends data using the UDP protocol
>> rather than the TCP protocol. Sometimes, the network connect is unstable
>> and parts of the data will be buffered for later sending.
>>>>>> 
>>>>>> * Supporting long-term storage. Historical data is precious for
>> equipment manufacturers. Therefore, removing or unloading historical data
>> is highly desired for most industrial applications. The database system
>> must not only support fast retrieval of historical data, but also should
>> guarantee that the historical data does not impact the processing speed for
>> “hot” or current data.
>>>>>> 
>>>>>> * Supporting online transaction processing (OLTP) as well as complex
>> analytics. It is obvious that supporting analyzing from the data files
>> using Apache Spark/Apache Hadoop MapReduce directly is better than
>> transforming data files to another file format for Big Data analytics.
>>>>>> 
>>>>>> * Flexible deployment either on premise or in the cloud.  IoTDB is as
>> simple and can be deployed on a Raspberry Pi handling hundreds of time
>> series. Meanwhile, the system can be also deployed in the cloud so that it
>> supports tens of millions ingestions per second, OLTP queries in
>> milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
>>>>>> 
>>>>>> * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a
>> wind turbine, or a meteorological station, the deployment of the chosen
>> database is designed to be simple. A device may have hundreds of time
>> series (but less than a thousand time series) and the database needs to
>> handle them.
>>>>>> * * (2) When deploying IoTDB in a data center, the computational
>> resources (i.e., the hardware configuration of servers) is not a problem
>> when compared to a Raspberry Pi. In this deployment, IoTDB can use more
>> computation resources, and has the ability to handle more time seires
>> (e.g., millions of time series).
>>>>>> 
>>>>>> Based on these requirements, we developed IoTDB, a new data store
>> system for managing time series data.
>>>>>> 
>>>>>> IoTDB started as a Tsinghua University research project. IoTDB's
>> developer community has also grown to include additional institutions, for
>> example, universities (e.g., Fudan University), research labs (e.g, NEL-BDS
>> lab), and corporations (e.g., K2Data, Tencent). Funding has been provided
>> by various institutions including the National Natural Science Foundation
>> of China, and industry sponsors, such as Lenovo and K2Data.
>>>>>> 
>>>>>> == Rationale ==
>>>>>> Because there is no existed open-sourced time series databases
>> covering all the above requirements, we developed IoTDB. As the system
>> matures, we are seeking a long-term home for the project. We believe the
>> Apache Software Foundation would be an ideal fit. Also joining Apache will
>> help coordinate and improve the development effort of the growing number of
>> organizations which contribute to IoTDB improving the diversity of our
>> community.
>>>>>> 
>>>>>> IoTDB contains multiple modules, which are classified into categories:
>>>>>> 
>>>>>> * '''TsFile Format''': TsFile is a new columnar file format.
>>>>>> * '''Adaptor for Analytics and Visualization''': Integrating TsFile
>> with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples
>> of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also
>> provided.
>>>>>> * '''IoTDB Engine''': An engine which consists of SQL parser, query
>> plan generator, memtable, authentication and authorization,write ahead log
>> (WAL), crash recovery, out-of-order data handler, and index for aggregation
>> and pattern matching. The engine stores system data in TsFile format.
>>>>>> * '''IoTDB JDBC''': An implementation of Java Database Connectivity
>> (JDBC) for clients to connect to IoTDB using Java.
>>>>>> 
>>>>>> === TsFile Format ===
>>>>>> 
>>>>>> TsFile format is a columnar store, which is similar with Apache
>> Parquet and Apache CarbonData. It has the concepts of Chunk Group, Column
>> Chunk, Page and Footer. Comparing with Apache Parquet and Apache
>> CarbonData, it is designed and optimized for time series:
>>>>>> 
>>>>>> ==== Time Series Friendly Encoding ====
>>>>>> IoTDB currently supports run length encoding (RLE), delta-of-delta
>> encoding, and Facebook's Gorilla encoding.
>>>>>> 
>>>>>> Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA)
>> and time-frequency transformation are works-in-progress.
>>>>>> 
>>>>>> 
>>>>>> ==== Chunk Group ====
>>>>>> The data part of a TsFile consists of many Chunk Groups. Each Chunk
>> Group stores the data of a device at a time interval.  A Chunk Group is
>> similar to the row group in Apache Parquet, while there are some
>> constraints of the time dimension:  For each device, the time intervals of
>> different Chunk Groups are not overlapped and the latter Chunk Group always
>> has a larger timestamp.
>>>>>> 
>>>>>> Given a TsFile and a query with a time range filter, the query
>> process can terminate scanning data once it reads data points whose
>> timestamp reaches the time limit of the filter. We call the feature
>> ''fast-return'' and it makes the time range query in a TsFile very
>> efficient.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ==== Different Column Chunk Format (Unnecessary the Repetition (R)
>> and Definition (D) Fields) ====
>>>>>> 
>>>>>> While Apache Parquet and Apache CarbonData support complex data
>> types, e.g., nested data and sparse columns, TsFile is exclusively designed
>> for time series whose data model is \<device_id, series_id, timestamp,
>> value\>.
>>>>>> 
>>>>>> In a `Chunk Group`, each time series is a `Column Chunk`. Even though
>> these time series belong to the same device, the data points in different
>> time series are not aligned in the time dimension originally.
>>>>>> 
>>>>>> For example, if you have a device with 2 sensors on the same data
>> collection frequencies, sensor 1 may collect data at time 1521622662000
>> while the other one collects data at time 1521622662001 (delta=1ms).
>> Therefore, each Column Chunk has its timestamps and values, which is quite
>> different from Apache Parquet and Apache CarbonData.  Because we store the
>> time column along with each value column instead of making different chunks
>> share the same time column for the sake of diverse data frequency for
>> different time series, we do not store any null value on disk to align
>> across time series. Besides, we do not need to attach  `repetition` (R) and
>> `definition` (D) fields on each value. Therefore, the disk space is saved
>> and the query latency is reduced (because we do not align data by
>> calculating R and D fields).
>>>>>> 
>>>>>> 
>>>>>> ==== Domain Specific Information in Each Page ====
>>>>>> Similar to Apache Parquet and Apache CarbonData, a `Column Chunk`
>> consists of several `Pages`, and each `Page` has a `Page header`. The `Page
>> header` is a summary of the data in the page.
>>>>>> 
>>>>>> Because TsFile is optimized for time series, the page header contains
>> more domain specific information, such as the minimal and maximal value,
>> the minimal and the maximal timestamp, the frequency and so on. TsFile can
>> even store the histogram of values in the page header.
>>>>>> 
>>>>>> This header information helps IoTDB in speeding up queries by
>> skipping unnecessary pages.
>>>>>> 
>>>>>> 
>>>>>> === Adaptor for Analytics ===
>>>>>> The TsFile provides:
>>>>>> 
>>>>>> * InputFormat/OutputFormat interfaces for Reading/Writing data.
>>>>>> * Deep integration with Apache Spark/Hadoop MapReduce including
>> predicate push-down, column pruning, aggregation push down, etc. So users
>> can use Apache Spark SQL/HiveQL to connect and query TsFiles.
>>>>>> 
>>>>>> 
>>>>>> === IoTDB Engine ===
>>>>>> The IoTDB engine is a database engine, which uses TsFile as its
>> storage file format. The IoTDB Engine supports SQL-like query plus many
>> useful functions:
>>>>>> 
>>>>>> * Tree-based time series schema
>>>>>> * Log-Structured Merge (LSM)-based storage
>>>>>> * Overflow file for out-of-order data
>>>>>> * Scalable index framework
>>>>>> * Special queries for time series
>>>>>> 
>>>>>> ==== Tree-based Time Series Schema ====
>>>>>> IoTDB manages all the time series definitions using a tree structure.
>> A path from the root of the tree to a leaf node represents a time series.
>> Therefore, the unique id of a time series is a path, e.g.,
>> `root.China.beijing.windFarm1.windTurbine1.speed`.
>>>>>> 
>>>>>> This kind of schema can express `group by` naturally. For example,
>> `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind
>> turbines in wind farm 1 in Beijing, China.
>>>>>> 
>>>>>> ==== Log-Structured Merge (LSM)-based Storage ====
>>>>>> In a time series, the data points should be ordered by their
>> timestamps. In IoTDB, we use Log-Structured Merge (LSM) based mechanism.
>> Therefore, a part of the data is stored in memory first and can be called
>> as `memtable`. At this time, if data points come out-of-order, we resort
>> them in memory. When this part of data exceeds the configured memory limit,
>> we flush it on disk as a `Chunk Group` into an unclosed TsFile.  Finally, a
>> TsFile may contain several Chunk Groups, for reducing the number of small
>> data files, which is helpful to reduce the I/O load of the storage system
>> and reduces the execution time of a file-merge in LSM. Notice that the data
>> is time-ordered in one Chunk Group on disk, and this layout is helpful for
>> fast filtering in one Chunk Group for a query.
>>>>>> 
>>>>>> Rule 1: In a TsFile, the Chunk Groups of one device are ordered by
>> timestamp (Rule 1), and it is helpful for fast filtering among Chunk Groups
>> for a query.
>>>>>> 
>>>>>> Rule 2: When the size of the unclosed TsFile reaches the threshold
>> defined in the configuration file, we close the file and generate a new one
>> to store new arriving data spanning the entire data set. Like many systems
>> which use LSM-based storage, we never modify a TsFile which has been closed
>> except for the file-merge process (Rule 2).
>>>>>> 
>>>>>> Rule 3: To reduce the number of TsFiles involved in a query process,
>> we guarantee that the data points in different TsFiles are not overlapping
>> on the time dimension after file mergence (Rule 3).
>>>>>> 
>>>>>> ==== Overflow File for Out-of-order Data ====
>>>>>> When a part of data is flushed on disk (and will form a `Chunk Group`
>> in a TsFile), the newly arriving data points whose timestamps are smaller
>> than the largest timestamp in the Tsfile are `out-of-order`.
>>>>>> 
>>>>>> To store the out-of-order data, we organize all the troublesome
>> `out-of-order` data point insertions into a special TsFile, named
>> `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device
>> may be overlapping in the time dimension, which violates the Rule 1 and
>> costs additional time compared to a normal TsFile for query filtering.
>>>>>> 
>>>>>> There is another special operation: updating all the data points in a
>> time range, e.g., `update all the speed values of device1 as 0 where the
>> data time is in [1521622000000, 1521622662000]`. The operation is called
>> when: (1) a sensor malfunctions and the database receives wrong data for a
>> period; (2) we may want to reset all the records. Many NoSQL time series
>> databases do not support such an operation. To support the operation in
>> IoTDB, we use a tree-based structure, Treap, to store this part of
>> operations and store them as `Overflow` files.
>>>>>> 
>>>>>> Therefore, there are 3 kinds of data files: TsFiles,
>> UnSequenceTsFiles and Overflow files.  TsFiles should store most of the
>> data. The volume of UnSequenceTsFiles depends on the workload: if there are
>> too many out-of-order and the time span of out-of-order is huge, the volume
>> will be large. Overflow files handle fewest data operations but will depend
>> on the use of the special operations.
>>>>>> 
>>>>>> ==== LSM-tree ====
>>>>>> Normally, LSM-based storage engines merge data files level by level
>> so that it looks like a tree structure. In this way, data is well
>> organized. The disadvantage is that data will be read and written several
>> times. If the tree has 4 levels, each data point will be rewritten at least
>> 4 times.
>>>>>> 
>>>>>> Currently, we do not merge all the TsFiles into one because (1) the
>> number of TsFiles is kept lower than many LSM storage engines because a
>> memtable is mapped to several Chunk Groups rather than a file; (2)
>> different TsFiles are not overlapping with each other in the time dimension
>> (because of Rule 3).
>>>>>> 
>>>>>> As mentioned before,  TsFile supports ''fast-return'' to accelerate
>> queries. However, UnSequenceTsFile and Overflow files do not allow this
>> feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may be
>> overlapped, which leads to more files involved in the query process. To
>> accelerate these queries, there is a merging process to reorganize files in
>> the background. All the three kinds of files: TsFiles, UnSequenceTsFiles
>> and Overflow files, are involved in the merging process. The merging
>> process is implemented using multi-threading, while each thread is
>> responsible for a series family.
>>>>>> After merging, only TsFiles are left. These files have
>> non-overlapping time spans and support the ''fast-return'' feature.
>>>>>> 
>>>>>> ==== Scalable Index Framework ====
>>>>>> We allow users to implement indexes for faster queries. We currently
>> support an index for pattern matching query (KV-Match index, ICDE 2019).
>> Another index for fast aggregation (PISA index, CIKM 2016) is a
>> work-in-progress.
>>>>>> 
>>>>>> ==== Special Queries ====
>>>>>> We currently support `group by time interval` aggregation queries and
>> `Fill by` operations, which are similar to those of InfluxDB. Time series
>> segmentation operations and frequency queries are work-in-progress.
>>>>>> 
>>>>>> == Initial Goals ==
>>>>>> The initial goals are to be open sourced and to integrate with the
>> Apache development process. Furthermore, we plan for incremental
>> development, and releases along with the Apache guidelines.
>>>>>> 
>>>>>> == Current Status ==
>>>>>> We have developed the system for more than 2 years. There are
>> currently 13k lines of code, some of which are generated by Antlr3 and
>> Thrift.  There are 230 issues which have been solved and more than 1500
>> commits.
>>>>>> 
>>>>>> The system has been deployed in the staging environment of the State
>> Grid Corporation of China to handle ~3 million time series (i.e, ~30,000
>> power generation assembly * ~100 sensors) and an equipment service company
>> in China managing ~2 million time series (i.e, ~20k devices * 100 sensors).
>> The insertion speed reaches ~2 million points/second/node, which is faster
>> than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
>>>>>> 
>>>>>> There are many new features in the works including those mentioned
>> herein. We will add more analytics functions, improve the data file merge
>> process, and finish the first released version of IoTDB.
>>>>>> 
>>>>>> == Meritocracy ==
>>>>>> The IoTDB project operates on meritocratic principles. Developers who
>> submit more code with higher quality earn more merit. We have used `Issues`
>> and `Pull Requests` modules on Github for collecting users' suggestions and
>> patches. Users who submit issues, pull requests, documents and help the
>> community management are welcomed and encouraged to become committers.
>>>>>> 
>>>>>> == Community ==
>>>>>> 
>>>>>> The IoTDB project users communicate on Github (
>>>>>> https://github.com/thulab/tsfile) . Developers make the
>> communication on a website which is similar with JIRA (Currently, only
>> registered users can apply to access the project for communication, url:
>> https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/
>>>>>> ). We have also introduced IoTDB at many technical conferences. Next,
>> we will build the mailing list for more convenience, broader communication
>> and archived discussions.
>>>>>> 
>>>>>> If IoTDB is accepted for incubation at the Apache Software
>> Foundation, the primary goal is to build a larger community. We believe
>> that IoTDB will become a key project for time series data management, and
>> so, we will rely on a large community of users and developers.
>>>>>> 
>>>>>> TODO: IoTDB is currently on a private Github repository (
>>>>>> https://github.com/thulab/iotdb), while its subproject TsFile (a
>> file format for storing time series data) is open sourced on Github (
>> https://github.com/thulab/tsfile
>>>>>> ).
>>>>>> 
>>>>>> == Core Developers ==
>>>>>> IoTDB was initially developed by 2 dozen of students and teachers at
>> Tsinghua University. Now, more and more developers have joined coming from
>> other universities: Fudan University, Northwestern Polytechnical University
>> and Harbin Institute of Technology in China.  Other developers come from
>> business companies such as Lenovo and Microsoft. We will be working to
>> bring more and more developers into the project making contributions to
>> IoTDB.
>>>>>> 
>>>>>> == Relationships with Other Apache Products ==
>>>>>> IoTDB requires some Apache products (Apache Thrift, commons,
>> collections, httpclient).
>>>>>> 
>>>>>> IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed
>> for supporting analysing time series data by using Apache Spark and
>> MapReduce.
>>>>>> 
>>>>>> Overall, IoTDB is designed as an open architecture, and it can be
>> integrated with many other systems in the future.
>>>>>> 
>>>>>> As mentioned before, in the IoTDB project, we designed a new columnar
>> file format, called TsFile, which is similar to Apache Parquet. However,
>> the new file format is optimized for time series data.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> == Known Risks ==
>>>>>> 
>>>>>> === Orphaned Products ===
>>>>>> Given the current level of investment in IoTDB, the risk of the
>> project being abandoned is minimal. Time series data is more and more
>> important and there are several constituents who are highly inspired to
>> continue development. Tsinghua and NEL-BDS Lab relies on IoTDB as a
>> platform for a large number of long-term research projects. We have
>> deployed IoTDB in some company's staging environments for future
>> applications.
>>>>>> 
>>>>>> === Inexperience with Open Source ===
>>>>>> Students and researchers in Tsinghua University have been developing
>> and using open source software for a long time. It is wonderful to be
>> guided to join a formal open-source process for students. Some of our
>> committers
>>>>>> have  experiences contributing to open source, for example:
>>>>>> 
>>>>>> * druid:
>>>>>> 
>> https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44
>>>>>> 
>>>>>> * druid:
>>>>>> 
>> https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794
>>>>>> 
>>>>>> * YCSB:
>>>>>> https://github.com/brianfrankcooper/YCSB/pull/776
>>>>>> 
>>>>>> 
>>>>>> Additionally, several ASF veterans and industry veterans have agreed
>> to mentor the project and are listed in this proposal. The project will
>> rely on their guidance and collective wisdom to quickly transition the
>> entire team of initial committers towards practicing the Apache Way.
>>>>>> 
>>>>>> 
>>>>>> === Reliance on Salaried Developers ===
>>>>>> Most of current developers are students and researchers/professors in
>> universities, and their researches focus on big data management and
>> analytics. It is unlikely that they will change their research focus away
>> from big data management.  We will work to ensure that the ability for the
>> project to continuously be stewarded and to proceed forward independent of
>> salaried developers is continued.
>>>>>> 
>>>>>> === An Excessive Fascination with the Apache Brand ===
>>>>>> Most of the initial developers come from Tsinghua University with no
>> intent to use the Apache brand for profit. We have no plans for making use
>> of Apache brand in press releases nor posting billboards advertising
>> acceptance of IoTDB into Apache Incubator.
>>>>>> 
>>>>>> 
>>>>>> == Initial Source ==
>>>>>> IoTDB's github address and some required dependencies:
>>>>>> 
>>>>>> * The storage file format:
>>>>>> https://github.com/thulab/tsfile
>>>>>> 
>>>>>> * Adaptor for Apache Hadoop MapReduce:
>>>>>> https://github.com/thulab/tsfile-hadoop-connector
>>>>>> 
>>>>>> * Adaptor for Apache Spark:
>>>>>> https://github.com/thulab/tsfile-spark-connector
>>>>>> 
>>>>>> * Adaptor for Grafana:
>>>>>> https://github.com/thulab/iotdb-grafana
>>>>>> 
>>>>>> * The database engine:
>>>>>> https://github.com/thulab/iotdb
>>>>>> (private project up to now)
>>>>>> * The client driver:
>>>>>> https://github.com/thulab/iotdb-jdbc
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> === External Dependencies ===
>>>>>> To the best of our knowledge, all dependencies of IoTDB are
>> distributed under Apache compatible licenses. Upon acceptance to the
>> incubator, we would begin a thorough analysis of all transitive
>> dependencies to verify this fact and introduce license checking into the
>> build and release process.
>>>>>> 
>>>>>> == Documentation ==
>>>>>> * Documentation for TsFile:
>>>>>> https://github.com/thulab/tsfile/wiki
>>>>>> 
>>>>>> * Documentation for IoTDB and its JDBC:
>>>>>> http://tsfile.org/document
>>>>>> (Chinese only. An English version is in progress.)
>>>>>> 
>>>>>> == Required Resources ==
>>>>>> === Mailing Lists ===
>>>>>> *
>>>>>> private@iotdb.incubator.apache.org
>>>>>> 
>>>>>> *
>>>>>> dev@iotdb.incubator.apache.org
>>>>>> 
>>>>>> *
>>>>>> commits@iotdb.incubator.apache.org
>>>>>> 
>>>>>> 
>>>>>> === Git Repositories ===
>>>>>> *
>>>>>> https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git
>>>>>> 
>>>>>> 
>>>>>> === Issue Tracking ===
>>>>>> *  JIRA IoTDB (We currently use the issue management provided by
>> Github to track issues.)
>>>>>> 
>>>>>> 
>>>>>> == Initial Committers ==
>>>>>> Tsinghua University, K2Data Company, Lenovo, Microsoft
>>>>>> 
>>>>>> Jianmin Wang (jimwang at tsinghua dot edu dot cn )
>>>>>> 
>>>>>> Xiangdong Huang (sainthxd at gmail dot com)
>>>>>> 
>>>>>> Jun Yuan (richard_yuan16 at 163 dot com)
>>>>>> 
>>>>>> Chen Wang ( wang_chen at tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Jinrui Zhang (jinrzhan at microsoft dot com)
>>>>>> 
>>>>>> Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Shuo Zhang (zhangshuo at k2data dot com dot cn)
>>>>>> 
>>>>>> Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
>>>>>> 
>>>>>> == Sponsors ==
>>>>>> === Champion ===
>>>>>> Kevin A. McGrail (
>>>>>> kmcgrail@apache.org
>>>>>> )
>>>>>> 
>>>>>> === Nominated Mentors ===
>>>>>> Justin Mclean (justin at classsoftware dot com)
>>>>>> 
>>>>>> Christofer Dutz (christofer.dutz at c-ware dot de)
>>>>>> 
>>>>>> Willem Jiang (willem.jiang at gmail dot com)
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by "Kevin A. McGrail" <km...@apache.org>.
Well, first, let's ask some questions:

- When you say "open source" repo, do you mean private repo vs public
repo?

- I believe Craig as Secretary will say an SGA never hurts but isn't
everything already licensed ASLv2?  It's been a few weeks and a few
proposals reviewed so it could be my memory.

Regards,
KAM

--
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


On Thu, Nov 15, 2018 at 7:27 AM hxd <hx...@qq.com> wrote:

> Currently, there are 6 repositories (IoTDB, IoTDB-JDBC, TsFile,
> Spark-Connector, Hive-Connector, and Grafana-Connector) totally and we will
> merge them all in one repositories.
>
> Only the first one is private.
>
> Actually we are lack of experiences about how to open source.
>
> Should we open all the source now or after all the Apache legal documents
> are done?
>
> Best,
>
> Xiangdong Huang
>
> > 在 2018年11月15日,下午5:06,Willem Jiang <wi...@gmail.com> 写道:
> >
> > Here is a question for the source code repository
> >
> > The main source git repo[1] is still a private repo.  I think we need
> > to open source the repo before sending the SGA?
> >
> >
> > [1]https://github.com/thulab/iotdb
> >
> > Willem Jiang
> >
> > Twitter: willemjiang
> > Weibo: 姜宁willem
> > On Thu, Nov 15, 2018 at 4:08 PM hxd <hx...@qq.com> wrote:
> >>
> >> Hi,
> >>
> >> In the proposal discussion process, we got 3 mentors,  Justin Mclean,
> Christofer Dutz, and Willem Ning Jiang.
> >>
> >> In the vote process, we got a new mentor, Joe Witt.
> >>
> >> Totally, there are one Champion and four mentors, they are:
> >>
> >> Kevin A. McGrail (the Champion),
> >> Justin Mclean,
> >> Christofer Dutz,
> >> Willem Ning Jiang, and
> >> Joe Witt
> >>
> >> I have checked their name on
> http://people.apache.org/committer-index.html, and they are accurate now.
> >> The name list on the proposal list (
> https://wiki.apache.org/incubator/IoTDBProposal) is also correct.
> >>
> >> Regards,
> >> Xiangdong Huang
> >>
> >>
> >>
> >> 在 2018年11月15日,上午12:51,Kevin A. McGrail <km...@apache.org> 写道:
> >>
> >> Congratulations!  As champion, I think the next steps are:
> >>
> >> 1 - Xiangdong, Can you confirm the list of mentors on the proposal is
> accurate?
> >>
> >> 2 - Also Xiangdong, Is there anyone else that stepped forward as a
> mentor during the voting process that the project wants the IPMC to approve?
> >>
> >> 3 - Justin, I think you have to request the creation of the podling and
> then I as champion work on things like the meta data file from this page,
> >> https://incubator.apache.org/policy/incubation.html, correct?
> >>
> >> Regards,
> >> KAM
> >>
> >>
> >>
> >>
> >> --
> >> Kevin A. McGrail
> >> VP Fundraising, Apache Software Foundation
> >> Chair Emeritus Apache SpamAssassin Project
> >> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >>
> >>
> >> On Wed, Nov 14, 2018 at 6:29 AM hxd <hx...@qq.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1
> votes, this VOTE passes.
> >>>
> >>> Thanks to everyone who voted!
> >>>
> >>> Bellow is a voting tally:
> >>>
> >>> Binding
> >>> Von Gosling
> >>> Christofer Dutz
> >>> Kevin A. McGrail
> >>> Felix Cheung
> >>> Matt Sticker
> >>> Joe Witt
> >>> Justin Mclean
> >>> Willem Jiang
> >>>
> >>>
> >>> Non-binding
> >>> Sheng Wu
> >>> Yang Bo
> >>>
> >>> The vote thread:
> https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E
> <
> https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E
> >
> >>> The proposal: https://wiki.apache.org/incubator/IoTDBProposal <
> https://wiki.apache.org/incubator/IoTDBProposal>
> >>>
> >>> Thanks,
> >>>
> >>> Xiangdong Huang
> >>>
> >>>
> >>>> 在 2018年11月7日,下午3:46,hxd <hx...@qq.com> 写道:
> >>>>
> >>>> Hi,
> >>>>
> >>>> Sorry for the previous mail with bad format.
> >>>> I'd like to call a VOTE to accept IoTDB project, a database for
> managing large amounts of time series data  from IoT sensors in industrial
> applications, into the Apache Incubator.
> >>>> The full proposal is available on the wiki:
> https://wiki.apache.org/incubator/IoTDBProposal
> >>>> and it is also attached below for your convenience.
> >>>>
> >>>> Please cast your vote:
> >>>>
> >>>>  [ ] +1, bring IoTDB into Incubator
> >>>>  [ ] +0, I don't care either way,
> >>>>  [ ] -1, do not bring IoTDB into Incubator, because...
> >>>>
> >>>> The vote will open at least for 72 hours.
> >>>>
> >>>> Thanks,
> >>>> Xiangdong Huang.
> >>>>
> >>>>
> >>>> = IoTDB Proposal  =
> >>>> v0.1.1
> >>>>
> >>>>
> >>>> == Abstract ==
> >>>> IoTDB is a data store for managing large amounts of time series data
> such as timestamped data from IoT sensors in industrial applications.
> >>>>
> >>>> == Proposal ==
> >>>> IoTDB is a database for managing large amount of time series data
> with columnar storage, data encoding, pre-computation, and index
> techniques. It has SQL-like interface to write millions of data points per
> second per node and is optimized to get query results in few seconds over
> trillions of data points. It can also be easily integrated with Apache
> Hadoop MapReduce and Apache Spark for analytics.
> >>>>
> >>>> == Background ==
> >>>>
> >>>> A new class of data management system requirements is becoming
> increasingly important with the rise of the Internet of Things. There are
> some database systems and technologies aimed at time series data
> management.  For example, Gorilla and InfluxDB which are mainly built for
> data centers and monitoring application metrics. Other systems, for
> example, OpenTSDB and KairosDB, are built on Apache HBase and Apache
> Cassandra, respectively.
> >>>>
> >>>> However, many applications for time series data management have more
> requirements especially in industrial applications as follows:
> >>>>
> >>>> * Supporting time series data which has high data frequency. For
> example, a turbine engine may generate 1000 points per second (i.e.,
> 1000Hz), while each CPU only reports 1 data points per 5 seconds in a data
> center monitoring application.
> >>>>
> >>>> * Supporting scanning data multi-resolutionally. For example,
> aggregation operation is important for time series data.
> >>>>
> >>>> * Supporting special queries for time series, such as pattern
> matching, time series segmentation, time-frequency transformation and
> frequency query.
> >>>>
> >>>> * Supporting a large number of monitoring targets (i.e. time series).
> An excavator may report more than 1000 time series, for example, revolving
> speed of the motor-engine, the speed of the excavator, the accelerated
> speed, the temperature of the water tank and so on, while a CPU or an
> application monitor has much fewer time series.
> >>>>
> >>>> * Optimization for out-of-order data points. In the industrial
> sector, it is common that equipment sends data using the UDP protocol
> rather than the TCP protocol. Sometimes, the network connect is unstable
> and parts of the data will be buffered for later sending.
> >>>>
> >>>> * Supporting long-term storage. Historical data is precious for
> equipment manufacturers. Therefore, removing or unloading historical data
> is highly desired for most industrial applications. The database system
> must not only support fast retrieval of historical data, but also should
> guarantee that the historical data does not impact the processing speed for
> “hot” or current data.
> >>>>
> >>>> * Supporting online transaction processing (OLTP) as well as complex
> analytics. It is obvious that supporting analyzing from the data files
> using Apache Spark/Apache Hadoop MapReduce directly is better than
> transforming data files to another file format for Big Data analytics.
> >>>>
> >>>> * Flexible deployment either on premise or in the cloud.  IoTDB is as
> simple and can be deployed on a Raspberry Pi handling hundreds of time
> series. Meanwhile, the system can be also deployed in the cloud so that it
> supports tens of millions ingestions per second, OLTP queries in
> milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
> >>>>
> >>>> * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a
> wind turbine, or a meteorological station, the deployment of the chosen
> database is designed to be simple. A device may have hundreds of time
> series (but less than a thousand time series) and the database needs to
> handle them.
> >>>> * * (2) When deploying IoTDB in a data center, the computational
> resources (i.e., the hardware configuration of servers) is not a problem
> when compared to a Raspberry Pi. In this deployment, IoTDB can use more
> computation resources, and has the ability to handle more time seires
> (e.g., millions of time series).
> >>>>
> >>>> Based on these requirements, we developed IoTDB, a new data store
> system for managing time series data.
> >>>>
> >>>> IoTDB started as a Tsinghua University research project. IoTDB's
> developer community has also grown to include additional institutions, for
> example, universities (e.g., Fudan University), research labs (e.g, NEL-BDS
> lab), and corporations (e.g., K2Data, Tencent). Funding has been provided
> by various institutions including the National Natural Science Foundation
> of China, and industry sponsors, such as Lenovo and K2Data.
> >>>>
> >>>> == Rationale ==
> >>>> Because there is no existed open-sourced time series databases
> covering all the above requirements, we developed IoTDB. As the system
> matures, we are seeking a long-term home for the project. We believe the
> Apache Software Foundation would be an ideal fit. Also joining Apache will
> help coordinate and improve the development effort of the growing number of
> organizations which contribute to IoTDB improving the diversity of our
> community.
> >>>>
> >>>> IoTDB contains multiple modules, which are classified into categories:
> >>>>
> >>>> * '''TsFile Format''': TsFile is a new columnar file format.
> >>>> * '''Adaptor for Analytics and Visualization''': Integrating TsFile
> with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples
> of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also
> provided.
> >>>> * '''IoTDB Engine''': An engine which consists of SQL parser, query
> plan generator, memtable, authentication and authorization,write ahead log
> (WAL), crash recovery, out-of-order data handler, and index for aggregation
> and pattern matching. The engine stores system data in TsFile format.
> >>>> * '''IoTDB JDBC''': An implementation of Java Database Connectivity
> (JDBC) for clients to connect to IoTDB using Java.
> >>>>
> >>>> === TsFile Format ===
> >>>>
> >>>> TsFile format is a columnar store, which is similar with Apache
> Parquet and Apache CarbonData. It has the concepts of Chunk Group, Column
> Chunk, Page and Footer. Comparing with Apache Parquet and Apache
> CarbonData, it is designed and optimized for time series:
> >>>>
> >>>> ==== Time Series Friendly Encoding ====
> >>>> IoTDB currently supports run length encoding (RLE), delta-of-delta
> encoding, and Facebook's Gorilla encoding.
> >>>>
> >>>> Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA)
> and time-frequency transformation are works-in-progress.
> >>>>
> >>>>
> >>>> ==== Chunk Group ====
> >>>> The data part of a TsFile consists of many Chunk Groups. Each Chunk
> Group stores the data of a device at a time interval.  A Chunk Group is
> similar to the row group in Apache Parquet, while there are some
> constraints of the time dimension:  For each device, the time intervals of
> different Chunk Groups are not overlapped and the latter Chunk Group always
> has a larger timestamp.
> >>>>
> >>>> Given a TsFile and a query with a time range filter, the query
> process can terminate scanning data once it reads data points whose
> timestamp reaches the time limit of the filter. We call the feature
> ''fast-return'' and it makes the time range query in a TsFile very
> efficient.
> >>>>
> >>>>
> >>>>
> >>>> ==== Different Column Chunk Format (Unnecessary the Repetition (R)
> and Definition (D) Fields) ====
> >>>>
> >>>> While Apache Parquet and Apache CarbonData support complex data
> types, e.g., nested data and sparse columns, TsFile is exclusively designed
> for time series whose data model is \<device_id, series_id, timestamp,
> value\>.
> >>>>
> >>>> In a `Chunk Group`, each time series is a `Column Chunk`. Even though
> these time series belong to the same device, the data points in different
> time series are not aligned in the time dimension originally.
> >>>>
> >>>> For example, if you have a device with 2 sensors on the same data
> collection frequencies, sensor 1 may collect data at time 1521622662000
> while the other one collects data at time 1521622662001 (delta=1ms).
> Therefore, each Column Chunk has its timestamps and values, which is quite
> different from Apache Parquet and Apache CarbonData.  Because we store the
> time column along with each value column instead of making different chunks
> share the same time column for the sake of diverse data frequency for
> different time series, we do not store any null value on disk to align
> across time series. Besides, we do not need to attach  `repetition` (R) and
> `definition` (D) fields on each value. Therefore, the disk space is saved
> and the query latency is reduced (because we do not align data by
> calculating R and D fields).
> >>>>
> >>>>
> >>>> ==== Domain Specific Information in Each Page ====
> >>>> Similar to Apache Parquet and Apache CarbonData, a `Column Chunk`
> consists of several `Pages`, and each `Page` has a `Page header`. The `Page
> header` is a summary of the data in the page.
> >>>>
> >>>> Because TsFile is optimized for time series, the page header contains
> more domain specific information, such as the minimal and maximal value,
> the minimal and the maximal timestamp, the frequency and so on. TsFile can
> even store the histogram of values in the page header.
> >>>>
> >>>> This header information helps IoTDB in speeding up queries by
> skipping unnecessary pages.
> >>>>
> >>>>
> >>>> === Adaptor for Analytics ===
> >>>> The TsFile provides:
> >>>>
> >>>> * InputFormat/OutputFormat interfaces for Reading/Writing data.
> >>>> * Deep integration with Apache Spark/Hadoop MapReduce including
> predicate push-down, column pruning, aggregation push down, etc. So users
> can use Apache Spark SQL/HiveQL to connect and query TsFiles.
> >>>>
> >>>>
> >>>> === IoTDB Engine ===
> >>>> The IoTDB engine is a database engine, which uses TsFile as its
> storage file format. The IoTDB Engine supports SQL-like query plus many
> useful functions:
> >>>>
> >>>> * Tree-based time series schema
> >>>> * Log-Structured Merge (LSM)-based storage
> >>>> * Overflow file for out-of-order data
> >>>> * Scalable index framework
> >>>> * Special queries for time series
> >>>>
> >>>> ==== Tree-based Time Series Schema ====
> >>>> IoTDB manages all the time series definitions using a tree structure.
> A path from the root of the tree to a leaf node represents a time series.
> Therefore, the unique id of a time series is a path, e.g.,
> `root.China.beijing.windFarm1.windTurbine1.speed`.
> >>>>
> >>>> This kind of schema can express `group by` naturally. For example,
> `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind
> turbines in wind farm 1 in Beijing, China.
> >>>>
> >>>> ==== Log-Structured Merge (LSM)-based Storage ====
> >>>> In a time series, the data points should be ordered by their
> timestamps. In IoTDB, we use Log-Structured Merge (LSM) based mechanism.
> Therefore, a part of the data is stored in memory first and can be called
> as `memtable`. At this time, if data points come out-of-order, we resort
> them in memory. When this part of data exceeds the configured memory limit,
> we flush it on disk as a `Chunk Group` into an unclosed TsFile.  Finally, a
> TsFile may contain several Chunk Groups, for reducing the number of small
> data files, which is helpful to reduce the I/O load of the storage system
> and reduces the execution time of a file-merge in LSM. Notice that the data
> is time-ordered in one Chunk Group on disk, and this layout is helpful for
> fast filtering in one Chunk Group for a query.
> >>>>
> >>>> Rule 1: In a TsFile, the Chunk Groups of one device are ordered by
> timestamp (Rule 1), and it is helpful for fast filtering among Chunk Groups
> for a query.
> >>>>
> >>>> Rule 2: When the size of the unclosed TsFile reaches the threshold
> defined in the configuration file, we close the file and generate a new one
> to store new arriving data spanning the entire data set. Like many systems
> which use LSM-based storage, we never modify a TsFile which has been closed
> except for the file-merge process (Rule 2).
> >>>>
> >>>> Rule 3: To reduce the number of TsFiles involved in a query process,
> we guarantee that the data points in different TsFiles are not overlapping
> on the time dimension after file mergence (Rule 3).
> >>>>
> >>>> ==== Overflow File for Out-of-order Data ====
> >>>> When a part of data is flushed on disk (and will form a `Chunk Group`
> in a TsFile), the newly arriving data points whose timestamps are smaller
> than the largest timestamp in the Tsfile are `out-of-order`.
> >>>>
> >>>> To store the out-of-order data, we organize all the troublesome
> `out-of-order` data point insertions into a special TsFile, named
> `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device
> may be overlapping in the time dimension, which violates the Rule 1 and
> costs additional time compared to a normal TsFile for query filtering.
> >>>>
> >>>> There is another special operation: updating all the data points in a
> time range, e.g., `update all the speed values of device1 as 0 where the
> data time is in [1521622000000, 1521622662000]`. The operation is called
> when: (1) a sensor malfunctions and the database receives wrong data for a
> period; (2) we may want to reset all the records. Many NoSQL time series
> databases do not support such an operation. To support the operation in
> IoTDB, we use a tree-based structure, Treap, to store this part of
> operations and store them as `Overflow` files.
> >>>>
> >>>> Therefore, there are 3 kinds of data files: TsFiles,
> UnSequenceTsFiles and Overflow files.  TsFiles should store most of the
> data. The volume of UnSequenceTsFiles depends on the workload: if there are
> too many out-of-order and the time span of out-of-order is huge, the volume
> will be large. Overflow files handle fewest data operations but will depend
> on the use of the special operations.
> >>>>
> >>>> ==== LSM-tree ====
> >>>> Normally, LSM-based storage engines merge data files level by level
> so that it looks like a tree structure. In this way, data is well
> organized. The disadvantage is that data will be read and written several
> times. If the tree has 4 levels, each data point will be rewritten at least
> 4 times.
> >>>>
> >>>> Currently, we do not merge all the TsFiles into one because (1) the
> number of TsFiles is kept lower than many LSM storage engines because a
> memtable is mapped to several Chunk Groups rather than a file; (2)
> different TsFiles are not overlapping with each other in the time dimension
> (because of Rule 3).
> >>>>
> >>>> As mentioned before,  TsFile supports ''fast-return'' to accelerate
> queries. However, UnSequenceTsFile and Overflow files do not allow this
> feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may be
> overlapped, which leads to more files involved in the query process. To
> accelerate these queries, there is a merging process to reorganize files in
> the background. All the three kinds of files: TsFiles, UnSequenceTsFiles
> and Overflow files, are involved in the merging process. The merging
> process is implemented using multi-threading, while each thread is
> responsible for a series family.
> >>>> After merging, only TsFiles are left. These files have
> non-overlapping time spans and support the ''fast-return'' feature.
> >>>>
> >>>> ==== Scalable Index Framework ====
> >>>> We allow users to implement indexes for faster queries. We currently
> support an index for pattern matching query (KV-Match index, ICDE 2019).
> Another index for fast aggregation (PISA index, CIKM 2016) is a
> work-in-progress.
> >>>>
> >>>> ==== Special Queries ====
> >>>> We currently support `group by time interval` aggregation queries and
> `Fill by` operations, which are similar to those of InfluxDB. Time series
> segmentation operations and frequency queries are work-in-progress.
> >>>>
> >>>> == Initial Goals ==
> >>>> The initial goals are to be open sourced and to integrate with the
> Apache development process. Furthermore, we plan for incremental
> development, and releases along with the Apache guidelines.
> >>>>
> >>>> == Current Status ==
> >>>> We have developed the system for more than 2 years. There are
> currently 13k lines of code, some of which are generated by Antlr3 and
> Thrift.  There are 230 issues which have been solved and more than 1500
> commits.
> >>>>
> >>>> The system has been deployed in the staging environment of the State
> Grid Corporation of China to handle ~3 million time series (i.e, ~30,000
> power generation assembly * ~100 sensors) and an equipment service company
> in China managing ~2 million time series (i.e, ~20k devices * 100 sensors).
> The insertion speed reaches ~2 million points/second/node, which is faster
> than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
> >>>>
> >>>> There are many new features in the works including those mentioned
> herein. We will add more analytics functions, improve the data file merge
> process, and finish the first released version of IoTDB.
> >>>>
> >>>> == Meritocracy ==
> >>>> The IoTDB project operates on meritocratic principles. Developers who
> submit more code with higher quality earn more merit. We have used `Issues`
> and `Pull Requests` modules on Github for collecting users' suggestions and
> patches. Users who submit issues, pull requests, documents and help the
> community management are welcomed and encouraged to become committers.
> >>>>
> >>>> == Community ==
> >>>>
> >>>> The IoTDB project users communicate on Github (
> >>>> https://github.com/thulab/tsfile) . Developers make the
> communication on a website which is similar with JIRA (Currently, only
> registered users can apply to access the project for communication, url:
> https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/
> >>>> ). We have also introduced IoTDB at many technical conferences. Next,
> we will build the mailing list for more convenience, broader communication
> and archived discussions.
> >>>>
> >>>> If IoTDB is accepted for incubation at the Apache Software
> Foundation, the primary goal is to build a larger community. We believe
> that IoTDB will become a key project for time series data management, and
> so, we will rely on a large community of users and developers.
> >>>>
> >>>> TODO: IoTDB is currently on a private Github repository (
> >>>> https://github.com/thulab/iotdb), while its subproject TsFile (a
> file format for storing time series data) is open sourced on Github (
> https://github.com/thulab/tsfile
> >>>> ).
> >>>>
> >>>> == Core Developers ==
> >>>> IoTDB was initially developed by 2 dozen of students and teachers at
> Tsinghua University. Now, more and more developers have joined coming from
> other universities: Fudan University, Northwestern Polytechnical University
> and Harbin Institute of Technology in China.  Other developers come from
> business companies such as Lenovo and Microsoft. We will be working to
> bring more and more developers into the project making contributions to
> IoTDB.
> >>>>
> >>>> == Relationships with Other Apache Products ==
> >>>> IoTDB requires some Apache products (Apache Thrift, commons,
> collections, httpclient).
> >>>>
> >>>> IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed
> for supporting analysing time series data by using Apache Spark and
> MapReduce.
> >>>>
> >>>> Overall, IoTDB is designed as an open architecture, and it can be
> integrated with many other systems in the future.
> >>>>
> >>>> As mentioned before, in the IoTDB project, we designed a new columnar
> file format, called TsFile, which is similar to Apache Parquet. However,
> the new file format is optimized for time series data.
> >>>>
> >>>>
> >>>>
> >>>> == Known Risks ==
> >>>>
> >>>> === Orphaned Products ===
> >>>> Given the current level of investment in IoTDB, the risk of the
> project being abandoned is minimal. Time series data is more and more
> important and there are several constituents who are highly inspired to
> continue development. Tsinghua and NEL-BDS Lab relies on IoTDB as a
> platform for a large number of long-term research projects. We have
> deployed IoTDB in some company's staging environments for future
> applications.
> >>>>
> >>>> === Inexperience with Open Source ===
> >>>> Students and researchers in Tsinghua University have been developing
> and using open source software for a long time. It is wonderful to be
> guided to join a formal open-source process for students. Some of our
> committers
> >>>> have  experiences contributing to open source, for example:
> >>>>
> >>>> * druid:
> >>>>
> https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44
> >>>>
> >>>> * druid:
> >>>>
> https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794
> >>>>
> >>>> * YCSB:
> >>>> https://github.com/brianfrankcooper/YCSB/pull/776
> >>>>
> >>>>
> >>>> Additionally, several ASF veterans and industry veterans have agreed
> to mentor the project and are listed in this proposal. The project will
> rely on their guidance and collective wisdom to quickly transition the
> entire team of initial committers towards practicing the Apache Way.
> >>>>
> >>>>
> >>>> === Reliance on Salaried Developers ===
> >>>> Most of current developers are students and researchers/professors in
> universities, and their researches focus on big data management and
> analytics. It is unlikely that they will change their research focus away
> from big data management.  We will work to ensure that the ability for the
> project to continuously be stewarded and to proceed forward independent of
> salaried developers is continued.
> >>>>
> >>>> === An Excessive Fascination with the Apache Brand ===
> >>>> Most of the initial developers come from Tsinghua University with no
> intent to use the Apache brand for profit. We have no plans for making use
> of Apache brand in press releases nor posting billboards advertising
> acceptance of IoTDB into Apache Incubator.
> >>>>
> >>>>
> >>>> == Initial Source ==
> >>>> IoTDB's github address and some required dependencies:
> >>>>
> >>>> * The storage file format:
> >>>> https://github.com/thulab/tsfile
> >>>>
> >>>> * Adaptor for Apache Hadoop MapReduce:
> >>>> https://github.com/thulab/tsfile-hadoop-connector
> >>>>
> >>>> * Adaptor for Apache Spark:
> >>>> https://github.com/thulab/tsfile-spark-connector
> >>>>
> >>>> * Adaptor for Grafana:
> >>>> https://github.com/thulab/iotdb-grafana
> >>>>
> >>>> * The database engine:
> >>>> https://github.com/thulab/iotdb
> >>>> (private project up to now)
> >>>> * The client driver:
> >>>> https://github.com/thulab/iotdb-jdbc
> >>>>
> >>>>
> >>>>
> >>>> === External Dependencies ===
> >>>> To the best of our knowledge, all dependencies of IoTDB are
> distributed under Apache compatible licenses. Upon acceptance to the
> incubator, we would begin a thorough analysis of all transitive
> dependencies to verify this fact and introduce license checking into the
> build and release process.
> >>>>
> >>>> == Documentation ==
> >>>> * Documentation for TsFile:
> >>>> https://github.com/thulab/tsfile/wiki
> >>>>
> >>>> * Documentation for IoTDB and its JDBC:
> >>>> http://tsfile.org/document
> >>>> (Chinese only. An English version is in progress.)
> >>>>
> >>>> == Required Resources ==
> >>>> === Mailing Lists ===
> >>>> *
> >>>> private@iotdb.incubator.apache.org
> >>>>
> >>>> *
> >>>> dev@iotdb.incubator.apache.org
> >>>>
> >>>> *
> >>>> commits@iotdb.incubator.apache.org
> >>>>
> >>>>
> >>>> === Git Repositories ===
> >>>> *
> >>>> https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git
> >>>>
> >>>>
> >>>> === Issue Tracking ===
> >>>> *  JIRA IoTDB (We currently use the issue management provided by
> Github to track issues.)
> >>>>
> >>>>
> >>>> == Initial Committers ==
> >>>> Tsinghua University, K2Data Company, Lenovo, Microsoft
> >>>>
> >>>> Jianmin Wang (jimwang at tsinghua dot edu dot cn )
> >>>>
> >>>> Xiangdong Huang (sainthxd at gmail dot com)
> >>>>
> >>>> Jun Yuan (richard_yuan16 at 163 dot com)
> >>>>
> >>>> Chen Wang ( wang_chen at tsinghua dot edu dot cn)
> >>>>
> >>>> Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Jinrui Zhang (jinrzhan at microsoft dot com)
> >>>>
> >>>> Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Shuo Zhang (zhangshuo at k2data dot com dot cn)
> >>>>
> >>>> Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
> >>>>
> >>>> Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
> >>>>
> >>>> == Sponsors ==
> >>>> === Champion ===
> >>>> Kevin A. McGrail (
> >>>> kmcgrail@apache.org
> >>>> )
> >>>>
> >>>> === Nominated Mentors ===
> >>>> Justin Mclean (justin at classsoftware dot com)
> >>>>
> >>>> Christofer Dutz (christofer.dutz at c-ware dot de)
> >>>>
> >>>> Willem Jiang (willem.jiang at gmail dot com)
> >>>>
> >>>>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by hxd <hx...@qq.com>.
Hi,

Thank you Justin!  

Next, we will finish signing SGA and then transfer the source code of IoTDB as soon as possible.

Best,

Xiangdong Huang


> 在 2018年11月25日,下午8:35,Justin Mclean <ju...@classsoftware.com> 写道:
> 
> Hi,
> 
> And the mailing lists are now set up, please subscribe.
> 
> Justin
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by Justin Mclean <ju...@classsoftware.com>.
Hi,

And the mailing lists are now set up, please subscribe.

Justin


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by Justin Mclean <ju...@classsoftware.com>.
Hi,

For the project:
- The DNS and LDAP have been set up
- I’ve just requested the dev@, private@ and commits@ email lists be created.
- Requested git

Thanks,
Justin
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by Craig Russell <ap...@gmail.com>.
Please note that changing the license headers only applies to code that is intended to be Apache licensed. Dependencies that are imported as source will retain their original licenses and headers.

Craig

> On Nov 19, 2018, at 4:36 PM, Craig Russell <ap...@gmail.com> wrote:
> 
> FTR, this is SOP for incubator podlings.
> 
> Here's what needs to happen, in any order.
> 
> Move the code to an Apache repository
> Establish the provenance of every line of code
> For each line of code, contact the author and get a license for it (one of: SGA, ICLA, CCLA)
> 
> If you cannot find the author for some small pieces of code or the author is unresponsive, you can ignore getting the license but make a note of the issue. For example, some drive-by patches of several (up to dozens of) lines of code or a few paragraphs of documentation. 
> 
> Specifically, relicensing and changing headers on code should be done after the code has arrived in Apache. That way, the changes are recorded in Apache source control. Ideally the author should make the header changes.
> 
> As part of due diligence before the first release, all the above should be done. 
> 
> Regards,
> 
> Craig
> 
>> On Nov 15, 2018, at 4:27 AM, hxd <hxdreg@qq.com <ma...@qq.com>> wrote:
>> 
>> Currently, there are 6 repositories (IoTDB, IoTDB-JDBC, TsFile, Spark-Connector, Hive-Connector, and Grafana-Connector) totally and we will merge them all in one repositories. 
>> 
>> Only the first one is private. 
>> 
>> Actually we are lack of experiences about how to open source. 
>> 
>> Should we open all the source now or after all the Apache legal documents are done? 
>> 
>> Best,
>> 
>> Xiangdong Huang  
>> 
>>> 在 2018年11月15日,下午5:06,Willem Jiang <willem.jiang@gmail.com <ma...@gmail.com>> 写道:
>>> 
>>> Here is a question for the source code repository
>>> 
>>> The main source git repo[1] is still a private repo.  I think we need
>>> to open source the repo before sending the SGA?
>>> 
>>> 
>>> [1]https://github.com/thulab/iotdb <https://github.com/thulab/iotdb>
>>> 
>>> Willem Jiang
>>> 
>>> Twitter: willemjiang
>>> Weibo: 姜宁willem
>>> On Thu, Nov 15, 2018 at 4:08 PM hxd <hxdreg@qq.com <ma...@qq.com>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> In the proposal discussion process, we got 3 mentors,  Justin Mclean, Christofer Dutz, and Willem Ning Jiang.
>>>> 
>>>> In the vote process, we got a new mentor, Joe Witt.
>>>> 
>>>> Totally, there are one Champion and four mentors, they are:
>>>> 
>>>> Kevin A. McGrail (the Champion),
>>>> Justin Mclean,
>>>> Christofer Dutz,
>>>> Willem Ning Jiang, and
>>>> Joe Witt
>>>> 
>>>> I have checked their name on http://people.apache.org/committer-index.html <http://people.apache.org/committer-index.html>, and they are accurate now.
>>>> The name list on the proposal list (https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal>) is also correct.
>>>> 
>>>> Regards,
>>>> Xiangdong Huang
>>>> 
>>>> 
>>>> 
>>>> 在 2018年11月15日,上午12:51,Kevin A. McGrail <kmcgrail@apache.org <ma...@apache.org>> 写道:
>>>> 
>>>> Congratulations!  As champion, I think the next steps are:
>>>> 
>>>> 1 - Xiangdong, Can you confirm the list of mentors on the proposal is accurate?
>>>> 
>>>> 2 - Also Xiangdong, Is there anyone else that stepped forward as a mentor during the voting process that the project wants the IPMC to approve?
>>>> 
>>>> 3 - Justin, I think you have to request the creation of the podling and then I as champion work on things like the meta data file from this page,
>>>> https://incubator.apache.org/policy/incubation.html <https://incubator.apache.org/policy/incubation.html>, correct?
>>>> 
>>>> Regards,
>>>> KAM
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Kevin A. McGrail
>>>> VP Fundraising, Apache Software Foundation
>>>> Chair Emeritus Apache SpamAssassin Project
>>>> https://www.linkedin.com/in/kmcgrail <https://www.linkedin.com/in/kmcgrail> - 703.798.0171
>>>> 
>>>> 
>>>> On Wed, Nov 14, 2018 at 6:29 AM hxd <hxdreg@qq.com <ma...@qq.com>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1 votes, this VOTE passes.
>>>>> 
>>>>> Thanks to everyone who voted!
>>>>> 
>>>>> Bellow is a voting tally:
>>>>> 
>>>>> Binding
>>>>> Von Gosling
>>>>> Christofer Dutz
>>>>> Kevin A. McGrail
>>>>> Felix Cheung
>>>>> Matt Sticker
>>>>> Joe Witt
>>>>> Justin Mclean
>>>>> Willem Jiang
>>>>> 
>>>>> 
>>>>> Non-binding
>>>>> Sheng Wu
>>>>> Yang Bo
>>>>> 
>>>>> The vote thread: https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E<https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E> <https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E%3Chttps://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E%3E>
>>>>> The proposal: https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal> <https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal>>
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Xiangdong Huang
>>>>> 
>>>>> 
>>>>>> 在 2018年11月7日,下午3:46,hxd <hxdreg@qq.com <ma...@qq.com>> 写道:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Sorry for the previous mail with bad format.
>>>>>> I'd like to call a VOTE to accept IoTDB project, a database for managing large amounts of time series data  from IoT sensors in industrial applications, into the Apache Incubator.
>>>>>> The full proposal is available on the wiki: https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal>
>>>>>> and it is also attached below for your convenience.
>>>>>> 
>>>>>> Please cast your vote:
>>>>>> 
>>>>>> [ ] +1, bring IoTDB into Incubator
>>>>>> [ ] +0, I don't care either way,
>>>>>> [ ] -1, do not bring IoTDB into Incubator, because...
>>>>>> 
>>>>>> The vote will open at least for 72 hours.
>>>>>> 
>>>>>> Thanks,
>>>>>> Xiangdong Huang.
>>>>>> 
>>>>>> 
>>>>>> = IoTDB Proposal  =
>>>>>> v0.1.1
>>>>>> 
>>>>>> 
>>>>>> == Abstract ==
>>>>>> IoTDB is a data store for managing large amounts of time series data such as timestamped data from IoT sensors in industrial applications.
>>>>>> 
>>>>>> == Proposal ==
>>>>>> IoTDB is a database for managing large amount of time series data with columnar storage, data encoding, pre-computation, and index techniques. It has SQL-like interface to write millions of data points per second per node and is optimized to get query results in few seconds over trillions of data points. It can also be easily integrated with Apache Hadoop MapReduce and Apache Spark for analytics.
>>>>>> 
>>>>>> == Background ==
>>>>>> 
>>>>>> A new class of data management system requirements is becoming increasingly important with the rise of the Internet of Things. There are some database systems and technologies aimed at time series data management.  For example, Gorilla and InfluxDB which are mainly built for data centers and monitoring application metrics. Other systems, for example, OpenTSDB and KairosDB, are built on Apache HBase and Apache Cassandra, respectively.
>>>>>> 
>>>>>> However, many applications for time series data management have more requirements especially in industrial applications as follows:
>>>>>> 
>>>>>> * Supporting time series data which has high data frequency. For example, a turbine engine may generate 1000 points per second (i.e., 1000Hz), while each CPU only reports 1 data points per 5 seconds in a data center monitoring application.
>>>>>> 
>>>>>> * Supporting scanning data multi-resolutionally. For example, aggregation operation is important for time series data.
>>>>>> 
>>>>>> * Supporting special queries for time series, such as pattern matching, time series segmentation, time-frequency transformation and frequency query.
>>>>>> 
>>>>>> * Supporting a large number of monitoring targets (i.e. time series). An excavator may report more than 1000 time series, for example, revolving speed of the motor-engine, the speed of the excavator, the accelerated speed, the temperature of the water tank and so on, while a CPU or an application monitor has much fewer time series.
>>>>>> 
>>>>>> * Optimization for out-of-order data points. In the industrial sector, it is common that equipment sends data using the UDP protocol rather than the TCP protocol. Sometimes, the network connect is unstable and parts of the data will be buffered for later sending.
>>>>>> 
>>>>>> * Supporting long-term storage. Historical data is precious for equipment manufacturers. Therefore, removing or unloading historical data is highly desired for most industrial applications. The database system must not only support fast retrieval of historical data, but also should guarantee that the historical data does not impact the processing speed for “hot” or current data.
>>>>>> 
>>>>>> * Supporting online transaction processing (OLTP) as well as complex analytics. It is obvious that supporting analyzing from the data files using Apache Spark/Apache Hadoop MapReduce directly is better than transforming data files to another file format for Big Data analytics.
>>>>>> 
>>>>>> * Flexible deployment either on premise or in the cloud.  IoTDB is as simple and can be deployed on a Raspberry Pi handling hundreds of time series. Meanwhile, the system can be also deployed in the cloud so that it supports tens of millions ingestions per second, OLTP queries in milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
>>>>>> 
>>>>>> * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a wind turbine, or a meteorological station, the deployment of the chosen database is designed to be simple. A device may have hundreds of time series (but less than a thousand time series) and the database needs to handle them.
>>>>>> * * (2) When deploying IoTDB in a data center, the computational resources (i.e., the hardware configuration of servers) is not a problem when compared to a Raspberry Pi. In this deployment, IoTDB can use more computation resources, and has the ability to handle more time seires (e.g., millions of time series).
>>>>>> 
>>>>>> Based on these requirements, we developed IoTDB, a new data store system for managing time series data.
>>>>>> 
>>>>>> IoTDB started as a Tsinghua University research project. IoTDB's developer community has also grown to include additional institutions, for example, universities (e.g., Fudan University), research labs (e.g, NEL-BDS lab), and corporations (e.g., K2Data, Tencent). Funding has been provided by various institutions including the National Natural Science Foundation of China, and industry sponsors, such as Lenovo and K2Data.
>>>>>> 
>>>>>> == Rationale ==
>>>>>> Because there is no existed open-sourced time series databases covering all the above requirements, we developed IoTDB. As the system matures, we are seeking a long-term home for the project. We believe the Apache Software Foundation would be an ideal fit. Also joining Apache will help coordinate and improve the development effort of the growing number of organizations which contribute to IoTDB improving the diversity of our community.
>>>>>> 
>>>>>> IoTDB contains multiple modules, which are classified into categories:
>>>>>> 
>>>>>> * '''TsFile Format''': TsFile is a new columnar file format.
>>>>>> * '''Adaptor for Analytics and Visualization''': Integrating TsFile with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also provided.
>>>>>> * '''IoTDB Engine''': An engine which consists of SQL parser, query plan generator, memtable, authentication and authorization,write ahead log (WAL), crash recovery, out-of-order data handler, and index for aggregation and pattern matching. The engine stores system data in TsFile format.
>>>>>> * '''IoTDB JDBC''': An implementation of Java Database Connectivity (JDBC) for clients to connect to IoTDB using Java.
>>>>>> 
>>>>>> === TsFile Format ===
>>>>>> 
>>>>>> TsFile format is a columnar store, which is similar with Apache Parquet and Apache CarbonData. It has the concepts of Chunk Group, Column Chunk, Page and Footer. Comparing with Apache Parquet and Apache CarbonData, it is designed and optimized for time series:
>>>>>> 
>>>>>> ==== Time Series Friendly Encoding ====
>>>>>> IoTDB currently supports run length encoding (RLE), delta-of-delta encoding, and Facebook's Gorilla encoding.
>>>>>> 
>>>>>> Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA) and time-frequency transformation are works-in-progress.
>>>>>> 
>>>>>> 
>>>>>> ==== Chunk Group ====
>>>>>> The data part of a TsFile consists of many Chunk Groups. Each Chunk Group stores the data of a device at a time interval.  A Chunk Group is similar to the row group in Apache Parquet, while there are some constraints of the time dimension:  For each device, the time intervals of different Chunk Groups are not overlapped and the latter Chunk Group always has a larger timestamp.
>>>>>> 
>>>>>> Given a TsFile and a query with a time range filter, the query process can terminate scanning data once it reads data points whose timestamp reaches the time limit of the filter. We call the feature ''fast-return'' and it makes the time range query in a TsFile very efficient.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ==== Different Column Chunk Format (Unnecessary the Repetition (R) and Definition (D) Fields) ====
>>>>>> 
>>>>>> While Apache Parquet and Apache CarbonData support complex data types, e.g., nested data and sparse columns, TsFile is exclusively designed for time series whose data model is \<device_id, series_id, timestamp, value\>.
>>>>>> 
>>>>>> In a `Chunk Group`, each time series is a `Column Chunk`. Even though these time series belong to the same device, the data points in different time series are not aligned in the time dimension originally.
>>>>>> 
>>>>>> For example, if you have a device with 2 sensors on the same data collection frequencies, sensor 1 may collect data at time 1521622662000 while the other one collects data at time 1521622662001 (delta=1ms). Therefore, each Column Chunk has its timestamps and values, which is quite different from Apache Parquet and Apache CarbonData.  Because we store the time column along with each value column instead of making different chunks share the same time column for the sake of diverse data frequency for different time series, we do not store any null value on disk to align across time series. Besides, we do not need to attach  `repetition` (R) and `definition` (D) fields on each value. Therefore, the disk space is saved and the query latency is reduced (because we do not align data by calculating R and D fields).
>>>>>> 
>>>>>> 
>>>>>> ==== Domain Specific Information in Each Page ====
>>>>>> Similar to Apache Parquet and Apache CarbonData, a `Column Chunk` consists of several `Pages`, and each `Page` has a `Page header`. The `Page header` is a summary of the data in the page.
>>>>>> 
>>>>>> Because TsFile is optimized for time series, the page header contains more domain specific information, such as the minimal and maximal value, the minimal and the maximal timestamp, the frequency and so on. TsFile can even store the histogram of values in the page header.
>>>>>> 
>>>>>> This header information helps IoTDB in speeding up queries by skipping unnecessary pages.
>>>>>> 
>>>>>> 
>>>>>> === Adaptor for Analytics ===
>>>>>> The TsFile provides:
>>>>>> 
>>>>>> * InputFormat/OutputFormat interfaces for Reading/Writing data.
>>>>>> * Deep integration with Apache Spark/Hadoop MapReduce including predicate push-down, column pruning, aggregation push down, etc. So users can use Apache Spark SQL/HiveQL to connect and query TsFiles.
>>>>>> 
>>>>>> 
>>>>>> === IoTDB Engine ===
>>>>>> The IoTDB engine is a database engine, which uses TsFile as its storage file format. The IoTDB Engine supports SQL-like query plus many useful functions:
>>>>>> 
>>>>>> * Tree-based time series schema
>>>>>> * Log-Structured Merge (LSM)-based storage
>>>>>> * Overflow file for out-of-order data
>>>>>> * Scalable index framework
>>>>>> * Special queries for time series
>>>>>> 
>>>>>> ==== Tree-based Time Series Schema ====
>>>>>> IoTDB manages all the time series definitions using a tree structure. A path from the root of the tree to a leaf node represents a time series. Therefore, the unique id of a time series is a path, e.g., `root.China.beijing.windFarm1.windTurbine1.speed`.
>>>>>> 
>>>>>> This kind of schema can express `group by` naturally. For example, `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind turbines in wind farm 1 in Beijing, China.
>>>>>> 
>>>>>> ==== Log-Structured Merge (LSM)-based Storage ====
>>>>>> In a time series, the data points should be ordered by their timestamps. In IoTDB, we use Log-Structured Merge (LSM) based mechanism. Therefore, a part of the data is stored in memory first and can be called as `memtable`. At this time, if data points come out-of-order, we resort them in memory. When this part of data exceeds the configured memory limit, we flush it on disk as a `Chunk Group` into an unclosed TsFile.  Finally, a TsFile may contain several Chunk Groups, for reducing the number of small data files, which is helpful to reduce the I/O load of the storage system and reduces the execution time of a file-merge in LSM. Notice that the data is time-ordered in one Chunk Group on disk, and this layout is helpful for fast filtering in one Chunk Group for a query.
>>>>>> 
>>>>>> Rule 1: In a TsFile, the Chunk Groups of one device are ordered by timestamp (Rule 1), and it is helpful for fast filtering among Chunk Groups for a query.
>>>>>> 
>>>>>> Rule 2: When the size of the unclosed TsFile reaches the threshold defined in the configuration file, we close the file and generate a new one to store new arriving data spanning the entire data set. Like many systems which use LSM-based storage, we never modify a TsFile which has been closed except for the file-merge process (Rule 2).
>>>>>> 
>>>>>> Rule 3: To reduce the number of TsFiles involved in a query process, we guarantee that the data points in different TsFiles are not overlapping on the time dimension after file mergence (Rule 3).
>>>>>> 
>>>>>> ==== Overflow File for Out-of-order Data ====
>>>>>> When a part of data is flushed on disk (and will form a `Chunk Group` in a TsFile), the newly arriving data points whose timestamps are smaller than the largest timestamp in the Tsfile are `out-of-order`.
>>>>>> 
>>>>>> To store the out-of-order data, we organize all the troublesome `out-of-order` data point insertions into a special TsFile, named `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device may be overlapping in the time dimension, which violates the Rule 1 and costs additional time compared to a normal TsFile for query filtering.
>>>>>> 
>>>>>> There is another special operation: updating all the data points in a time range, e.g., `update all the speed values of device1 as 0 where the data time is in [1521622000000, 1521622662000]`. The operation is called when: (1) a sensor malfunctions and the database receives wrong data for a period; (2) we may want to reset all the records. Many NoSQL time series databases do not support such an operation. To support the operation in IoTDB, we use a tree-based structure, Treap, to store this part of operations and store them as `Overflow` files.
>>>>>> 
>>>>>> Therefore, there are 3 kinds of data files: TsFiles, UnSequenceTsFiles and Overflow files.  TsFiles should store most of the data. The volume of UnSequenceTsFiles depends on the workload: if there are too many out-of-order and the time span of out-of-order is huge, the volume will be large. Overflow files handle fewest data operations but will depend on the use of the special operations.
>>>>>> 
>>>>>> ==== LSM-tree ====
>>>>>> Normally, LSM-based storage engines merge data files level by level so that it looks like a tree structure. In this way, data is well organized. The disadvantage is that data will be read and written several times. If the tree has 4 levels, each data point will be rewritten at least 4 times.
>>>>>> 
>>>>>> Currently, we do not merge all the TsFiles into one because (1) the number of TsFiles is kept lower than many LSM storage engines because a memtable is mapped to several Chunk Groups rather than a file; (2) different TsFiles are not overlapping with each other in the time dimension (because of Rule 3).
>>>>>> 
>>>>>> As mentioned before,  TsFile supports ''fast-return'' to accelerate queries. However, UnSequenceTsFile and Overflow files do not allow this feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may be overlapped, which leads to more files involved in the query process. To accelerate these queries, there is a merging process to reorganize files in the background. All the three kinds of files: TsFiles, UnSequenceTsFiles and Overflow files, are involved in the merging process. The merging process is implemented using multi-threading, while each thread is responsible for a series family.
>>>>>> After merging, only TsFiles are left. These files have non-overlapping time spans and support the ''fast-return'' feature.
>>>>>> 
>>>>>> ==== Scalable Index Framework ====
>>>>>> We allow users to implement indexes for faster queries. We currently support an index for pattern matching query (KV-Match index, ICDE 2019). Another index for fast aggregation (PISA index, CIKM 2016) is a work-in-progress.
>>>>>> 
>>>>>> ==== Special Queries ====
>>>>>> We currently support `group by time interval` aggregation queries and `Fill by` operations, which are similar to those of InfluxDB. Time series segmentation operations and frequency queries are work-in-progress.
>>>>>> 
>>>>>> == Initial Goals ==
>>>>>> The initial goals are to be open sourced and to integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines.
>>>>>> 
>>>>>> == Current Status ==
>>>>>> We have developed the system for more than 2 years. There are currently 13k lines of code, some of which are generated by Antlr3 and Thrift.  There are 230 issues which have been solved and more than 1500 commits.
>>>>>> 
>>>>>> The system has been deployed in the staging environment of the State Grid Corporation of China to handle ~3 million time series (i.e, ~30,000 power generation assembly * ~100 sensors) and an equipment service company in China managing ~2 million time series (i.e, ~20k devices * 100 sensors). The insertion speed reaches ~2 million points/second/node, which is faster than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
>>>>>> 
>>>>>> There are many new features in the works including those mentioned herein. We will add more analytics functions, improve the data file merge process, and finish the first released version of IoTDB.
>>>>>> 
>>>>>> == Meritocracy ==
>>>>>> The IoTDB project operates on meritocratic principles. Developers who submit more code with higher quality earn more merit. We have used `Issues` and `Pull Requests` modules on Github for collecting users' suggestions and patches. Users who submit issues, pull requests, documents and help the community management are welcomed and encouraged to become committers.
>>>>>> 
>>>>>> == Community ==
>>>>>> 
>>>>>> The IoTDB project users communicate on Github (
>>>>>> https://github.com/thulab/tsfile <https://github.com/thulab/tsfile>) . Developers make the communication on a website which is similar with JIRA (Currently, only registered users can apply to access the project for communication, url: https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/ <https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/>
>>>>>> ). We have also introduced IoTDB at many technical conferences. Next, we will build the mailing list for more convenience, broader communication and archived discussions.
>>>>>> 
>>>>>> If IoTDB is accepted for incubation at the Apache Software Foundation, the primary goal is to build a larger community. We believe that IoTDB will become a key project for time series data management, and so, we will rely on a large community of users and developers.
>>>>>> 
>>>>>> TODO: IoTDB is currently on a private Github repository (
>>>>>> https://github.com/thulab/iotdb <https://github.com/thulab/iotdb>), while its subproject TsFile (a file format for storing time series data) is open sourced on Github (https://github.com/thulab/tsfile <https://github.com/thulab/tsfile>
>>>>>> ).
>>>>>> 
>>>>>> == Core Developers ==
>>>>>> IoTDB was initially developed by 2 dozen of students and teachers at Tsinghua University. Now, more and more developers have joined coming from other universities: Fudan University, Northwestern Polytechnical University and Harbin Institute of Technology in China.  Other developers come from business companies such as Lenovo and Microsoft. We will be working to bring more and more developers into the project making contributions to IoTDB.
>>>>>> 
>>>>>> == Relationships with Other Apache Products ==
>>>>>> IoTDB requires some Apache products (Apache Thrift, commons, collections, httpclient).
>>>>>> 
>>>>>> IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed for supporting analysing time series data by using Apache Spark and MapReduce.
>>>>>> 
>>>>>> Overall, IoTDB is designed as an open architecture, and it can be integrated with many other systems in the future.
>>>>>> 
>>>>>> As mentioned before, in the IoTDB project, we designed a new columnar file format, called TsFile, which is similar to Apache Parquet. However, the new file format is optimized for time series data.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> == Known Risks ==
>>>>>> 
>>>>>> === Orphaned Products ===
>>>>>> Given the current level of investment in IoTDB, the risk of the project being abandoned is minimal. Time series data is more and more important and there are several constituents who are highly inspired to continue development. Tsinghua and NEL-BDS Lab relies on IoTDB as a platform for a large number of long-term research projects. We have deployed IoTDB in some company's staging environments for future applications.
>>>>>> 
>>>>>> === Inexperience with Open Source ===
>>>>>> Students and researchers in Tsinghua University have been developing and using open source software for a long time. It is wonderful to be guided to join a formal open-source process for students. Some of our committers
>>>>>> have  experiences contributing to open source, for example:
>>>>>> 
>>>>>> * druid:
>>>>>> https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44 <https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44>
>>>>>> 
>>>>>> * druid:
>>>>>> https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794
>>>>>> 
>>>>>> * YCSB:
>>>>>> https://github.com/brianfrankcooper/YCSB/pull/776
>>>>>> 
>>>>>> 
>>>>>> Additionally, several ASF veterans and industry veterans have agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
>>>>>> 
>>>>>> 
>>>>>> === Reliance on Salaried Developers ===
>>>>>> Most of current developers are students and researchers/professors in universities, and their researches focus on big data management and analytics. It is unlikely that they will change their research focus away from big data management.  We will work to ensure that the ability for the project to continuously be stewarded and to proceed forward independent of salaried developers is continued.
>>>>>> 
>>>>>> === An Excessive Fascination with the Apache Brand ===
>>>>>> Most of the initial developers come from Tsinghua University with no intent to use the Apache brand for profit. We have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of IoTDB into Apache Incubator.
>>>>>> 
>>>>>> 
>>>>>> == Initial Source ==
>>>>>> IoTDB's github address and some required dependencies:
>>>>>> 
>>>>>> * The storage file format:
>>>>>> https://github.com/thulab/tsfile
>>>>>> 
>>>>>> * Adaptor for Apache Hadoop MapReduce:
>>>>>> https://github.com/thulab/tsfile-hadoop-connector
>>>>>> 
>>>>>> * Adaptor for Apache Spark:
>>>>>> https://github.com/thulab/tsfile-spark-connector
>>>>>> 
>>>>>> * Adaptor for Grafana:
>>>>>> https://github.com/thulab/iotdb-grafana
>>>>>> 
>>>>>> * The database engine:
>>>>>> https://github.com/thulab/iotdb
>>>>>> (private project up to now)
>>>>>> * The client driver:
>>>>>> https://github.com/thulab/iotdb-jdbc
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> === External Dependencies ===
>>>>>> To the best of our knowledge, all dependencies of IoTDB are distributed under Apache compatible licenses. Upon acceptance to the incubator, we would begin a thorough analysis of all transitive dependencies to verify this fact and introduce license checking into the build and release process.
>>>>>> 
>>>>>> == Documentation ==
>>>>>> * Documentation for TsFile:
>>>>>> https://github.com/thulab/tsfile/wiki
>>>>>> 
>>>>>> * Documentation for IoTDB and its JDBC:
>>>>>> http://tsfile.org/document
>>>>>> (Chinese only. An English version is in progress.)
>>>>>> 
>>>>>> == Required Resources ==
>>>>>> === Mailing Lists ===
>>>>>> *
>>>>>> private@iotdb.incubator.apache.org
>>>>>> 
>>>>>> *
>>>>>> dev@iotdb.incubator.apache.org
>>>>>> 
>>>>>> *
>>>>>> commits@iotdb.incubator.apache.org
>>>>>> 
>>>>>> 
>>>>>> === Git Repositories ===
>>>>>> *
>>>>>> https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git
>>>>>> 
>>>>>> 
>>>>>> === Issue Tracking ===
>>>>>> *  JIRA IoTDB (We currently use the issue management provided by Github to track issues.)
>>>>>> 
>>>>>> 
>>>>>> == Initial Committers ==
>>>>>> Tsinghua University, K2Data Company, Lenovo, Microsoft
>>>>>> 
>>>>>> Jianmin Wang (jimwang at tsinghua dot edu dot cn )
>>>>>> 
>>>>>> Xiangdong Huang (sainthxd at gmail dot com)
>>>>>> 
>>>>>> Jun Yuan (richard_yuan16 at 163 dot com)
>>>>>> 
>>>>>> Chen Wang ( wang_chen at tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Jinrui Zhang (jinrzhan at microsoft dot com)
>>>>>> 
>>>>>> Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Shuo Zhang (zhangshuo at k2data dot com dot cn)
>>>>>> 
>>>>>> Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
>>>>>> 
>>>>>> Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
>>>>>> 
>>>>>> == Sponsors ==
>>>>>> === Champion ===
>>>>>> Kevin A. McGrail (
>>>>>> kmcgrail@apache.org
>>>>>> )
>>>>>> 
>>>>>> === Nominated Mentors ===
>>>>>> Justin Mclean (justin at classsoftware dot com)
>>>>>> 
>>>>>> Christofer Dutz (christofer.dutz at c-ware dot de)
>>>>>> 
>>>>>> Willem Jiang (willem.jiang at gmail dot com)
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org <ma...@incubator.apache.org>
>>> For additional commands, e-mail: general-help@incubator.apache.org <ma...@incubator.apache.org>
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org <ma...@incubator.apache.org>
>> For additional commands, e-mail: general-help@incubator.apache.org <ma...@incubator.apache.org>
>> 
> 
> Craig L Russell
> Secretary, Apache Software Foundation
> clr@apache.org <ma...@apache.org> http://db.apache.org/jdo <http://db.apache.org/jdo>

Craig L Russell
Secretary, Apache Software Foundation
clr@apache.org <ma...@apache.org> http://db.apache.org/jdo <http://db.apache.org/jdo>

Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by Craig Russell <ap...@gmail.com>.
FTR, this is SOP for incubator podlings.

Here's what needs to happen, in any order.

Move the code to an Apache repository
Establish the provenance of every line of code
For each line of code, contact the author and get a license for it (one of: SGA, ICLA, CCLA)

If you cannot find the author for some small pieces of code or the author is unresponsive, you can ignore getting the license but make a note of the issue. For example, some drive-by patches of several (up to dozens of) lines of code or a few paragraphs of documentation. 

Specifically, relicensing and changing headers on code should be done after the code has arrived in Apache. That way, the changes are recorded in Apache source control. Ideally the author should make the header changes.

As part of due diligence before the first release, all the above should be done. 

Regards,

Craig

> On Nov 15, 2018, at 4:27 AM, hxd <hx...@qq.com> wrote:
> 
> Currently, there are 6 repositories (IoTDB, IoTDB-JDBC, TsFile, Spark-Connector, Hive-Connector, and Grafana-Connector) totally and we will merge them all in one repositories. 
> 
> Only the first one is private. 
> 
> Actually we are lack of experiences about how to open source. 
> 
> Should we open all the source now or after all the Apache legal documents are done? 
> 
> Best,
> 
> Xiangdong Huang  
> 
>> 在 2018年11月15日,下午5:06,Willem Jiang <wi...@gmail.com> 写道:
>> 
>> Here is a question for the source code repository
>> 
>> The main source git repo[1] is still a private repo.  I think we need
>> to open source the repo before sending the SGA?
>> 
>> 
>> [1]https://github.com/thulab/iotdb
>> 
>> Willem Jiang
>> 
>> Twitter: willemjiang
>> Weibo: 姜宁willem
>> On Thu, Nov 15, 2018 at 4:08 PM hxd <hx...@qq.com> wrote:
>>> 
>>> Hi,
>>> 
>>> In the proposal discussion process, we got 3 mentors,  Justin Mclean, Christofer Dutz, and Willem Ning Jiang.
>>> 
>>> In the vote process, we got a new mentor, Joe Witt.
>>> 
>>> Totally, there are one Champion and four mentors, they are:
>>> 
>>> Kevin A. McGrail (the Champion),
>>> Justin Mclean,
>>> Christofer Dutz,
>>> Willem Ning Jiang, and
>>> Joe Witt
>>> 
>>> I have checked their name on http://people.apache.org/committer-index.html, and they are accurate now.
>>> The name list on the proposal list (https://wiki.apache.org/incubator/IoTDBProposal) is also correct.
>>> 
>>> Regards,
>>> Xiangdong Huang
>>> 
>>> 
>>> 
>>> 在 2018年11月15日,上午12:51,Kevin A. McGrail <km...@apache.org> 写道:
>>> 
>>> Congratulations!  As champion, I think the next steps are:
>>> 
>>> 1 - Xiangdong, Can you confirm the list of mentors on the proposal is accurate?
>>> 
>>> 2 - Also Xiangdong, Is there anyone else that stepped forward as a mentor during the voting process that the project wants the IPMC to approve?
>>> 
>>> 3 - Justin, I think you have to request the creation of the podling and then I as champion work on things like the meta data file from this page,
>>> https://incubator.apache.org/policy/incubation.html, correct?
>>> 
>>> Regards,
>>> KAM
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Kevin A. McGrail
>>> VP Fundraising, Apache Software Foundation
>>> Chair Emeritus Apache SpamAssassin Project
>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
>>> 
>>> 
>>> On Wed, Nov 14, 2018 at 6:29 AM hxd <hx...@qq.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1 votes, this VOTE passes.
>>>> 
>>>> Thanks to everyone who voted!
>>>> 
>>>> Bellow is a voting tally:
>>>> 
>>>> Binding
>>>> Von Gosling
>>>> Christofer Dutz
>>>> Kevin A. McGrail
>>>> Felix Cheung
>>>> Matt Sticker
>>>> Joe Witt
>>>> Justin Mclean
>>>> Willem Jiang
>>>> 
>>>> 
>>>> Non-binding
>>>> Sheng Wu
>>>> Yang Bo
>>>> 
>>>> The vote thread: https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E<https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E>
>>>> The proposal: https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal>
>>>> 
>>>> Thanks,
>>>> 
>>>> Xiangdong Huang
>>>> 
>>>> 
>>>>> 在 2018年11月7日,下午3:46,hxd <hx...@qq.com> 写道:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Sorry for the previous mail with bad format.
>>>>> I'd like to call a VOTE to accept IoTDB project, a database for managing large amounts of time series data  from IoT sensors in industrial applications, into the Apache Incubator.
>>>>> The full proposal is available on the wiki: https://wiki.apache.org/incubator/IoTDBProposal
>>>>> and it is also attached below for your convenience.
>>>>> 
>>>>> Please cast your vote:
>>>>> 
>>>>> [ ] +1, bring IoTDB into Incubator
>>>>> [ ] +0, I don't care either way,
>>>>> [ ] -1, do not bring IoTDB into Incubator, because...
>>>>> 
>>>>> The vote will open at least for 72 hours.
>>>>> 
>>>>> Thanks,
>>>>> Xiangdong Huang.
>>>>> 
>>>>> 
>>>>> = IoTDB Proposal  =
>>>>> v0.1.1
>>>>> 
>>>>> 
>>>>> == Abstract ==
>>>>> IoTDB is a data store for managing large amounts of time series data such as timestamped data from IoT sensors in industrial applications.
>>>>> 
>>>>> == Proposal ==
>>>>> IoTDB is a database for managing large amount of time series data with columnar storage, data encoding, pre-computation, and index techniques. It has SQL-like interface to write millions of data points per second per node and is optimized to get query results in few seconds over trillions of data points. It can also be easily integrated with Apache Hadoop MapReduce and Apache Spark for analytics.
>>>>> 
>>>>> == Background ==
>>>>> 
>>>>> A new class of data management system requirements is becoming increasingly important with the rise of the Internet of Things. There are some database systems and technologies aimed at time series data management.  For example, Gorilla and InfluxDB which are mainly built for data centers and monitoring application metrics. Other systems, for example, OpenTSDB and KairosDB, are built on Apache HBase and Apache Cassandra, respectively.
>>>>> 
>>>>> However, many applications for time series data management have more requirements especially in industrial applications as follows:
>>>>> 
>>>>> * Supporting time series data which has high data frequency. For example, a turbine engine may generate 1000 points per second (i.e., 1000Hz), while each CPU only reports 1 data points per 5 seconds in a data center monitoring application.
>>>>> 
>>>>> * Supporting scanning data multi-resolutionally. For example, aggregation operation is important for time series data.
>>>>> 
>>>>> * Supporting special queries for time series, such as pattern matching, time series segmentation, time-frequency transformation and frequency query.
>>>>> 
>>>>> * Supporting a large number of monitoring targets (i.e. time series). An excavator may report more than 1000 time series, for example, revolving speed of the motor-engine, the speed of the excavator, the accelerated speed, the temperature of the water tank and so on, while a CPU or an application monitor has much fewer time series.
>>>>> 
>>>>> * Optimization for out-of-order data points. In the industrial sector, it is common that equipment sends data using the UDP protocol rather than the TCP protocol. Sometimes, the network connect is unstable and parts of the data will be buffered for later sending.
>>>>> 
>>>>> * Supporting long-term storage. Historical data is precious for equipment manufacturers. Therefore, removing or unloading historical data is highly desired for most industrial applications. The database system must not only support fast retrieval of historical data, but also should guarantee that the historical data does not impact the processing speed for “hot” or current data.
>>>>> 
>>>>> * Supporting online transaction processing (OLTP) as well as complex analytics. It is obvious that supporting analyzing from the data files using Apache Spark/Apache Hadoop MapReduce directly is better than transforming data files to another file format for Big Data analytics.
>>>>> 
>>>>> * Flexible deployment either on premise or in the cloud.  IoTDB is as simple and can be deployed on a Raspberry Pi handling hundreds of time series. Meanwhile, the system can be also deployed in the cloud so that it supports tens of millions ingestions per second, OLTP queries in milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
>>>>> 
>>>>> * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a wind turbine, or a meteorological station, the deployment of the chosen database is designed to be simple. A device may have hundreds of time series (but less than a thousand time series) and the database needs to handle them.
>>>>> * * (2) When deploying IoTDB in a data center, the computational resources (i.e., the hardware configuration of servers) is not a problem when compared to a Raspberry Pi. In this deployment, IoTDB can use more computation resources, and has the ability to handle more time seires (e.g., millions of time series).
>>>>> 
>>>>> Based on these requirements, we developed IoTDB, a new data store system for managing time series data.
>>>>> 
>>>>> IoTDB started as a Tsinghua University research project. IoTDB's developer community has also grown to include additional institutions, for example, universities (e.g., Fudan University), research labs (e.g, NEL-BDS lab), and corporations (e.g., K2Data, Tencent). Funding has been provided by various institutions including the National Natural Science Foundation of China, and industry sponsors, such as Lenovo and K2Data.
>>>>> 
>>>>> == Rationale ==
>>>>> Because there is no existed open-sourced time series databases covering all the above requirements, we developed IoTDB. As the system matures, we are seeking a long-term home for the project. We believe the Apache Software Foundation would be an ideal fit. Also joining Apache will help coordinate and improve the development effort of the growing number of organizations which contribute to IoTDB improving the diversity of our community.
>>>>> 
>>>>> IoTDB contains multiple modules, which are classified into categories:
>>>>> 
>>>>> * '''TsFile Format''': TsFile is a new columnar file format.
>>>>> * '''Adaptor for Analytics and Visualization''': Integrating TsFile with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also provided.
>>>>> * '''IoTDB Engine''': An engine which consists of SQL parser, query plan generator, memtable, authentication and authorization,write ahead log (WAL), crash recovery, out-of-order data handler, and index for aggregation and pattern matching. The engine stores system data in TsFile format.
>>>>> * '''IoTDB JDBC''': An implementation of Java Database Connectivity (JDBC) for clients to connect to IoTDB using Java.
>>>>> 
>>>>> === TsFile Format ===
>>>>> 
>>>>> TsFile format is a columnar store, which is similar with Apache Parquet and Apache CarbonData. It has the concepts of Chunk Group, Column Chunk, Page and Footer. Comparing with Apache Parquet and Apache CarbonData, it is designed and optimized for time series:
>>>>> 
>>>>> ==== Time Series Friendly Encoding ====
>>>>> IoTDB currently supports run length encoding (RLE), delta-of-delta encoding, and Facebook's Gorilla encoding.
>>>>> 
>>>>> Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA) and time-frequency transformation are works-in-progress.
>>>>> 
>>>>> 
>>>>> ==== Chunk Group ====
>>>>> The data part of a TsFile consists of many Chunk Groups. Each Chunk Group stores the data of a device at a time interval.  A Chunk Group is similar to the row group in Apache Parquet, while there are some constraints of the time dimension:  For each device, the time intervals of different Chunk Groups are not overlapped and the latter Chunk Group always has a larger timestamp.
>>>>> 
>>>>> Given a TsFile and a query with a time range filter, the query process can terminate scanning data once it reads data points whose timestamp reaches the time limit of the filter. We call the feature ''fast-return'' and it makes the time range query in a TsFile very efficient.
>>>>> 
>>>>> 
>>>>> 
>>>>> ==== Different Column Chunk Format (Unnecessary the Repetition (R) and Definition (D) Fields) ====
>>>>> 
>>>>> While Apache Parquet and Apache CarbonData support complex data types, e.g., nested data and sparse columns, TsFile is exclusively designed for time series whose data model is \<device_id, series_id, timestamp, value\>.
>>>>> 
>>>>> In a `Chunk Group`, each time series is a `Column Chunk`. Even though these time series belong to the same device, the data points in different time series are not aligned in the time dimension originally.
>>>>> 
>>>>> For example, if you have a device with 2 sensors on the same data collection frequencies, sensor 1 may collect data at time 1521622662000 while the other one collects data at time 1521622662001 (delta=1ms). Therefore, each Column Chunk has its timestamps and values, which is quite different from Apache Parquet and Apache CarbonData.  Because we store the time column along with each value column instead of making different chunks share the same time column for the sake of diverse data frequency for different time series, we do not store any null value on disk to align across time series. Besides, we do not need to attach  `repetition` (R) and `definition` (D) fields on each value. Therefore, the disk space is saved and the query latency is reduced (because we do not align data by calculating R and D fields).
>>>>> 
>>>>> 
>>>>> ==== Domain Specific Information in Each Page ====
>>>>> Similar to Apache Parquet and Apache CarbonData, a `Column Chunk` consists of several `Pages`, and each `Page` has a `Page header`. The `Page header` is a summary of the data in the page.
>>>>> 
>>>>> Because TsFile is optimized for time series, the page header contains more domain specific information, such as the minimal and maximal value, the minimal and the maximal timestamp, the frequency and so on. TsFile can even store the histogram of values in the page header.
>>>>> 
>>>>> This header information helps IoTDB in speeding up queries by skipping unnecessary pages.
>>>>> 
>>>>> 
>>>>> === Adaptor for Analytics ===
>>>>> The TsFile provides:
>>>>> 
>>>>> * InputFormat/OutputFormat interfaces for Reading/Writing data.
>>>>> * Deep integration with Apache Spark/Hadoop MapReduce including predicate push-down, column pruning, aggregation push down, etc. So users can use Apache Spark SQL/HiveQL to connect and query TsFiles.
>>>>> 
>>>>> 
>>>>> === IoTDB Engine ===
>>>>> The IoTDB engine is a database engine, which uses TsFile as its storage file format. The IoTDB Engine supports SQL-like query plus many useful functions:
>>>>> 
>>>>> * Tree-based time series schema
>>>>> * Log-Structured Merge (LSM)-based storage
>>>>> * Overflow file for out-of-order data
>>>>> * Scalable index framework
>>>>> * Special queries for time series
>>>>> 
>>>>> ==== Tree-based Time Series Schema ====
>>>>> IoTDB manages all the time series definitions using a tree structure. A path from the root of the tree to a leaf node represents a time series. Therefore, the unique id of a time series is a path, e.g., `root.China.beijing.windFarm1.windTurbine1.speed`.
>>>>> 
>>>>> This kind of schema can express `group by` naturally. For example, `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind turbines in wind farm 1 in Beijing, China.
>>>>> 
>>>>> ==== Log-Structured Merge (LSM)-based Storage ====
>>>>> In a time series, the data points should be ordered by their timestamps. In IoTDB, we use Log-Structured Merge (LSM) based mechanism. Therefore, a part of the data is stored in memory first and can be called as `memtable`. At this time, if data points come out-of-order, we resort them in memory. When this part of data exceeds the configured memory limit, we flush it on disk as a `Chunk Group` into an unclosed TsFile.  Finally, a TsFile may contain several Chunk Groups, for reducing the number of small data files, which is helpful to reduce the I/O load of the storage system and reduces the execution time of a file-merge in LSM. Notice that the data is time-ordered in one Chunk Group on disk, and this layout is helpful for fast filtering in one Chunk Group for a query.
>>>>> 
>>>>> Rule 1: In a TsFile, the Chunk Groups of one device are ordered by timestamp (Rule 1), and it is helpful for fast filtering among Chunk Groups for a query.
>>>>> 
>>>>> Rule 2: When the size of the unclosed TsFile reaches the threshold defined in the configuration file, we close the file and generate a new one to store new arriving data spanning the entire data set. Like many systems which use LSM-based storage, we never modify a TsFile which has been closed except for the file-merge process (Rule 2).
>>>>> 
>>>>> Rule 3: To reduce the number of TsFiles involved in a query process, we guarantee that the data points in different TsFiles are not overlapping on the time dimension after file mergence (Rule 3).
>>>>> 
>>>>> ==== Overflow File for Out-of-order Data ====
>>>>> When a part of data is flushed on disk (and will form a `Chunk Group` in a TsFile), the newly arriving data points whose timestamps are smaller than the largest timestamp in the Tsfile are `out-of-order`.
>>>>> 
>>>>> To store the out-of-order data, we organize all the troublesome `out-of-order` data point insertions into a special TsFile, named `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device may be overlapping in the time dimension, which violates the Rule 1 and costs additional time compared to a normal TsFile for query filtering.
>>>>> 
>>>>> There is another special operation: updating all the data points in a time range, e.g., `update all the speed values of device1 as 0 where the data time is in [1521622000000, 1521622662000]`. The operation is called when: (1) a sensor malfunctions and the database receives wrong data for a period; (2) we may want to reset all the records. Many NoSQL time series databases do not support such an operation. To support the operation in IoTDB, we use a tree-based structure, Treap, to store this part of operations and store them as `Overflow` files.
>>>>> 
>>>>> Therefore, there are 3 kinds of data files: TsFiles, UnSequenceTsFiles and Overflow files.  TsFiles should store most of the data. The volume of UnSequenceTsFiles depends on the workload: if there are too many out-of-order and the time span of out-of-order is huge, the volume will be large. Overflow files handle fewest data operations but will depend on the use of the special operations.
>>>>> 
>>>>> ==== LSM-tree ====
>>>>> Normally, LSM-based storage engines merge data files level by level so that it looks like a tree structure. In this way, data is well organized. The disadvantage is that data will be read and written several times. If the tree has 4 levels, each data point will be rewritten at least 4 times.
>>>>> 
>>>>> Currently, we do not merge all the TsFiles into one because (1) the number of TsFiles is kept lower than many LSM storage engines because a memtable is mapped to several Chunk Groups rather than a file; (2) different TsFiles are not overlapping with each other in the time dimension (because of Rule 3).
>>>>> 
>>>>> As mentioned before,  TsFile supports ''fast-return'' to accelerate queries. However, UnSequenceTsFile and Overflow files do not allow this feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may be overlapped, which leads to more files involved in the query process. To accelerate these queries, there is a merging process to reorganize files in the background. All the three kinds of files: TsFiles, UnSequenceTsFiles and Overflow files, are involved in the merging process. The merging process is implemented using multi-threading, while each thread is responsible for a series family.
>>>>> After merging, only TsFiles are left. These files have non-overlapping time spans and support the ''fast-return'' feature.
>>>>> 
>>>>> ==== Scalable Index Framework ====
>>>>> We allow users to implement indexes for faster queries. We currently support an index for pattern matching query (KV-Match index, ICDE 2019). Another index for fast aggregation (PISA index, CIKM 2016) is a work-in-progress.
>>>>> 
>>>>> ==== Special Queries ====
>>>>> We currently support `group by time interval` aggregation queries and `Fill by` operations, which are similar to those of InfluxDB. Time series segmentation operations and frequency queries are work-in-progress.
>>>>> 
>>>>> == Initial Goals ==
>>>>> The initial goals are to be open sourced and to integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines.
>>>>> 
>>>>> == Current Status ==
>>>>> We have developed the system for more than 2 years. There are currently 13k lines of code, some of which are generated by Antlr3 and Thrift.  There are 230 issues which have been solved and more than 1500 commits.
>>>>> 
>>>>> The system has been deployed in the staging environment of the State Grid Corporation of China to handle ~3 million time series (i.e, ~30,000 power generation assembly * ~100 sensors) and an equipment service company in China managing ~2 million time series (i.e, ~20k devices * 100 sensors). The insertion speed reaches ~2 million points/second/node, which is faster than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
>>>>> 
>>>>> There are many new features in the works including those mentioned herein. We will add more analytics functions, improve the data file merge process, and finish the first released version of IoTDB.
>>>>> 
>>>>> == Meritocracy ==
>>>>> The IoTDB project operates on meritocratic principles. Developers who submit more code with higher quality earn more merit. We have used `Issues` and `Pull Requests` modules on Github for collecting users' suggestions and patches. Users who submit issues, pull requests, documents and help the community management are welcomed and encouraged to become committers.
>>>>> 
>>>>> == Community ==
>>>>> 
>>>>> The IoTDB project users communicate on Github (
>>>>> https://github.com/thulab/tsfile) . Developers make the communication on a website which is similar with JIRA (Currently, only registered users can apply to access the project for communication, url: https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/
>>>>> ). We have also introduced IoTDB at many technical conferences. Next, we will build the mailing list for more convenience, broader communication and archived discussions.
>>>>> 
>>>>> If IoTDB is accepted for incubation at the Apache Software Foundation, the primary goal is to build a larger community. We believe that IoTDB will become a key project for time series data management, and so, we will rely on a large community of users and developers.
>>>>> 
>>>>> TODO: IoTDB is currently on a private Github repository (
>>>>> https://github.com/thulab/iotdb), while its subproject TsFile (a file format for storing time series data) is open sourced on Github (https://github.com/thulab/tsfile
>>>>> ).
>>>>> 
>>>>> == Core Developers ==
>>>>> IoTDB was initially developed by 2 dozen of students and teachers at Tsinghua University. Now, more and more developers have joined coming from other universities: Fudan University, Northwestern Polytechnical University and Harbin Institute of Technology in China.  Other developers come from business companies such as Lenovo and Microsoft. We will be working to bring more and more developers into the project making contributions to IoTDB.
>>>>> 
>>>>> == Relationships with Other Apache Products ==
>>>>> IoTDB requires some Apache products (Apache Thrift, commons, collections, httpclient).
>>>>> 
>>>>> IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed for supporting analysing time series data by using Apache Spark and MapReduce.
>>>>> 
>>>>> Overall, IoTDB is designed as an open architecture, and it can be integrated with many other systems in the future.
>>>>> 
>>>>> As mentioned before, in the IoTDB project, we designed a new columnar file format, called TsFile, which is similar to Apache Parquet. However, the new file format is optimized for time series data.
>>>>> 
>>>>> 
>>>>> 
>>>>> == Known Risks ==
>>>>> 
>>>>> === Orphaned Products ===
>>>>> Given the current level of investment in IoTDB, the risk of the project being abandoned is minimal. Time series data is more and more important and there are several constituents who are highly inspired to continue development. Tsinghua and NEL-BDS Lab relies on IoTDB as a platform for a large number of long-term research projects. We have deployed IoTDB in some company's staging environments for future applications.
>>>>> 
>>>>> === Inexperience with Open Source ===
>>>>> Students and researchers in Tsinghua University have been developing and using open source software for a long time. It is wonderful to be guided to join a formal open-source process for students. Some of our committers
>>>>> have  experiences contributing to open source, for example:
>>>>> 
>>>>> * druid:
>>>>> https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44
>>>>> 
>>>>> * druid:
>>>>> https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794
>>>>> 
>>>>> * YCSB:
>>>>> https://github.com/brianfrankcooper/YCSB/pull/776
>>>>> 
>>>>> 
>>>>> Additionally, several ASF veterans and industry veterans have agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
>>>>> 
>>>>> 
>>>>> === Reliance on Salaried Developers ===
>>>>> Most of current developers are students and researchers/professors in universities, and their researches focus on big data management and analytics. It is unlikely that they will change their research focus away from big data management.  We will work to ensure that the ability for the project to continuously be stewarded and to proceed forward independent of salaried developers is continued.
>>>>> 
>>>>> === An Excessive Fascination with the Apache Brand ===
>>>>> Most of the initial developers come from Tsinghua University with no intent to use the Apache brand for profit. We have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of IoTDB into Apache Incubator.
>>>>> 
>>>>> 
>>>>> == Initial Source ==
>>>>> IoTDB's github address and some required dependencies:
>>>>> 
>>>>> * The storage file format:
>>>>> https://github.com/thulab/tsfile
>>>>> 
>>>>> * Adaptor for Apache Hadoop MapReduce:
>>>>> https://github.com/thulab/tsfile-hadoop-connector
>>>>> 
>>>>> * Adaptor for Apache Spark:
>>>>> https://github.com/thulab/tsfile-spark-connector
>>>>> 
>>>>> * Adaptor for Grafana:
>>>>> https://github.com/thulab/iotdb-grafana
>>>>> 
>>>>> * The database engine:
>>>>> https://github.com/thulab/iotdb
>>>>> (private project up to now)
>>>>> * The client driver:
>>>>> https://github.com/thulab/iotdb-jdbc
>>>>> 
>>>>> 
>>>>> 
>>>>> === External Dependencies ===
>>>>> To the best of our knowledge, all dependencies of IoTDB are distributed under Apache compatible licenses. Upon acceptance to the incubator, we would begin a thorough analysis of all transitive dependencies to verify this fact and introduce license checking into the build and release process.
>>>>> 
>>>>> == Documentation ==
>>>>> * Documentation for TsFile:
>>>>> https://github.com/thulab/tsfile/wiki
>>>>> 
>>>>> * Documentation for IoTDB and its JDBC:
>>>>> http://tsfile.org/document
>>>>> (Chinese only. An English version is in progress.)
>>>>> 
>>>>> == Required Resources ==
>>>>> === Mailing Lists ===
>>>>> *
>>>>> private@iotdb.incubator.apache.org
>>>>> 
>>>>> *
>>>>> dev@iotdb.incubator.apache.org
>>>>> 
>>>>> *
>>>>> commits@iotdb.incubator.apache.org
>>>>> 
>>>>> 
>>>>> === Git Repositories ===
>>>>> *
>>>>> https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git
>>>>> 
>>>>> 
>>>>> === Issue Tracking ===
>>>>> *  JIRA IoTDB (We currently use the issue management provided by Github to track issues.)
>>>>> 
>>>>> 
>>>>> == Initial Committers ==
>>>>> Tsinghua University, K2Data Company, Lenovo, Microsoft
>>>>> 
>>>>> Jianmin Wang (jimwang at tsinghua dot edu dot cn )
>>>>> 
>>>>> Xiangdong Huang (sainthxd at gmail dot com)
>>>>> 
>>>>> Jun Yuan (richard_yuan16 at 163 dot com)
>>>>> 
>>>>> Chen Wang ( wang_chen at tsinghua dot edu dot cn)
>>>>> 
>>>>> Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Jinrui Zhang (jinrzhan at microsoft dot com)
>>>>> 
>>>>> Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Shuo Zhang (zhangshuo at k2data dot com dot cn)
>>>>> 
>>>>> Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
>>>>> 
>>>>> Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
>>>>> 
>>>>> == Sponsors ==
>>>>> === Champion ===
>>>>> Kevin A. McGrail (
>>>>> kmcgrail@apache.org
>>>>> )
>>>>> 
>>>>> === Nominated Mentors ===
>>>>> Justin Mclean (justin at classsoftware dot com)
>>>>> 
>>>>> Christofer Dutz (christofer.dutz at c-ware dot de)
>>>>> 
>>>>> Willem Jiang (willem.jiang at gmail dot com)
>>>>> 
>>>>> 
>>> 
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 

Craig L Russell
Secretary, Apache Software Foundation
clr@apache.org <ma...@apache.org> http://db.apache.org/jdo <http://db.apache.org/jdo>

Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by hxd <hx...@qq.com>.
Currently, there are 6 repositories (IoTDB, IoTDB-JDBC, TsFile, Spark-Connector, Hive-Connector, and Grafana-Connector) totally and we will merge them all in one repositories. 

Only the first one is private. 

Actually we are lack of experiences about how to open source. 

Should we open all the source now or after all the Apache legal documents are done? 

Best,

Xiangdong Huang  

> 在 2018年11月15日,下午5:06,Willem Jiang <wi...@gmail.com> 写道:
> 
> Here is a question for the source code repository
> 
> The main source git repo[1] is still a private repo.  I think we need
> to open source the repo before sending the SGA?
> 
> 
> [1]https://github.com/thulab/iotdb
> 
> Willem Jiang
> 
> Twitter: willemjiang
> Weibo: 姜宁willem
> On Thu, Nov 15, 2018 at 4:08 PM hxd <hx...@qq.com> wrote:
>> 
>> Hi,
>> 
>> In the proposal discussion process, we got 3 mentors,  Justin Mclean, Christofer Dutz, and Willem Ning Jiang.
>> 
>> In the vote process, we got a new mentor, Joe Witt.
>> 
>> Totally, there are one Champion and four mentors, they are:
>> 
>> Kevin A. McGrail (the Champion),
>> Justin Mclean,
>> Christofer Dutz,
>> Willem Ning Jiang, and
>> Joe Witt
>> 
>> I have checked their name on http://people.apache.org/committer-index.html, and they are accurate now.
>> The name list on the proposal list (https://wiki.apache.org/incubator/IoTDBProposal) is also correct.
>> 
>> Regards,
>> Xiangdong Huang
>> 
>> 
>> 
>> 在 2018年11月15日,上午12:51,Kevin A. McGrail <km...@apache.org> 写道:
>> 
>> Congratulations!  As champion, I think the next steps are:
>> 
>> 1 - Xiangdong, Can you confirm the list of mentors on the proposal is accurate?
>> 
>> 2 - Also Xiangdong, Is there anyone else that stepped forward as a mentor during the voting process that the project wants the IPMC to approve?
>> 
>> 3 - Justin, I think you have to request the creation of the podling and then I as champion work on things like the meta data file from this page,
>> https://incubator.apache.org/policy/incubation.html, correct?
>> 
>> Regards,
>> KAM
>> 
>> 
>> 
>> 
>> --
>> Kevin A. McGrail
>> VP Fundraising, Apache Software Foundation
>> Chair Emeritus Apache SpamAssassin Project
>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
>> 
>> 
>> On Wed, Nov 14, 2018 at 6:29 AM hxd <hx...@qq.com> wrote:
>>> 
>>> Hi,
>>> 
>>> With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1 votes, this VOTE passes.
>>> 
>>> Thanks to everyone who voted!
>>> 
>>> Bellow is a voting tally:
>>> 
>>> Binding
>>> Von Gosling
>>> Christofer Dutz
>>> Kevin A. McGrail
>>> Felix Cheung
>>> Matt Sticker
>>> Joe Witt
>>> Justin Mclean
>>> Willem Jiang
>>> 
>>> 
>>> Non-binding
>>> Sheng Wu
>>> Yang Bo
>>> 
>>> The vote thread: https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E<https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E>
>>> The proposal: https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal>
>>> 
>>> Thanks,
>>> 
>>> Xiangdong Huang
>>> 
>>> 
>>>> 在 2018年11月7日,下午3:46,hxd <hx...@qq.com> 写道:
>>>> 
>>>> Hi,
>>>> 
>>>> Sorry for the previous mail with bad format.
>>>> I'd like to call a VOTE to accept IoTDB project, a database for managing large amounts of time series data  from IoT sensors in industrial applications, into the Apache Incubator.
>>>> The full proposal is available on the wiki: https://wiki.apache.org/incubator/IoTDBProposal
>>>> and it is also attached below for your convenience.
>>>> 
>>>> Please cast your vote:
>>>> 
>>>>  [ ] +1, bring IoTDB into Incubator
>>>>  [ ] +0, I don't care either way,
>>>>  [ ] -1, do not bring IoTDB into Incubator, because...
>>>> 
>>>> The vote will open at least for 72 hours.
>>>> 
>>>> Thanks,
>>>> Xiangdong Huang.
>>>> 
>>>> 
>>>> = IoTDB Proposal  =
>>>> v0.1.1
>>>> 
>>>> 
>>>> == Abstract ==
>>>> IoTDB is a data store for managing large amounts of time series data such as timestamped data from IoT sensors in industrial applications.
>>>> 
>>>> == Proposal ==
>>>> IoTDB is a database for managing large amount of time series data with columnar storage, data encoding, pre-computation, and index techniques. It has SQL-like interface to write millions of data points per second per node and is optimized to get query results in few seconds over trillions of data points. It can also be easily integrated with Apache Hadoop MapReduce and Apache Spark for analytics.
>>>> 
>>>> == Background ==
>>>> 
>>>> A new class of data management system requirements is becoming increasingly important with the rise of the Internet of Things. There are some database systems and technologies aimed at time series data management.  For example, Gorilla and InfluxDB which are mainly built for data centers and monitoring application metrics. Other systems, for example, OpenTSDB and KairosDB, are built on Apache HBase and Apache Cassandra, respectively.
>>>> 
>>>> However, many applications for time series data management have more requirements especially in industrial applications as follows:
>>>> 
>>>> * Supporting time series data which has high data frequency. For example, a turbine engine may generate 1000 points per second (i.e., 1000Hz), while each CPU only reports 1 data points per 5 seconds in a data center monitoring application.
>>>> 
>>>> * Supporting scanning data multi-resolutionally. For example, aggregation operation is important for time series data.
>>>> 
>>>> * Supporting special queries for time series, such as pattern matching, time series segmentation, time-frequency transformation and frequency query.
>>>> 
>>>> * Supporting a large number of monitoring targets (i.e. time series). An excavator may report more than 1000 time series, for example, revolving speed of the motor-engine, the speed of the excavator, the accelerated speed, the temperature of the water tank and so on, while a CPU or an application monitor has much fewer time series.
>>>> 
>>>> * Optimization for out-of-order data points. In the industrial sector, it is common that equipment sends data using the UDP protocol rather than the TCP protocol. Sometimes, the network connect is unstable and parts of the data will be buffered for later sending.
>>>> 
>>>> * Supporting long-term storage. Historical data is precious for equipment manufacturers. Therefore, removing or unloading historical data is highly desired for most industrial applications. The database system must not only support fast retrieval of historical data, but also should guarantee that the historical data does not impact the processing speed for “hot” or current data.
>>>> 
>>>> * Supporting online transaction processing (OLTP) as well as complex analytics. It is obvious that supporting analyzing from the data files using Apache Spark/Apache Hadoop MapReduce directly is better than transforming data files to another file format for Big Data analytics.
>>>> 
>>>> * Flexible deployment either on premise or in the cloud.  IoTDB is as simple and can be deployed on a Raspberry Pi handling hundreds of time series. Meanwhile, the system can be also deployed in the cloud so that it supports tens of millions ingestions per second, OLTP queries in milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
>>>> 
>>>> * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a wind turbine, or a meteorological station, the deployment of the chosen database is designed to be simple. A device may have hundreds of time series (but less than a thousand time series) and the database needs to handle them.
>>>> * * (2) When deploying IoTDB in a data center, the computational resources (i.e., the hardware configuration of servers) is not a problem when compared to a Raspberry Pi. In this deployment, IoTDB can use more computation resources, and has the ability to handle more time seires (e.g., millions of time series).
>>>> 
>>>> Based on these requirements, we developed IoTDB, a new data store system for managing time series data.
>>>> 
>>>> IoTDB started as a Tsinghua University research project. IoTDB's developer community has also grown to include additional institutions, for example, universities (e.g., Fudan University), research labs (e.g, NEL-BDS lab), and corporations (e.g., K2Data, Tencent). Funding has been provided by various institutions including the National Natural Science Foundation of China, and industry sponsors, such as Lenovo and K2Data.
>>>> 
>>>> == Rationale ==
>>>> Because there is no existed open-sourced time series databases covering all the above requirements, we developed IoTDB. As the system matures, we are seeking a long-term home for the project. We believe the Apache Software Foundation would be an ideal fit. Also joining Apache will help coordinate and improve the development effort of the growing number of organizations which contribute to IoTDB improving the diversity of our community.
>>>> 
>>>> IoTDB contains multiple modules, which are classified into categories:
>>>> 
>>>> * '''TsFile Format''': TsFile is a new columnar file format.
>>>> * '''Adaptor for Analytics and Visualization''': Integrating TsFile with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also provided.
>>>> * '''IoTDB Engine''': An engine which consists of SQL parser, query plan generator, memtable, authentication and authorization,write ahead log (WAL), crash recovery, out-of-order data handler, and index for aggregation and pattern matching. The engine stores system data in TsFile format.
>>>> * '''IoTDB JDBC''': An implementation of Java Database Connectivity (JDBC) for clients to connect to IoTDB using Java.
>>>> 
>>>> === TsFile Format ===
>>>> 
>>>> TsFile format is a columnar store, which is similar with Apache Parquet and Apache CarbonData. It has the concepts of Chunk Group, Column Chunk, Page and Footer. Comparing with Apache Parquet and Apache CarbonData, it is designed and optimized for time series:
>>>> 
>>>> ==== Time Series Friendly Encoding ====
>>>> IoTDB currently supports run length encoding (RLE), delta-of-delta encoding, and Facebook's Gorilla encoding.
>>>> 
>>>> Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA) and time-frequency transformation are works-in-progress.
>>>> 
>>>> 
>>>> ==== Chunk Group ====
>>>> The data part of a TsFile consists of many Chunk Groups. Each Chunk Group stores the data of a device at a time interval.  A Chunk Group is similar to the row group in Apache Parquet, while there are some constraints of the time dimension:  For each device, the time intervals of different Chunk Groups are not overlapped and the latter Chunk Group always has a larger timestamp.
>>>> 
>>>> Given a TsFile and a query with a time range filter, the query process can terminate scanning data once it reads data points whose timestamp reaches the time limit of the filter. We call the feature ''fast-return'' and it makes the time range query in a TsFile very efficient.
>>>> 
>>>> 
>>>> 
>>>> ==== Different Column Chunk Format (Unnecessary the Repetition (R) and Definition (D) Fields) ====
>>>> 
>>>> While Apache Parquet and Apache CarbonData support complex data types, e.g., nested data and sparse columns, TsFile is exclusively designed for time series whose data model is \<device_id, series_id, timestamp, value\>.
>>>> 
>>>> In a `Chunk Group`, each time series is a `Column Chunk`. Even though these time series belong to the same device, the data points in different time series are not aligned in the time dimension originally.
>>>> 
>>>> For example, if you have a device with 2 sensors on the same data collection frequencies, sensor 1 may collect data at time 1521622662000 while the other one collects data at time 1521622662001 (delta=1ms). Therefore, each Column Chunk has its timestamps and values, which is quite different from Apache Parquet and Apache CarbonData.  Because we store the time column along with each value column instead of making different chunks share the same time column for the sake of diverse data frequency for different time series, we do not store any null value on disk to align across time series. Besides, we do not need to attach  `repetition` (R) and `definition` (D) fields on each value. Therefore, the disk space is saved and the query latency is reduced (because we do not align data by calculating R and D fields).
>>>> 
>>>> 
>>>> ==== Domain Specific Information in Each Page ====
>>>> Similar to Apache Parquet and Apache CarbonData, a `Column Chunk` consists of several `Pages`, and each `Page` has a `Page header`. The `Page header` is a summary of the data in the page.
>>>> 
>>>> Because TsFile is optimized for time series, the page header contains more domain specific information, such as the minimal and maximal value, the minimal and the maximal timestamp, the frequency and so on. TsFile can even store the histogram of values in the page header.
>>>> 
>>>> This header information helps IoTDB in speeding up queries by skipping unnecessary pages.
>>>> 
>>>> 
>>>> === Adaptor for Analytics ===
>>>> The TsFile provides:
>>>> 
>>>> * InputFormat/OutputFormat interfaces for Reading/Writing data.
>>>> * Deep integration with Apache Spark/Hadoop MapReduce including predicate push-down, column pruning, aggregation push down, etc. So users can use Apache Spark SQL/HiveQL to connect and query TsFiles.
>>>> 
>>>> 
>>>> === IoTDB Engine ===
>>>> The IoTDB engine is a database engine, which uses TsFile as its storage file format. The IoTDB Engine supports SQL-like query plus many useful functions:
>>>> 
>>>> * Tree-based time series schema
>>>> * Log-Structured Merge (LSM)-based storage
>>>> * Overflow file for out-of-order data
>>>> * Scalable index framework
>>>> * Special queries for time series
>>>> 
>>>> ==== Tree-based Time Series Schema ====
>>>> IoTDB manages all the time series definitions using a tree structure. A path from the root of the tree to a leaf node represents a time series. Therefore, the unique id of a time series is a path, e.g., `root.China.beijing.windFarm1.windTurbine1.speed`.
>>>> 
>>>> This kind of schema can express `group by` naturally. For example, `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind turbines in wind farm 1 in Beijing, China.
>>>> 
>>>> ==== Log-Structured Merge (LSM)-based Storage ====
>>>> In a time series, the data points should be ordered by their timestamps. In IoTDB, we use Log-Structured Merge (LSM) based mechanism. Therefore, a part of the data is stored in memory first and can be called as `memtable`. At this time, if data points come out-of-order, we resort them in memory. When this part of data exceeds the configured memory limit, we flush it on disk as a `Chunk Group` into an unclosed TsFile.  Finally, a TsFile may contain several Chunk Groups, for reducing the number of small data files, which is helpful to reduce the I/O load of the storage system and reduces the execution time of a file-merge in LSM. Notice that the data is time-ordered in one Chunk Group on disk, and this layout is helpful for fast filtering in one Chunk Group for a query.
>>>> 
>>>> Rule 1: In a TsFile, the Chunk Groups of one device are ordered by timestamp (Rule 1), and it is helpful for fast filtering among Chunk Groups for a query.
>>>> 
>>>> Rule 2: When the size of the unclosed TsFile reaches the threshold defined in the configuration file, we close the file and generate a new one to store new arriving data spanning the entire data set. Like many systems which use LSM-based storage, we never modify a TsFile which has been closed except for the file-merge process (Rule 2).
>>>> 
>>>> Rule 3: To reduce the number of TsFiles involved in a query process, we guarantee that the data points in different TsFiles are not overlapping on the time dimension after file mergence (Rule 3).
>>>> 
>>>> ==== Overflow File for Out-of-order Data ====
>>>> When a part of data is flushed on disk (and will form a `Chunk Group` in a TsFile), the newly arriving data points whose timestamps are smaller than the largest timestamp in the Tsfile are `out-of-order`.
>>>> 
>>>> To store the out-of-order data, we organize all the troublesome `out-of-order` data point insertions into a special TsFile, named `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device may be overlapping in the time dimension, which violates the Rule 1 and costs additional time compared to a normal TsFile for query filtering.
>>>> 
>>>> There is another special operation: updating all the data points in a time range, e.g., `update all the speed values of device1 as 0 where the data time is in [1521622000000, 1521622662000]`. The operation is called when: (1) a sensor malfunctions and the database receives wrong data for a period; (2) we may want to reset all the records. Many NoSQL time series databases do not support such an operation. To support the operation in IoTDB, we use a tree-based structure, Treap, to store this part of operations and store them as `Overflow` files.
>>>> 
>>>> Therefore, there are 3 kinds of data files: TsFiles, UnSequenceTsFiles and Overflow files.  TsFiles should store most of the data. The volume of UnSequenceTsFiles depends on the workload: if there are too many out-of-order and the time span of out-of-order is huge, the volume will be large. Overflow files handle fewest data operations but will depend on the use of the special operations.
>>>> 
>>>> ==== LSM-tree ====
>>>> Normally, LSM-based storage engines merge data files level by level so that it looks like a tree structure. In this way, data is well organized. The disadvantage is that data will be read and written several times. If the tree has 4 levels, each data point will be rewritten at least 4 times.
>>>> 
>>>> Currently, we do not merge all the TsFiles into one because (1) the number of TsFiles is kept lower than many LSM storage engines because a memtable is mapped to several Chunk Groups rather than a file; (2) different TsFiles are not overlapping with each other in the time dimension (because of Rule 3).
>>>> 
>>>> As mentioned before,  TsFile supports ''fast-return'' to accelerate queries. However, UnSequenceTsFile and Overflow files do not allow this feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may be overlapped, which leads to more files involved in the query process. To accelerate these queries, there is a merging process to reorganize files in the background. All the three kinds of files: TsFiles, UnSequenceTsFiles and Overflow files, are involved in the merging process. The merging process is implemented using multi-threading, while each thread is responsible for a series family.
>>>> After merging, only TsFiles are left. These files have non-overlapping time spans and support the ''fast-return'' feature.
>>>> 
>>>> ==== Scalable Index Framework ====
>>>> We allow users to implement indexes for faster queries. We currently support an index for pattern matching query (KV-Match index, ICDE 2019). Another index for fast aggregation (PISA index, CIKM 2016) is a work-in-progress.
>>>> 
>>>> ==== Special Queries ====
>>>> We currently support `group by time interval` aggregation queries and `Fill by` operations, which are similar to those of InfluxDB. Time series segmentation operations and frequency queries are work-in-progress.
>>>> 
>>>> == Initial Goals ==
>>>> The initial goals are to be open sourced and to integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines.
>>>> 
>>>> == Current Status ==
>>>> We have developed the system for more than 2 years. There are currently 13k lines of code, some of which are generated by Antlr3 and Thrift.  There are 230 issues which have been solved and more than 1500 commits.
>>>> 
>>>> The system has been deployed in the staging environment of the State Grid Corporation of China to handle ~3 million time series (i.e, ~30,000 power generation assembly * ~100 sensors) and an equipment service company in China managing ~2 million time series (i.e, ~20k devices * 100 sensors). The insertion speed reaches ~2 million points/second/node, which is faster than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
>>>> 
>>>> There are many new features in the works including those mentioned herein. We will add more analytics functions, improve the data file merge process, and finish the first released version of IoTDB.
>>>> 
>>>> == Meritocracy ==
>>>> The IoTDB project operates on meritocratic principles. Developers who submit more code with higher quality earn more merit. We have used `Issues` and `Pull Requests` modules on Github for collecting users' suggestions and patches. Users who submit issues, pull requests, documents and help the community management are welcomed and encouraged to become committers.
>>>> 
>>>> == Community ==
>>>> 
>>>> The IoTDB project users communicate on Github (
>>>> https://github.com/thulab/tsfile) . Developers make the communication on a website which is similar with JIRA (Currently, only registered users can apply to access the project for communication, url: https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/
>>>> ). We have also introduced IoTDB at many technical conferences. Next, we will build the mailing list for more convenience, broader communication and archived discussions.
>>>> 
>>>> If IoTDB is accepted for incubation at the Apache Software Foundation, the primary goal is to build a larger community. We believe that IoTDB will become a key project for time series data management, and so, we will rely on a large community of users and developers.
>>>> 
>>>> TODO: IoTDB is currently on a private Github repository (
>>>> https://github.com/thulab/iotdb), while its subproject TsFile (a file format for storing time series data) is open sourced on Github (https://github.com/thulab/tsfile
>>>> ).
>>>> 
>>>> == Core Developers ==
>>>> IoTDB was initially developed by 2 dozen of students and teachers at Tsinghua University. Now, more and more developers have joined coming from other universities: Fudan University, Northwestern Polytechnical University and Harbin Institute of Technology in China.  Other developers come from business companies such as Lenovo and Microsoft. We will be working to bring more and more developers into the project making contributions to IoTDB.
>>>> 
>>>> == Relationships with Other Apache Products ==
>>>> IoTDB requires some Apache products (Apache Thrift, commons, collections, httpclient).
>>>> 
>>>> IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed for supporting analysing time series data by using Apache Spark and MapReduce.
>>>> 
>>>> Overall, IoTDB is designed as an open architecture, and it can be integrated with many other systems in the future.
>>>> 
>>>> As mentioned before, in the IoTDB project, we designed a new columnar file format, called TsFile, which is similar to Apache Parquet. However, the new file format is optimized for time series data.
>>>> 
>>>> 
>>>> 
>>>> == Known Risks ==
>>>> 
>>>> === Orphaned Products ===
>>>> Given the current level of investment in IoTDB, the risk of the project being abandoned is minimal. Time series data is more and more important and there are several constituents who are highly inspired to continue development. Tsinghua and NEL-BDS Lab relies on IoTDB as a platform for a large number of long-term research projects. We have deployed IoTDB in some company's staging environments for future applications.
>>>> 
>>>> === Inexperience with Open Source ===
>>>> Students and researchers in Tsinghua University have been developing and using open source software for a long time. It is wonderful to be guided to join a formal open-source process for students. Some of our committers
>>>> have  experiences contributing to open source, for example:
>>>> 
>>>> * druid:
>>>> https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44
>>>> 
>>>> * druid:
>>>> https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794
>>>> 
>>>> * YCSB:
>>>> https://github.com/brianfrankcooper/YCSB/pull/776
>>>> 
>>>> 
>>>> Additionally, several ASF veterans and industry veterans have agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
>>>> 
>>>> 
>>>> === Reliance on Salaried Developers ===
>>>> Most of current developers are students and researchers/professors in universities, and their researches focus on big data management and analytics. It is unlikely that they will change their research focus away from big data management.  We will work to ensure that the ability for the project to continuously be stewarded and to proceed forward independent of salaried developers is continued.
>>>> 
>>>> === An Excessive Fascination with the Apache Brand ===
>>>> Most of the initial developers come from Tsinghua University with no intent to use the Apache brand for profit. We have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of IoTDB into Apache Incubator.
>>>> 
>>>> 
>>>> == Initial Source ==
>>>> IoTDB's github address and some required dependencies:
>>>> 
>>>> * The storage file format:
>>>> https://github.com/thulab/tsfile
>>>> 
>>>> * Adaptor for Apache Hadoop MapReduce:
>>>> https://github.com/thulab/tsfile-hadoop-connector
>>>> 
>>>> * Adaptor for Apache Spark:
>>>> https://github.com/thulab/tsfile-spark-connector
>>>> 
>>>> * Adaptor for Grafana:
>>>> https://github.com/thulab/iotdb-grafana
>>>> 
>>>> * The database engine:
>>>> https://github.com/thulab/iotdb
>>>> (private project up to now)
>>>> * The client driver:
>>>> https://github.com/thulab/iotdb-jdbc
>>>> 
>>>> 
>>>> 
>>>> === External Dependencies ===
>>>> To the best of our knowledge, all dependencies of IoTDB are distributed under Apache compatible licenses. Upon acceptance to the incubator, we would begin a thorough analysis of all transitive dependencies to verify this fact and introduce license checking into the build and release process.
>>>> 
>>>> == Documentation ==
>>>> * Documentation for TsFile:
>>>> https://github.com/thulab/tsfile/wiki
>>>> 
>>>> * Documentation for IoTDB and its JDBC:
>>>> http://tsfile.org/document
>>>> (Chinese only. An English version is in progress.)
>>>> 
>>>> == Required Resources ==
>>>> === Mailing Lists ===
>>>> *
>>>> private@iotdb.incubator.apache.org
>>>> 
>>>> *
>>>> dev@iotdb.incubator.apache.org
>>>> 
>>>> *
>>>> commits@iotdb.incubator.apache.org
>>>> 
>>>> 
>>>> === Git Repositories ===
>>>> *
>>>> https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git
>>>> 
>>>> 
>>>> === Issue Tracking ===
>>>> *  JIRA IoTDB (We currently use the issue management provided by Github to track issues.)
>>>> 
>>>> 
>>>> == Initial Committers ==
>>>> Tsinghua University, K2Data Company, Lenovo, Microsoft
>>>> 
>>>> Jianmin Wang (jimwang at tsinghua dot edu dot cn )
>>>> 
>>>> Xiangdong Huang (sainthxd at gmail dot com)
>>>> 
>>>> Jun Yuan (richard_yuan16 at 163 dot com)
>>>> 
>>>> Chen Wang ( wang_chen at tsinghua dot edu dot cn)
>>>> 
>>>> Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Jinrui Zhang (jinrzhan at microsoft dot com)
>>>> 
>>>> Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Shuo Zhang (zhangshuo at k2data dot com dot cn)
>>>> 
>>>> Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
>>>> 
>>>> Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
>>>> 
>>>> == Sponsors ==
>>>> === Champion ===
>>>> Kevin A. McGrail (
>>>> kmcgrail@apache.org
>>>> )
>>>> 
>>>> === Nominated Mentors ===
>>>> Justin Mclean (justin at classsoftware dot com)
>>>> 
>>>> Christofer Dutz (christofer.dutz at c-ware dot de)
>>>> 
>>>> Willem Jiang (willem.jiang at gmail dot com)
>>>> 
>>>> 
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by Willem Jiang <wi...@gmail.com>.
Here is a question for the source code repository

The main source git repo[1] is still a private repo.  I think we need
to open source the repo before sending the SGA?


[1]https://github.com/thulab/iotdb

Willem Jiang

Twitter: willemjiang
Weibo: 姜宁willem
On Thu, Nov 15, 2018 at 4:08 PM hxd <hx...@qq.com> wrote:
>
> Hi,
>
> In the proposal discussion process, we got 3 mentors,  Justin Mclean, Christofer Dutz, and Willem Ning Jiang.
>
> In the vote process, we got a new mentor, Joe Witt.
>
> Totally, there are one Champion and four mentors, they are:
>
> Kevin A. McGrail (the Champion),
> Justin Mclean,
> Christofer Dutz,
> Willem Ning Jiang, and
> Joe Witt
>
> I have checked their name on http://people.apache.org/committer-index.html, and they are accurate now.
> The name list on the proposal list (https://wiki.apache.org/incubator/IoTDBProposal) is also correct.
>
> Regards,
> Xiangdong Huang
>
>
>
> 在 2018年11月15日,上午12:51,Kevin A. McGrail <km...@apache.org> 写道:
>
> Congratulations!  As champion, I think the next steps are:
>
> 1 - Xiangdong, Can you confirm the list of mentors on the proposal is accurate?
>
> 2 - Also Xiangdong, Is there anyone else that stepped forward as a mentor during the voting process that the project wants the IPMC to approve?
>
> 3 - Justin, I think you have to request the creation of the podling and then I as champion work on things like the meta data file from this page,
> https://incubator.apache.org/policy/incubation.html, correct?
>
> Regards,
> KAM
>
>
>
>
> --
> Kevin A. McGrail
> VP Fundraising, Apache Software Foundation
> Chair Emeritus Apache SpamAssassin Project
> https://www.linkedin.com/in/kmcgrail - 703.798.0171
>
>
> On Wed, Nov 14, 2018 at 6:29 AM hxd <hx...@qq.com> wrote:
>>
>> Hi,
>>
>> With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1 votes, this VOTE passes.
>>
>> Thanks to everyone who voted!
>>
>> Bellow is a voting tally:
>>
>> Binding
>> Von Gosling
>>  Christofer Dutz
>>  Kevin A. McGrail
>>  Felix Cheung
>>  Matt Sticker
>>  Joe Witt
>>  Justin Mclean
>>  Willem Jiang
>>
>>
>> Non-binding
>>  Sheng Wu
>>  Yang Bo
>>
>> The vote thread: https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E<https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E>
>> The proposal: https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal>
>>
>> Thanks,
>>
>> Xiangdong Huang
>>
>>
>> > 在 2018年11月7日,下午3:46,hxd <hx...@qq.com> 写道:
>> >
>> > Hi,
>> >
>> > Sorry for the previous mail with bad format.
>> > I'd like to call a VOTE to accept IoTDB project, a database for managing large amounts of time series data  from IoT sensors in industrial applications, into the Apache Incubator.
>> > The full proposal is available on the wiki: https://wiki.apache.org/incubator/IoTDBProposal
>> > and it is also attached below for your convenience.
>> >
>> > Please cast your vote:
>> >
>> >   [ ] +1, bring IoTDB into Incubator
>> >   [ ] +0, I don't care either way,
>> >   [ ] -1, do not bring IoTDB into Incubator, because...
>> >
>> > The vote will open at least for 72 hours.
>> >
>> > Thanks,
>> > Xiangdong Huang.
>> >
>> >
>> > = IoTDB Proposal  =
>> > v0.1.1
>> >
>> >
>> > == Abstract ==
>> > IoTDB is a data store for managing large amounts of time series data such as timestamped data from IoT sensors in industrial applications.
>> >
>> > == Proposal ==
>> > IoTDB is a database for managing large amount of time series data with columnar storage, data encoding, pre-computation, and index techniques. It has SQL-like interface to write millions of data points per second per node and is optimized to get query results in few seconds over trillions of data points. It can also be easily integrated with Apache Hadoop MapReduce and Apache Spark for analytics.
>> >
>> > == Background ==
>> >
>> > A new class of data management system requirements is becoming increasingly important with the rise of the Internet of Things. There are some database systems and technologies aimed at time series data management.  For example, Gorilla and InfluxDB which are mainly built for data centers and monitoring application metrics. Other systems, for example, OpenTSDB and KairosDB, are built on Apache HBase and Apache Cassandra, respectively.
>> >
>> > However, many applications for time series data management have more requirements especially in industrial applications as follows:
>> >
>> >  * Supporting time series data which has high data frequency. For example, a turbine engine may generate 1000 points per second (i.e., 1000Hz), while each CPU only reports 1 data points per 5 seconds in a data center monitoring application.
>> >
>> >  * Supporting scanning data multi-resolutionally. For example, aggregation operation is important for time series data.
>> >
>> >  * Supporting special queries for time series, such as pattern matching, time series segmentation, time-frequency transformation and frequency query.
>> >
>> >  * Supporting a large number of monitoring targets (i.e. time series). An excavator may report more than 1000 time series, for example, revolving speed of the motor-engine, the speed of the excavator, the accelerated speed, the temperature of the water tank and so on, while a CPU or an application monitor has much fewer time series.
>> >
>> >  * Optimization for out-of-order data points. In the industrial sector, it is common that equipment sends data using the UDP protocol rather than the TCP protocol. Sometimes, the network connect is unstable and parts of the data will be buffered for later sending.
>> >
>> >  * Supporting long-term storage. Historical data is precious for equipment manufacturers. Therefore, removing or unloading historical data is highly desired for most industrial applications. The database system must not only support fast retrieval of historical data, but also should guarantee that the historical data does not impact the processing speed for “hot” or current data.
>> >
>> >  * Supporting online transaction processing (OLTP) as well as complex analytics. It is obvious that supporting analyzing from the data files using Apache Spark/Apache Hadoop MapReduce directly is better than transforming data files to another file format for Big Data analytics.
>> >
>> >  * Flexible deployment either on premise or in the cloud.  IoTDB is as simple and can be deployed on a Raspberry Pi handling hundreds of time series. Meanwhile, the system can be also deployed in the cloud so that it supports tens of millions ingestions per second, OLTP queries in milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
>> >
>> >  * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a wind turbine, or a meteorological station, the deployment of the chosen database is designed to be simple. A device may have hundreds of time series (but less than a thousand time series) and the database needs to handle them.
>> >  * * (2) When deploying IoTDB in a data center, the computational resources (i.e., the hardware configuration of servers) is not a problem when compared to a Raspberry Pi. In this deployment, IoTDB can use more computation resources, and has the ability to handle more time seires (e.g., millions of time series).
>> >
>> > Based on these requirements, we developed IoTDB, a new data store system for managing time series data.
>> >
>> > IoTDB started as a Tsinghua University research project. IoTDB's developer community has also grown to include additional institutions, for example, universities (e.g., Fudan University), research labs (e.g, NEL-BDS lab), and corporations (e.g., K2Data, Tencent). Funding has been provided by various institutions including the National Natural Science Foundation of China, and industry sponsors, such as Lenovo and K2Data.
>> >
>> > == Rationale ==
>> > Because there is no existed open-sourced time series databases covering all the above requirements, we developed IoTDB. As the system matures, we are seeking a long-term home for the project. We believe the Apache Software Foundation would be an ideal fit. Also joining Apache will help coordinate and improve the development effort of the growing number of organizations which contribute to IoTDB improving the diversity of our community.
>> >
>> > IoTDB contains multiple modules, which are classified into categories:
>> >
>> >  * '''TsFile Format''': TsFile is a new columnar file format.
>> >  * '''Adaptor for Analytics and Visualization''': Integrating TsFile with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also provided.
>> >  * '''IoTDB Engine''': An engine which consists of SQL parser, query plan generator, memtable, authentication and authorization,write ahead log (WAL), crash recovery, out-of-order data handler, and index for aggregation and pattern matching. The engine stores system data in TsFile format.
>> >  * '''IoTDB JDBC''': An implementation of Java Database Connectivity (JDBC) for clients to connect to IoTDB using Java.
>> >
>> > === TsFile Format ===
>> >
>> > TsFile format is a columnar store, which is similar with Apache Parquet and Apache CarbonData. It has the concepts of Chunk Group, Column Chunk, Page and Footer. Comparing with Apache Parquet and Apache CarbonData, it is designed and optimized for time series:
>> >
>> > ==== Time Series Friendly Encoding ====
>> > IoTDB currently supports run length encoding (RLE), delta-of-delta encoding, and Facebook's Gorilla encoding.
>> >
>> > Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA) and time-frequency transformation are works-in-progress.
>> >
>> >
>> > ==== Chunk Group ====
>> > The data part of a TsFile consists of many Chunk Groups. Each Chunk Group stores the data of a device at a time interval.  A Chunk Group is similar to the row group in Apache Parquet, while there are some constraints of the time dimension:  For each device, the time intervals of different Chunk Groups are not overlapped and the latter Chunk Group always has a larger timestamp.
>> >
>> > Given a TsFile and a query with a time range filter, the query process can terminate scanning data once it reads data points whose timestamp reaches the time limit of the filter. We call the feature ''fast-return'' and it makes the time range query in a TsFile very efficient.
>> >
>> >
>> >
>> > ==== Different Column Chunk Format (Unnecessary the Repetition (R) and Definition (D) Fields) ====
>> >
>> > While Apache Parquet and Apache CarbonData support complex data types, e.g., nested data and sparse columns, TsFile is exclusively designed for time series whose data model is \<device_id, series_id, timestamp, value\>.
>> >
>> > In a `Chunk Group`, each time series is a `Column Chunk`. Even though these time series belong to the same device, the data points in different time series are not aligned in the time dimension originally.
>> >
>> > For example, if you have a device with 2 sensors on the same data collection frequencies, sensor 1 may collect data at time 1521622662000 while the other one collects data at time 1521622662001 (delta=1ms). Therefore, each Column Chunk has its timestamps and values, which is quite different from Apache Parquet and Apache CarbonData.  Because we store the time column along with each value column instead of making different chunks share the same time column for the sake of diverse data frequency for different time series, we do not store any null value on disk to align across time series. Besides, we do not need to attach  `repetition` (R) and `definition` (D) fields on each value. Therefore, the disk space is saved and the query latency is reduced (because we do not align data by calculating R and D fields).
>> >
>> >
>> > ==== Domain Specific Information in Each Page ====
>> > Similar to Apache Parquet and Apache CarbonData, a `Column Chunk` consists of several `Pages`, and each `Page` has a `Page header`. The `Page header` is a summary of the data in the page.
>> >
>> > Because TsFile is optimized for time series, the page header contains more domain specific information, such as the minimal and maximal value, the minimal and the maximal timestamp, the frequency and so on. TsFile can even store the histogram of values in the page header.
>> >
>> > This header information helps IoTDB in speeding up queries by skipping unnecessary pages.
>> >
>> >
>> > === Adaptor for Analytics ===
>> > The TsFile provides:
>> >
>> >  * InputFormat/OutputFormat interfaces for Reading/Writing data.
>> >  * Deep integration with Apache Spark/Hadoop MapReduce including predicate push-down, column pruning, aggregation push down, etc. So users can use Apache Spark SQL/HiveQL to connect and query TsFiles.
>> >
>> >
>> > === IoTDB Engine ===
>> > The IoTDB engine is a database engine, which uses TsFile as its storage file format. The IoTDB Engine supports SQL-like query plus many useful functions:
>> >
>> >  * Tree-based time series schema
>> >  * Log-Structured Merge (LSM)-based storage
>> >  * Overflow file for out-of-order data
>> >  * Scalable index framework
>> >  * Special queries for time series
>> >
>> > ==== Tree-based Time Series Schema ====
>> > IoTDB manages all the time series definitions using a tree structure. A path from the root of the tree to a leaf node represents a time series. Therefore, the unique id of a time series is a path, e.g., `root.China.beijing.windFarm1.windTurbine1.speed`.
>> >
>> > This kind of schema can express `group by` naturally. For example, `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind turbines in wind farm 1 in Beijing, China.
>> >
>> > ==== Log-Structured Merge (LSM)-based Storage ====
>> > In a time series, the data points should be ordered by their timestamps. In IoTDB, we use Log-Structured Merge (LSM) based mechanism. Therefore, a part of the data is stored in memory first and can be called as `memtable`. At this time, if data points come out-of-order, we resort them in memory. When this part of data exceeds the configured memory limit, we flush it on disk as a `Chunk Group` into an unclosed TsFile.  Finally, a TsFile may contain several Chunk Groups, for reducing the number of small data files, which is helpful to reduce the I/O load of the storage system and reduces the execution time of a file-merge in LSM. Notice that the data is time-ordered in one Chunk Group on disk, and this layout is helpful for fast filtering in one Chunk Group for a query.
>> >
>> > Rule 1: In a TsFile, the Chunk Groups of one device are ordered by timestamp (Rule 1), and it is helpful for fast filtering among Chunk Groups for a query.
>> >
>> > Rule 2: When the size of the unclosed TsFile reaches the threshold defined in the configuration file, we close the file and generate a new one to store new arriving data spanning the entire data set. Like many systems which use LSM-based storage, we never modify a TsFile which has been closed except for the file-merge process (Rule 2).
>> >
>> > Rule 3: To reduce the number of TsFiles involved in a query process, we guarantee that the data points in different TsFiles are not overlapping on the time dimension after file mergence (Rule 3).
>> >
>> > ==== Overflow File for Out-of-order Data ====
>> > When a part of data is flushed on disk (and will form a `Chunk Group` in a TsFile), the newly arriving data points whose timestamps are smaller than the largest timestamp in the Tsfile are `out-of-order`.
>> >
>> > To store the out-of-order data, we organize all the troublesome `out-of-order` data point insertions into a special TsFile, named `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device may be overlapping in the time dimension, which violates the Rule 1 and costs additional time compared to a normal TsFile for query filtering.
>> >
>> > There is another special operation: updating all the data points in a time range, e.g., `update all the speed values of device1 as 0 where the data time is in [1521622000000, 1521622662000]`. The operation is called when: (1) a sensor malfunctions and the database receives wrong data for a period; (2) we may want to reset all the records. Many NoSQL time series databases do not support such an operation. To support the operation in IoTDB, we use a tree-based structure, Treap, to store this part of operations and store them as `Overflow` files.
>> >
>> > Therefore, there are 3 kinds of data files: TsFiles, UnSequenceTsFiles and Overflow files.  TsFiles should store most of the data. The volume of UnSequenceTsFiles depends on the workload: if there are too many out-of-order and the time span of out-of-order is huge, the volume will be large. Overflow files handle fewest data operations but will depend on the use of the special operations.
>> >
>> > ==== LSM-tree ====
>> > Normally, LSM-based storage engines merge data files level by level so that it looks like a tree structure. In this way, data is well organized. The disadvantage is that data will be read and written several times. If the tree has 4 levels, each data point will be rewritten at least 4 times.
>> >
>> > Currently, we do not merge all the TsFiles into one because (1) the number of TsFiles is kept lower than many LSM storage engines because a memtable is mapped to several Chunk Groups rather than a file; (2) different TsFiles are not overlapping with each other in the time dimension (because of Rule 3).
>> >
>> > As mentioned before,  TsFile supports ''fast-return'' to accelerate queries. However, UnSequenceTsFile and Overflow files do not allow this feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may be overlapped, which leads to more files involved in the query process. To accelerate these queries, there is a merging process to reorganize files in the background. All the three kinds of files: TsFiles, UnSequenceTsFiles and Overflow files, are involved in the merging process. The merging process is implemented using multi-threading, while each thread is responsible for a series family.
>> > After merging, only TsFiles are left. These files have non-overlapping time spans and support the ''fast-return'' feature.
>> >
>> > ==== Scalable Index Framework ====
>> > We allow users to implement indexes for faster queries. We currently support an index for pattern matching query (KV-Match index, ICDE 2019). Another index for fast aggregation (PISA index, CIKM 2016) is a work-in-progress.
>> >
>> > ==== Special Queries ====
>> > We currently support `group by time interval` aggregation queries and `Fill by` operations, which are similar to those of InfluxDB. Time series segmentation operations and frequency queries are work-in-progress.
>> >
>> > == Initial Goals ==
>> > The initial goals are to be open sourced and to integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines.
>> >
>> > == Current Status ==
>> > We have developed the system for more than 2 years. There are currently 13k lines of code, some of which are generated by Antlr3 and Thrift.  There are 230 issues which have been solved and more than 1500 commits.
>> >
>> > The system has been deployed in the staging environment of the State Grid Corporation of China to handle ~3 million time series (i.e, ~30,000 power generation assembly * ~100 sensors) and an equipment service company in China managing ~2 million time series (i.e, ~20k devices * 100 sensors). The insertion speed reaches ~2 million points/second/node, which is faster than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
>> >
>> > There are many new features in the works including those mentioned herein. We will add more analytics functions, improve the data file merge process, and finish the first released version of IoTDB.
>> >
>> > == Meritocracy ==
>> > The IoTDB project operates on meritocratic principles. Developers who submit more code with higher quality earn more merit. We have used `Issues` and `Pull Requests` modules on Github for collecting users' suggestions and patches. Users who submit issues, pull requests, documents and help the community management are welcomed and encouraged to become committers.
>> >
>> > == Community ==
>> >
>> > The IoTDB project users communicate on Github (
>> > https://github.com/thulab/tsfile) . Developers make the communication on a website which is similar with JIRA (Currently, only registered users can apply to access the project for communication, url: https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/
>> > ). We have also introduced IoTDB at many technical conferences. Next, we will build the mailing list for more convenience, broader communication and archived discussions.
>> >
>> > If IoTDB is accepted for incubation at the Apache Software Foundation, the primary goal is to build a larger community. We believe that IoTDB will become a key project for time series data management, and so, we will rely on a large community of users and developers.
>> >
>> > TODO: IoTDB is currently on a private Github repository (
>> > https://github.com/thulab/iotdb), while its subproject TsFile (a file format for storing time series data) is open sourced on Github (https://github.com/thulab/tsfile
>> > ).
>> >
>> > == Core Developers ==
>> > IoTDB was initially developed by 2 dozen of students and teachers at Tsinghua University. Now, more and more developers have joined coming from other universities: Fudan University, Northwestern Polytechnical University and Harbin Institute of Technology in China.  Other developers come from business companies such as Lenovo and Microsoft. We will be working to bring more and more developers into the project making contributions to IoTDB.
>> >
>> > == Relationships with Other Apache Products ==
>> > IoTDB requires some Apache products (Apache Thrift, commons, collections, httpclient).
>> >
>> > IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed for supporting analysing time series data by using Apache Spark and MapReduce.
>> >
>> > Overall, IoTDB is designed as an open architecture, and it can be integrated with many other systems in the future.
>> >
>> > As mentioned before, in the IoTDB project, we designed a new columnar file format, called TsFile, which is similar to Apache Parquet. However, the new file format is optimized for time series data.
>> >
>> >
>> >
>> > == Known Risks ==
>> >
>> > === Orphaned Products ===
>> > Given the current level of investment in IoTDB, the risk of the project being abandoned is minimal. Time series data is more and more important and there are several constituents who are highly inspired to continue development. Tsinghua and NEL-BDS Lab relies on IoTDB as a platform for a large number of long-term research projects. We have deployed IoTDB in some company's staging environments for future applications.
>> >
>> > === Inexperience with Open Source ===
>> > Students and researchers in Tsinghua University have been developing and using open source software for a long time. It is wonderful to be guided to join a formal open-source process for students. Some of our committers
>> > have  experiences contributing to open source, for example:
>> >
>> >  * druid:
>> > https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44
>> >
>> >  * druid:
>> > https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794
>> >
>> >  * YCSB:
>> > https://github.com/brianfrankcooper/YCSB/pull/776
>> >
>> >
>> > Additionally, several ASF veterans and industry veterans have agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
>> >
>> >
>> > === Reliance on Salaried Developers ===
>> > Most of current developers are students and researchers/professors in universities, and their researches focus on big data management and analytics. It is unlikely that they will change their research focus away from big data management.  We will work to ensure that the ability for the project to continuously be stewarded and to proceed forward independent of salaried developers is continued.
>> >
>> > === An Excessive Fascination with the Apache Brand ===
>> > Most of the initial developers come from Tsinghua University with no intent to use the Apache brand for profit. We have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of IoTDB into Apache Incubator.
>> >
>> >
>> > == Initial Source ==
>> > IoTDB's github address and some required dependencies:
>> >
>> >  * The storage file format:
>> > https://github.com/thulab/tsfile
>> >
>> >  * Adaptor for Apache Hadoop MapReduce:
>> > https://github.com/thulab/tsfile-hadoop-connector
>> >
>> >  * Adaptor for Apache Spark:
>> > https://github.com/thulab/tsfile-spark-connector
>> >
>> >  * Adaptor for Grafana:
>> > https://github.com/thulab/iotdb-grafana
>> >
>> >  * The database engine:
>> > https://github.com/thulab/iotdb
>> >  (private project up to now)
>> >  * The client driver:
>> > https://github.com/thulab/iotdb-jdbc
>> >
>> >
>> >
>> > === External Dependencies ===
>> > To the best of our knowledge, all dependencies of IoTDB are distributed under Apache compatible licenses. Upon acceptance to the incubator, we would begin a thorough analysis of all transitive dependencies to verify this fact and introduce license checking into the build and release process.
>> >
>> > == Documentation ==
>> >  * Documentation for TsFile:
>> > https://github.com/thulab/tsfile/wiki
>> >
>> >  * Documentation for IoTDB and its JDBC:
>> > http://tsfile.org/document
>> >  (Chinese only. An English version is in progress.)
>> >
>> > == Required Resources ==
>> > === Mailing Lists ===
>> >  *
>> > private@iotdb.incubator.apache.org
>> >
>> >  *
>> > dev@iotdb.incubator.apache.org
>> >
>> >  *
>> > commits@iotdb.incubator.apache.org
>> >
>> >
>> > === Git Repositories ===
>> >  *
>> > https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git
>> >
>> >
>> > === Issue Tracking ===
>> >  *  JIRA IoTDB (We currently use the issue management provided by Github to track issues.)
>> >
>> >
>> > == Initial Committers ==
>> > Tsinghua University, K2Data Company, Lenovo, Microsoft
>> >
>> > Jianmin Wang (jimwang at tsinghua dot edu dot cn )
>> >
>> > Xiangdong Huang (sainthxd at gmail dot com)
>> >
>> > Jun Yuan (richard_yuan16 at 163 dot com)
>> >
>> > Chen Wang ( wang_chen at tsinghua dot edu dot cn)
>> >
>> > Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
>> >
>> > Jinrui Zhang (jinrzhan at microsoft dot com)
>> >
>> > Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
>> >
>> > Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
>> >
>> > Shuo Zhang (zhangshuo at k2data dot com dot cn)
>> >
>> > Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
>> >
>> > Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
>> >
>> > Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
>> >
>> > Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
>> >
>> > Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
>> >
>> > Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
>> >
>> > Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
>> >
>> > Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
>> >
>> > Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
>> >
>> > == Sponsors ==
>> > === Champion ===
>> > Kevin A. McGrail (
>> > kmcgrail@apache.org
>> > )
>> >
>> > === Nominated Mentors ===
>> > Justin Mclean (justin at classsoftware dot com)
>> >
>> > Christofer Dutz (christofer.dutz at c-ware dot de)
>> >
>> > Willem Jiang (willem.jiang at gmail dot com)
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by hxd <hx...@qq.com>.
Hi,

In the proposal discussion process, we got 3 mentors,  Justin Mclean, Christofer Dutz, and Willem Ning Jiang. 

In the vote process, we got a new mentor, Joe Witt.

Totally, there are one Champion and four mentors, they are:

Kevin A. McGrail (the Champion),
Justin Mclean, 
Christofer Dutz, 
Willem Ning Jiang, and
Joe Witt

I have checked their name on http://people.apache.org/committer-index.html <http://people.apache.org/committer-index.html>, and they are accurate now. 
The name list on the proposal list (https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal>) is also correct.

Regards,
Xiangdong Huang

 

> 在 2018年11月15日,上午12:51,Kevin A. McGrail <km...@apache.org> 写道:
> 
> Congratulations!  As champion, I think the next steps are:
> 
> 1 - Xiangdong, Can you confirm the list of mentors on the proposal is accurate?
> 
> 2 - Also Xiangdong, Is there anyone else that stepped forward as a mentor during the voting process that the project wants the IPMC to approve?  
> 
> 3 - Justin, I think you have to request the creation of the podling and then I as champion work on things like the meta data file from this page, 
> https://incubator.apache.org/policy/incubation.html <https://incubator.apache.org/policy/incubation.html>, correct?
> 
> Regards,
> KAM
> 
> 
> 
> 
> --
> Kevin A. McGrail
> VP Fundraising, Apache Software Foundation
> Chair Emeritus Apache SpamAssassin Project
> https://www.linkedin.com/in/kmcgrail <https://www.linkedin.com/in/kmcgrail> - 703.798.0171 <tel:703.798.0171>
> 
> On Wed, Nov 14, 2018 at 6:29 AM hxd <hxdreg@qq.com <ma...@qq.com>> wrote:
> Hi,
> 
> With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1 votes, this VOTE passes. 
> 
> Thanks to everyone who voted!
> 
> Bellow is a voting tally:
> 
> Binding
> Von Gosling
>  Christofer Dutz 
>  Kevin A. McGrail
>  Felix Cheung
>  Matt Sticker
>  Joe Witt
>  Justin Mclean 
>  Willem Jiang 
> 
> 
> Non-binding
>  Sheng Wu
>  Yang Bo
> 
> The vote thread: https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E <https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E><https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E <https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E>> 
> The proposal: https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal> <https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal>> 
> 
> Thanks,
> 
> Xiangdong Huang
> 
> 
> > 在 2018年11月7日,下午3:46,hxd <hxdreg@qq.com <ma...@qq.com>> 写道:
> > 
> > Hi,
> > 
> > Sorry for the previous mail with bad format.
> > I'd like to call a VOTE to accept IoTDB project, a database for managing large amounts of time series data  from IoT sensors in industrial applications, into the Apache Incubator. 
> > The full proposal is available on the wiki: https://wiki.apache.org/incubator/IoTDBProposal <https://wiki.apache.org/incubator/IoTDBProposal>
> > and it is also attached below for your convenience.
> > 
> > Please cast your vote:
> > 
> >   [ ] +1, bring IoTDB into Incubator
> >   [ ] +0, I don't care either way,
> >   [ ] -1, do not bring IoTDB into Incubator, because...
> > 
> > The vote will open at least for 72 hours.
> > 
> > Thanks,
> > Xiangdong Huang.
> > 
> > 
> > = IoTDB Proposal  =
> > v0.1.1
> > 
> > 
> > == Abstract ==
> > IoTDB is a data store for managing large amounts of time series data such as timestamped data from IoT sensors in industrial applications.
> > 
> > == Proposal ==
> > IoTDB is a database for managing large amount of time series data with columnar storage, data encoding, pre-computation, and index techniques. It has SQL-like interface to write millions of data points per second per node and is optimized to get query results in few seconds over trillions of data points. It can also be easily integrated with Apache Hadoop MapReduce and Apache Spark for analytics.
> > 
> > == Background ==
> > 
> > A new class of data management system requirements is becoming increasingly important with the rise of the Internet of Things. There are some database systems and technologies aimed at time series data management.  For example, Gorilla and InfluxDB which are mainly built for data centers and monitoring application metrics. Other systems, for example, OpenTSDB and KairosDB, are built on Apache HBase and Apache Cassandra, respectively. 
> > 
> > However, many applications for time series data management have more requirements especially in industrial applications as follows:
> > 
> >  * Supporting time series data which has high data frequency. For example, a turbine engine may generate 1000 points per second (i.e., 1000Hz), while each CPU only reports 1 data points per 5 seconds in a data center monitoring application.
> > 
> >  * Supporting scanning data multi-resolutionally. For example, aggregation operation is important for time series data.
> > 
> >  * Supporting special queries for time series, such as pattern matching, time series segmentation, time-frequency transformation and frequency query.
> > 
> >  * Supporting a large number of monitoring targets (i.e. time series). An excavator may report more than 1000 time series, for example, revolving speed of the motor-engine, the speed of the excavator, the accelerated speed, the temperature of the water tank and so on, while a CPU or an application monitor has much fewer time series.
> > 
> >  * Optimization for out-of-order data points. In the industrial sector, it is common that equipment sends data using the UDP protocol rather than the TCP protocol. Sometimes, the network connect is unstable and parts of the data will be buffered for later sending.
> > 
> >  * Supporting long-term storage. Historical data is precious for equipment manufacturers. Therefore, removing or unloading historical data is highly desired for most industrial applications. The database system must not only support fast retrieval of historical data, but also should guarantee that the historical data does not impact the processing speed for “hot” or current data.
> > 
> >  * Supporting online transaction processing (OLTP) as well as complex analytics. It is obvious that supporting analyzing from the data files using Apache Spark/Apache Hadoop MapReduce directly is better than transforming data files to another file format for Big Data analytics.
> > 
> >  * Flexible deployment either on premise or in the cloud.  IoTDB is as simple and can be deployed on a Raspberry Pi handling hundreds of time series. Meanwhile, the system can be also deployed in the cloud so that it supports tens of millions ingestions per second, OLTP queries in milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
> > 
> >  * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a wind turbine, or a meteorological station, the deployment of the chosen database is designed to be simple. A device may have hundreds of time series (but less than a thousand time series) and the database needs to handle them.
> >  * * (2) When deploying IoTDB in a data center, the computational resources (i.e., the hardware configuration of servers) is not a problem when compared to a Raspberry Pi. In this deployment, IoTDB can use more computation resources, and has the ability to handle more time seires (e.g., millions of time series).
> > 
> > Based on these requirements, we developed IoTDB, a new data store system for managing time series data. 
> > 
> > IoTDB started as a Tsinghua University research project. IoTDB's developer community has also grown to include additional institutions, for example, universities (e.g., Fudan University), research labs (e.g, NEL-BDS lab), and corporations (e.g., K2Data, Tencent). Funding has been provided by various institutions including the National Natural Science Foundation of China, and industry sponsors, such as Lenovo and K2Data. 
> > 
> > == Rationale ==
> > Because there is no existed open-sourced time series databases covering all the above requirements, we developed IoTDB. As the system matures, we are seeking a long-term home for the project. We believe the Apache Software Foundation would be an ideal fit. Also joining Apache will help coordinate and improve the development effort of the growing number of organizations which contribute to IoTDB improving the diversity of our community.
> > 
> > IoTDB contains multiple modules, which are classified into categories:
> > 
> >  * '''TsFile Format''': TsFile is a new columnar file format. 
> >  * '''Adaptor for Analytics and Visualization''': Integrating TsFile with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also provided.
> >  * '''IoTDB Engine''': An engine which consists of SQL parser, query plan generator, memtable, authentication and authorization,write ahead log (WAL), crash recovery, out-of-order data handler, and index for aggregation and pattern matching. The engine stores system data in TsFile format.
> >  * '''IoTDB JDBC''': An implementation of Java Database Connectivity (JDBC) for clients to connect to IoTDB using Java.
> > 
> > === TsFile Format ===
> > 
> > TsFile format is a columnar store, which is similar with Apache Parquet and Apache CarbonData. It has the concepts of Chunk Group, Column Chunk, Page and Footer. Comparing with Apache Parquet and Apache CarbonData, it is designed and optimized for time series:
> > 
> > ==== Time Series Friendly Encoding ====
> > IoTDB currently supports run length encoding (RLE), delta-of-delta encoding, and Facebook's Gorilla encoding. 
> > 
> > Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA) and time-frequency transformation are works-in-progress.
> > 
> > 
> > ==== Chunk Group ====
> > The data part of a TsFile consists of many Chunk Groups. Each Chunk Group stores the data of a device at a time interval.  A Chunk Group is similar to the row group in Apache Parquet, while there are some constraints of the time dimension:  For each device, the time intervals of different Chunk Groups are not overlapped and the latter Chunk Group always has a larger timestamp.
> > 
> > Given a TsFile and a query with a time range filter, the query process can terminate scanning data once it reads data points whose timestamp reaches the time limit of the filter. We call the feature ''fast-return'' and it makes the time range query in a TsFile very efficient.
> > 
> > 
> > 
> > ==== Different Column Chunk Format (Unnecessary the Repetition (R) and Definition (D) Fields) ====
> > 
> > While Apache Parquet and Apache CarbonData support complex data types, e.g., nested data and sparse columns, TsFile is exclusively designed for time series whose data model is \<device_id, series_id, timestamp, value\>. 
> > 
> > In a `Chunk Group`, each time series is a `Column Chunk`. Even though these time series belong to the same device, the data points in different time series are not aligned in the time dimension originally. 
> > 
> > For example, if you have a device with 2 sensors on the same data collection frequencies, sensor 1 may collect data at time 1521622662000 while the other one collects data at time 1521622662001 (delta=1ms). Therefore, each Column Chunk has its timestamps and values, which is quite different from Apache Parquet and Apache CarbonData.  Because we store the time column along with each value column instead of making different chunks share the same time column for the sake of diverse data frequency for different time series, we do not store any null value on disk to align across time series. Besides, we do not need to attach  `repetition` (R) and `definition` (D) fields on each value. Therefore, the disk space is saved and the query latency is reduced (because we do not align data by calculating R and D fields).
> > 
> > 
> > ==== Domain Specific Information in Each Page ====
> > Similar to Apache Parquet and Apache CarbonData, a `Column Chunk` consists of several `Pages`, and each `Page` has a `Page header`. The `Page header` is a summary of the data in the page. 
> > 
> > Because TsFile is optimized for time series, the page header contains more domain specific information, such as the minimal and maximal value, the minimal and the maximal timestamp, the frequency and so on. TsFile can even store the histogram of values in the page header. 
> > 
> > This header information helps IoTDB in speeding up queries by skipping unnecessary pages.
> > 
> > 
> > === Adaptor for Analytics ===
> > The TsFile provides:
> > 
> >  * InputFormat/OutputFormat interfaces for Reading/Writing data.
> >  * Deep integration with Apache Spark/Hadoop MapReduce including predicate push-down, column pruning, aggregation push down, etc. So users can use Apache Spark SQL/HiveQL to connect and query TsFiles.
> > 
> > 
> > === IoTDB Engine ===
> > The IoTDB engine is a database engine, which uses TsFile as its storage file format. The IoTDB Engine supports SQL-like query plus many useful functions:
> > 
> >  * Tree-based time series schema
> >  * Log-Structured Merge (LSM)-based storage
> >  * Overflow file for out-of-order data
> >  * Scalable index framework
> >  * Special queries for time series
> > 
> > ==== Tree-based Time Series Schema ====
> > IoTDB manages all the time series definitions using a tree structure. A path from the root of the tree to a leaf node represents a time series. Therefore, the unique id of a time series is a path, e.g., `root.China.beijing.windFarm1.windTurbine1.speed`. 
> > 
> > This kind of schema can express `group by` naturally. For example, `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind turbines in wind farm 1 in Beijing, China.
> > 
> > ==== Log-Structured Merge (LSM)-based Storage ====
> > In a time series, the data points should be ordered by their timestamps. In IoTDB, we use Log-Structured Merge (LSM) based mechanism. Therefore, a part of the data is stored in memory first and can be called as `memtable`. At this time, if data points come out-of-order, we resort them in memory. When this part of data exceeds the configured memory limit, we flush it on disk as a `Chunk Group` into an unclosed TsFile.  Finally, a TsFile may contain several Chunk Groups, for reducing the number of small data files, which is helpful to reduce the I/O load of the storage system and reduces the execution time of a file-merge in LSM. Notice that the data is time-ordered in one Chunk Group on disk, and this layout is helpful for fast filtering in one Chunk Group for a query.
> > 
> > Rule 1: In a TsFile, the Chunk Groups of one device are ordered by timestamp (Rule 1), and it is helpful for fast filtering among Chunk Groups for a query.
> > 
> > Rule 2: When the size of the unclosed TsFile reaches the threshold defined in the configuration file, we close the file and generate a new one to store new arriving data spanning the entire data set. Like many systems which use LSM-based storage, we never modify a TsFile which has been closed except for the file-merge process (Rule 2). 
> > 
> > Rule 3: To reduce the number of TsFiles involved in a query process, we guarantee that the data points in different TsFiles are not overlapping on the time dimension after file mergence (Rule 3). 
> > 
> > ==== Overflow File for Out-of-order Data ====
> > When a part of data is flushed on disk (and will form a `Chunk Group` in a TsFile), the newly arriving data points whose timestamps are smaller than the largest timestamp in the Tsfile are `out-of-order`. 
> > 
> > To store the out-of-order data, we organize all the troublesome `out-of-order` data point insertions into a special TsFile, named `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device may be overlapping in the time dimension, which violates the Rule 1 and costs additional time compared to a normal TsFile for query filtering.
> >   
> > There is another special operation: updating all the data points in a time range, e.g., `update all the speed values of device1 as 0 where the data time is in [1521622000000, 1521622662000]`. The operation is called when: (1) a sensor malfunctions and the database receives wrong data for a period; (2) we may want to reset all the records. Many NoSQL time series databases do not support such an operation. To support the operation in IoTDB, we use a tree-based structure, Treap, to store this part of operations and store them as `Overflow` files. 
> > 
> > Therefore, there are 3 kinds of data files: TsFiles, UnSequenceTsFiles and Overflow files.  TsFiles should store most of the data. The volume of UnSequenceTsFiles depends on the workload: if there are too many out-of-order and the time span of out-of-order is huge, the volume will be large. Overflow files handle fewest data operations but will depend on the use of the special operations. 
> > 
> > ==== LSM-tree ====
> > Normally, LSM-based storage engines merge data files level by level so that it looks like a tree structure. In this way, data is well organized. The disadvantage is that data will be read and written several times. If the tree has 4 levels, each data point will be rewritten at least 4 times. 
> > 
> > Currently, we do not merge all the TsFiles into one because (1) the number of TsFiles is kept lower than many LSM storage engines because a memtable is mapped to several Chunk Groups rather than a file; (2) different TsFiles are not overlapping with each other in the time dimension (because of Rule 3). 
> > 
> > As mentioned before,  TsFile supports ''fast-return'' to accelerate queries. However, UnSequenceTsFile and Overflow files do not allow this feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may be overlapped, which leads to more files involved in the query process. To accelerate these queries, there is a merging process to reorganize files in the background. All the three kinds of files: TsFiles, UnSequenceTsFiles and Overflow files, are involved in the merging process. The merging process is implemented using multi-threading, while each thread is responsible for a series family. 
> > After merging, only TsFiles are left. These files have non-overlapping time spans and support the ''fast-return'' feature. 
> > 
> > ==== Scalable Index Framework ====
> > We allow users to implement indexes for faster queries. We currently support an index for pattern matching query (KV-Match index, ICDE 2019). Another index for fast aggregation (PISA index, CIKM 2016) is a work-in-progress. 
> > 
> > ==== Special Queries ====
> > We currently support `group by time interval` aggregation queries and `Fill by` operations, which are similar to those of InfluxDB. Time series segmentation operations and frequency queries are work-in-progress.
> > 
> > == Initial Goals ==
> > The initial goals are to be open sourced and to integrate with the Apache development process. Furthermore, we plan for incremental development, and releases along with the Apache guidelines.
> > 
> > == Current Status ==
> > We have developed the system for more than 2 years. There are currently 13k lines of code, some of which are generated by Antlr3 and Thrift.  There are 230 issues which have been solved and more than 1500 commits.  
> > 
> > The system has been deployed in the staging environment of the State Grid Corporation of China to handle ~3 million time series (i.e, ~30,000 power generation assembly * ~100 sensors) and an equipment service company in China managing ~2 million time series (i.e, ~20k devices * 100 sensors). The insertion speed reaches ~2 million points/second/node, which is faster than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
> > 
> > There are many new features in the works including those mentioned herein. We will add more analytics functions, improve the data file merge process, and finish the first released version of IoTDB. 
> > 
> > == Meritocracy ==
> > The IoTDB project operates on meritocratic principles. Developers who submit more code with higher quality earn more merit. We have used `Issues` and `Pull Requests` modules on Github for collecting users' suggestions and patches. Users who submit issues, pull requests, documents and help the community management are welcomed and encouraged to become committers.
> > 
> > == Community ==
> > 
> > The IoTDB project users communicate on Github (
> > https://github.com/thulab/tsfile <https://github.com/thulab/tsfile>) . Developers make the communication on a website which is similar with JIRA (Currently, only registered users can apply to access the project for communication, url: https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/ <https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/>
> > ). We have also introduced IoTDB at many technical conferences. Next, we will build the mailing list for more convenience, broader communication and archived discussions. 
> > 
> > If IoTDB is accepted for incubation at the Apache Software Foundation, the primary goal is to build a larger community. We believe that IoTDB will become a key project for time series data management, and so, we will rely on a large community of users and developers.
> > 
> > TODO: IoTDB is currently on a private Github repository (
> > https://github.com/thulab/iotdb <https://github.com/thulab/iotdb>), while its subproject TsFile (a file format for storing time series data) is open sourced on Github (https://github.com/thulab/tsfile <https://github.com/thulab/tsfile>
> > ).
> > 
> > == Core Developers ==
> > IoTDB was initially developed by 2 dozen of students and teachers at Tsinghua University. Now, more and more developers have joined coming from other universities: Fudan University, Northwestern Polytechnical University and Harbin Institute of Technology in China.  Other developers come from business companies such as Lenovo and Microsoft. We will be working to bring more and more developers into the project making contributions to IoTDB.
> > 
> > == Relationships with Other Apache Products ==
> > IoTDB requires some Apache products (Apache Thrift, commons, collections, httpclient). 
> > 
> > IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed for supporting analysing time series data by using Apache Spark and MapReduce. 
> > 
> > Overall, IoTDB is designed as an open architecture, and it can be integrated with many other systems in the future.
> > 
> > As mentioned before, in the IoTDB project, we designed a new columnar file format, called TsFile, which is similar to Apache Parquet. However, the new file format is optimized for time series data. 
> > 
> > 
> > 
> > == Known Risks ==
> > 
> > === Orphaned Products ===
> > Given the current level of investment in IoTDB, the risk of the project being abandoned is minimal. Time series data is more and more important and there are several constituents who are highly inspired to continue development. Tsinghua and NEL-BDS Lab relies on IoTDB as a platform for a large number of long-term research projects. We have deployed IoTDB in some company's staging environments for future applications.
> > 
> > === Inexperience with Open Source ===
> > Students and researchers in Tsinghua University have been developing and using open source software for a long time. It is wonderful to be guided to join a formal open-source process for students. Some of our committers
> > have  experiences contributing to open source, for example:
> > 
> >  * druid: 
> > https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44 <https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44>
> > 
> >  * druid: 
> > https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794 <https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794>
> > 
> >  * YCSB: 
> > https://github.com/brianfrankcooper/YCSB/pull/776 <https://github.com/brianfrankcooper/YCSB/pull/776>
> > 
> > 
> > Additionally, several ASF veterans and industry veterans have agreed to mentor the project and are listed in this proposal. The project will rely on their guidance and collective wisdom to quickly transition the entire team of initial committers towards practicing the Apache Way.
> > 
> > 
> > === Reliance on Salaried Developers ===
> > Most of current developers are students and researchers/professors in universities, and their researches focus on big data management and analytics. It is unlikely that they will change their research focus away from big data management.  We will work to ensure that the ability for the project to continuously be stewarded and to proceed forward independent of salaried developers is continued.
> > 
> > === An Excessive Fascination with the Apache Brand ===
> > Most of the initial developers come from Tsinghua University with no intent to use the Apache brand for profit. We have no plans for making use of Apache brand in press releases nor posting billboards advertising acceptance of IoTDB into Apache Incubator.
> > 
> > 
> > == Initial Source ==
> > IoTDB's github address and some required dependencies: 
> > 
> >  * The storage file format: 
> > https://github.com/thulab/tsfile <https://github.com/thulab/tsfile>
> > 
> >  * Adaptor for Apache Hadoop MapReduce: 
> > https://github.com/thulab/tsfile-hadoop-connector <https://github.com/thulab/tsfile-hadoop-connector>
> > 
> >  * Adaptor for Apache Spark: 
> > https://github.com/thulab/tsfile-spark-connector <https://github.com/thulab/tsfile-spark-connector>
> > 
> >  * Adaptor for Grafana: 
> > https://github.com/thulab/iotdb-grafana <https://github.com/thulab/iotdb-grafana>
> > 
> >  * The database engine: 
> > https://github.com/thulab/iotdb <https://github.com/thulab/iotdb>
> >  (private project up to now)
> >  * The client driver: 
> > https://github.com/thulab/iotdb-jdbc <https://github.com/thulab/iotdb-jdbc>
> > 
> > 
> > 
> > === External Dependencies ===
> > To the best of our knowledge, all dependencies of IoTDB are distributed under Apache compatible licenses. Upon acceptance to the incubator, we would begin a thorough analysis of all transitive dependencies to verify this fact and introduce license checking into the build and release process.
> > 
> > == Documentation ==
> >  * Documentation for TsFile: 
> > https://github.com/thulab/tsfile/wiki <https://github.com/thulab/tsfile/wiki>
> > 
> >  * Documentation for IoTDB and its JDBC:  
> > http://tsfile.org/document <http://tsfile.org/document>
> >  (Chinese only. An English version is in progress.)
> > 
> > == Required Resources ==
> > === Mailing Lists ===
> >  * 
> > private@iotdb.incubator.apache.org <ma...@iotdb.incubator.apache.org>
> > 
> >  * 
> > dev@iotdb.incubator.apache.org <ma...@iotdb.incubator.apache.org>
> > 
> >  * 
> > commits@iotdb.incubator.apache.org <ma...@iotdb.incubator.apache.org>
> > 
> > 
> > === Git Repositories ===
> >  * 
> > https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git <https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git>
> > 
> > 
> > === Issue Tracking ===
> >  *  JIRA IoTDB (We currently use the issue management provided by Github to track issues.)
> > 
> > 
> > == Initial Committers ==
> > Tsinghua University, K2Data Company, Lenovo, Microsoft
> > 
> > Jianmin Wang (jimwang at tsinghua dot edu dot cn )
> > 
> > Xiangdong Huang (sainthxd at gmail dot com)
> > 
> > Jun Yuan (richard_yuan16 at 163 dot com)
> > 
> > Chen Wang ( wang_chen at tsinghua dot edu dot cn)
> > 
> > Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
> > 
> > Jinrui Zhang (jinrzhan at microsoft dot com)
> > 
> > Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
> > 
> > Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
> > 
> > Shuo Zhang (zhangshuo at k2data dot com dot cn)
> > 
> > Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
> > 
> > Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
> > 
> > Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
> > 
> > Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
> > 
> > Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
> > 
> > Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
> > 
> > Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
> > 
> > Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
> > 
> > Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
> > 
> > == Sponsors ==
> > === Champion ===
> > Kevin A. McGrail (
> > kmcgrail@apache.org <ma...@apache.org>
> > )
> > 
> > === Nominated Mentors ===
> > Justin Mclean (justin at classsoftware dot com)
> > 
> > Christofer Dutz (christofer.dutz at c-ware dot de)
> > 
> > Willem Jiang (willem.jiang at gmail dot com)
> > 
> > 


Re: [Result][Vote] vote for IoTDB incubation proposal

Posted by "Kevin A. McGrail" <km...@apache.org>.
Congratulations!  As champion, I think the next steps are:

1 - Xiangdong, Can you confirm the list of mentors on the proposal is
accurate?

2 - Also Xiangdong, Is there anyone else that stepped forward as a mentor
during the voting process that the project wants the IPMC to approve?

3 - Justin, I think you have to request the creation of the podling and
then I as champion work on things like the meta data file from this page,
https://incubator.apache.org/policy/incubation.html, correct?

Regards,
KAM




--
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


On Wed, Nov 14, 2018 at 6:29 AM hxd <hx...@qq.com> wrote:

> Hi,
>
> With 8 +1 binding votes,  2 +1 non-binding votes and No +/-0 or -1 votes,
> this VOTE passes.
>
> Thanks to everyone who voted!
>
> Bellow is a voting tally:
>
> Binding
> Von Gosling
>  Christofer Dutz
>  Kevin A. McGrail
>  Felix Cheung
>  Matt Sticker
>  Joe Witt
>  Justin Mclean
>  Willem Jiang
>
>
> Non-binding
>  Sheng Wu
>  Yang Bo
>
> The vote thread:
> https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E
> <
> https://lists.apache.org/thread.html/077f029ab2b52a2b19fc8d41c07438f660a8e93dd87b3895d262263c@%3Cgeneral.incubator.apache.org%3E>
>
> The proposal: https://wiki.apache.org/incubator/IoTDBProposal <
> https://wiki.apache.org/incubator/IoTDBProposal>
>
> Thanks,
>
> Xiangdong Huang
>
>
> > 在 2018年11月7日,下午3:46,hxd <hx...@qq.com> 写道:
> >
> > Hi,
> >
> > Sorry for the previous mail with bad format.
> > I'd like to call a VOTE to accept IoTDB project, a database for managing
> large amounts of time series data  from IoT sensors in industrial
> applications, into the Apache Incubator.
> > The full proposal is available on the wiki:
> https://wiki.apache.org/incubator/IoTDBProposal
> > and it is also attached below for your convenience.
> >
> > Please cast your vote:
> >
> >   [ ] +1, bring IoTDB into Incubator
> >   [ ] +0, I don't care either way,
> >   [ ] -1, do not bring IoTDB into Incubator, because...
> >
> > The vote will open at least for 72 hours.
> >
> > Thanks,
> > Xiangdong Huang.
> >
> >
> > = IoTDB Proposal  =
> > v0.1.1
> >
> >
> > == Abstract ==
> > IoTDB is a data store for managing large amounts of time series data
> such as timestamped data from IoT sensors in industrial applications.
> >
> > == Proposal ==
> > IoTDB is a database for managing large amount of time series data with
> columnar storage, data encoding, pre-computation, and index techniques. It
> has SQL-like interface to write millions of data points per second per node
> and is optimized to get query results in few seconds over trillions of data
> points. It can also be easily integrated with Apache Hadoop MapReduce and
> Apache Spark for analytics.
> >
> > == Background ==
> >
> > A new class of data management system requirements is becoming
> increasingly important with the rise of the Internet of Things. There are
> some database systems and technologies aimed at time series data
> management.  For example, Gorilla and InfluxDB which are mainly built for
> data centers and monitoring application metrics. Other systems, for
> example, OpenTSDB and KairosDB, are built on Apache HBase and Apache
> Cassandra, respectively.
> >
> > However, many applications for time series data management have more
> requirements especially in industrial applications as follows:
> >
> >  * Supporting time series data which has high data frequency. For
> example, a turbine engine may generate 1000 points per second (i.e.,
> 1000Hz), while each CPU only reports 1 data points per 5 seconds in a data
> center monitoring application.
> >
> >  * Supporting scanning data multi-resolutionally. For example,
> aggregation operation is important for time series data.
> >
> >  * Supporting special queries for time series, such as pattern matching,
> time series segmentation, time-frequency transformation and frequency query.
> >
> >  * Supporting a large number of monitoring targets (i.e. time series).
> An excavator may report more than 1000 time series, for example, revolving
> speed of the motor-engine, the speed of the excavator, the accelerated
> speed, the temperature of the water tank and so on, while a CPU or an
> application monitor has much fewer time series.
> >
> >  * Optimization for out-of-order data points. In the industrial sector,
> it is common that equipment sends data using the UDP protocol rather than
> the TCP protocol. Sometimes, the network connect is unstable and parts of
> the data will be buffered for later sending.
> >
> >  * Supporting long-term storage. Historical data is precious for
> equipment manufacturers. Therefore, removing or unloading historical data
> is highly desired for most industrial applications. The database system
> must not only support fast retrieval of historical data, but also should
> guarantee that the historical data does not impact the processing speed for
> “hot” or current data.
> >
> >  * Supporting online transaction processing (OLTP) as well as complex
> analytics. It is obvious that supporting analyzing from the data files
> using Apache Spark/Apache Hadoop MapReduce directly is better than
> transforming data files to another file format for Big Data analytics.
> >
> >  * Flexible deployment either on premise or in the cloud.  IoTDB is as
> simple and can be deployed on a Raspberry Pi handling hundreds of time
> series. Meanwhile, the system can be also deployed in the cloud so that it
> supports tens of millions ingestions per second, OLTP queries in
> milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce.
> >
> >  * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a
> wind turbine, or a meteorological station, the deployment of the chosen
> database is designed to be simple. A device may have hundreds of time
> series (but less than a thousand time series) and the database needs to
> handle them.
> >  * * (2) When deploying IoTDB in a data center, the computational
> resources (i.e., the hardware configuration of servers) is not a problem
> when compared to a Raspberry Pi. In this deployment, IoTDB can use more
> computation resources, and has the ability to handle more time seires
> (e.g., millions of time series).
> >
> > Based on these requirements, we developed IoTDB, a new data store system
> for managing time series data.
> >
> > IoTDB started as a Tsinghua University research project. IoTDB's
> developer community has also grown to include additional institutions, for
> example, universities (e.g., Fudan University), research labs (e.g, NEL-BDS
> lab), and corporations (e.g., K2Data, Tencent). Funding has been provided
> by various institutions including the National Natural Science Foundation
> of China, and industry sponsors, such as Lenovo and K2Data.
> >
> > == Rationale ==
> > Because there is no existed open-sourced time series databases covering
> all the above requirements, we developed IoTDB. As the system matures, we
> are seeking a long-term home for the project. We believe the Apache
> Software Foundation would be an ideal fit. Also joining Apache will help
> coordinate and improve the development effort of the growing number of
> organizations which contribute to IoTDB improving the diversity of our
> community.
> >
> > IoTDB contains multiple modules, which are classified into categories:
> >
> >  * '''TsFile Format''': TsFile is a new columnar file format.
> >  * '''Adaptor for Analytics and Visualization''': Integrating TsFile
> with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. Examples
> of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also
> provided.
> >  * '''IoTDB Engine''': An engine which consists of SQL parser, query
> plan generator, memtable, authentication and authorization,write ahead log
> (WAL), crash recovery, out-of-order data handler, and index for aggregation
> and pattern matching. The engine stores system data in TsFile format.
> >  * '''IoTDB JDBC''': An implementation of Java Database Connectivity
> (JDBC) for clients to connect to IoTDB using Java.
> >
> > === TsFile Format ===
> >
> > TsFile format is a columnar store, which is similar with Apache Parquet
> and Apache CarbonData. It has the concepts of Chunk Group, Column Chunk,
> Page and Footer. Comparing with Apache Parquet and Apache CarbonData, it is
> designed and optimized for time series:
> >
> > ==== Time Series Friendly Encoding ====
> > IoTDB currently supports run length encoding (RLE), delta-of-delta
> encoding, and Facebook's Gorilla encoding.
> >
> > Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA) and
> time-frequency transformation are works-in-progress.
> >
> >
> > ==== Chunk Group ====
> > The data part of a TsFile consists of many Chunk Groups. Each Chunk
> Group stores the data of a device at a time interval.  A Chunk Group is
> similar to the row group in Apache Parquet, while there are some
> constraints of the time dimension:  For each device, the time intervals of
> different Chunk Groups are not overlapped and the latter Chunk Group always
> has a larger timestamp.
> >
> > Given a TsFile and a query with a time range filter, the query process
> can terminate scanning data once it reads data points whose timestamp
> reaches the time limit of the filter. We call the feature ''fast-return''
> and it makes the time range query in a TsFile very efficient.
> >
> >
> >
> > ==== Different Column Chunk Format (Unnecessary the Repetition (R) and
> Definition (D) Fields) ====
> >
> > While Apache Parquet and Apache CarbonData support complex data types,
> e.g., nested data and sparse columns, TsFile is exclusively designed for
> time series whose data model is \<device_id, series_id, timestamp, value\>.
> >
> > In a `Chunk Group`, each time series is a `Column Chunk`. Even though
> these time series belong to the same device, the data points in different
> time series are not aligned in the time dimension originally.
> >
> > For example, if you have a device with 2 sensors on the same data
> collection frequencies, sensor 1 may collect data at time 1521622662000
> while the other one collects data at time 1521622662001 (delta=1ms).
> Therefore, each Column Chunk has its timestamps and values, which is quite
> different from Apache Parquet and Apache CarbonData.  Because we store the
> time column along with each value column instead of making different chunks
> share the same time column for the sake of diverse data frequency for
> different time series, we do not store any null value on disk to align
> across time series. Besides, we do not need to attach  `repetition` (R) and
> `definition` (D) fields on each value. Therefore, the disk space is saved
> and the query latency is reduced (because we do not align data by
> calculating R and D fields).
> >
> >
> > ==== Domain Specific Information in Each Page ====
> > Similar to Apache Parquet and Apache CarbonData, a `Column Chunk`
> consists of several `Pages`, and each `Page` has a `Page header`. The `Page
> header` is a summary of the data in the page.
> >
> > Because TsFile is optimized for time series, the page header contains
> more domain specific information, such as the minimal and maximal value,
> the minimal and the maximal timestamp, the frequency and so on. TsFile can
> even store the histogram of values in the page header.
> >
> > This header information helps IoTDB in speeding up queries by skipping
> unnecessary pages.
> >
> >
> > === Adaptor for Analytics ===
> > The TsFile provides:
> >
> >  * InputFormat/OutputFormat interfaces for Reading/Writing data.
> >  * Deep integration with Apache Spark/Hadoop MapReduce including
> predicate push-down, column pruning, aggregation push down, etc. So users
> can use Apache Spark SQL/HiveQL to connect and query TsFiles.
> >
> >
> > === IoTDB Engine ===
> > The IoTDB engine is a database engine, which uses TsFile as its storage
> file format. The IoTDB Engine supports SQL-like query plus many useful
> functions:
> >
> >  * Tree-based time series schema
> >  * Log-Structured Merge (LSM)-based storage
> >  * Overflow file for out-of-order data
> >  * Scalable index framework
> >  * Special queries for time series
> >
> > ==== Tree-based Time Series Schema ====
> > IoTDB manages all the time series definitions using a tree structure. A
> path from the root of the tree to a leaf node represents a time series.
> Therefore, the unique id of a time series is a path, e.g.,
> `root.China.beijing.windFarm1.windTurbine1.speed`.
> >
> > This kind of schema can express `group by` naturally. For example,
> `root.China.beijing.windFarm1.*.speed` represents the speed of all the wind
> turbines in wind farm 1 in Beijing, China.
> >
> > ==== Log-Structured Merge (LSM)-based Storage ====
> > In a time series, the data points should be ordered by their timestamps.
> In IoTDB, we use Log-Structured Merge (LSM) based mechanism. Therefore, a
> part of the data is stored in memory first and can be called as `memtable`.
> At this time, if data points come out-of-order, we resort them in memory.
> When this part of data exceeds the configured memory limit, we flush it on
> disk as a `Chunk Group` into an unclosed TsFile.  Finally, a TsFile may
> contain several Chunk Groups, for reducing the number of small data files,
> which is helpful to reduce the I/O load of the storage system and reduces
> the execution time of a file-merge in LSM. Notice that the data is
> time-ordered in one Chunk Group on disk, and this layout is helpful for
> fast filtering in one Chunk Group for a query.
> >
> > Rule 1: In a TsFile, the Chunk Groups of one device are ordered by
> timestamp (Rule 1), and it is helpful for fast filtering among Chunk Groups
> for a query.
> >
> > Rule 2: When the size of the unclosed TsFile reaches the threshold
> defined in the configuration file, we close the file and generate a new one
> to store new arriving data spanning the entire data set. Like many systems
> which use LSM-based storage, we never modify a TsFile which has been closed
> except for the file-merge process (Rule 2).
> >
> > Rule 3: To reduce the number of TsFiles involved in a query process, we
> guarantee that the data points in different TsFiles are not overlapping on
> the time dimension after file mergence (Rule 3).
> >
> > ==== Overflow File for Out-of-order Data ====
> > When a part of data is flushed on disk (and will form a `Chunk Group` in
> a TsFile), the newly arriving data points whose timestamps are smaller than
> the largest timestamp in the Tsfile are `out-of-order`.
> >
> > To store the out-of-order data, we organize all the troublesome
> `out-of-order` data point insertions into a special TsFile, named
> `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one device
> may be overlapping in the time dimension, which violates the Rule 1 and
> costs additional time compared to a normal TsFile for query filtering.
> >
> > There is another special operation: updating all the data points in a
> time range, e.g., `update all the speed values of device1 as 0 where the
> data time is in [1521622000000, 1521622662000]`. The operation is called
> when: (1) a sensor malfunctions and the database receives wrong data for a
> period; (2) we may want to reset all the records. Many NoSQL time series
> databases do not support such an operation. To support the operation in
> IoTDB, we use a tree-based structure, Treap, to store this part of
> operations and store them as `Overflow` files.
> >
> > Therefore, there are 3 kinds of data files: TsFiles, UnSequenceTsFiles
> and Overflow files.  TsFiles should store most of the data. The volume of
> UnSequenceTsFiles depends on the workload: if there are too many
> out-of-order and the time span of out-of-order is huge, the volume will be
> large. Overflow files handle fewest data operations but will depend on the
> use of the special operations.
> >
> > ==== LSM-tree ====
> > Normally, LSM-based storage engines merge data files level by level so
> that it looks like a tree structure. In this way, data is well organized.
> The disadvantage is that data will be read and written several times. If
> the tree has 4 levels, each data point will be rewritten at least 4 times.
> >
> > Currently, we do not merge all the TsFiles into one because (1) the
> number of TsFiles is kept lower than many LSM storage engines because a
> memtable is mapped to several Chunk Groups rather than a file; (2)
> different TsFiles are not overlapping with each other in the time dimension
> (because of Rule 3).
> >
> > As mentioned before,  TsFile supports ''fast-return'' to accelerate
> queries. However, UnSequenceTsFile and Overflow files do not allow this
> feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may be
> overlapped, which leads to more files involved in the query process. To
> accelerate these queries, there is a merging process to reorganize files in
> the background. All the three kinds of files: TsFiles, UnSequenceTsFiles
> and Overflow files, are involved in the merging process. The merging
> process is implemented using multi-threading, while each thread is
> responsible for a series family.
> > After merging, only TsFiles are left. These files have non-overlapping
> time spans and support the ''fast-return'' feature.
> >
> > ==== Scalable Index Framework ====
> > We allow users to implement indexes for faster queries. We currently
> support an index for pattern matching query (KV-Match index, ICDE 2019).
> Another index for fast aggregation (PISA index, CIKM 2016) is a
> work-in-progress.
> >
> > ==== Special Queries ====
> > We currently support `group by time interval` aggregation queries and
> `Fill by` operations, which are similar to those of InfluxDB. Time series
> segmentation operations and frequency queries are work-in-progress.
> >
> > == Initial Goals ==
> > The initial goals are to be open sourced and to integrate with the
> Apache development process. Furthermore, we plan for incremental
> development, and releases along with the Apache guidelines.
> >
> > == Current Status ==
> > We have developed the system for more than 2 years. There are currently
> 13k lines of code, some of which are generated by Antlr3 and Thrift.  There
> are 230 issues which have been solved and more than 1500 commits.
> >
> > The system has been deployed in the staging environment of the State
> Grid Corporation of China to handle ~3 million time series (i.e, ~30,000
> power generation assembly * ~100 sensors) and an equipment service company
> in China managing ~2 million time series (i.e, ~20k devices * 100 sensors).
> The insertion speed reaches ~2 million points/second/node, which is faster
> than InfluxDB, OpenTSDB and Apache Cassandra in our environment.
> >
> > There are many new features in the works including those mentioned
> herein. We will add more analytics functions, improve the data file merge
> process, and finish the first released version of IoTDB.
> >
> > == Meritocracy ==
> > The IoTDB project operates on meritocratic principles. Developers who
> submit more code with higher quality earn more merit. We have used `Issues`
> and `Pull Requests` modules on Github for collecting users' suggestions and
> patches. Users who submit issues, pull requests, documents and help the
> community management are welcomed and encouraged to become committers.
> >
> > == Community ==
> >
> > The IoTDB project users communicate on Github (
> > https://github.com/thulab/tsfile) . Developers make the communication
> on a website which is similar with JIRA (Currently, only registered users
> can apply to access the project for communication, url:
> https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/
> > ). We have also introduced IoTDB at many technical conferences. Next, we
> will build the mailing list for more convenience, broader communication and
> archived discussions.
> >
> > If IoTDB is accepted for incubation at the Apache Software Foundation,
> the primary goal is to build a larger community. We believe that IoTDB will
> become a key project for time series data management, and so, we will rely
> on a large community of users and developers.
> >
> > TODO: IoTDB is currently on a private Github repository (
> > https://github.com/thulab/iotdb), while its subproject TsFile (a file
> format for storing time series data) is open sourced on Github (
> https://github.com/thulab/tsfile
> > ).
> >
> > == Core Developers ==
> > IoTDB was initially developed by 2 dozen of students and teachers at
> Tsinghua University. Now, more and more developers have joined coming from
> other universities: Fudan University, Northwestern Polytechnical University
> and Harbin Institute of Technology in China.  Other developers come from
> business companies such as Lenovo and Microsoft. We will be working to
> bring more and more developers into the project making contributions to
> IoTDB.
> >
> > == Relationships with Other Apache Products ==
> > IoTDB requires some Apache products (Apache Thrift, commons,
> collections, httpclient).
> >
> > IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed for
> supporting analysing time series data by using Apache Spark and MapReduce.
> >
> > Overall, IoTDB is designed as an open architecture, and it can be
> integrated with many other systems in the future.
> >
> > As mentioned before, in the IoTDB project, we designed a new columnar
> file format, called TsFile, which is similar to Apache Parquet. However,
> the new file format is optimized for time series data.
> >
> >
> >
> > == Known Risks ==
> >
> > === Orphaned Products ===
> > Given the current level of investment in IoTDB, the risk of the project
> being abandoned is minimal. Time series data is more and more important and
> there are several constituents who are highly inspired to continue
> development. Tsinghua and NEL-BDS Lab relies on IoTDB as a platform for a
> large number of long-term research projects. We have deployed IoTDB in some
> company's staging environments for future applications.
> >
> > === Inexperience with Open Source ===
> > Students and researchers in Tsinghua University have been developing and
> using open source software for a long time. It is wonderful to be guided to
> join a formal open-source process for students. Some of our committers
> > have  experiences contributing to open source, for example:
> >
> >  * druid:
> >
> https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44
> >
> >  * druid:
> >
> https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794
> >
> >  * YCSB:
> > https://github.com/brianfrankcooper/YCSB/pull/776
> >
> >
> > Additionally, several ASF veterans and industry veterans have agreed to
> mentor the project and are listed in this proposal. The project will rely
> on their guidance and collective wisdom to quickly transition the entire
> team of initial committers towards practicing the Apache Way.
> >
> >
> > === Reliance on Salaried Developers ===
> > Most of current developers are students and researchers/professors in
> universities, and their researches focus on big data management and
> analytics. It is unlikely that they will change their research focus away
> from big data management.  We will work to ensure that the ability for the
> project to continuously be stewarded and to proceed forward independent of
> salaried developers is continued.
> >
> > === An Excessive Fascination with the Apache Brand ===
> > Most of the initial developers come from Tsinghua University with no
> intent to use the Apache brand for profit. We have no plans for making use
> of Apache brand in press releases nor posting billboards advertising
> acceptance of IoTDB into Apache Incubator.
> >
> >
> > == Initial Source ==
> > IoTDB's github address and some required dependencies:
> >
> >  * The storage file format:
> > https://github.com/thulab/tsfile
> >
> >  * Adaptor for Apache Hadoop MapReduce:
> > https://github.com/thulab/tsfile-hadoop-connector
> >
> >  * Adaptor for Apache Spark:
> > https://github.com/thulab/tsfile-spark-connector
> >
> >  * Adaptor for Grafana:
> > https://github.com/thulab/iotdb-grafana
> >
> >  * The database engine:
> > https://github.com/thulab/iotdb
> >  (private project up to now)
> >  * The client driver:
> > https://github.com/thulab/iotdb-jdbc
> >
> >
> >
> > === External Dependencies ===
> > To the best of our knowledge, all dependencies of IoTDB are distributed
> under Apache compatible licenses. Upon acceptance to the incubator, we
> would begin a thorough analysis of all transitive dependencies to verify
> this fact and introduce license checking into the build and release process.
> >
> > == Documentation ==
> >  * Documentation for TsFile:
> > https://github.com/thulab/tsfile/wiki
> >
> >  * Documentation for IoTDB and its JDBC:
> > http://tsfile.org/document
> >  (Chinese only. An English version is in progress.)
> >
> > == Required Resources ==
> > === Mailing Lists ===
> >  *
> > private@iotdb.incubator.apache.org
> >
> >  *
> > dev@iotdb.incubator.apache.org
> >
> >  *
> > commits@iotdb.incubator.apache.org
> >
> >
> > === Git Repositories ===
> >  *
> > https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git
> >
> >
> > === Issue Tracking ===
> >  *  JIRA IoTDB (We currently use the issue management provided by Github
> to track issues.)
> >
> >
> > == Initial Committers ==
> > Tsinghua University, K2Data Company, Lenovo, Microsoft
> >
> > Jianmin Wang (jimwang at tsinghua dot edu dot cn )
> >
> > Xiangdong Huang (sainthxd at gmail dot com)
> >
> > Jun Yuan (richard_yuan16 at 163 dot com)
> >
> > Chen Wang ( wang_chen at tsinghua dot edu dot cn)
> >
> > Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn)
> >
> > Jinrui Zhang (jinrzhan at microsoft dot com)
> >
> > Rong Kang (kr11 at mails dot tsinghua dot edu dot cn)
> >
> > Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn)
> >
> > Shuo Zhang (zhangshuo at k2data dot com dot cn)
> >
> > Lei Rui (rl18 at mails dot tsinghua dot edu dot cn)
> >
> > Rui Liu (liur17 at mails dot tsinghua dot edu dot cn)
> >
> > Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn)
> >
> > Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn)
> >
> > Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn)
> >
> > Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn)
> >
> > Tianan Li(lta18 at mails dot tsinghua dot edu dot cn)
> >
> > Yue Su (suy18 at mails dot tsinghua dot edu dot cn)
> >
> > Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com )
> >
> > == Sponsors ==
> > === Champion ===
> > Kevin A. McGrail (
> > kmcgrail@apache.org
> > )
> >
> > === Nominated Mentors ===
> > Justin Mclean (justin at classsoftware dot com)
> >
> > Christofer Dutz (christofer.dutz at c-ware dot de)
> >
> > Willem Jiang (willem.jiang at gmail dot com)
> >
> >
>
>