You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iotdb.apache.org by Jack Tsai <ja...@outlook.com> on 2019/07/15 06:13:48 UTC

Support batched ingestion

Hi,

I recently start working on the issue (https://issues.apache.org/jira/browse/IOTDB-13) which is going to implement a batched writing interface.

The user could invoke the interface to write a series of data with the static schema at once, which they could do as follows:

A class TimeSeries that has two columns: long[] times, double[] values with a write method for user to load the data in it.

TimeSeries series = new TimeSeries((int)columnSize);

for(int i = 0; i < columnSize; i++) {
  series.write((long) currentTime, (double) currentValue);
}

TsFileWriter.write(series);

If you have any different idea or opinion, please welcome to discuss with me.

Best regards,
Tsung-Han Tsai

回覆: Support batched ingestion

Posted by Jack Tsai <ja...@outlook.com>.

Hi,

I have finished the implementation of this function which it is called RowBatch (https://issues.apache.org/jira/browse/IOTDB-13).

Also, I have done some tests to compare the performance between the writing aligned data with TSRecord and RowBatch.

The following is the test result of writing aligned data with RowBatch and TSRecord. I compared both of them in different data point amount (the time estimate is from the user constructing RowBatch or TSRecord until the end of writing all the data):


  *   Row number: 1000000
Column (sensor) number: 1000 (All sensors are with INT64 datatype)
Total data point number: 1000000000 (1 billion)
Write with RowBatch performance avg: 18338865 (data point write / s)
Write with TSRecord performance avg: 3275348.6 (data point write / s)
RowBatch / TSRecord: 5x

  *   Row number: 1000000
Column (sensor) number: 100 (All sensors are with INT64 datatype)
Total data point number: 100000000 (100 million)
Write with RowBatch performance avg: 14501160.7 (data points write / s)
Write with TSRecord performance avg: 6369021.3 (data points write / s)
RowBatch / TSRecord: 2x

  *   Row number: 100000
Column (sensor) number: 1000 (All sensors are with INT64 datatype)
Total data point number: 100000000 (100 million)
Write with RowBatch performance avg: 13960631.8 (data point write / s)
Write with TSRecord performance avg: 2993653.2 (data point write / s)
RowBatch / TSRecord: 4x

  *   Row number: 1000000
Column (sensor) number: 300 (100 sensor with INT32 datatype, 100 with INT64 and 100 with DOUBLE)
Total data point number: 300000000 (300 million)
Write with RowBatch performance avg: 15577941.3 (data point write / s)
Write with TSRecord performance avg: 759258.5 (data point write / s)
RowBatch / TSRecord: 2x

  *   Row number: 10000000
Column (sensor) number: 3 (1 sensor with INT32 datatype, 1 with INT64 and 1 with DOUBLE)
Total data point number: 30000000 (30 million)
Write with RowBatch performance avg: 3213367.6 (data point write / s)
Write with TSRecord performance avg: 1604363.8 (data point write / s)
RowBatch / TSRecord: 2x

As the result, under such a big data size, the performance of writing aligned data with RowBatch and TSRecord has obvious difference in various degree.

To enhance the performance of writing with RowBatch, I implement it in a similar way with Apache ORC.

In RowBatch, each sensor's (including the timestamp column) space is created with a primitive array to store its values (In TSRecord, each timestamp and one value in each sensor are stored in a list).

It could be said that the data in RowBatch is stored based on each sensor, and it is stored based on each timestamp in TSRecord.

By changing the data structure and the way of storing data, it could improve the performance in various levels.

Also, in original way, when each TSRecord has wrote into the page writer, it needs to check page size every single time to know if it needs to open a new page. It is too fine and may waste some time in this stage.

With RowBatch, it only needs to have size check while each primitive array (all data in one sensor) is wrote into the page writer.

In conclusion, it is faster to write aligned data by using RowBatch due to its implementation, and TSRecord should still be used while storing and writing the unaligned data.

Best regards,
Tsung-Han Tsai
________________________________
寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
寄件日期: 2019年7月15日 下午 02:32
收件者: dev@iotdb.apache.org <de...@iotdb.apache.org>
主旨: Re: Support batched ingestion

Hi,

Just a reminder that TimeSeries should also support bool, int, long, float and text values apart from double.

The supported data types can be found in TSDataType class.

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

> -----原始邮件-----
> 发件人: "Jack Tsai" <ja...@outlook.com>
> 发送时间: 2019-07-15 14:13:48 (星期一)
> 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> 抄送:
> 主题: Support batched ingestion
>
> Hi,
>
> I recently start working on the issue (https://issues.apache.org/jira/browse/IOTDB-13) which is going to implement a batched writing interface.
>
> The user could invoke the interface to write a series of data with the static schema at once, which they could do as follows:
>
> A class TimeSeries that has two columns: long[] times, double[] values with a write method for user to load the data in it.
>
> TimeSeries series = new TimeSeries((int)columnSize);
>
> for(int i = 0; i < columnSize; i++) {
>   series.write((long) currentTime, (double) currentValue);
> }
>
> TsFileWriter.write(series);
>
> If you have any different idea or opinion, please welcome to discuss with me.
>
> Best regards,
> Tsung-Han Tsai
>

Re: Support batched ingestion

Posted by Jialin Qiao <qj...@mails.tsinghua.edu.cn>.

Hi, 

Just a reminder that TimeSeries should also support bool, int, long, float and text values apart from double. 

The supported data types can be found in TSDataType class.

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

> -----原始邮件-----
> 发件人: "Jack Tsai" <ja...@outlook.com>
> 发送时间: 2019-07-15 14:13:48 (星期一)
> 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> 抄送: 
> 主题: Support batched ingestion
> 
> Hi,
> 
> I recently start working on the issue (https://issues.apache.org/jira/browse/IOTDB-13) which is going to implement a batched writing interface.
> 
> The user could invoke the interface to write a series of data with the static schema at once, which they could do as follows:
> 
> A class TimeSeries that has two columns: long[] times, double[] values with a write method for user to load the data in it.
> 
> TimeSeries series = new TimeSeries((int)columnSize);
> 
> for(int i = 0; i < columnSize; i++) {
>   series.write((long) currentTime, (double) currentValue);
> }
> 
> TsFileWriter.write(series);
> 
> If you have any different idea or opinion, please welcome to discuss with me.
> 
> Best regards,
> Tsung-Han Tsai
>