You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iotdb.apache.org by Xiangdong Huang <sa...@gmail.com> on 2019/07/03 12:57:48 UTC

Re: 回覆: A new encoding method for regular timestamp column

Hi,

Thanks very much for your detailed experiment.

Firstly, I assume you are using TS2DIFF encoding method.

Secondly, let's considering the compression ratio first.

In my imagination, suppose the data is regular (i.e., the frequency is
fixed) and there is no missing data points,
then the compression ratio (of the original method is): L / (8 + 8 + 8 +
4), referring to: raw data size / (start timestamp + level 1 delta + level
2 delta + number of points).

Using missing data completion, then the data space cost is (8 + 8+ 8 + 4) +
the size of bitmap, where  (8 + 8 + 8 + 4) has the same meaning as above,
and the size of bitmap should equal to  "the number of points"/8 bytes.
The new compression ratio can be: L / (8 + 8 + 8 + 4 + n/8)  (suppose the
number of points is $n$). The denominator keeps constant no matter how the
missing data ratio changes, while the numerator changes according to how
many real data points you have.
Because the fewer the missing data is, the larger the L is, then you
compression ratio will increase. That is why in your experiment, the
compression ratio of the new method increase but not so much.

In your method, the improvement of the compression ratio depends on how
many data points in one page.
If there is 100 points in a page,  if there is no missing points, and the
raw data size is (100 * 8) = 800 bytes, then the original compression is:
800 / (8 + 8 + 8 + 4 +100/8) ~=19.7, but the original method is 800 / (8 +
8 + 8 + 4) ~=28, which is better.
However,  if there is 1% point missing, then in your method, the
compression ratio is 800*0.99 / (8 + 8 + 8 +4 +100/8) ~=19.5. Suppose the
only one missing point is at the middle of the points, then the original
method's compression ratio is : 800 / ((8 + 8 + 8 + 4) + (8 + 8 + 8 + 4) )
~= 14, which is worse.

So, your experiment make sense on the compression ratio aspect.

But I have no idea about the read and write time cost. Why?

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jack Tsai <ja...@outlook.com> 于2019年7月3日周三 下午5:30写道：

> Hi,
>
> Sorry for sending the unfinished mail (last mail). This is the complete
> version.
>
> According to the result of the performance test I have done so far, I
> found some difference between the new encoding method I created and the
> original encoding method.
>
> I increased the data size to 2000000 rows (the data size in the original
> test of the old encoding method was around 3000) when testing the new
> encoding method. There are some aspects for the comparison:
>
>
>   1.  Compression ratio
>   2.  Write (encode) operation time
>   3.  Read (decode) operation time
>
> The following is the comparison between the old encoding method and the
> new encoding method:
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 296
> encoding data size:2994679 byte
> Read time: 173
> Compression ratio: 5.6086679073116015
>
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> write time: 287
> encoding data size:3161045 byte
> read time: 195
> compression ratio: 5.608676877425029
>
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> write time: 291
> encoding data size:3285820 byte
> read time: 189
> compression ratio: 5.608682155443694
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 199
> encoding data size:651696 byte
> Read time: 94
> Compression ratio: 28.619822739436792
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> row number: 2332801
> missing point percentage: 2.5291484357259364E-5
> source data size:18661936 byte
> write time: 232
> encoding data size:443136 byte
> read time: 69
> compression ratio: 42.11333766608897
>
>
>
> There are few testing results for the new encoding method as follows:
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 117
> encoding data size:653285 byte
> Read time: 84
> Compression ratio: 25.71031020152001
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> Write time: 171
> encoding data size:653285 byte
> Read time: 116
> Compression ratio: 27.138660768271123
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> Write time: 128
> encoding data size:653285 byte
> Read time: 89
> Compression ratio: 28.209923693334456
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 118
> encoding data size:653285 byte
> Read time: 161
> Compression ratio: 28.550210092073137
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 0.002142488793514752
> source data size:18622424 byte
> Write time: 110
> encoding data size:653286 byte
> Read time: 81
> Compression ratio: 28.50577541842317
>
> As you could see above, the compression ratio would increase along with
> the decreasing of the missing point percentage.
>
> In the high missing point percentage, the new encoding method has the
> better performance on compressing data.
>
> While the missing point percentage came down to 0.0005%, both compression
> performance of the encoding method would be almost the same.
>
> However, as the missing point percentage keep decreasing, the compression
> ratio in the new encoding method almost won't increase anymore. On the
> contrary, the performance of the original encoding method still improving
> along with the decreasing of the missing point percentage.
>
> In the last comparison of the compression ratio, the original encoding
> method could have the compression ratio up to 40x, which almost has the
> same performance as encoding the data without missing point. But in the new
> encoding method, the compression ratio still remain on 20x.
>
> The write time and the read time is also different between two encoding
> method. With using the new encoding method, the write (encode) performance
> has made an improvement.
>
> In conclusion, the new encoding method could be more adapted to encode
> when the missing point percentage is high (above 1%), but in the lower
> missing point percentage (which could be stated as under 1%), the original
> encoding method may have better performance in the compression.
>
> As a result, do I need to make any changes to my implementation on the new
> encoding method? Does this result meet what this issue need?
>
> If you have some problems on my testing or implementation, please welcome
> to give me some advice.
>
> Best regards,
> Tsung-Han Tsai
> ________________________________
> 寄件者: Jack Tsai
> 寄件日期: 2019年7月3日 下午 03:58
> 收件者: dev@iotdb.apache.org
> 主旨: 回覆: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> According to the result of the performance test I have done so far, I
> found some difference between the new encoding method I created and the
> original encoding method.
>
> I increased the data size to 2000000 rows (the data size in the original
> test of the old encoding method was around 3000) when testing the new
> encoding method. There are some aspects for the comparison:
>
>
>   1.  Compression ratio
>   2.  Write (encode) operation time
>   3.  Read (decode) operation time
>
> I found that in the original encoding method, the determining factor for
> the compression performance is not the number of the missing point in the
> data. Actually, it is the density of the missing point. I would provide
> some examples for you to understand.
>
>
>   *   Missing a data point every 500 points
>
> ________________________________
> 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> 寄件日期: 2019年6月28日 上午 10:35
> 收件者: dev@iotdb.apache.org
> 主旨: Re: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> The bitmap is also needed. Just use another encoding for regular time
> column.
>
> Suppose the time column is "2, 3, 4, 6". The generated column is "2, 3, 4,
> 5, 6".
>
> Then we could use "2(the first value), 1(the delta), 5(total data number)"
> to encode the original data.
>
> However, when decoding from "2, 1, 5" and get "2, 3, 4, 5, 6", we still
> need a mark column such that "11101" to denote that the fourth data is
> generated.
>
> Best,
> --
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
> > -----原始邮件-----
> > 发件人: "Jack Tsai" <ja...@outlook.com>
> > 发送时间: 2019-06-28 09:54:18 (星期五)
> > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > 抄送:
> > 主题: 回覆: A new encoding method for regular timestamp column
> >
> > Hi Jialin,
> >
> > I am not sure if my understanding is right.
> >
> > Does it mean that to encode the data without the bitmap?
> >
> > Best regards,
> > Tsung-Han Tsai
> >
> > ________________________________
> > 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> > 寄件日期: 2019年6月28日 上午 08:59
> > 收件者: dev@iotdb.apache.org
> > 主旨: Re: A new encoding method for regular timestamp column
> >
> > Hi, Tsung-Han
> >
> > Nice try!
> > I think by "low compression ratio of regular time series (with missing
> points)" you mean using TS2DIFF to encode the time column.
> > I wonder if we could "generate a new data array with no missing points
> (regular time series)",
> > could we just use (1) the first value, (2) the delta and (3) the total
> number of value to encode the raw data instead of using TS2DIFF?
> >
> > Best,
> > --
> > Jialin Qiao
> > School of Software, Tsinghua University
> >
> > 乔嘉林
> > 清华大学 软件学院
> >
> > > -----原始邮件-----
> > > 发件人: "Jack Tsai" <ja...@outlook.com>
> > > 发送时间: 2019-06-27 17:56:45 (星期四)
> > > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > > 抄送:
> > > 主题: A new encoding method for regular timestamp column
> > >
> > > Hi all,
> > >
> > > I am working on this issue:
> https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73
> > >
> > > The new encoding method could deal with the problem of the regular
> time series (the time elapsed between each data point is always the same)
> which got the missing point. When the missing points exist in the data, the
> compression ratio would decrease from 40x to 8x.
> > >
> > > To solve this issue, here comes my solution. I would divide it into
> two parts to explain, which is the write operation and the read operation.
> > >
> > > Write (Encode)
> > >
> > >   1.  First of all, calculate the delta between each elements in the
> data. If the delta value is different from the previous delta, then it
> could be stated that the missing point exist in the data.
> > >   2.  If the missing points exist, get the minimum data base between
> each element. Generate a new data array with no missing points (regular
> time series).
> > >   3.  Next, begin to write the info of this data into the byte array
> output stream.
> > >   4.  The first part is for the identifier, which is the boolean value
> to show whether the missing points exist in the following data.
> > >   5.  Compare the original data, which is the missing point data with
> the new complete data. It would form a bitmap to denote the position of
> missing points.
> > >   6.  Convert the bitmap into the byte array and write it and its
> length into the output stream.
> > >   7.  Start to encode the data (the newly created one which has no
> missing point) value into the output stream.
> > >   8.  When the encode data size reach the target block size, flush it
> into the output stream. Repeat until all values in the data flush into the
> output stream.
> > >
> > > Read (Decode)
> > >
> > >   1.  First of all, decode the first boolean value in the buffer to
> check whether the following data has the missing point.
> > >   2.  If the missing point exist in the data, then it means there is a
> byte array of bitmap in the following part of this buffer.
> > >   3.  To decode the bitmap, initially, read the next int value which
> is for the length of the following byte array of the bitmap.
> > >   4.  Decode the byte array of the bitmap and convert it into the
> bitmap for denoting the missing point in the following data.
> > >   5.  Decode the following data, which is the data array without
> missing points and compare it with the bitmap (according to the mechanism
> of the bitmap, when the bit comes to 0, then it means the missing point
> exists here. Return (read out) the value in the decoded data when the bit
> value is 1).
> > >
> > > The compression ratio would be up to 20x result in the original unit
> test (the one with old encoding method would result in 8x).
> > >
> > > To compare with the original encoding method more precisely, I would
> start the unit test with more details and the performance test for this new
> encoding method recently.
> > >
> > > If you are confused or think there is the problem in my
> implementation, please welcome to give me some advice.
> > >
> > > Best regards,
> > > Tsung-Han Tsai
>

回覆: 回覆: A new encoding method for regular timestamp column

Posted by Jack Tsai <ja...@outlook.com>.

Hi,

I think it should let the user declare which data to use (maybe it would be like GorillaEncoder).

RegularDataEncoder (the new encoder) could be only used while the missing point exists in the regular data (the interval is the same between each element).

The reason why it cannot meet the inrregular data is because this kind of implementation might make the data encoded very large while the data is unsequence. Let's assume there's a data block which is:

100000 1 2 10000 200000 15000   (minimum delta base: 1)

While it has any missing point in it, by using the implementation in RegularDataEncoder, which would reformat this data block into the regular data to be encoded, the block size would be very large.

So it is better to use DeltaBinaryEncoder (the original encoder) to encode the inrregular data as the one doesn't need to fill the point which is missing (DeltaBinaryEncoder would treat the unsequence data like the regular data with missing points).

I believe there is a better way to deal with the unsequence data. However, I cannot think of it right away. Maybe I would move onto the next issue first and think of this problem for a while.

Best regards,
Tsung-Han Tsai
________________________________
寄件者: Xiangdong Huang <sa...@gmail.com>
寄件日期: 2019年7月9日 下午 09:38
收件者: dev@iotdb.apache.org
主旨: Re: 回覆: A new encoding method for regular timestamp column

Hi,

Thanks for your detailed experiment.

My question is, given a page, how to decide the encoding method?

Let users declare that? Or just always use the new encoding method? If the
latter, how the performance if the new method meets irregular data?

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jack Tsai <ja...@outlook.com> 于2019年7月9日周二 下午2:59写道：

> Hi,
>
> I have made some improvement in the new encoding method I created before,
> which I think the one I've done initially was not fully meet the user's
> usage for invoking the encoder's API.
>
> The following is the detail for the new implementation:
>
> Write (Encode)
>
>   1.  Invoke the RegularDataEncoder (the new encoder I created) to encode
> data points. Let's assume each data point is the millisecond of the
> timestamp (date).
>   2.  Create a new array with default size for storing the incoming data.
>   3.  When the data points fill the array, start to check whether missing
> points exist in it.
>   4.  If the missing point is detected, create a new array for storing the
> regular data, which the time elapsed between each data point is the same
> (calculate the minimum delta between each point to get the time interval
> for the new regular data array).
>   5.  Next, mark the identifier for decoder to know if the missing point
> exist in the following data, and start to create a bitmap for denoting
> where the missing point is in this data array. Encode these info into the
> output stream.
>   6.  In this step, encode the the array info (the array size, minimum
> delta value and first value of this array) into the output stream.
>   7.  Repeat 1-6 steps until the end of the data.
>
> Read (Decode)
>
>   1.  Read the identifier of the pack (data array) to know if the missing
> point exists in the following data.
>   2.  If the missing point exists in the data, read the bitmap which had
> been encoded.
>   3.  Load the first value and the minimum delta base. Add up the minimum
> delta base along with the previous value (begin from the first value of
> this pack) and store it into the data array.
>   4.  While all the value in this pack have been loaded, read the bitmap
> with the data array (while the bitmap element return 0, means the missing
> point exists in this position). If the missing point is detected, skip this
> data point in the array.
>   5.  Repeat 1-4 steps until all the data has been read.
>
> Result
> It is the testing result of this encoder below, which I would compare with
> the result of the DeltaBinaryEncoder:
>
> DeltaBinaryEncoder (the original encoding method)
>
>   *   No missing point
>
> Row number: 2332801
> source data size:18662408 byte
> Write time: 158
> encoding data size:434016 byte
> Read time: 68
> Compression ratio: 42.99935486249355
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 296
> encoding data size:2994679 byte
> Read time: 173
> Compression ratio: 5.6086679073116015
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> write time: 287
> encoding data size:3161045 byte
> read time: 195
> compression ratio: 5.608676877425029
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> write time: 291
> encoding data size:3285820 byte
> read time: 189
> compression ratio: 5.608682155443694
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 199
> encoding data size:651696 byte
> Read time: 94
> Compression ratio: 28.619822739436792
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> row number: 2332801
> missing point percentage: 2.5291484357259364E-5
> source data size:18661936 byte
> write time: 232
> encoding data size:443136 byte
> read time: 69
> compression ratio: 42.11333766608897
>
> RegularDataEncoder (the new encoding method I created)
>
>   *   No missing point
>
> Row number: 2332801
> source data size:18662408 byte
> Write time: 83
> encoding data size:382746 byte
> Read time: 53
> Compression ratio: 48.75925025996353
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 133
> encoding data size:705320 byte
> Read time: 108
> Compression ratio: 23.81353144671922
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> Write time: 125
> encoding data size:727184 byte
> Read time: 121
> Compression ratio: 24.38073444960285
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> Write time: 161
> encoding data size:755895 byte
> Read time: 159
> Compression ratio: 24.380529041731986
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 100
> encoding data size:411117 byte
> Read time: 73
> Compression ratio: 45.36767878730386
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 2.5291484357259364E-5
> source data size:18661936 byte
> Write time: 89
> encoding data size:383943 byte
> Read time: 65
> Compression ratio: 48.60600662077444
>
> Conclusion
>
> As you could see above, no matter the missing point exists in the regular
> data or not, RegularDataEncoder (the new encoding method) got a better
> performance on the write time, read time and the compression ratio.
>
> As the missing point percentage decreased, the compression ratio would
> increase. Even the regular data without missing point could have the better
> compression ratio than the DeltaBinaryEncoder (the original encoding
> method).
>
> The reason why the performance of the write time, read time and
> compression ratio could have different degrees of improvement is because I
> have changed the element to write (encode) into the output stream.
>
> In the original encoding method, not only the first value, minimum delta
> base and the pack length, but also the the whole delta between each element
> would be encode into the output stream, which could slow down the read and
> write performance.
>
> After adapting what Jialin told me, which just encode the missing point
> data identifier, first value, minimum delta base and the pack length
> (sometimes with the bitmap), then add up the minimum delta along with the
> previous value to form the data array while decoding (sometimes decode with
> the bitmap). The size of the data in the output stream could be decreased,
> and as the data which need to be encoded and decoded decreased, the read
> time and write time performance could be risen up.
>
> Please welcome to discuss with me if you got any problem. I would commit
> the pull request if everything is fine.
>
> Best regards,
> Tsung-Han Tsai
> ________________________________
> 寄件者: Jack Tsai <ja...@outlook.com>
> 寄件日期: 2019年7月4日 下午 04:06
> 收件者: dev@iotdb.apache.org
> 主旨: 回覆: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> I also found that while adopting the new encoding method I created, the
> compression ratio and the operation time cost for data without missing
> point also got an improvement. Please see the following for details:
>
>
>   *   Original encoding method (data without missing point)
>
> Row number: 2332801
> source data size:18662408 byte
> Write time: 158
> encoding data size:434016 byte
> Read time: 68
> Compression ratio: 42.99935486249355
>
>
>   *   New encoding method (data without missing point)
>
> Row number: 2332801
> source data size:18662408 byte
> Write time: 78
> encoding data size:361681 byte
> Read time: 55
> Compression ratio: 51.59908316997575
>
> As the experimental result, while the data got no missing point, adopting
> the new encoding method might be better.
>
> Best regards,
> Tsung-Han Tsai
> ________________________________
> 寄件者: Xiangdong Huang <sa...@gmail.com>
> 寄件日期: 2019年7月3日 下午 08:57
> 收件者: dev@iotdb.apache.org
> 主旨: Re: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> Thanks very much for your detailed experiment.
>
> Firstly, I assume you are using TS2DIFF encoding method.
>
> Secondly, let's considering the compression ratio first.
>
> In my imagination, suppose the data is regular (i.e., the frequency is
> fixed) and there is no missing data points,
> then the compression ratio (of the original method is): L / (8 + 8 + 8 +
> 4), referring to: raw data size / (start timestamp + level 1 delta + level
> 2 delta + number of points).
>
> Using missing data completion, then the data space cost is (8 + 8+ 8 + 4) +
> the size of bitmap, where  (8 + 8 + 8 + 4) has the same meaning as above,
> and the size of bitmap should equal to  "the number of points"/8 bytes.
> The new compression ratio can be: L / (8 + 8 + 8 + 4 + n/8)  (suppose the
> number of points is $n$). The denominator keeps constant no matter how the
> missing data ratio changes, while the numerator changes according to how
> many real data points you have.
> Because the fewer the missing data is, the larger the L is, then you
> compression ratio will increase. That is why in your experiment, the
> compression ratio of the new method increase but not so much.
>
> In your method, the improvement of the compression ratio depends on how
> many data points in one page.
> If there is 100 points in a page,  if there is no missing points, and the
> raw data size is (100 * 8) = 800 bytes, then the original compression is:
> 800 / (8 + 8 + 8 + 4 +100/8) ~=19.7, but the original method is 800 / (8 +
> 8 + 8 + 4) ~=28, which is better.
> However,  if there is 1% point missing, then in your method, the
> compression ratio is 800*0.99 / (8 + 8 + 8 +4 +100/8) ~=19.5. Suppose the
> only one missing point is at the middle of the points, then the original
> method's compression ratio is : 800 / ((8 + 8 + 8 + 4) + (8 + 8 + 8 + 4) )
> ~= 14, which is worse.
>
> So, your experiment make sense on the compression ratio aspect.
>
> But I have no idea about the read and write time cost. Why?
>
> Best,
> -----------------------------------
> Xiangdong Huang
> School of Software, Tsinghua University
>
>  黄向东
> 清华大学 软件学院
>
>
> Jack Tsai <ja...@outlook.com> 于2019年7月3日周三 下午5:30写道：
>
> > Hi,
> >
> > Sorry for sending the unfinished mail (last mail). This is the complete
> > version.
> >
> > According to the result of the performance test I have done so far, I
> > found some difference between the new encoding method I created and the
> > original encoding method.
> >
> > I increased the data size to 2000000 rows (the data size in the original
> > test of the old encoding method was around 3000) when testing the new
> > encoding method. There are some aspects for the comparison:
> >
> >
> >   1.  Compression ratio
> >   2.  Write (encode) operation time
> >   3.  Read (decode) operation time
> >
> > The following is the comparison between the old encoding method and the
> > new encoding method:
> >
> >   *   Missing a data point every 10 points (10% missing point percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.10000038580230375
> > source data size:16796160 byte
> > Write time: 296
> > encoding data size:2994679 byte
> > Read time: 173
> > Compression ratio: 5.6086679073116015
> >
> >
> >   *   Missing a data point every 20 points (5% missing point percentage)
> >
> > row number: 2332801
> > missing point percentage: 0.05000040723576504
> > source data size:17729280 byte
> > write time: 287
> > encoding data size:3161045 byte
> > read time: 195
> > compression ratio: 5.608676877425029
> >
> >
> >   *   Missing a data point every 80 points (1% missing point percentage)
> >
> > row number: 2332801
> > missing point percentage: 0.012500423310861097
> > source data size:18429120 byte
> > write time: 291
> > encoding data size:3285820 byte
> > read time: 189
> > compression ratio: 5.608682155443694
> >
> >   *   Missing a data point every 1700 points (0.0005% missing point
> > percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 5.885628478382587E-4
> > source data size:18651424 byte
> > Write time: 199
> > encoding data size:651696 byte
> > Read time: 94
> > Compression ratio: 28.619822739436792
> >
> >   *   Missing a data point every 40000 points (0.00002% missing point
> > percentage)
> >
> > row number: 2332801
> > missing point percentage: 2.5291484357259364E-5
> > source data size:18661936 byte
> > write time: 232
> > encoding data size:443136 byte
> > read time: 69
> > compression ratio: 42.11333766608897
> >
> >
> >
> > There are few testing results for the new encoding method as follows:
> >
> >   *   Missing a data point every 10 points (10% missing point percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.10000038580230375
> > source data size:16796160 byte
> > Write time: 117
> > encoding data size:653285 byte
> > Read time: 84
> > Compression ratio: 25.71031020152001
> >
> >   *   Missing a data point every 20 points (5% missing point percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.05000040723576504
> > source data size:17729280 byte
> > Write time: 171
> > encoding data size:653285 byte
> > Read time: 116
> > Compression ratio: 27.138660768271123
> >
> >   *   Missing a data point every 80 points (1% missing point percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.012500423310861097
> > source data size:18429120 byte
> > Write time: 128
> > encoding data size:653285 byte
> > Read time: 89
> > Compression ratio: 28.209923693334456
> >
> >   *   Missing a data point every 1700 points (0.0005% missing point
> > percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 5.885628478382587E-4
> > source data size:18651424 byte
> > Write time: 118
> > encoding data size:653285 byte
> > Read time: 161
> > Compression ratio: 28.550210092073137
> >
> >   *   Missing a data point every 40000 points (0.00002% missing point
> > percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.002142488793514752
> > source data size:18622424 byte
> > Write time: 110
> > encoding data size:653286 byte
> > Read time: 81
> > Compression ratio: 28.50577541842317
> >
> > As you could see above, the compression ratio would increase along with
> > the decreasing of the missing point percentage.
> >
> > In the high missing point percentage, the new encoding method has the
> > better performance on compressing data.
> >
> > While the missing point percentage came down to 0.0005%, both compression
> > performance of the encoding method would be almost the same.
> >
> > However, as the missing point percentage keep decreasing, the compression
> > ratio in the new encoding method almost won't increase anymore. On the
> > contrary, the performance of the original encoding method still improving
> > along with the decreasing of the missing point percentage.
> >
> > In the last comparison of the compression ratio, the original encoding
> > method could have the compression ratio up to 40x, which almost has the
> > same performance as encoding the data without missing point. But in the
> new
> > encoding method, the compression ratio still remain on 20x.
> >
> > The write time and the read time is also different between two encoding
> > method. With using the new encoding method, the write (encode)
> performance
> > has made an improvement.
> >
> > In conclusion, the new encoding method could be more adapted to encode
> > when the missing point percentage is high (above 1%), but in the lower
> > missing point percentage (which could be stated as under 1%), the
> original
> > encoding method may have better performance in the compression.
> >
> > As a result, do I need to make any changes to my implementation on the
> new
> > encoding method? Does this result meet what this issue need?
> >
> > If you have some problems on my testing or implementation, please welcome
> > to give me some advice.
> >
> > Best regards,
> > Tsung-Han Tsai
> > ________________________________
> > 寄件者: Jack Tsai
> > 寄件日期: 2019年7月3日 下午 03:58
> > 收件者: dev@iotdb.apache.org
> > 主旨: 回覆: 回覆: A new encoding method for regular timestamp column
> >
> > Hi,
> >
> > According to the result of the performance test I have done so far, I
> > found some difference between the new encoding method I created and the
> > original encoding method.
> >
> > I increased the data size to 2000000 rows (the data size in the original
> > test of the old encoding method was around 3000) when testing the new
> > encoding method. There are some aspects for the comparison:
> >
> >
> >   1.  Compression ratio
> >   2.  Write (encode) operation time
> >   3.  Read (decode) operation time
> >
> > I found that in the original encoding method, the determining factor for
> > the compression performance is not the number of the missing point in the
> > data. Actually, it is the density of the missing point. I would provide
> > some examples for you to understand.
> >
> >
> >   *   Missing a data point every 500 points
> >
> > ________________________________
> > 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> > 寄件日期: 2019年6月28日 上午 10:35
> > 收件者: dev@iotdb.apache.org
> > 主旨: Re: 回覆: A new encoding method for regular timestamp column
> >
> > Hi,
> >
> > The bitmap is also needed. Just use another encoding for regular time
> > column.
> >
> > Suppose the time column is "2, 3, 4, 6". The generated column is "2, 3,
> 4,
> > 5, 6".
> >
> > Then we could use "2(the first value), 1(the delta), 5(total data
> number)"
> > to encode the original data.
> >
> > However, when decoding from "2, 1, 5" and get "2, 3, 4, 5, 6", we still
> > need a mark column such that "11101" to denote that the fourth data is
> > generated.
> >
> > Best,
> > --
> > Jialin Qiao
> > School of Software, Tsinghua University
> >
> > 乔嘉林
> > 清华大学 软件学院
> >
> > > -----原始邮件-----
> > > 发件人: "Jack Tsai" <ja...@outlook.com>
> > > 发送时间: 2019-06-28 09:54:18 (星期五)
> > > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > > 抄送:
> > > 主题: 回覆: A new encoding method for regular timestamp column
> > >
> > > Hi Jialin,
> > >
> > > I am not sure if my understanding is right.
> > >
> > > Does it mean that to encode the data without the bitmap?
> > >
> > > Best regards,
> > > Tsung-Han Tsai
> > >
> > > ________________________________
> > > 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> > > 寄件日期: 2019年6月28日 上午 08:59
> > > 收件者: dev@iotdb.apache.org
> > > 主旨: Re: A new encoding method for regular timestamp column
> > >
> > > Hi, Tsung-Han
> > >
> > > Nice try!
> > > I think by "low compression ratio of regular time series (with missing
> > points)" you mean using TS2DIFF to encode the time column.
> > > I wonder if we could "generate a new data array with no missing points
> > (regular time series)",
> > > could we just use (1) the first value, (2) the delta and (3) the total
> > number of value to encode the raw data instead of using TS2DIFF?
> > >
> > > Best,
> > > --
> > > Jialin Qiao
> > > School of Software, Tsinghua University
> > >
> > > 乔嘉林
> > > 清华大学 软件学院
> > >
> > > > -----原始邮件-----
> > > > 发件人: "Jack Tsai" <ja...@outlook.com>
> > > > 发送时间: 2019-06-27 17:56:45 (星期四)
> > > > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > > > 抄送:
> > > > 主题: A new encoding method for regular timestamp column
> > > >
> > > > Hi all,
> > > >
> > > > I am working on this issue:
> > https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73
> > > >
> > > > The new encoding method could deal with the problem of the regular
> > time series (the time elapsed between each data point is always the same)
> > which got the missing point. When the missing points exist in the data,
> the
> > compression ratio would decrease from 40x to 8x.
> > > >
> > > > To solve this issue, here comes my solution. I would divide it into
> > two parts to explain, which is the write operation and the read
> operation.
> > > >
> > > > Write (Encode)
> > > >
> > > >   1.  First of all, calculate the delta between each elements in the
> > data. If the delta value is different from the previous delta, then it
> > could be stated that the missing point exist in the data.
> > > >   2.  If the missing points exist, get the minimum data base between
> > each element. Generate a new data array with no missing points (regular
> > time series).
> > > >   3.  Next, begin to write the info of this data into the byte array
> > output stream.
> > > >   4.  The first part is for the identifier, which is the boolean
> value
> > to show whether the missing points exist in the following data.
> > > >   5.  Compare the original data, which is the missing point data with
> > the new complete data. It would form a bitmap to denote the position of
> > missing points.
> > > >   6.  Convert the bitmap into the byte array and write it and its
> > length into the output stream.
> > > >   7.  Start to encode the data (the newly created one which has no
> > missing point) value into the output stream.
> > > >   8.  When the encode data size reach the target block size, flush it
> > into the output stream. Repeat until all values in the data flush into
> the
> > output stream.
> > > >
> > > > Read (Decode)
> > > >
> > > >   1.  First of all, decode the first boolean value in the buffer to
> > check whether the following data has the missing point.
> > > >   2.  If the missing point exist in the data, then it means there is
> a
> > byte array of bitmap in the following part of this buffer.
> > > >   3.  To decode the bitmap, initially, read the next int value which
> > is for the length of the following byte array of the bitmap.
> > > >   4.  Decode the byte array of the bitmap and convert it into the
> > bitmap for denoting the missing point in the following data.
> > > >   5.  Decode the following data, which is the data array without
> > missing points and compare it with the bitmap (according to the mechanism
> > of the bitmap, when the bit comes to 0, then it means the missing point
> > exists here. Return (read out) the value in the decoded data when the bit
> > value is 1).
> > > >
> > > > The compression ratio would be up to 20x result in the original unit
> > test (the one with old encoding method would result in 8x).
> > > >
> > > > To compare with the original encoding method more precisely, I would
> > start the unit test with more details and the performance test for this
> new
> > encoding method recently.
> > > >
> > > > If you are confused or think there is the problem in my
> > implementation, please welcome to give me some advice.
> > > >
> > > > Best regards,
> > > > Tsung-Han Tsai
> >
>

Re: 回覆: A new encoding method for regular timestamp column

Posted by Xiangdong Huang <sa...@gmail.com>.

Hi,

Thanks for your detailed experiment.

My question is, given a page, how to decide the encoding method?

Let users declare that? Or just always use the new encoding method? If the
latter, how the performance if the new method meets irregular data?

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jack Tsai <ja...@outlook.com> 于2019年7月9日周二 下午2:59写道：

> Hi,
>
> I have made some improvement in the new encoding method I created before,
> which I think the one I've done initially was not fully meet the user's
> usage for invoking the encoder's API.
>
> The following is the detail for the new implementation:
>
> Write (Encode)
>
>   1.  Invoke the RegularDataEncoder (the new encoder I created) to encode
> data points. Let's assume each data point is the millisecond of the
> timestamp (date).
>   2.  Create a new array with default size for storing the incoming data.
>   3.  When the data points fill the array, start to check whether missing
> points exist in it.
>   4.  If the missing point is detected, create a new array for storing the
> regular data, which the time elapsed between each data point is the same
> (calculate the minimum delta between each point to get the time interval
> for the new regular data array).
>   5.  Next, mark the identifier for decoder to know if the missing point
> exist in the following data, and start to create a bitmap for denoting
> where the missing point is in this data array. Encode these info into the
> output stream.
>   6.  In this step, encode the the array info (the array size, minimum
> delta value and first value of this array) into the output stream.
>   7.  Repeat 1-6 steps until the end of the data.
>
> Read (Decode)
>
>   1.  Read the identifier of the pack (data array) to know if the missing
> point exists in the following data.
>   2.  If the missing point exists in the data, read the bitmap which had
> been encoded.
>   3.  Load the first value and the minimum delta base. Add up the minimum
> delta base along with the previous value (begin from the first value of
> this pack) and store it into the data array.
>   4.  While all the value in this pack have been loaded, read the bitmap
> with the data array (while the bitmap element return 0, means the missing
> point exists in this position). If the missing point is detected, skip this
> data point in the array.
>   5.  Repeat 1-4 steps until all the data has been read.
>
> Result
> It is the testing result of this encoder below, which I would compare with
> the result of the DeltaBinaryEncoder:
>
> DeltaBinaryEncoder (the original encoding method)
>
>   *   No missing point
>
> Row number: 2332801
> source data size:18662408 byte
> Write time: 158
> encoding data size:434016 byte
> Read time: 68
> Compression ratio: 42.99935486249355
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 296
> encoding data size:2994679 byte
> Read time: 173
> Compression ratio: 5.6086679073116015
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> write time: 287
> encoding data size:3161045 byte
> read time: 195
> compression ratio: 5.608676877425029
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> write time: 291
> encoding data size:3285820 byte
> read time: 189
> compression ratio: 5.608682155443694
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 199
> encoding data size:651696 byte
> Read time: 94
> Compression ratio: 28.619822739436792
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> row number: 2332801
> missing point percentage: 2.5291484357259364E-5
> source data size:18661936 byte
> write time: 232
> encoding data size:443136 byte
> read time: 69
> compression ratio: 42.11333766608897
>
> RegularDataEncoder (the new encoding method I created)
>
>   *   No missing point
>
> Row number: 2332801
> source data size:18662408 byte
> Write time: 83
> encoding data size:382746 byte
> Read time: 53
> Compression ratio: 48.75925025996353
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 133
> encoding data size:705320 byte
> Read time: 108
> Compression ratio: 23.81353144671922
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> Write time: 125
> encoding data size:727184 byte
> Read time: 121
> Compression ratio: 24.38073444960285
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> Write time: 161
> encoding data size:755895 byte
> Read time: 159
> Compression ratio: 24.380529041731986
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 100
> encoding data size:411117 byte
> Read time: 73
> Compression ratio: 45.36767878730386
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 2.5291484357259364E-5
> source data size:18661936 byte
> Write time: 89
> encoding data size:383943 byte
> Read time: 65
> Compression ratio: 48.60600662077444
>
> Conclusion
>
> As you could see above, no matter the missing point exists in the regular
> data or not, RegularDataEncoder (the new encoding method) got a better
> performance on the write time, read time and the compression ratio.
>
> As the missing point percentage decreased, the compression ratio would
> increase. Even the regular data without missing point could have the better
> compression ratio than the DeltaBinaryEncoder (the original encoding
> method).
>
> The reason why the performance of the write time, read time and
> compression ratio could have different degrees of improvement is because I
> have changed the element to write (encode) into the output stream.
>
> In the original encoding method, not only the first value, minimum delta
> base and the pack length, but also the the whole delta between each element
> would be encode into the output stream, which could slow down the read and
> write performance.
>
> After adapting what Jialin told me, which just encode the missing point
> data identifier, first value, minimum delta base and the pack length
> (sometimes with the bitmap), then add up the minimum delta along with the
> previous value to form the data array while decoding (sometimes decode with
> the bitmap). The size of the data in the output stream could be decreased,
> and as the data which need to be encoded and decoded decreased, the read
> time and write time performance could be risen up.
>
> Please welcome to discuss with me if you got any problem. I would commit
> the pull request if everything is fine.
>
> Best regards,
> Tsung-Han Tsai
> ________________________________
> 寄件者: Jack Tsai <ja...@outlook.com>
> 寄件日期: 2019年7月4日 下午 04:06
> 收件者: dev@iotdb.apache.org
> 主旨: 回覆: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> I also found that while adopting the new encoding method I created, the
> compression ratio and the operation time cost for data without missing
> point also got an improvement. Please see the following for details:
>
>
>   *   Original encoding method (data without missing point)
>
> Row number: 2332801
> source data size:18662408 byte
> Write time: 158
> encoding data size:434016 byte
> Read time: 68
> Compression ratio: 42.99935486249355
>
>
>   *   New encoding method (data without missing point)
>
> Row number: 2332801
> source data size:18662408 byte
> Write time: 78
> encoding data size:361681 byte
> Read time: 55
> Compression ratio: 51.59908316997575
>
> As the experimental result, while the data got no missing point, adopting
> the new encoding method might be better.
>
> Best regards,
> Tsung-Han Tsai
> ________________________________
> 寄件者: Xiangdong Huang <sa...@gmail.com>
> 寄件日期: 2019年7月3日 下午 08:57
> 收件者: dev@iotdb.apache.org
> 主旨: Re: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> Thanks very much for your detailed experiment.
>
> Firstly, I assume you are using TS2DIFF encoding method.
>
> Secondly, let's considering the compression ratio first.
>
> In my imagination, suppose the data is regular (i.e., the frequency is
> fixed) and there is no missing data points,
> then the compression ratio (of the original method is): L / (8 + 8 + 8 +
> 4), referring to: raw data size / (start timestamp + level 1 delta + level
> 2 delta + number of points).
>
> Using missing data completion, then the data space cost is (8 + 8+ 8 + 4) +
> the size of bitmap, where  (8 + 8 + 8 + 4) has the same meaning as above,
> and the size of bitmap should equal to  "the number of points"/8 bytes.
> The new compression ratio can be: L / (8 + 8 + 8 + 4 + n/8)  (suppose the
> number of points is $n$). The denominator keeps constant no matter how the
> missing data ratio changes, while the numerator changes according to how
> many real data points you have.
> Because the fewer the missing data is, the larger the L is, then you
> compression ratio will increase. That is why in your experiment, the
> compression ratio of the new method increase but not so much.
>
> In your method, the improvement of the compression ratio depends on how
> many data points in one page.
> If there is 100 points in a page,  if there is no missing points, and the
> raw data size is (100 * 8) = 800 bytes, then the original compression is:
> 800 / (8 + 8 + 8 + 4 +100/8) ~=19.7, but the original method is 800 / (8 +
> 8 + 8 + 4) ~=28, which is better.
> However,  if there is 1% point missing, then in your method, the
> compression ratio is 800*0.99 / (8 + 8 + 8 +4 +100/8) ~=19.5. Suppose the
> only one missing point is at the middle of the points, then the original
> method's compression ratio is : 800 / ((8 + 8 + 8 + 4) + (8 + 8 + 8 + 4) )
> ~= 14, which is worse.
>
> So, your experiment make sense on the compression ratio aspect.
>
> But I have no idea about the read and write time cost. Why?
>
> Best,
> -----------------------------------
> Xiangdong Huang
> School of Software, Tsinghua University
>
>  黄向东
> 清华大学 软件学院
>
>
> Jack Tsai <ja...@outlook.com> 于2019年7月3日周三 下午5:30写道：
>
> > Hi,
> >
> > Sorry for sending the unfinished mail (last mail). This is the complete
> > version.
> >
> > According to the result of the performance test I have done so far, I
> > found some difference between the new encoding method I created and the
> > original encoding method.
> >
> > I increased the data size to 2000000 rows (the data size in the original
> > test of the old encoding method was around 3000) when testing the new
> > encoding method. There are some aspects for the comparison:
> >
> >
> >   1.  Compression ratio
> >   2.  Write (encode) operation time
> >   3.  Read (decode) operation time
> >
> > The following is the comparison between the old encoding method and the
> > new encoding method:
> >
> >   *   Missing a data point every 10 points (10% missing point percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.10000038580230375
> > source data size:16796160 byte
> > Write time: 296
> > encoding data size:2994679 byte
> > Read time: 173
> > Compression ratio: 5.6086679073116015
> >
> >
> >   *   Missing a data point every 20 points (5% missing point percentage)
> >
> > row number: 2332801
> > missing point percentage: 0.05000040723576504
> > source data size:17729280 byte
> > write time: 287
> > encoding data size:3161045 byte
> > read time: 195
> > compression ratio: 5.608676877425029
> >
> >
> >   *   Missing a data point every 80 points (1% missing point percentage)
> >
> > row number: 2332801
> > missing point percentage: 0.012500423310861097
> > source data size:18429120 byte
> > write time: 291
> > encoding data size:3285820 byte
> > read time: 189
> > compression ratio: 5.608682155443694
> >
> >   *   Missing a data point every 1700 points (0.0005% missing point
> > percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 5.885628478382587E-4
> > source data size:18651424 byte
> > Write time: 199
> > encoding data size:651696 byte
> > Read time: 94
> > Compression ratio: 28.619822739436792
> >
> >   *   Missing a data point every 40000 points (0.00002% missing point
> > percentage)
> >
> > row number: 2332801
> > missing point percentage: 2.5291484357259364E-5
> > source data size:18661936 byte
> > write time: 232
> > encoding data size:443136 byte
> > read time: 69
> > compression ratio: 42.11333766608897
> >
> >
> >
> > There are few testing results for the new encoding method as follows:
> >
> >   *   Missing a data point every 10 points (10% missing point percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.10000038580230375
> > source data size:16796160 byte
> > Write time: 117
> > encoding data size:653285 byte
> > Read time: 84
> > Compression ratio: 25.71031020152001
> >
> >   *   Missing a data point every 20 points (5% missing point percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.05000040723576504
> > source data size:17729280 byte
> > Write time: 171
> > encoding data size:653285 byte
> > Read time: 116
> > Compression ratio: 27.138660768271123
> >
> >   *   Missing a data point every 80 points (1% missing point percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.012500423310861097
> > source data size:18429120 byte
> > Write time: 128
> > encoding data size:653285 byte
> > Read time: 89
> > Compression ratio: 28.209923693334456
> >
> >   *   Missing a data point every 1700 points (0.0005% missing point
> > percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 5.885628478382587E-4
> > source data size:18651424 byte
> > Write time: 118
> > encoding data size:653285 byte
> > Read time: 161
> > Compression ratio: 28.550210092073137
> >
> >   *   Missing a data point every 40000 points (0.00002% missing point
> > percentage)
> >
> > Row number: 2332801
> > Missing point percentage: 0.002142488793514752
> > source data size:18622424 byte
> > Write time: 110
> > encoding data size:653286 byte
> > Read time: 81
> > Compression ratio: 28.50577541842317
> >
> > As you could see above, the compression ratio would increase along with
> > the decreasing of the missing point percentage.
> >
> > In the high missing point percentage, the new encoding method has the
> > better performance on compressing data.
> >
> > While the missing point percentage came down to 0.0005%, both compression
> > performance of the encoding method would be almost the same.
> >
> > However, as the missing point percentage keep decreasing, the compression
> > ratio in the new encoding method almost won't increase anymore. On the
> > contrary, the performance of the original encoding method still improving
> > along with the decreasing of the missing point percentage.
> >
> > In the last comparison of the compression ratio, the original encoding
> > method could have the compression ratio up to 40x, which almost has the
> > same performance as encoding the data without missing point. But in the
> new
> > encoding method, the compression ratio still remain on 20x.
> >
> > The write time and the read time is also different between two encoding
> > method. With using the new encoding method, the write (encode)
> performance
> > has made an improvement.
> >
> > In conclusion, the new encoding method could be more adapted to encode
> > when the missing point percentage is high (above 1%), but in the lower
> > missing point percentage (which could be stated as under 1%), the
> original
> > encoding method may have better performance in the compression.
> >
> > As a result, do I need to make any changes to my implementation on the
> new
> > encoding method? Does this result meet what this issue need?
> >
> > If you have some problems on my testing or implementation, please welcome
> > to give me some advice.
> >
> > Best regards,
> > Tsung-Han Tsai
> > ________________________________
> > 寄件者: Jack Tsai
> > 寄件日期: 2019年7月3日 下午 03:58
> > 收件者: dev@iotdb.apache.org
> > 主旨: 回覆: 回覆: A new encoding method for regular timestamp column
> >
> > Hi,
> >
> > According to the result of the performance test I have done so far, I
> > found some difference between the new encoding method I created and the
> > original encoding method.
> >
> > I increased the data size to 2000000 rows (the data size in the original
> > test of the old encoding method was around 3000) when testing the new
> > encoding method. There are some aspects for the comparison:
> >
> >
> >   1.  Compression ratio
> >   2.  Write (encode) operation time
> >   3.  Read (decode) operation time
> >
> > I found that in the original encoding method, the determining factor for
> > the compression performance is not the number of the missing point in the
> > data. Actually, it is the density of the missing point. I would provide
> > some examples for you to understand.
> >
> >
> >   *   Missing a data point every 500 points
> >
> > ________________________________
> > 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> > 寄件日期: 2019年6月28日 上午 10:35
> > 收件者: dev@iotdb.apache.org
> > 主旨: Re: 回覆: A new encoding method for regular timestamp column
> >
> > Hi,
> >
> > The bitmap is also needed. Just use another encoding for regular time
> > column.
> >
> > Suppose the time column is "2, 3, 4, 6". The generated column is "2, 3,
> 4,
> > 5, 6".
> >
> > Then we could use "2(the first value), 1(the delta), 5(total data
> number)"
> > to encode the original data.
> >
> > However, when decoding from "2, 1, 5" and get "2, 3, 4, 5, 6", we still
> > need a mark column such that "11101" to denote that the fourth data is
> > generated.
> >
> > Best,
> > --
> > Jialin Qiao
> > School of Software, Tsinghua University
> >
> > 乔嘉林
> > 清华大学 软件学院
> >
> > > -----原始邮件-----
> > > 发件人: "Jack Tsai" <ja...@outlook.com>
> > > 发送时间: 2019-06-28 09:54:18 (星期五)
> > > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > > 抄送:
> > > 主题: 回覆: A new encoding method for regular timestamp column
> > >
> > > Hi Jialin,
> > >
> > > I am not sure if my understanding is right.
> > >
> > > Does it mean that to encode the data without the bitmap?
> > >
> > > Best regards,
> > > Tsung-Han Tsai
> > >
> > > ________________________________
> > > 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> > > 寄件日期: 2019年6月28日 上午 08:59
> > > 收件者: dev@iotdb.apache.org
> > > 主旨: Re: A new encoding method for regular timestamp column
> > >
> > > Hi, Tsung-Han
> > >
> > > Nice try!
> > > I think by "low compression ratio of regular time series (with missing
> > points)" you mean using TS2DIFF to encode the time column.
> > > I wonder if we could "generate a new data array with no missing points
> > (regular time series)",
> > > could we just use (1) the first value, (2) the delta and (3) the total
> > number of value to encode the raw data instead of using TS2DIFF?
> > >
> > > Best,
> > > --
> > > Jialin Qiao
> > > School of Software, Tsinghua University
> > >
> > > 乔嘉林
> > > 清华大学 软件学院
> > >
> > > > -----原始邮件-----
> > > > 发件人: "Jack Tsai" <ja...@outlook.com>
> > > > 发送时间: 2019-06-27 17:56:45 (星期四)
> > > > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > > > 抄送:
> > > > 主题: A new encoding method for regular timestamp column
> > > >
> > > > Hi all,
> > > >
> > > > I am working on this issue:
> > https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73
> > > >
> > > > The new encoding method could deal with the problem of the regular
> > time series (the time elapsed between each data point is always the same)
> > which got the missing point. When the missing points exist in the data,
> the
> > compression ratio would decrease from 40x to 8x.
> > > >
> > > > To solve this issue, here comes my solution. I would divide it into
> > two parts to explain, which is the write operation and the read
> operation.
> > > >
> > > > Write (Encode)
> > > >
> > > >   1.  First of all, calculate the delta between each elements in the
> > data. If the delta value is different from the previous delta, then it
> > could be stated that the missing point exist in the data.
> > > >   2.  If the missing points exist, get the minimum data base between
> > each element. Generate a new data array with no missing points (regular
> > time series).
> > > >   3.  Next, begin to write the info of this data into the byte array
> > output stream.
> > > >   4.  The first part is for the identifier, which is the boolean
> value
> > to show whether the missing points exist in the following data.
> > > >   5.  Compare the original data, which is the missing point data with
> > the new complete data. It would form a bitmap to denote the position of
> > missing points.
> > > >   6.  Convert the bitmap into the byte array and write it and its
> > length into the output stream.
> > > >   7.  Start to encode the data (the newly created one which has no
> > missing point) value into the output stream.
> > > >   8.  When the encode data size reach the target block size, flush it
> > into the output stream. Repeat until all values in the data flush into
> the
> > output stream.
> > > >
> > > > Read (Decode)
> > > >
> > > >   1.  First of all, decode the first boolean value in the buffer to
> > check whether the following data has the missing point.
> > > >   2.  If the missing point exist in the data, then it means there is
> a
> > byte array of bitmap in the following part of this buffer.
> > > >   3.  To decode the bitmap, initially, read the next int value which
> > is for the length of the following byte array of the bitmap.
> > > >   4.  Decode the byte array of the bitmap and convert it into the
> > bitmap for denoting the missing point in the following data.
> > > >   5.  Decode the following data, which is the data array without
> > missing points and compare it with the bitmap (according to the mechanism
> > of the bitmap, when the bit comes to 0, then it means the missing point
> > exists here. Return (read out) the value in the decoded data when the bit
> > value is 1).
> > > >
> > > > The compression ratio would be up to 20x result in the original unit
> > test (the one with old encoding method would result in 8x).
> > > >
> > > > To compare with the original encoding method more precisely, I would
> > start the unit test with more details and the performance test for this
> new
> > encoding method recently.
> > > >
> > > > If you are confused or think there is the problem in my
> > implementation, please welcome to give me some advice.
> > > >
> > > > Best regards,
> > > > Tsung-Han Tsai
> >
>

回覆: 回覆: A new encoding method for regular timestamp column

Posted by Jack Tsai <ja...@outlook.com>.

Hi,

I have made some improvement in the new encoding method I created before, which I think the one I've done initially was not fully meet the user's usage for invoking the encoder's API.

The following is the detail for the new implementation:

Write (Encode)

  1.  Invoke the RegularDataEncoder (the new encoder I created) to encode data points. Let's assume each data point is the millisecond of the timestamp (date).
  2.  Create a new array with default size for storing the incoming data.
  3.  When the data points fill the array, start to check whether missing points exist in it.
  4.  If the missing point is detected, create a new array for storing the regular data, which the time elapsed between each data point is the same (calculate the minimum delta between each point to get the time interval for the new regular data array).
  5.  Next, mark the identifier for decoder to know if the missing point exist in the following data, and start to create a bitmap for denoting where the missing point is in this data array. Encode these info into the output stream.
  6.  In this step, encode the the array info (the array size, minimum delta value and first value of this array) into the output stream.
  7.  Repeat 1-6 steps until the end of the data.

Read (Decode)

  1.  Read the identifier of the pack (data array) to know if the missing point exists in the following data.
  2.  If the missing point exists in the data, read the bitmap which had been encoded.
  3.  Load the first value and the minimum delta base. Add up the minimum delta base along with the previous value (begin from the first value of this pack) and store it into the data array.
  4.  While all the value in this pack have been loaded, read the bitmap with the data array (while the bitmap element return 0, means the missing point exists in this position). If the missing point is detected, skip this data point in the array.
  5.  Repeat 1-4 steps until all the data has been read.

Result
It is the testing result of this encoder below, which I would compare with the result of the DeltaBinaryEncoder:

DeltaBinaryEncoder (the original encoding method)

  *   No missing point

Row number: 2332801
source data size:18662408 byte
Write time: 158
encoding data size:434016 byte
Read time: 68
Compression ratio: 42.99935486249355

  *   Missing a data point every 10 points (10% missing point percentage)

Row number: 2332801
Missing point percentage: 0.10000038580230375
source data size:16796160 byte
Write time: 296
encoding data size:2994679 byte
Read time: 173
Compression ratio: 5.6086679073116015

  *   Missing a data point every 20 points (5% missing point percentage)

row number: 2332801
missing point percentage: 0.05000040723576504
source data size:17729280 byte
write time: 287
encoding data size:3161045 byte
read time: 195
compression ratio: 5.608676877425029

  *   Missing a data point every 80 points (1% missing point percentage)

row number: 2332801
missing point percentage: 0.012500423310861097
source data size:18429120 byte
write time: 291
encoding data size:3285820 byte
read time: 189
compression ratio: 5.608682155443694

  *   Missing a data point every 1700 points (0.0005% missing point percentage)

Row number: 2332801
Missing point percentage: 5.885628478382587E-4
source data size:18651424 byte
Write time: 199
encoding data size:651696 byte
Read time: 94
Compression ratio: 28.619822739436792

  *   Missing a data point every 40000 points (0.00002% missing point percentage)

row number: 2332801
missing point percentage: 2.5291484357259364E-5
source data size:18661936 byte
write time: 232
encoding data size:443136 byte
read time: 69
compression ratio: 42.11333766608897

RegularDataEncoder (the new encoding method I created)

  *   No missing point

Row number: 2332801
source data size:18662408 byte
Write time: 83
encoding data size:382746 byte
Read time: 53
Compression ratio: 48.75925025996353

  *   Missing a data point every 10 points (10% missing point percentage)

Row number: 2332801
Missing point percentage: 0.10000038580230375
source data size:16796160 byte
Write time: 133
encoding data size:705320 byte
Read time: 108
Compression ratio: 23.81353144671922

  *   Missing a data point every 20 points (5% missing point percentage)

Row number: 2332801
Missing point percentage: 0.05000040723576504
source data size:17729280 byte
Write time: 125
encoding data size:727184 byte
Read time: 121
Compression ratio: 24.38073444960285

  *   Missing a data point every 80 points (1% missing point percentage)

Row number: 2332801
Missing point percentage: 0.012500423310861097
source data size:18429120 byte
Write time: 161
encoding data size:755895 byte
Read time: 159
Compression ratio: 24.380529041731986

  *   Missing a data point every 1700 points (0.0005% missing point percentage)

Row number: 2332801
Missing point percentage: 5.885628478382587E-4
source data size:18651424 byte
Write time: 100
encoding data size:411117 byte
Read time: 73
Compression ratio: 45.36767878730386

  *   Missing a data point every 40000 points (0.00002% missing point percentage)

Row number: 2332801
Missing point percentage: 2.5291484357259364E-5
source data size:18661936 byte
Write time: 89
encoding data size:383943 byte
Read time: 65
Compression ratio: 48.60600662077444

Conclusion

As you could see above, no matter the missing point exists in the regular data or not, RegularDataEncoder (the new encoding method) got a better performance on the write time, read time and the compression ratio.

As the missing point percentage decreased, the compression ratio would increase. Even the regular data without missing point could have the better compression ratio than the DeltaBinaryEncoder (the original encoding method).

The reason why the performance of the write time, read time and compression ratio could have different degrees of improvement is because I have changed the element to write (encode) into the output stream.

In the original encoding method, not only the first value, minimum delta base and the pack length, but also the the whole delta between each element would be encode into the output stream, which could slow down the read and write performance.

After adapting what Jialin told me, which just encode the missing point data identifier, first value, minimum delta base and the pack length (sometimes with the bitmap), then add up the minimum delta along with the previous value to form the data array while decoding (sometimes decode with the bitmap). The size of the data in the output stream could be decreased, and as the data which need to be encoded and decoded decreased, the read time and write time performance could be risen up.

Please welcome to discuss with me if you got any problem. I would commit the pull request if everything is fine.

Best regards,
Tsung-Han Tsai
________________________________
寄件者: Jack Tsai <ja...@outlook.com>
寄件日期: 2019年7月4日 下午 04:06
收件者: dev@iotdb.apache.org
主旨: 回覆: 回覆: A new encoding method for regular timestamp column

Hi,

I also found that while adopting the new encoding method I created, the compression ratio and the operation time cost for data without missing point also got an improvement. Please see the following for details:


  *   Original encoding method (data without missing point)

Row number: 2332801
source data size:18662408 byte
Write time: 158
encoding data size:434016 byte
Read time: 68
Compression ratio: 42.99935486249355


  *   New encoding method (data without missing point)

Row number: 2332801
source data size:18662408 byte
Write time: 78
encoding data size:361681 byte
Read time: 55
Compression ratio: 51.59908316997575

As the experimental result, while the data got no missing point, adopting the new encoding method might be better.

Best regards,
Tsung-Han Tsai
________________________________
寄件者: Xiangdong Huang <sa...@gmail.com>
寄件日期: 2019年7月3日 下午 08:57
收件者: dev@iotdb.apache.org
主旨: Re: 回覆: A new encoding method for regular timestamp column

Hi,

Thanks very much for your detailed experiment.

Firstly, I assume you are using TS2DIFF encoding method.

Secondly, let's considering the compression ratio first.

In my imagination, suppose the data is regular (i.e., the frequency is
fixed) and there is no missing data points,
then the compression ratio (of the original method is): L / (8 + 8 + 8 +
4), referring to: raw data size / (start timestamp + level 1 delta + level
2 delta + number of points).

Using missing data completion, then the data space cost is (8 + 8+ 8 + 4) +
the size of bitmap, where  (8 + 8 + 8 + 4) has the same meaning as above,
and the size of bitmap should equal to  "the number of points"/8 bytes.
The new compression ratio can be: L / (8 + 8 + 8 + 4 + n/8)  (suppose the
number of points is $n$). The denominator keeps constant no matter how the
missing data ratio changes, while the numerator changes according to how
many real data points you have.
Because the fewer the missing data is, the larger the L is, then you
compression ratio will increase. That is why in your experiment, the
compression ratio of the new method increase but not so much.

In your method, the improvement of the compression ratio depends on how
many data points in one page.
If there is 100 points in a page,  if there is no missing points, and the
raw data size is (100 * 8) = 800 bytes, then the original compression is:
800 / (8 + 8 + 8 + 4 +100/8) ~=19.7, but the original method is 800 / (8 +
8 + 8 + 4) ~=28, which is better.
However,  if there is 1% point missing, then in your method, the
compression ratio is 800*0.99 / (8 + 8 + 8 +4 +100/8) ~=19.5. Suppose the
only one missing point is at the middle of the points, then the original
method's compression ratio is : 800 / ((8 + 8 + 8 + 4) + (8 + 8 + 8 + 4) )
~= 14, which is worse.

So, your experiment make sense on the compression ratio aspect.

But I have no idea about the read and write time cost. Why?

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jack Tsai <ja...@outlook.com> 于2019年7月3日周三 下午5:30写道：

> Hi,
>
> Sorry for sending the unfinished mail (last mail). This is the complete
> version.
>
> According to the result of the performance test I have done so far, I
> found some difference between the new encoding method I created and the
> original encoding method.
>
> I increased the data size to 2000000 rows (the data size in the original
> test of the old encoding method was around 3000) when testing the new
> encoding method. There are some aspects for the comparison:
>
>
>   1.  Compression ratio
>   2.  Write (encode) operation time
>   3.  Read (decode) operation time
>
> The following is the comparison between the old encoding method and the
> new encoding method:
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 296
> encoding data size:2994679 byte
> Read time: 173
> Compression ratio: 5.6086679073116015
>
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> write time: 287
> encoding data size:3161045 byte
> read time: 195
> compression ratio: 5.608676877425029
>
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> write time: 291
> encoding data size:3285820 byte
> read time: 189
> compression ratio: 5.608682155443694
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 199
> encoding data size:651696 byte
> Read time: 94
> Compression ratio: 28.619822739436792
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> row number: 2332801
> missing point percentage: 2.5291484357259364E-5
> source data size:18661936 byte
> write time: 232
> encoding data size:443136 byte
> read time: 69
> compression ratio: 42.11333766608897
>
>
>
> There are few testing results for the new encoding method as follows:
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 117
> encoding data size:653285 byte
> Read time: 84
> Compression ratio: 25.71031020152001
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> Write time: 171
> encoding data size:653285 byte
> Read time: 116
> Compression ratio: 27.138660768271123
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> Write time: 128
> encoding data size:653285 byte
> Read time: 89
> Compression ratio: 28.209923693334456
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 118
> encoding data size:653285 byte
> Read time: 161
> Compression ratio: 28.550210092073137
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 0.002142488793514752
> source data size:18622424 byte
> Write time: 110
> encoding data size:653286 byte
> Read time: 81
> Compression ratio: 28.50577541842317
>
> As you could see above, the compression ratio would increase along with
> the decreasing of the missing point percentage.
>
> In the high missing point percentage, the new encoding method has the
> better performance on compressing data.
>
> While the missing point percentage came down to 0.0005%, both compression
> performance of the encoding method would be almost the same.
>
> However, as the missing point percentage keep decreasing, the compression
> ratio in the new encoding method almost won't increase anymore. On the
> contrary, the performance of the original encoding method still improving
> along with the decreasing of the missing point percentage.
>
> In the last comparison of the compression ratio, the original encoding
> method could have the compression ratio up to 40x, which almost has the
> same performance as encoding the data without missing point. But in the new
> encoding method, the compression ratio still remain on 20x.
>
> The write time and the read time is also different between two encoding
> method. With using the new encoding method, the write (encode) performance
> has made an improvement.
>
> In conclusion, the new encoding method could be more adapted to encode
> when the missing point percentage is high (above 1%), but in the lower
> missing point percentage (which could be stated as under 1%), the original
> encoding method may have better performance in the compression.
>
> As a result, do I need to make any changes to my implementation on the new
> encoding method? Does this result meet what this issue need?
>
> If you have some problems on my testing or implementation, please welcome
> to give me some advice.
>
> Best regards,
> Tsung-Han Tsai
> ________________________________
> 寄件者: Jack Tsai
> 寄件日期: 2019年7月3日 下午 03:58
> 收件者: dev@iotdb.apache.org
> 主旨: 回覆: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> According to the result of the performance test I have done so far, I
> found some difference between the new encoding method I created and the
> original encoding method.
>
> I increased the data size to 2000000 rows (the data size in the original
> test of the old encoding method was around 3000) when testing the new
> encoding method. There are some aspects for the comparison:
>
>
>   1.  Compression ratio
>   2.  Write (encode) operation time
>   3.  Read (decode) operation time
>
> I found that in the original encoding method, the determining factor for
> the compression performance is not the number of the missing point in the
> data. Actually, it is the density of the missing point. I would provide
> some examples for you to understand.
>
>
>   *   Missing a data point every 500 points
>
> ________________________________
> 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> 寄件日期: 2019年6月28日 上午 10:35
> 收件者: dev@iotdb.apache.org
> 主旨: Re: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> The bitmap is also needed. Just use another encoding for regular time
> column.
>
> Suppose the time column is "2, 3, 4, 6". The generated column is "2, 3, 4,
> 5, 6".
>
> Then we could use "2(the first value), 1(the delta), 5(total data number)"
> to encode the original data.
>
> However, when decoding from "2, 1, 5" and get "2, 3, 4, 5, 6", we still
> need a mark column such that "11101" to denote that the fourth data is
> generated.
>
> Best,
> --
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
> > -----原始邮件-----
> > 发件人: "Jack Tsai" <ja...@outlook.com>
> > 发送时间: 2019-06-28 09:54:18 (星期五)
> > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > 抄送:
> > 主题: 回覆: A new encoding method for regular timestamp column
> >
> > Hi Jialin,
> >
> > I am not sure if my understanding is right.
> >
> > Does it mean that to encode the data without the bitmap?
> >
> > Best regards,
> > Tsung-Han Tsai
> >
> > ________________________________
> > 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> > 寄件日期: 2019年6月28日 上午 08:59
> > 收件者: dev@iotdb.apache.org
> > 主旨: Re: A new encoding method for regular timestamp column
> >
> > Hi, Tsung-Han
> >
> > Nice try!
> > I think by "low compression ratio of regular time series (with missing
> points)" you mean using TS2DIFF to encode the time column.
> > I wonder if we could "generate a new data array with no missing points
> (regular time series)",
> > could we just use (1) the first value, (2) the delta and (3) the total
> number of value to encode the raw data instead of using TS2DIFF?
> >
> > Best,
> > --
> > Jialin Qiao
> > School of Software, Tsinghua University
> >
> > 乔嘉林
> > 清华大学 软件学院
> >
> > > -----原始邮件-----
> > > 发件人: "Jack Tsai" <ja...@outlook.com>
> > > 发送时间: 2019-06-27 17:56:45 (星期四)
> > > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > > 抄送:
> > > 主题: A new encoding method for regular timestamp column
> > >
> > > Hi all,
> > >
> > > I am working on this issue:
> https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73
> > >
> > > The new encoding method could deal with the problem of the regular
> time series (the time elapsed between each data point is always the same)
> which got the missing point. When the missing points exist in the data, the
> compression ratio would decrease from 40x to 8x.
> > >
> > > To solve this issue, here comes my solution. I would divide it into
> two parts to explain, which is the write operation and the read operation.
> > >
> > > Write (Encode)
> > >
> > >   1.  First of all, calculate the delta between each elements in the
> data. If the delta value is different from the previous delta, then it
> could be stated that the missing point exist in the data.
> > >   2.  If the missing points exist, get the minimum data base between
> each element. Generate a new data array with no missing points (regular
> time series).
> > >   3.  Next, begin to write the info of this data into the byte array
> output stream.
> > >   4.  The first part is for the identifier, which is the boolean value
> to show whether the missing points exist in the following data.
> > >   5.  Compare the original data, which is the missing point data with
> the new complete data. It would form a bitmap to denote the position of
> missing points.
> > >   6.  Convert the bitmap into the byte array and write it and its
> length into the output stream.
> > >   7.  Start to encode the data (the newly created one which has no
> missing point) value into the output stream.
> > >   8.  When the encode data size reach the target block size, flush it
> into the output stream. Repeat until all values in the data flush into the
> output stream.
> > >
> > > Read (Decode)
> > >
> > >   1.  First of all, decode the first boolean value in the buffer to
> check whether the following data has the missing point.
> > >   2.  If the missing point exist in the data, then it means there is a
> byte array of bitmap in the following part of this buffer.
> > >   3.  To decode the bitmap, initially, read the next int value which
> is for the length of the following byte array of the bitmap.
> > >   4.  Decode the byte array of the bitmap and convert it into the
> bitmap for denoting the missing point in the following data.
> > >   5.  Decode the following data, which is the data array without
> missing points and compare it with the bitmap (according to the mechanism
> of the bitmap, when the bit comes to 0, then it means the missing point
> exists here. Return (read out) the value in the decoded data when the bit
> value is 1).
> > >
> > > The compression ratio would be up to 20x result in the original unit
> test (the one with old encoding method would result in 8x).
> > >
> > > To compare with the original encoding method more precisely, I would
> start the unit test with more details and the performance test for this new
> encoding method recently.
> > >
> > > If you are confused or think there is the problem in my
> implementation, please welcome to give me some advice.
> > >
> > > Best regards,
> > > Tsung-Han Tsai
>

回覆: 回覆: A new encoding method for regular timestamp column

Posted by Jack Tsai <ja...@outlook.com>.

Hi,

I also found that while adopting the new encoding method I created, the compression ratio and the operation time cost for data without missing point also got an improvement. Please see the following for details:


  *   Original encoding method (data without missing point)

Row number: 2332801
source data size:18662408 byte
Write time: 158
encoding data size:434016 byte
Read time: 68
Compression ratio: 42.99935486249355


  *   New encoding method (data without missing point)

Row number: 2332801
source data size:18662408 byte
Write time: 78
encoding data size:361681 byte
Read time: 55
Compression ratio: 51.59908316997575

As the experimental result, while the data got no missing point, adopting the new encoding method might be better.

Best regards,
Tsung-Han Tsai
________________________________
寄件者: Xiangdong Huang <sa...@gmail.com>
寄件日期: 2019年7月3日 下午 08:57
收件者: dev@iotdb.apache.org
主旨: Re: 回覆: A new encoding method for regular timestamp column

Hi,

Thanks very much for your detailed experiment.

Firstly, I assume you are using TS2DIFF encoding method.

Secondly, let's considering the compression ratio first.

In my imagination, suppose the data is regular (i.e., the frequency is
fixed) and there is no missing data points,
then the compression ratio (of the original method is): L / (8 + 8 + 8 +
4), referring to: raw data size / (start timestamp + level 1 delta + level
2 delta + number of points).

Using missing data completion, then the data space cost is (8 + 8+ 8 + 4) +
the size of bitmap, where  (8 + 8 + 8 + 4) has the same meaning as above,
and the size of bitmap should equal to  "the number of points"/8 bytes.
The new compression ratio can be: L / (8 + 8 + 8 + 4 + n/8)  (suppose the
number of points is $n$). The denominator keeps constant no matter how the
missing data ratio changes, while the numerator changes according to how
many real data points you have.
Because the fewer the missing data is, the larger the L is, then you
compression ratio will increase. That is why in your experiment, the
compression ratio of the new method increase but not so much.

In your method, the improvement of the compression ratio depends on how
many data points in one page.
If there is 100 points in a page,  if there is no missing points, and the
raw data size is (100 * 8) = 800 bytes, then the original compression is:
800 / (8 + 8 + 8 + 4 +100/8) ~=19.7, but the original method is 800 / (8 +
8 + 8 + 4) ~=28, which is better.
However,  if there is 1% point missing, then in your method, the
compression ratio is 800*0.99 / (8 + 8 + 8 +4 +100/8) ~=19.5. Suppose the
only one missing point is at the middle of the points, then the original
method's compression ratio is : 800 / ((8 + 8 + 8 + 4) + (8 + 8 + 8 + 4) )
~= 14, which is worse.

So, your experiment make sense on the compression ratio aspect.

But I have no idea about the read and write time cost. Why?

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jack Tsai <ja...@outlook.com> 于2019年7月3日周三 下午5:30写道：

> Hi,
>
> Sorry for sending the unfinished mail (last mail). This is the complete
> version.
>
> According to the result of the performance test I have done so far, I
> found some difference between the new encoding method I created and the
> original encoding method.
>
> I increased the data size to 2000000 rows (the data size in the original
> test of the old encoding method was around 3000) when testing the new
> encoding method. There are some aspects for the comparison:
>
>
>   1.  Compression ratio
>   2.  Write (encode) operation time
>   3.  Read (decode) operation time
>
> The following is the comparison between the old encoding method and the
> new encoding method:
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 296
> encoding data size:2994679 byte
> Read time: 173
> Compression ratio: 5.6086679073116015
>
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> write time: 287
> encoding data size:3161045 byte
> read time: 195
> compression ratio: 5.608676877425029
>
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> write time: 291
> encoding data size:3285820 byte
> read time: 189
> compression ratio: 5.608682155443694
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 199
> encoding data size:651696 byte
> Read time: 94
> Compression ratio: 28.619822739436792
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> row number: 2332801
> missing point percentage: 2.5291484357259364E-5
> source data size:18661936 byte
> write time: 232
> encoding data size:443136 byte
> read time: 69
> compression ratio: 42.11333766608897
>
>
>
> There are few testing results for the new encoding method as follows:
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 117
> encoding data size:653285 byte
> Read time: 84
> Compression ratio: 25.71031020152001
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> Write time: 171
> encoding data size:653285 byte
> Read time: 116
> Compression ratio: 27.138660768271123
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> Write time: 128
> encoding data size:653285 byte
> Read time: 89
> Compression ratio: 28.209923693334456
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 118
> encoding data size:653285 byte
> Read time: 161
> Compression ratio: 28.550210092073137
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 0.002142488793514752
> source data size:18622424 byte
> Write time: 110
> encoding data size:653286 byte
> Read time: 81
> Compression ratio: 28.50577541842317
>
> As you could see above, the compression ratio would increase along with
> the decreasing of the missing point percentage.
>
> In the high missing point percentage, the new encoding method has the
> better performance on compressing data.
>
> While the missing point percentage came down to 0.0005%, both compression
> performance of the encoding method would be almost the same.
>
> However, as the missing point percentage keep decreasing, the compression
> ratio in the new encoding method almost won't increase anymore. On the
> contrary, the performance of the original encoding method still improving
> along with the decreasing of the missing point percentage.
>
> In the last comparison of the compression ratio, the original encoding
> method could have the compression ratio up to 40x, which almost has the
> same performance as encoding the data without missing point. But in the new
> encoding method, the compression ratio still remain on 20x.
>
> The write time and the read time is also different between two encoding
> method. With using the new encoding method, the write (encode) performance
> has made an improvement.
>
> In conclusion, the new encoding method could be more adapted to encode
> when the missing point percentage is high (above 1%), but in the lower
> missing point percentage (which could be stated as under 1%), the original
> encoding method may have better performance in the compression.
>
> As a result, do I need to make any changes to my implementation on the new
> encoding method? Does this result meet what this issue need?
>
> If you have some problems on my testing or implementation, please welcome
> to give me some advice.
>
> Best regards,
> Tsung-Han Tsai
> ________________________________
> 寄件者: Jack Tsai
> 寄件日期: 2019年7月3日 下午 03:58
> 收件者: dev@iotdb.apache.org
> 主旨: 回覆: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> According to the result of the performance test I have done so far, I
> found some difference between the new encoding method I created and the
> original encoding method.
>
> I increased the data size to 2000000 rows (the data size in the original
> test of the old encoding method was around 3000) when testing the new
> encoding method. There are some aspects for the comparison:
>
>
>   1.  Compression ratio
>   2.  Write (encode) operation time
>   3.  Read (decode) operation time
>
> I found that in the original encoding method, the determining factor for
> the compression performance is not the number of the missing point in the
> data. Actually, it is the density of the missing point. I would provide
> some examples for you to understand.
>
>
>   *   Missing a data point every 500 points
>
> ________________________________
> 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> 寄件日期: 2019年6月28日 上午 10:35
> 收件者: dev@iotdb.apache.org
> 主旨: Re: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> The bitmap is also needed. Just use another encoding for regular time
> column.
>
> Suppose the time column is "2, 3, 4, 6". The generated column is "2, 3, 4,
> 5, 6".
>
> Then we could use "2(the first value), 1(the delta), 5(total data number)"
> to encode the original data.
>
> However, when decoding from "2, 1, 5" and get "2, 3, 4, 5, 6", we still
> need a mark column such that "11101" to denote that the fourth data is
> generated.
>
> Best,
> --
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
> > -----原始邮件-----
> > 发件人: "Jack Tsai" <ja...@outlook.com>
> > 发送时间: 2019-06-28 09:54:18 (星期五)
> > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > 抄送:
> > 主题: 回覆: A new encoding method for regular timestamp column
> >
> > Hi Jialin,
> >
> > I am not sure if my understanding is right.
> >
> > Does it mean that to encode the data without the bitmap?
> >
> > Best regards,
> > Tsung-Han Tsai
> >
> > ________________________________
> > 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> > 寄件日期: 2019年6月28日 上午 08:59
> > 收件者: dev@iotdb.apache.org
> > 主旨: Re: A new encoding method for regular timestamp column
> >
> > Hi, Tsung-Han
> >
> > Nice try!
> > I think by "low compression ratio of regular time series (with missing
> points)" you mean using TS2DIFF to encode the time column.
> > I wonder if we could "generate a new data array with no missing points
> (regular time series)",
> > could we just use (1) the first value, (2) the delta and (3) the total
> number of value to encode the raw data instead of using TS2DIFF?
> >
> > Best,
> > --
> > Jialin Qiao
> > School of Software, Tsinghua University
> >
> > 乔嘉林
> > 清华大学 软件学院
> >
> > > -----原始邮件-----
> > > 发件人: "Jack Tsai" <ja...@outlook.com>
> > > 发送时间: 2019-06-27 17:56:45 (星期四)
> > > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > > 抄送:
> > > 主题: A new encoding method for regular timestamp column
> > >
> > > Hi all,
> > >
> > > I am working on this issue:
> https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73
> > >
> > > The new encoding method could deal with the problem of the regular
> time series (the time elapsed between each data point is always the same)
> which got the missing point. When the missing points exist in the data, the
> compression ratio would decrease from 40x to 8x.
> > >
> > > To solve this issue, here comes my solution. I would divide it into
> two parts to explain, which is the write operation and the read operation.
> > >
> > > Write (Encode)
> > >
> > >   1.  First of all, calculate the delta between each elements in the
> data. If the delta value is different from the previous delta, then it
> could be stated that the missing point exist in the data.
> > >   2.  If the missing points exist, get the minimum data base between
> each element. Generate a new data array with no missing points (regular
> time series).
> > >   3.  Next, begin to write the info of this data into the byte array
> output stream.
> > >   4.  The first part is for the identifier, which is the boolean value
> to show whether the missing points exist in the following data.
> > >   5.  Compare the original data, which is the missing point data with
> the new complete data. It would form a bitmap to denote the position of
> missing points.
> > >   6.  Convert the bitmap into the byte array and write it and its
> length into the output stream.
> > >   7.  Start to encode the data (the newly created one which has no
> missing point) value into the output stream.
> > >   8.  When the encode data size reach the target block size, flush it
> into the output stream. Repeat until all values in the data flush into the
> output stream.
> > >
> > > Read (Decode)
> > >
> > >   1.  First of all, decode the first boolean value in the buffer to
> check whether the following data has the missing point.
> > >   2.  If the missing point exist in the data, then it means there is a
> byte array of bitmap in the following part of this buffer.
> > >   3.  To decode the bitmap, initially, read the next int value which
> is for the length of the following byte array of the bitmap.
> > >   4.  Decode the byte array of the bitmap and convert it into the
> bitmap for denoting the missing point in the following data.
> > >   5.  Decode the following data, which is the data array without
> missing points and compare it with the bitmap (according to the mechanism
> of the bitmap, when the bit comes to 0, then it means the missing point
> exists here. Return (read out) the value in the decoded data when the bit
> value is 1).
> > >
> > > The compression ratio would be up to 20x result in the original unit
> test (the one with old encoding method would result in 8x).
> > >
> > > To compare with the original encoding method more precisely, I would
> start the unit test with more details and the performance test for this new
> encoding method recently.
> > >
> > > If you are confused or think there is the problem in my
> implementation, please welcome to give me some advice.
> > >
> > > Best regards,
> > > Tsung-Han Tsai
>

回覆: 回覆: A new encoding method for regular timestamp column

Posted by Jack Tsai <ja...@outlook.com>.

Hi,

First of all, thank you for your detailed explanation.

"But I have no idea about the read and write time cost. Why?"

As you could see, write operation time has been improved. This is because I took on Jialin's solution on the encoding process.

To reduce the time cost during the writing (encoding) process, it encoded only the first value of the data instead of writing the whole pack (data array) into the output stream. This could make an improvement on the writing process.

As for the reading process, except for reading the bitmap and comparing with it when the missing points exist in the data, I use the way which is almost the same as the original decoding solution (read the first value and keep adding up by the min delta value to following elements). Thus, the reading time cost remains nearly the same as the original decoding method.

This is the reason I thought which could make the time cost difference between these two encoding methods. If you have any different opinion, please welcome to discuss with me.

Best regards,
Tsung-Han Tsai


________________________________
寄件者: Xiangdong Huang <sa...@gmail.com>
寄件日期: 2019年7月3日 下午 08:57
收件者: dev@iotdb.apache.org
主旨: Re: 回覆: A new encoding method for regular timestamp column

Hi,

Thanks very much for your detailed experiment.

Firstly, I assume you are using TS2DIFF encoding method.

Secondly, let's considering the compression ratio first.

In my imagination, suppose the data is regular (i.e., the frequency is
fixed) and there is no missing data points,
then the compression ratio (of the original method is): L / (8 + 8 + 8 +
4), referring to: raw data size / (start timestamp + level 1 delta + level
2 delta + number of points).

Using missing data completion, then the data space cost is (8 + 8+ 8 + 4) +
the size of bitmap, where  (8 + 8 + 8 + 4) has the same meaning as above,
and the size of bitmap should equal to  "the number of points"/8 bytes.
The new compression ratio can be: L / (8 + 8 + 8 + 4 + n/8)  (suppose the
number of points is $n$). The denominator keeps constant no matter how the
missing data ratio changes, while the numerator changes according to how
many real data points you have.
Because the fewer the missing data is, the larger the L is, then you
compression ratio will increase. That is why in your experiment, the
compression ratio of the new method increase but not so much.

In your method, the improvement of the compression ratio depends on how
many data points in one page.
If there is 100 points in a page,  if there is no missing points, and the
raw data size is (100 * 8) = 800 bytes, then the original compression is:
800 / (8 + 8 + 8 + 4 +100/8) ~=19.7, but the original method is 800 / (8 +
8 + 8 + 4) ~=28, which is better.
However,  if there is 1% point missing, then in your method, the
compression ratio is 800*0.99 / (8 + 8 + 8 +4 +100/8) ~=19.5. Suppose the
only one missing point is at the middle of the points, then the original
method's compression ratio is : 800 / ((8 + 8 + 8 + 4) + (8 + 8 + 8 + 4) )
~= 14, which is worse.

So, your experiment make sense on the compression ratio aspect.

But I have no idea about the read and write time cost. Why?

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jack Tsai <ja...@outlook.com> 于2019年7月3日周三 下午5:30写道：

> Hi,
>
> Sorry for sending the unfinished mail (last mail). This is the complete
> version.
>
> According to the result of the performance test I have done so far, I
> found some difference between the new encoding method I created and the
> original encoding method.
>
> I increased the data size to 2000000 rows (the data size in the original
> test of the old encoding method was around 3000) when testing the new
> encoding method. There are some aspects for the comparison:
>
>
>   1.  Compression ratio
>   2.  Write (encode) operation time
>   3.  Read (decode) operation time
>
> The following is the comparison between the old encoding method and the
> new encoding method:
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 296
> encoding data size:2994679 byte
> Read time: 173
> Compression ratio: 5.6086679073116015
>
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> write time: 287
> encoding data size:3161045 byte
> read time: 195
> compression ratio: 5.608676877425029
>
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> row number: 2332801
> missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> write time: 291
> encoding data size:3285820 byte
> read time: 189
> compression ratio: 5.608682155443694
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 199
> encoding data size:651696 byte
> Read time: 94
> Compression ratio: 28.619822739436792
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> row number: 2332801
> missing point percentage: 2.5291484357259364E-5
> source data size:18661936 byte
> write time: 232
> encoding data size:443136 byte
> read time: 69
> compression ratio: 42.11333766608897
>
>
>
> There are few testing results for the new encoding method as follows:
>
>   *   Missing a data point every 10 points (10% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.10000038580230375
> source data size:16796160 byte
> Write time: 117
> encoding data size:653285 byte
> Read time: 84
> Compression ratio: 25.71031020152001
>
>   *   Missing a data point every 20 points (5% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.05000040723576504
> source data size:17729280 byte
> Write time: 171
> encoding data size:653285 byte
> Read time: 116
> Compression ratio: 27.138660768271123
>
>   *   Missing a data point every 80 points (1% missing point percentage)
>
> Row number: 2332801
> Missing point percentage: 0.012500423310861097
> source data size:18429120 byte
> Write time: 128
> encoding data size:653285 byte
> Read time: 89
> Compression ratio: 28.209923693334456
>
>   *   Missing a data point every 1700 points (0.0005% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 5.885628478382587E-4
> source data size:18651424 byte
> Write time: 118
> encoding data size:653285 byte
> Read time: 161
> Compression ratio: 28.550210092073137
>
>   *   Missing a data point every 40000 points (0.00002% missing point
> percentage)
>
> Row number: 2332801
> Missing point percentage: 0.002142488793514752
> source data size:18622424 byte
> Write time: 110
> encoding data size:653286 byte
> Read time: 81
> Compression ratio: 28.50577541842317
>
> As you could see above, the compression ratio would increase along with
> the decreasing of the missing point percentage.
>
> In the high missing point percentage, the new encoding method has the
> better performance on compressing data.
>
> While the missing point percentage came down to 0.0005%, both compression
> performance of the encoding method would be almost the same.
>
> However, as the missing point percentage keep decreasing, the compression
> ratio in the new encoding method almost won't increase anymore. On the
> contrary, the performance of the original encoding method still improving
> along with the decreasing of the missing point percentage.
>
> In the last comparison of the compression ratio, the original encoding
> method could have the compression ratio up to 40x, which almost has the
> same performance as encoding the data without missing point. But in the new
> encoding method, the compression ratio still remain on 20x.
>
> The write time and the read time is also different between two encoding
> method. With using the new encoding method, the write (encode) performance
> has made an improvement.
>
> In conclusion, the new encoding method could be more adapted to encode
> when the missing point percentage is high (above 1%), but in the lower
> missing point percentage (which could be stated as under 1%), the original
> encoding method may have better performance in the compression.
>
> As a result, do I need to make any changes to my implementation on the new
> encoding method? Does this result meet what this issue need?
>
> If you have some problems on my testing or implementation, please welcome
> to give me some advice.
>
> Best regards,
> Tsung-Han Tsai
> ________________________________
> 寄件者: Jack Tsai
> 寄件日期: 2019年7月3日 下午 03:58
> 收件者: dev@iotdb.apache.org
> 主旨: 回覆: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> According to the result of the performance test I have done so far, I
> found some difference between the new encoding method I created and the
> original encoding method.
>
> I increased the data size to 2000000 rows (the data size in the original
> test of the old encoding method was around 3000) when testing the new
> encoding method. There are some aspects for the comparison:
>
>
>   1.  Compression ratio
>   2.  Write (encode) operation time
>   3.  Read (decode) operation time
>
> I found that in the original encoding method, the determining factor for
> the compression performance is not the number of the missing point in the
> data. Actually, it is the density of the missing point. I would provide
> some examples for you to understand.
>
>
>   *   Missing a data point every 500 points
>
> ________________________________
> 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> 寄件日期: 2019年6月28日 上午 10:35
> 收件者: dev@iotdb.apache.org
> 主旨: Re: 回覆: A new encoding method for regular timestamp column
>
> Hi,
>
> The bitmap is also needed. Just use another encoding for regular time
> column.
>
> Suppose the time column is "2, 3, 4, 6". The generated column is "2, 3, 4,
> 5, 6".
>
> Then we could use "2(the first value), 1(the delta), 5(total data number)"
> to encode the original data.
>
> However, when decoding from "2, 1, 5" and get "2, 3, 4, 5, 6", we still
> need a mark column such that "11101" to denote that the fourth data is
> generated.
>
> Best,
> --
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
> > -----原始邮件-----
> > 发件人: "Jack Tsai" <ja...@outlook.com>
> > 发送时间: 2019-06-28 09:54:18 (星期五)
> > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > 抄送:
> > 主题: 回覆: A new encoding method for regular timestamp column
> >
> > Hi Jialin,
> >
> > I am not sure if my understanding is right.
> >
> > Does it mean that to encode the data without the bitmap?
> >
> > Best regards,
> > Tsung-Han Tsai
> >
> > ________________________________
> > 寄件者: Jialin Qiao <qj...@mails.tsinghua.edu.cn>
> > 寄件日期: 2019年6月28日 上午 08:59
> > 收件者: dev@iotdb.apache.org
> > 主旨: Re: A new encoding method for regular timestamp column
> >
> > Hi, Tsung-Han
> >
> > Nice try!
> > I think by "low compression ratio of regular time series (with missing
> points)" you mean using TS2DIFF to encode the time column.
> > I wonder if we could "generate a new data array with no missing points
> (regular time series)",
> > could we just use (1) the first value, (2) the delta and (3) the total
> number of value to encode the raw data instead of using TS2DIFF?
> >
> > Best,
> > --
> > Jialin Qiao
> > School of Software, Tsinghua University
> >
> > 乔嘉林
> > 清华大学 软件学院
> >
> > > -----原始邮件-----
> > > 发件人: "Jack Tsai" <ja...@outlook.com>
> > > 发送时间: 2019-06-27 17:56:45 (星期四)
> > > 收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
> > > 抄送:
> > > 主题: A new encoding method for regular timestamp column
> > >
> > > Hi all,
> > >
> > > I am working on this issue:
> https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73
> > >
> > > The new encoding method could deal with the problem of the regular
> time series (the time elapsed between each data point is always the same)
> which got the missing point. When the missing points exist in the data, the
> compression ratio would decrease from 40x to 8x.
> > >
> > > To solve this issue, here comes my solution. I would divide it into
> two parts to explain, which is the write operation and the read operation.
> > >
> > > Write (Encode)
> > >
> > >   1.  First of all, calculate the delta between each elements in the
> data. If the delta value is different from the previous delta, then it
> could be stated that the missing point exist in the data.
> > >   2.  If the missing points exist, get the minimum data base between
> each element. Generate a new data array with no missing points (regular
> time series).
> > >   3.  Next, begin to write the info of this data into the byte array
> output stream.
> > >   4.  The first part is for the identifier, which is the boolean value
> to show whether the missing points exist in the following data.
> > >   5.  Compare the original data, which is the missing point data with
> the new complete data. It would form a bitmap to denote the position of
> missing points.
> > >   6.  Convert the bitmap into the byte array and write it and its
> length into the output stream.
> > >   7.  Start to encode the data (the newly created one which has no
> missing point) value into the output stream.
> > >   8.  When the encode data size reach the target block size, flush it
> into the output stream. Repeat until all values in the data flush into the
> output stream.
> > >
> > > Read (Decode)
> > >
> > >   1.  First of all, decode the first boolean value in the buffer to
> check whether the following data has the missing point.
> > >   2.  If the missing point exist in the data, then it means there is a
> byte array of bitmap in the following part of this buffer.
> > >   3.  To decode the bitmap, initially, read the next int value which
> is for the length of the following byte array of the bitmap.
> > >   4.  Decode the byte array of the bitmap and convert it into the
> bitmap for denoting the missing point in the following data.
> > >   5.  Decode the following data, which is the data array without
> missing points and compare it with the bitmap (according to the mechanism
> of the bitmap, when the bit comes to 0, then it means the missing point
> exists here. Return (read out) the value in the decoded data when the bit
> value is 1).
> > >
> > > The compression ratio would be up to 20x result in the original unit
> test (the one with old encoding method would result in 8x).
> > >
> > > To compare with the original encoding method more precisely, I would
> start the unit test with more details and the performance test for this new
> encoding method recently.
> > >
> > > If you are confused or think there is the problem in my
> implementation, please welcome to give me some advice.
> > >
> > > Best regards,
> > > Tsung-Han Tsai
>