You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iotdb.apache.org by "DaweiLiu (Jira)" <ji...@apache.org> on 2020/02/21 15:48:00 UTC

[jira] [Created] (IOTDB-509) Optimize TsFileReader to reduce unnecessary GC and IO.

DaweiLiu created IOTDB-509:
------------------------------

             Summary: Optimize TsFileReader to reduce unnecessary GC and IO.
                 Key: IOTDB-509
                 URL: https://issues.apache.org/jira/browse/IOTDB-509
             Project: Apache IoTDB
          Issue Type: Wish
          Components: Core/TsFile
            Reporter: DaweiLiu


I think there are still two parts of TsFile that can be optimized
 # Reduce unnecessary IO. The current reading is carried out according to the Chunk level. I think we can put pageindex together. When the time in the filter contains the chunk time, all chunk data will be read out and returned directly. When only intersecting, we can determine which pages to read out by reading pageindex, thus reducing unnecessary data reading
 # The reduction in the gc, read the data returned is based on batchData structure, and the amount of data that is aligned with the page each time, that is, each time when you call next () method reads, will the new a batchData, if the query has experienced thousands of page, that means we have the new 10000 batchData.So I think that we should isolate the data of the page. We do io and serialization / decoding from the hard disk one page at a time, but when it is handed over to the business, it should be a data structure that can be reused. He is Fixed length, just like read (ByteBuffer) in JDK



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [jira] [Created] (IOTDB-509) Optimize TsFileReader to reduce unnecessary GC and IO.

Posted by Dawei Liu <at...@163.com>.

Hi，

I want to make my point clear

1. Because the chunk is an unstable factor, I think there may be few or many pages. We can assume that each block contains two pages, 64K for each page. 
We use filter time = x, and then we only need to use one of the points to read a chunk.
This means we read 128K of data, but only use 16 bytes, so we waste 127.9k of data.

If the user uses "select * from root.sg1" to query N * M sensor data(N is devices), assuming that the value is 50000, which means that we read from the hard disk and add 6G data to the cache, and only use 782k.

In my opinion, if the query is valueFilter, then timeFilter looks like time in (1, 200, 3000), which is more obvious.

Therefore, by using pageIndex, the data read out from the hard disk can be significantly reduced, which is more obvious for the random read scenario.

When using pageindex to read certain page, read from the hard disk size will reduce 3G. 

If the disk reading speed is 200M/s then the rough calculation can save 15s

2. I agree with your idea of using object pooling to solve the problem



Thanks
---
Dawei Liu



> 2020年2月22日 上午10:27，Jialin Qiao <qjl16@mails.tsinghua.edu.cn <ma...@mails.tsinghua.edu.cn>> 写道：
> 
> Hi,
> 
> Interesting thoughts!
> 
> (1) The page level index could optimize the scenario that a chunk has many pages.
> When a chunk only has few pages, maybe reading a whole chunk at a time is good. We could leave it as an option.
> (2) The queried BatchData is never changed and discarded after returning to client through RPC. We could use a pool for BatchData, 
> just like the MemtablePool to reuse BatchData.
> 
> Thansk,
> --
> Jialin Qiao
> School of Software, Tsinghua University
> 
> 乔嘉林
> 清华大学 软件学院
> 
>> -----原始邮件-----
>> 发件人: "DaweiLiu (Jira)" <jira@apache.org <ma...@apache.org>>
>> 发送时间: 2020-02-21 23:48:00 (星期五)
>> 收件人: dev@iotdb.apache.org <ma...@iotdb.apache.org>
>> 抄送: 
>> 主题: [jira] [Created] (IOTDB-509) Optimize TsFileReader to reduce unnecessary GC and IO.
>> 
>> DaweiLiu created IOTDB-509:
>> ------------------------------
>> 
>>             Summary: Optimize TsFileReader to reduce unnecessary GC and IO.
>>                 Key: IOTDB-509
>>                 URL: https://issues.apache.org/jira/browse/IOTDB-509 <https://issues.apache.org/jira/browse/IOTDB-509>
>>             Project: Apache IoTDB
>>          Issue Type: Wish
>>          Components: Core/TsFile
>>            Reporter: DaweiLiu
>> 
>> 
>> I think there are still two parts of TsFile that can be optimized
>> # Reduce unnecessary IO. The current reading is carried out according to the Chunk level. I think we can put pageindex together. When the time in the filter contains the chunk time, all chunk data will be read out and returned directly. When only intersecting, we can determine which pages to read out by reading pageindex, thus reducing unnecessary data reading
>> # The reduction in the gc, read the data returned is based on batchData structure, and the amount of data that is aligned with the page each time, that is, each time when you call next () method reads, will the new a batchData, if the query has experienced thousands of page, that means we have the new 10000 batchData.So I think that we should isolate the data of the page. We do io and serialization / decoding from the hard disk one page at a time, but when it is handed over to the business, it should be a data structure that can be reused. He is Fixed length, just like read (ByteBuffer) in JDK
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian Jira
>> (v8.3.4#803005)

Re: [jira] [Created] (IOTDB-509) Optimize TsFileReader to reduce unnecessary GC and IO.

Posted by Jialin Qiao <qj...@mails.tsinghua.edu.cn>.

Hi,

Interesting thoughts!

(1) The page level index could optimize the scenario that a chunk has many pages.
When a chunk only has few pages, maybe reading a whole chunk at a time is good. We could leave it as an option.
(2) The queried BatchData is never changed and discarded after returning to client through RPC. We could use a pool for BatchData, 
just like the MemtablePool to reuse BatchData.

Thansk,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

> -----原始邮件-----
> 发件人: "DaweiLiu (Jira)" <ji...@apache.org>
> 发送时间: 2020-02-21 23:48:00 (星期五)
> 收件人: dev@iotdb.apache.org
> 抄送: 
> 主题: [jira] [Created] (IOTDB-509) Optimize TsFileReader to reduce unnecessary GC and IO.
> 
> DaweiLiu created IOTDB-509:
> ------------------------------
> 
>              Summary: Optimize TsFileReader to reduce unnecessary GC and IO.
>                  Key: IOTDB-509
>                  URL: https://issues.apache.org/jira/browse/IOTDB-509
>              Project: Apache IoTDB
>           Issue Type: Wish
>           Components: Core/TsFile
>             Reporter: DaweiLiu
> 
> 
> I think there are still two parts of TsFile that can be optimized
>  # Reduce unnecessary IO. The current reading is carried out according to the Chunk level. I think we can put pageindex together. When the time in the filter contains the chunk time, all chunk data will be read out and returned directly. When only intersecting, we can determine which pages to read out by reading pageindex, thus reducing unnecessary data reading
>  # The reduction in the gc, read the data returned is based on batchData structure, and the amount of data that is aligned with the page each time, that is, each time when you call next () method reads, will the new a batchData, if the query has experienced thousands of page, that means we have the new 10000 batchData.So I think that we should isolate the data of the page. We do io and serialization / decoding from the hard disk one page at a time, but when it is handed over to the business, it should be a data structure that can be reused. He is Fixed length, just like read (ByteBuffer) in JDK
> 
> 
> 
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)