You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iotdb.apache.org by "RUI, LEI" <10...@qq.com> on 2019/06/24 14:19:50 UTC

回复： Discussion: IoTDB Query on Value Columns

This is the picture (bmp format) in 2.1.






------------------ 原始邮件 ------------------
发件人: "suyue"<su...@mails.tsinghua.edu.cn>;
发送时间: 2019年6月24日(星期一) 晚上10:14
收件人: "dev"<de...@iotdb.apache.org>;

主题: Re: Discussion: IoTDB Query on Value Columns



This is the picture in 2.1.
 

在 2019年6月24日，下午9:58，RUI, LEI <10...@qq.com> 写道：


1. Problem Description

Consider four data points (t,v) are written to IoTDB in the following order:

(1,1)

(2,2)

(3,3)

(1,100)

Then, given a query “select * from root where v<10”, the expected result is (2,2)(3,3). This is because the later inserted data point (1,100) should cover the earlier inserted data point (1,1). 

However, we find that in IoTDB the queried result is (1,100),(2,2),(3,3).

More details see JIRA-121.




2. IoTDB Background

2.1 data organization

In IoTDB, the above data points will be divided into sequential data source and unsequential data source separately, as is shown below.



2.2 query process

The execution process of sql “select * from root where v<10” is as follows:

(1) Create a timeGenerator for the value filter “v<10”. It will return statisfied timestamps iteratively.

(2) Fetch the value by the timestamp generated by the TimeGenerator.

 

3. Analysis

3.1 Annotation Description
 
s: data source

s1<s2: s2 has higher priority than s1, which means that data points in s2 always cover those of the same timestamps in s1.

ss: sequential data source. 

us: unsequential data source. us>ss, i.e., unsequential data source always has higher priority than sequential data source.

merge(s1,s2): union data points from s1 and s2. When two data points from s1 and s2 respectively have the same timestamp, keep the data point from the higher priority source.

query(s): apply the query pushdown on the data source s and return the query result 

 

3.2 Current Query Plan

       The current query plan in IoTDB goes like this: timeGenerator=merge(query(ss),query(us))



       Explain using the above example:

ss=((1,1),(2,2),(3,3))

us=(1,100)

query(ss)=((1,1),(2,2),(3,3))

query(us)=ϕ

timeGenerator=merge(query(ss),query(us))=((1,1),(2,2),(3,3))







       Then fetch the value by the timestamp generated by the above timeGenerator. Note that in this step, we fetch value from merged data source, i.e., merge(ss,us). The final result is ((1,100),(2,2),(3,3)). This is how the bug comes from: there is no post-filter applied on the false positives in the timeGenerator.

 

3.3 Possibile Solutions

We come up with several alternative solutions.

(1) timeGenerator=query(merge(ss,us))

(2) timeGenerator=query(merge(query(ss),us))

(3) timeGenerator=query(merge(query(ss),query(us)))



(1) is a simple solution. 

(2) and (3) have different advantages. 

(2): The query condition is pushed down to ss first and then applied to the merged result of query(ss) and us. When the selection query (corresponding to timeGenerator) and the projection query have the same series in common, we can use values of those series cached in timeGenerator to speed up the projection process.

(3): The query condition is pushed down to the unsequential data source too. Thus, data not satisfying the query condition can be filtered out at an early stage.




3.4 Discussion

       Does anyone know of any mature solutions in other systems? Or which solution do you think is better, (2) or (3)?

       Looking forward to your advice.




Sincerely,

Lei Rui, Yue Su

Re: Discussion: IoTDB Query on Value Columns

Posted by Xiangdong Huang <sa...@gmail.com>.

Hi all,

Mailing list does not support media data (e.g., image, document
attachment...), so if you want to share the image or other materials,
please submit them on JIRA :D

By the way, it is better that controlling the number of words in a line in
the mailing list... it is friendly for users who use small screen.

Besides, your solution make sense. I think it is a common but important
issue for LSM based systems. So, I suggest that have a look about
Cassandra, HBase, etc..

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


RUI, LEI <10...@qq.com> 于2019年6月24日周一 下午10:20写道：

> This is the picture (bmp format) in 2.1.
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "suyue"<su...@mails.tsinghua.edu.cn>;
> *发送时间:* 2019年6月24日(星期一) 晚上10:14
> *收件人:* "dev"<de...@iotdb.apache.org>;
> *主题:* Re: Discussion: IoTDB Query on Value Columns
>
> This is the picture in 2.1.
>
>
> 在 2019年6月24日，下午9:58，RUI, LEI <10...@qq.com> 写道：
>
> 1. Problem Description
>
> Consider four data points (t,v) are written to IoTDB in the following
> order:
>
> (1,1)
>
> (2,2)
>
> (3,3)
>
> (1,100)
>
> Then, given a query “select * from root where v<10”, the expected result
> is (2,2)(3,3). This is because the later inserted data point (1,100) should
> cover the earlier inserted data point (1,1).
>
> However, we find that in IoTDB the queried result is (1,100),(2,2),(3,3).
>
> More details see JIRA-121
> <https://jira.apache.org/jira/projects/IOTDB/issues/IOTDB-121>.
>
>
> 2. IoTDB Background
>
> 2.1 data organization
>
> In IoTDB, the above data points will be divided into sequential data
> source and unsequential data source separately, as is shown below.
>
> 2.2 query process
>
> The execution process of sql “*select * from root where v<10*” is as
> follows:
>
> (1) Create a timeGenerator for the value filter “v<10”. It will return
> statisfied timestamps iteratively.
>
> (2) Fetch the value by the timestamp generated by the TimeGenerator.
>
>
>
> 3. Analysis
>
> 3.1 Annotation Description
>
> s: data source
>
> s1<s2: s2 has higher priority than s1, which means that data points in s2
> always cover those of the same timestamps in s1.
>
> ss: sequential data source.
>
> us: unsequential data source. us>ss, i.e., unsequential data source always
> has higher priority than sequential data source.
>
> merge(s1,s2): union data points from s1 and s2. When two data points from
> s1 and s2 respectively have the same timestamp, keep the data point from
> the higher priority source.
>
> query(s): apply the query pushdown on the data source s and return the
> query result
>
>
>
> 3.2 Current Query Plan
>
>        The current query plan in IoTDB goes like this:
> timeGenerator=merge(query(ss),query(us))
>
>        Explain using the above example:
>
> ss=((1,1),(2,2),(3,3))
>
> us=(1,100)
>
> query(ss)=((1,1),(2,2),(3,3))
>
> query(us)=ϕ
>
> timeGenerator=merge(query(ss),query(us))=((1,1),(2,2),(3,3))
>
>        Then fetch the value by the timestamp generated by the above
> timeGenerator. Note that in this step, we fetch value from merged data
> source, i.e., merge(ss,us). The final result is ((1,100),(2,2),(3,3)).
> This is how the bug comes from: there is no post-filter applied on the
> false positives in the timeGenerator.
>
>
>
> 3.3 Possibile Solutions
>
> We come up with several alternative solutions.
>
> (1) timeGenerator=query(merge(ss,us))
>
> (2) timeGenerator=query(merge(query(ss),us))
>
> (3) timeGenerator=query(merge(query(ss),query(us)))
>
> (1) is a simple solution.
>
> (2) and (3) have different advantages.
>
> (2): The query condition is pushed down to ss first and then applied to
> the merged result of query(ss) and us. When the selection query
> (corresponding to timeGenerator) and the projection query have the same
> series in common, we can use values of those series cached in timeGenerator
> to speed up the projection process.
>
> (3): The query condition is pushed down to the unsequential data source
> too. Thus, data not satisfying the query condition can be filtered out at
> an early stage.
>
>
> 3.4 Discussion
>
>        Does anyone know of any mature solutions in other systems? Or
> which solution do you think is better, (2) or (3)?
>
>        Looking forward to your advice.
>
>
> Sincerely,
>
> Lei Rui, Yue Su
>
>
>