You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iotdb.apache.org by Lei Rui <ne...@163.com> on 2019/10/06 15:15:21 UTC

Re: A new result set format

Hi,

I have finished the PR#433 (add "group by device" function for narrow table display) to support this user demand.

Below are my development notes for your information.

Contents
# ORIGINAL DEMANDS
# DEMAND ANALYSIS
# DEMAND INDUCTION
# REALIZATION OF GROUP BY DEVICE - Design Details
# REALIZATION OF GROUP BY DEVICE - Code Realization
# USER GUIDANCE OF GROUP BY DEVICE

# ORIGINAL DEMANDS
1. As stated in JIRA IOTDB-203 and discussed in this email thread, a new result set format / narrow table / group by device
2. filter the devices using regular expression. For example, "root.sg.17xx" to select all devices starting with "17".

As for group by path: this PR#433 is proposed to support it.
As for range select path: Now only support select all using * or select an exact one. Others like "select root.sg.17xx" are not supported.

# DEMAND INDUCTION
## Demand 1: group by path

Question 1: Path only mean device-level path?
- Yes. Driven by real demands, there is no demand on the group-by of higher level paths such as storage groups. For now only group by device is in need.

Question 2: Do different devices need to explicitly select different measurements? E.g., select d0.s1, d0.s2, d1.s0, d1.s3 from root.sg group by device?
- No. It is validated that users don't have this demand.

## Demand 2: range select path
The user wants to filter the devices using regular expression such as "root.sg.17xx".
However, from another point of view, IoTDB can use different levels of a path to achieve the same effect as range select. For example, "root.sg.17.xx" instead of "root.sg.17xx".
This demand still needs more discussion and is not realized in this PR#433. For now we suggest the user to record this demand on the JIRA.

# REALIZATION OF GROUP BY DEVICE - Design Details
## Question 1: How to design sql rules to avoid different devices explicitly selecting different measurements?
Solution:
select <suffixPath> from <prefixPath>
Restrict that the suffixPath can only be single level. In other words, the suffixPath can only be measurements or STAR, without DOT.

Correct example: select s0,s1 from root.sg1.* group by device
Correct example: select s0,s1 from root.sg1.d0, root.sg1.d1 group by device
Correct example: select * from root.sg1.* group by device
Correct example: select * from root group by device
Correct example: select s0,s1,* from root.*.* group by device

Wrong example: select d0.s1, d0.s2, d1.s0 from root.sg1 group by device
Wrong example: select *.s0, *.s1 from root.* group by device
Wrong example: select *.*.* from root group by device

## Question 2: What if different devices with the same measurement have different data types?
Solution:
If it happens, reject service.
Restrict that the data types of the same measurement column should be the same across devices.
Note that when it comes to aggregated paths, the data type of the measurement column will reflect the aggregation function rather than the original timeseries.

Correct example: select s0 from root.sg1.d0,root.sg1.d1 group by device
root.sg1.d0.s0 and root.sg1.d1.s0 are both INT32.
Correct example: select count(s0) from root.sg1.d0,root.sg1.d1 group by device
count(root.sg1.d0.s0) and count(root.sg1.d1.s0) are both INT64.
Wrong example: select s0 from root.sg1.d0, root.sg2.d3 group by device
root.sg1.d0.s0 is INT32 while root.sg2.d3.s0 is FLOAT.

## Question 3: Show nonexistent timeseries or not in the result table?
Solution:
The principle is that only when the column (or row) has existing data will the column (or row) be shown, with nonexistent cells being null.

For example, "select s0,s1,s2 from root.sg.d0, root.sg.d1, root.sg.d2 group by device".
Suppose that the actual existing timeseries are as follows:
- root.sg.d0.s0
- root.sg.d0.s1
- root.sg.d1.s0
Then the header of the result table will be: [Time, Device, s0, s1].
And you could expect a table like:
| Time | Device | s0 | s1 |
| 1 |root.sg.d0| 20 | 2.5|
| 2 |root.sg.d0| 23 | 3.1|
| 1 |root.sg.d1| 12 |null|
| 2 |root.sg.d1| 19 |null|
Note that the cells of measurement 's0' and device 'root.sg.d1' are all null.
Also note that the column of 's2' and the rows of 'root.sg.d2' are not existent.

## Question 4: How to handle duplicated devices?
Solution:
The duplicated devices in the prefix paths are neglected.
For example, "select s0,s1 from root.sg.d0,root.sg.d0,root.sg.d1 group by device" is equal to "select s0,s1 from root.sg.d0,root.sg.d1 group by device".
For example. "select s0,s1 from root.sg.*,root.sg.d0 group by device" is equal to "select s0,s1 from root.sg.* group by device".

# REALIZATION OF GROUP BY DEVICE - Code Realization
1. sql lexer & parser
related classes: TSLexer, TSParser

2. sql -> logical operator
related classes: QueryOperator, LogicalGenerator

3. logical opeator -> optimized logical operator
related classes: ConcatPathOptimizer

4. optimized logical operator -> physical plan (core step)
related classes: QueryPlan, PhysicalGenerator

5. physical plan -> executeQueryStatement & client
related classes: QueryPlan, TSServiceImpl, SQLConstant

6. physical plan -> fetchResults & server
related classes: QueryPlan, AbstractQueryProcessorExecutor, DeviceIterateDataSet, QueryDataSet

7. For table display format
related classes: AbstractClient

# USER GUIDANCE OF GROUP BY DEVICE
GroupbyDeviceClause : GROUP BY DEVICE
Rules:
1. Both uppercase and lowercase are ok.
Correct example: select * from root.sg1 group by device
Correct example: select * from root.sg1 GROUP BY DEVICE
2. GroupbyDeviceClause can only be used at the end of a query statement.
Correct example: select * from root.sg1 where time > 10 group by device
Wrong example: select * from root.sg1 group by device where time > 10
3. The paths of the SELECT clause can only be single level. In other words, the paths of the SELECT clause can only be measurements or STAR, without DOT.
Correct example: select s0,s1 from root.sg1.* group by device
Correct example: select s0,s1 from root.sg1.d0, root.sg1.d1 group by device
Correct example: select * from root.sg1.* group by device
Correct example: select * from root group by device
Correct example: select s0,s1,* from root.*.* group by device
Wrong example: select d0.s1, d0.s2, d1.s0 from root.sg1 group by device
Wrong example: select *.s0, *.s1 from root.* group by device
Wrong example: select *.*.* from root group by device
4. The data types of the same measurement column should be the same across devices.
Note that when it comes to aggregated paths, the data type of the measurement column will reflect the aggregation function rather than the original timeseries.
Correct example: select s0 from root.sg1.d0,root.sg1.d1 group by device
root.sg1.d0.s0 and root.sg1.d1.s0 are both INT32.
Correct example: select count(s0) from root.sg1.d0,root.sg1.d1 group by device
count(root.sg1.d0.s0) and count(root.sg1.d1.s0) are both INT64.
Wrong example: select s0 from root.sg1.d0, root.sg2.d3 group by device
root.sg1.d0.s0 is INT32 while root.sg2.d3.s0 is FLOAT.
5. The display principle of the result table is that only when the column (or row) has existing data will the column (or row) be shown, with nonexistent cells being null.
For example, "select s0,s1,s2 from root.sg.d0, root.sg.d1, root.sg.d2 group by device". Suppose that the actual existing timeseries are as follows:
- root.sg.d0.s0
- root.sg.d0.s1
- root.sg.d1.s0
Then the header of the result table will be: [Time, Device, s0, s1].
And you could expect a table like:

| Time | Device | s0 | s1 |
| 1 |root.sg.d0| 20 | 2.5|
| 2 |root.sg.d0| 23 | 3.1|
| 1 |root.sg.d1| 12 |null|
| 2 |root.sg.d1| 19 |null|
Note that the cells of measurement 's0' and device 'root.sg.d1' are all null.
Also note that the column of 's2' and the rows of 'root.sg.d2' are not existent.
6. The duplicated devices in the prefix paths are neglected.
For example, "select s0,s1 from root.sg.d0,root.sg.d0,root.sg.d1 group by device" is equal to "select s0,s1 from root.sg.d0,root.sg.d1 group by device".
For example. "select s0,s1 from root.sg.*,root.sg.d0 group by device" is equal to "select s0,s1 from root.sg.* group by device".
7. The duplicated measurements in the suffix paths are not neglected.
For example, "select s0,s0,s1 from root.sg.* group by device" is not equal to "select s0,s1 from root.sg.* group by device".
8. More correct examples:
- select * from root.vehicle group by device
- select s0,s0,s1 from root.vehicle.* group by device
- select s0,s1 from root.vehicle.* limit 10 offset 1 group by device
- select * from root.vehicle slimit 10 soffset 2 group by device
- select * from root.vehicle where time > 10 group by device
- select * from root.vehicle where root.vehicle.d0.s0>0 group by device
- select count(*) from root.vehicle group by device
- select sum(*) from root.vehicle GROUP BY (20ms,0,[2,50]) group by device
- select * from root.vehicle where time = 3 Fill(int32[previous, 5ms]) group by device

---
Sincerely,
Lei Rui
On 9/8/2019 04:36，Julian Feinauer<j....@pragmaticminds.de> wrote：
Hi,

I think this is one oft he most important discussion to make it easily accessible to users.
As Xiangdong states, we have to make it comfortable for users to map to relational schemes.

And I agree that there are use cases for both, so probably really just provide both.

Julian

Am 07.09.19, 13:18 schrieb "Xiangdong Huang" <sa...@gmail.com>:

Hi,

Glad to see this discussion.

I am on my travel so I have no enough time these days to join with you but
I will never miss this discussion..

The discussion is important because the query is the highest frequent
function that user use IoTDB and make data valuable.
Besides, it will impact how we transfer ioTDB's schema view to a relational
schema view (e.g., be integrated with Calcite.).

I think Lei Rui got the key difference: the wide, the narrow and the
narrowest table depends on "how to align the data".

- If we want to align all timeseries according to the timestamp, then it is
a wide table. (minor issue: I think for sql "select * from root.sg_1", the
printed result should be "d1.s1, d1.s2..." rather than "root.sg1.d1.s1,
root.sg1.d1.s2". So the table's head row can be concise)

- If we want to align all timeseries that belong to the same device (i.e.,
the data source) according to the time stamp, then it is a narrow table.

- If we do not want to align data, then it is the narrowest table.

I do not know which one that users like. If we decide support all the
three format, maybe an ALIGN clause can be introduced in our SQL. (Well,
Jialin said it as "Group By", I am not sure which one is better).

Best,

在 2019年9月7日星期六，Rui, Lei <ne...@163.com> 写道：

Sorry, pictures cannot be attached in the last email I sent . So I
supplement them here in text.
The "wide" table is:
| time | root.sg_1.device_1.sensor_1 | root.sg_1.device_1.sensor_2 |
root.sg_1.device_2.sensor_1 | root.sg_1.device_2.sensor_2 |
| 1 | 100 | 2.5 | 99 | 1.3 |
| ... | ... | ... | ... | ... |

The "narrow" table is :
| time | device_Id | sensor_1 | sensor_2 |
| 1 | root.sg_1.device_1 | 100 | 2.5 |
| ... | ... | ... | ... |
| 1 | root.sg_1.device_2 | 99 | 1.3 |
| ... | ... | ... | ... |
On 9/7/2019 15:51，Rui, Lei<ne...@163.com> wrote：
Hi,

I try to make this proposal more concrete from a semantic perspective.

Consider the sql "select * from root.sg_1". The following format is the
"wide" table:

The following format is the "narrow" table:

The levels of data from low to high are:
- sensor data, or series data, e.g., from root.sg_1.device_1.sensor_1
- device data, e.g., from root.sg_1.device_1
- storage group data , e.g., from root.sg_1

So, the sql "select * from root.sg_1" queries data at the storage group
level. To present the results,
the wide table aligns all series data across multiple devices in the
storage group by timestamp,
while the narrow table aligns series data in a single device by timestamp,
and does the same for other devices in the storage group.

By the way, I guess the "narrowest" table is for a single sensor's data,
without the need to align with any other series data.

I have one question:
Why not make full use of sql and just use "select * from
root.sg_1.device_1" to specify the device (or the data level) they care
about?
Why use "select * from root.sg_1" with a narrow table format?

Lastly, I think the better query execution efficiency that a narrow table
may sometimes has is not the drive purpose,
because presenting the query result in a wide table and in a narrow table
are two different tasks.

Sincerely,
Lei Rui

From： Jialin Qiao <qj...@mails.tsinghua.edu.cn>
Date： 9/7/2019 15:26
To： <de...@iotdb.apache.org>
Subject： Re: A new result set format
Hi Julian，

He is my friend and contacted me offline, because I advertise IoTDB in my
weChat(like facebook or twitter).

Next time I will try to let him put issue in the mail list himself :)

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

-----原始邮件-----
发件人: "Julian Feinauer" <j....@pragmaticminds.de>
发送时间: 2019-09-07 13:52:17 (星期六)
收件人: "dev@iotdb.apache.org" <de...@iotdb.apache.org>
抄送:
主题: Re: A new result set format

Hi Jialin,

perhaps one question about "wanted by users" means (as I didn’t see
anything on the list).
How do these users get in contact with you?

Julian

Am 07.09.19, 04:29 schrieb "Jialin Qiao" <qj...@mails.tsinghua.edu.cn>:

Hi,

As described in this issue, a new result set format is wanted by users.
I'd like to open a discussion here.

For simplicity, I refer this format "time, root.sg1.d1.s1, root.sg1.d2.s1"
to wide table, and "time, deviceId, s1" as narrow table.

This issue is not only about how to organize the results, but also the
query process.

There are some advantages about narrow table.

(1) For wide table, we need to open a SeriesReader for each series at the
same time, each SeriesReader holds some ChunkMetadatas. For narrow table,
we only need to open SeriesReaders for one device at one time, then return
results and open SeriesReaders for the next device, which occupies less
memory compared to the wide table.
(2) Avoid reading all series at once may also improve the query latency.

There is also a question:

(1) If we show result in the narrow table format for users, do we need to
highlight the concept of table and device?
(2) If the answer of the first question is yes, do we need to support sql:
"select time, deviceId, s1, s2, s3 from root.sg1 where deviceId=d1"? This
may involve a lot of work...

From my side, I prefer the answers of the two questions are all NO. Then
we do not need to change the sql grammar and only use a new query process
to organize the result set.

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

-----原始邮件-----
发件人: "Jialin Qiao (Jira)" <ji...@apache.org>
发送时间: 2019-09-07 09:40:00 (星期六)
收件人: dev@iotdb.apache.org
抄送:
主题: [jira] [Created] (IOTDB-203) A new result set format

Jialin Qiao created IOTDB-203:
---------------------------------

Summary: A new result set format
Key: IOTDB-203
URL: https://issues.apache.org/jira/browse/IOTDB-203
Project: Apache IoTDB
Issue Type: New Feature
Reporter: Jialin Qiao

When executing a SQL like "select d1.s1, d2.s1 from root.sg1", the default
result set format in IoTDB is

"time, root.sg1.d1.s1, root.sg1.d2.s1"

1 , 1, 1

2, 2, 2

However, some users want to get another format, The results could be
grouped by device, then sorted by time.

"time, deviceId, s1".

1, root.sg1.d1, 1

2, root.sg1.d2, 2

This can be done in the client, but it would be better if we support this
format in the server.

--
This message was sent by Atlassian Jira
(v8.3.2#803003)

--
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

黄向东
清华大学 软件学院