You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iotdb.apache.org by 魏祥威 <52...@qq.com> on 2020/02/09 09:29:36 UTC

回复: [DISCUSS] Table schema of group by device

Hi,


I agree with the opinion of Xiangdong Huang.


(3) is the most friendly for users who are using Relational DB, and if they want a relational query (group by device query), their applications should guarantee the consistency of data type.

Best,
Xiangwei Wei



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Xiangdong Huang"<sainthxd@gmail.com&gt;;
发送时间:&nbsp;2020年2月7日(星期五) 下午2:58
收件人:&nbsp;"dev"<dev@iotdb.apache.org&gt;;

主题:&nbsp;Re: [DISCUSS] Table schema of group by device



One more suggestion, using "align by device" is more clear than "group by
device".

-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

&nbsp;黄向东
清华大学 软件学院


Xiangdong Huang <sainthxd@gmail.com&gt; 于2020年2月7日周五 下午2:56写道:

&gt;&nbsp; -1 for (2), forever and&nbsp; I think I will never vote +1 for it...
&gt;
&gt; If you do it like that, there is no chance to replace those applications
&gt; which are using relational db to manage timeseries data.
&gt;
&gt; (3) is the most friendly for those developers who are using Relational DB,
&gt; because when they write a SQL like "select c1, c2, c3 FROM", they think it
&gt; is of course that the resultset has 3 columns...
&gt;
&gt; Of course, for users who are using RDB and want a table like "Time
&gt; DeviceId, s1, s2", their applications can guarantee the data type of data
&gt; in s2 as const.
&gt; If there are many data types in s2, the RDB users may use "text"
&gt; "varchar2" format directly.
&gt;
&gt; Considering that, I think the choice is: if all data has the same data
&gt; type in a column, use the correct data type. Otherwise use String.
&gt;
&gt; (1) Well, it can be an option. But my suggestion is, if all data has the
&gt; same data type in a column, do not change its column name.
&gt;
&gt; Best,
&gt; -----------------------------------
&gt; Xiangdong Huang
&gt; School of Software, Tsinghua University
&gt;
&gt;&nbsp; 黄向东
&gt; 清华大学 软件学院
&gt;
&gt;
&gt; Jialin Qiao <qiaojialin@apache.org&gt; 于2020年2月7日周五 下午2:29写道:
&gt;
&gt;&gt; Hi,
&gt;&gt;
&gt;&gt; In IOTDB-243 [1], We want to allow create measurements with the same name
&gt;&gt; but with different types in the same storage group.
&gt;&gt;
&gt;&gt; For example,
&gt;&gt; root.sg1.d1.s1, int32
&gt;&gt; root.sg1.d1.s2 int32
&gt;&gt; root.sg1.d2.s1 boolean
&gt;&gt; root.sg1.d2.s2 int32
&gt;&gt;
&gt;&gt; This may cause trouble in group by device query. How do we organize the
&gt;&gt; result (table schema)? I thought of three ways:
&gt;&gt;
&gt;&gt; (1) Time, Device, s1_int, s1_boolean, s2_int32
&gt;&gt;
&gt;&gt; * advantage:
&gt;&gt; - No ambiguity
&gt;&gt; - The number of columns is acceptable.
&gt;&gt;
&gt;&gt; * disadvantage:
&gt;&gt; - In most cases, the datatype indicator is redundant and weird.
&gt;&gt; - Difficult to use parallelization among devices in the query.
&gt;&gt;
&gt;&gt; (2) Time, d1, s1, s2 Time, d2, s1, s2
&gt;&gt;
&gt;&gt; * advantage:
&gt;&gt; - No ambiguity
&gt;&gt; - This could leverage the parallelization among devices in the query.
&gt;&gt;
&gt;&gt; * disadvantage:
&gt;&gt; - The number of columns may be large.
&gt;&gt;
&gt;&gt; (3) Time DeviceId, s1, s2
&gt;&gt;
&gt;&gt; This may need to do much work in the QueryDataSet, and users need to get
&gt;&gt; value carefully according to the measurement type of one device.
&gt;&gt; Otherwise,
&gt;&gt; it may cause RunTimeException in JDBC Client.
&gt;&gt;
&gt;&gt; * advantage:
&gt;&gt; - The number of columns is the minimal.
&gt;&gt;
&gt;&gt; * disadvantage:
&gt;&gt; - May cause ambiguity, a column of one table has more than one type, which
&gt;&gt; also conflicts to the Spark connector or Hive connector.
&gt;&gt; - Difficult to use parallelization in the query.
&gt;&gt;
&gt;&gt; _______________
&gt;&gt;
&gt;&gt; From my perspective, I prefer (1) ≈ (2) &gt; (3).
&gt;&gt;
&gt;&gt; What's your opinion?
&gt;&gt;
&gt;&gt; [1] https://issues.apache.org/jira/browse/IOTDB-243
&gt;&gt;
&gt;&gt; Thanks,
&gt;&gt; —————————————————
&gt;&gt; Jialin Qiao
&gt;&gt; School of Software, Tsinghua University
&gt;&gt;
&gt;&gt; 乔嘉林
&gt;&gt; 清华大学 软件学院
&gt;&gt;
&gt;

Re: [DISCUSS] Table schema of group by device

Posted by Xiangdong Huang <sa...@gmail.com>.
Hi Jialin,

Very glad that you can agree with that. :-D

-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jialin Qiao <qi...@apache.org> 于2020年2月11日周二 下午5:50写道:

> Hi,
>
> If we use text when a column has multiple types, I'm ok with (3).
>
> Thanks,
> —————————————————
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
>
> 魏祥威 <52...@qq.com> 于2020年2月9日周日 下午5:30写道:
>
> > Hi,
> >
> >
> > I agree with the opinion of Xiangdong Huang.
> >
> >
> > (3) is the most friendly for users who are using Relational DB, and if
> > they want a relational query (group by device query), their applications
> > should guarantee the consistency of data type.
> >
> > Best,
> > Xiangwei Wei
> >
> >
> >
> > &nbsp;
> >
> >
> >
> >
> > ------------------&nbsp;原始邮件&nbsp;------------------
> > 发件人:&nbsp;"Xiangdong Huang"<sainthxd@gmail.com&gt;;
> > 发送时间:&nbsp;2020年2月7日(星期五) 下午2:58
> > 收件人:&nbsp;"dev"<dev@iotdb.apache.org&gt;;
> >
> > 主题:&nbsp;Re: [DISCUSS] Table schema of group by device
> >
> >
> >
> > One more suggestion, using "align by device" is more clear than "group by
> > device".
> >
> > -----------------------------------
> > Xiangdong Huang
> > School of Software, Tsinghua University
> >
> > &nbsp;黄向东
> > 清华大学 软件学院
> >
> >
> > Xiangdong Huang <sainthxd@gmail.com&gt; 于2020年2月7日周五 下午2:56写道:
> >
> > &gt;&nbsp; -1 for (2), forever and&nbsp; I think I will never vote +1 for
> > it...
> > &gt;
> > &gt; If you do it like that, there is no chance to replace those
> > applications
> > &gt; which are using relational db to manage timeseries data.
> > &gt;
> > &gt; (3) is the most friendly for those developers who are using
> > Relational DB,
> > &gt; because when they write a SQL like "select c1, c2, c3 FROM", they
> > think it
> > &gt; is of course that the resultset has 3 columns...
> > &gt;
> > &gt; Of course, for users who are using RDB and want a table like "Time
> > &gt; DeviceId, s1, s2", their applications can guarantee the data type of
> > data
> > &gt; in s2 as const.
> > &gt; If there are many data types in s2, the RDB users may use "text"
> > &gt; "varchar2" format directly.
> > &gt;
> > &gt; Considering that, I think the choice is: if all data has the same
> data
> > &gt; type in a column, use the correct data type. Otherwise use String.
> > &gt;
> > &gt; (1) Well, it can be an option. But my suggestion is, if all data has
> > the
> > &gt; same data type in a column, do not change its column name.
> > &gt;
> > &gt; Best,
> > &gt; -----------------------------------
> > &gt; Xiangdong Huang
> > &gt; School of Software, Tsinghua University
> > &gt;
> > &gt;&nbsp; 黄向东
> > &gt; 清华大学 软件学院
> > &gt;
> > &gt;
> > &gt; Jialin Qiao <qiaojialin@apache.org&gt; 于2020年2月7日周五 下午2:29写道:
> > &gt;
> > &gt;&gt; Hi,
> > &gt;&gt;
> > &gt;&gt; In IOTDB-243 [1], We want to allow create measurements with the
> > same name
> > &gt;&gt; but with different types in the same storage group.
> > &gt;&gt;
> > &gt;&gt; For example,
> > &gt;&gt; root.sg1.d1.s1, int32
> > &gt;&gt; root.sg1.d1.s2 int32
> > &gt;&gt; root.sg1.d2.s1 boolean
> > &gt;&gt; root.sg1.d2.s2 int32
> > &gt;&gt;
> > &gt;&gt; This may cause trouble in group by device query. How do we
> > organize the
> > &gt;&gt; result (table schema)? I thought of three ways:
> > &gt;&gt;
> > &gt;&gt; (1) Time, Device, s1_int, s1_boolean, s2_int32
> > &gt;&gt;
> > &gt;&gt; * advantage:
> > &gt;&gt; - No ambiguity
> > &gt;&gt; - The number of columns is acceptable.
> > &gt;&gt;
> > &gt;&gt; * disadvantage:
> > &gt;&gt; - In most cases, the datatype indicator is redundant and weird.
> > &gt;&gt; - Difficult to use parallelization among devices in the query.
> > &gt;&gt;
> > &gt;&gt; (2) Time, d1, s1, s2 Time, d2, s1, s2
> > &gt;&gt;
> > &gt;&gt; * advantage:
> > &gt;&gt; - No ambiguity
> > &gt;&gt; - This could leverage the parallelization among devices in the
> > query.
> > &gt;&gt;
> > &gt;&gt; * disadvantage:
> > &gt;&gt; - The number of columns may be large.
> > &gt;&gt;
> > &gt;&gt; (3) Time DeviceId, s1, s2
> > &gt;&gt;
> > &gt;&gt; This may need to do much work in the QueryDataSet, and users
> need
> > to get
> > &gt;&gt; value carefully according to the measurement type of one device.
> > &gt;&gt; Otherwise,
> > &gt;&gt; it may cause RunTimeException in JDBC Client.
> > &gt;&gt;
> > &gt;&gt; * advantage:
> > &gt;&gt; - The number of columns is the minimal.
> > &gt;&gt;
> > &gt;&gt; * disadvantage:
> > &gt;&gt; - May cause ambiguity, a column of one table has more than one
> > type, which
> > &gt;&gt; also conflicts to the Spark connector or Hive connector.
> > &gt;&gt; - Difficult to use parallelization in the query.
> > &gt;&gt;
> > &gt;&gt; _______________
> > &gt;&gt;
> > &gt;&gt; From my perspective, I prefer (1) ≈ (2) &gt; (3).
> > &gt;&gt;
> > &gt;&gt; What's your opinion?
> > &gt;&gt;
> > &gt;&gt; [1] https://issues.apache.org/jira/browse/IOTDB-243
> > &gt;&gt;
> > &gt;&gt; Thanks,
> > &gt;&gt; —————————————————
> > &gt;&gt; Jialin Qiao
> > &gt;&gt; School of Software, Tsinghua University
> > &gt;&gt;
> > &gt;&gt; 乔嘉林
> > &gt;&gt; 清华大学 软件学院
> > &gt;&gt;
> > &gt;
>

Re: [DISCUSS] Table schema of group by device

Posted by Jialin Qiao <qi...@apache.org>.
Hi,

If we use text when a column has multiple types, I'm ok with (3).

Thanks,
—————————————————
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院


魏祥威 <52...@qq.com> 于2020年2月9日周日 下午5:30写道:

> Hi,
>
>
> I agree with the opinion of Xiangdong Huang.
>
>
> (3) is the most friendly for users who are using Relational DB, and if
> they want a relational query (group by device query), their applications
> should guarantee the consistency of data type.
>
> Best,
> Xiangwei Wei
>
>
>
> &nbsp;
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:&nbsp;"Xiangdong Huang"<sainthxd@gmail.com&gt;;
> 发送时间:&nbsp;2020年2月7日(星期五) 下午2:58
> 收件人:&nbsp;"dev"<dev@iotdb.apache.org&gt;;
>
> 主题:&nbsp;Re: [DISCUSS] Table schema of group by device
>
>
>
> One more suggestion, using "align by device" is more clear than "group by
> device".
>
> -----------------------------------
> Xiangdong Huang
> School of Software, Tsinghua University
>
> &nbsp;黄向东
> 清华大学 软件学院
>
>
> Xiangdong Huang <sainthxd@gmail.com&gt; 于2020年2月7日周五 下午2:56写道:
>
> &gt;&nbsp; -1 for (2), forever and&nbsp; I think I will never vote +1 for
> it...
> &gt;
> &gt; If you do it like that, there is no chance to replace those
> applications
> &gt; which are using relational db to manage timeseries data.
> &gt;
> &gt; (3) is the most friendly for those developers who are using
> Relational DB,
> &gt; because when they write a SQL like "select c1, c2, c3 FROM", they
> think it
> &gt; is of course that the resultset has 3 columns...
> &gt;
> &gt; Of course, for users who are using RDB and want a table like "Time
> &gt; DeviceId, s1, s2", their applications can guarantee the data type of
> data
> &gt; in s2 as const.
> &gt; If there are many data types in s2, the RDB users may use "text"
> &gt; "varchar2" format directly.
> &gt;
> &gt; Considering that, I think the choice is: if all data has the same data
> &gt; type in a column, use the correct data type. Otherwise use String.
> &gt;
> &gt; (1) Well, it can be an option. But my suggestion is, if all data has
> the
> &gt; same data type in a column, do not change its column name.
> &gt;
> &gt; Best,
> &gt; -----------------------------------
> &gt; Xiangdong Huang
> &gt; School of Software, Tsinghua University
> &gt;
> &gt;&nbsp; 黄向东
> &gt; 清华大学 软件学院
> &gt;
> &gt;
> &gt; Jialin Qiao <qiaojialin@apache.org&gt; 于2020年2月7日周五 下午2:29写道:
> &gt;
> &gt;&gt; Hi,
> &gt;&gt;
> &gt;&gt; In IOTDB-243 [1], We want to allow create measurements with the
> same name
> &gt;&gt; but with different types in the same storage group.
> &gt;&gt;
> &gt;&gt; For example,
> &gt;&gt; root.sg1.d1.s1, int32
> &gt;&gt; root.sg1.d1.s2 int32
> &gt;&gt; root.sg1.d2.s1 boolean
> &gt;&gt; root.sg1.d2.s2 int32
> &gt;&gt;
> &gt;&gt; This may cause trouble in group by device query. How do we
> organize the
> &gt;&gt; result (table schema)? I thought of three ways:
> &gt;&gt;
> &gt;&gt; (1) Time, Device, s1_int, s1_boolean, s2_int32
> &gt;&gt;
> &gt;&gt; * advantage:
> &gt;&gt; - No ambiguity
> &gt;&gt; - The number of columns is acceptable.
> &gt;&gt;
> &gt;&gt; * disadvantage:
> &gt;&gt; - In most cases, the datatype indicator is redundant and weird.
> &gt;&gt; - Difficult to use parallelization among devices in the query.
> &gt;&gt;
> &gt;&gt; (2) Time, d1, s1, s2 Time, d2, s1, s2
> &gt;&gt;
> &gt;&gt; * advantage:
> &gt;&gt; - No ambiguity
> &gt;&gt; - This could leverage the parallelization among devices in the
> query.
> &gt;&gt;
> &gt;&gt; * disadvantage:
> &gt;&gt; - The number of columns may be large.
> &gt;&gt;
> &gt;&gt; (3) Time DeviceId, s1, s2
> &gt;&gt;
> &gt;&gt; This may need to do much work in the QueryDataSet, and users need
> to get
> &gt;&gt; value carefully according to the measurement type of one device.
> &gt;&gt; Otherwise,
> &gt;&gt; it may cause RunTimeException in JDBC Client.
> &gt;&gt;
> &gt;&gt; * advantage:
> &gt;&gt; - The number of columns is the minimal.
> &gt;&gt;
> &gt;&gt; * disadvantage:
> &gt;&gt; - May cause ambiguity, a column of one table has more than one
> type, which
> &gt;&gt; also conflicts to the Spark connector or Hive connector.
> &gt;&gt; - Difficult to use parallelization in the query.
> &gt;&gt;
> &gt;&gt; _______________
> &gt;&gt;
> &gt;&gt; From my perspective, I prefer (1) ≈ (2) &gt; (3).
> &gt;&gt;
> &gt;&gt; What's your opinion?
> &gt;&gt;
> &gt;&gt; [1] https://issues.apache.org/jira/browse/IOTDB-243
> &gt;&gt;
> &gt;&gt; Thanks,
> &gt;&gt; —————————————————
> &gt;&gt; Jialin Qiao
> &gt;&gt; School of Software, Tsinghua University
> &gt;&gt;
> &gt;&gt; 乔嘉林
> &gt;&gt; 清华大学 软件学院
> &gt;&gt;
> &gt;