You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iotdb.apache.org by Jiaye Wu <wu...@gmail.com> on 2019/01/22 14:52:44 UTC

Design of Time Series Similarity Search (KV-match) Integration - SQL Part

Hi all,

Let's start the thread to discuss about the integration. As the first step,
we need to design the SQL interface.

Time series similarity search is a fundamental problem in data mining. It
is quite useful for analyzing sensor data stored in IoTDB.
KV-match is one of approaches to solve this problem, which supports both
raw subsequence matching (RSM) and constrained normalized subsequence
matching (cNSM) under either Euclidean Distance (ED) or Dynamic Time
Warping (DTW).

As KV-match is just one implementation, in SQL layer we should add a new
query type as follows.


> *SELECT similar(<pattern_path>, <parameter_1[,parameter_2,...]>)
> from <query_path> where <single_range_time_criterion>;*


IoTDB should can be configured to use KV-match or other algorithms, or even
provide a naive brute force option for user to choose.
Add a configuration e.g.

similarity-search-implementation = KV-match | KV-match_simple | Naive | ...


The parameter list is highly related to the implementation.
For KV-match, we have following parameters (Please refer to the paper for
their meanings):

query_type           : enum   [RSM|cNSM] (required)
> distance_measure     : enum   [ED|DTW]   (required)
> distance_threshold   : double            (required)
> amplitude_scaling    : double            (optional, for cNSM only, default
> 1.1)
> offset_shifting      : double            (optional, for cNSM only, default
> 5.0)

warping_path         : double            (optional, for  DTW only, default
> 5.0)

max_answers          : int               (optional, default 100)

remove_overlapped    : bool              (optional, default TRUE)

overlapped_threshold : double            (optional, default 0.5)


It is ideal that IoTDB can support temporary memory tables (time series)
which is associated with session life-cycle, likes MySQL as follows:

*CREATE TEMPORARY TIMESERIES <pattern_path> SELECT *
> FROM <pattern_original_from_path> WHERE <criterion>;*


User can use the above SQL to select a pattern from other series and modify
it if they want. Then pass the <pattern_path> to the first SQL for
similarity search.
If this can not be implemented in the near future, we can just let user
store the pattern to a normal path and use the full series as the pattern
for query.

The result of similarity search will like follows:

+--------------------------+

| Rank |   Time | Distance |

+--------------------------+

|    1 |    124 |      0.3 |
> |    2 |    544 |      1.4 |
> |    3 |    346 |      3.5 |

+--------------------------+


KV-match or other approaches may need to build index on timeseries to
facilitate queries. IoTDB should support index creation SQL.

*CREATE INDEX
> ON <query_path> USING <index_type> <parameter_1[,parameter_2,...]>*



For KV-match, we use KV-index as our index structure, which has following
parameters (Please refer to the paper for their meanings):

window_lengths       : [int, int, ...]   (optional, default
> [25,50,100,200,400])
> merge_enabled        : bool              (optional, default TRUE)
> initial_row_range    : double            (optional, default 0.5)


Any suggestion on this? Appreciate your idea and help!

Thanks,
Jiaye Wu

Re: Design of Time Series Similarity Search (KV-match) Integration - SQL Part

Posted by Jiaye Wu <wu...@gmail.com>.

Thanks for your summary.

*1. Sub query*
I think it is useful for data analysis. But supporting sub query or nested
query may need to refactor the current query processing flow heavily?
So I suggest to support temp table in the near future, which ought to be
easier to implement.
If anyone has better idea, please let me know. We should provide
opportunity for user to select and change the pattern as desired.

*2. Create index*
*Index name:*
It is good for index maintenance (alter, drop etc.).
Maybe it should be optional, and we will generate a random one when it is
omitted. And provide a statement for user to find the index built on
specific timeseries:
SHOW INDEX ON <timeseries>;

*Index on group:*
Supporting index creation on a group of series is also a good idea. Though
that will increase the total processing time, it can facilitate users.
If the index support combination on multiple series, it can use this
command to build single index for multiple series.

*USING clause:*
I think we should use the alias like KV-index, unless we support plugin
mechanism later.

*Parameters: *
I prefer to choose key_value map, e.g. window_lengths=[10,20,40,80,160]

*3. Similarity search*
*Position of SIMILAR()*
I do agree to move the SIMILAR() to where clause, because it is also a
criterion not a projection.
SELECT * FROM <query_path> WHERE SIMILAR(<pattern_path>, <parameters>) [
AND <single_range_time_criterion> ];
But I am not sure whether this is hard to implement based on current query
logical / physical plan?
@Gaofei Cao, do you have any idea about this?

*Parameters: *
I prefer to choose key_value map too, e.g. type=RSM, measure=DTW,
threshold=1.0 [, overlapped_threshold=1.0]
But that will cause the SQL becomes long. Any suggestion on this?

*Keyword name:*
Whether to use SIMILAR()?

Besides, the JIRA ticket IOTDB-9
<https://issues.apache.org/jira/browse/IOTDB-9> is created, and I will
update the example later.

Thanks,
Jiaye

On Wed, Jan 23, 2019 at 11:04 AM Xiangdong Huang <sa...@gmail.com> wrote:

> Hi,
>
> Jiaye, welcome to join the community. It is great that adding Time series
> similarity search  function into IoTDB.
>
> I re-organize your suggestions and there are some questions:
>
> 1. sub query (or called nested query).
> - Temp Table (and we should to call it as Temp Time Series in IoTDB) is a
> good solution. (Is there any other choices? @Stefanie Zhao FYI).
>
> 2. Create Index
> - CREATE INDEX <name> ON <series/group> USING <type> WITH <parameters>
> - So, do we declare an index on a time series, or a series group?
> - How to describe the using clause? (using `org.apache.iotdb.index.ABCImpl`
> or a alias)
> - How to describe the parameters? (a list? a key value map in json format?)
>
> 3. SELECT... FROM... WHERE...
> - FUN SIMILAR(<pattern_path>, <parameter_1[,parameter_2,...]>)
> I think a better way is putting query conditions in the WHERE clause, not
> the SELECT clause.
>
> If possible, can you give some examples and declare there input and output
> in detail? An example can be found at
> https://github.com/thulab/iotdb/issues/499
>
> @Gaofei Cao, maybe you can give more advices to let Jiaye know which
> classes he has to deal with.
>
> By the way, Jiaye, can you open a ticket on IoTDB JIRA for this feature?
>
> Best,
> -----------------------------------
> Xiangdong Huang
> School of Software, Tsinghua University
>
>  黄向东
> 清华大学 软件学院
>
>
> Jiaye Wu <wu...@gmail.com> 于2019年1月22日周二 下午10:53写道：
>
> > Hi all,
> >
> > Let's start the thread to discuss about the integration. As the first
> step,
> > we need to design the SQL interface.
> >
> > Time series similarity search is a fundamental problem in data mining. It
> > is quite useful for analyzing sensor data stored in IoTDB.
> > KV-match is one of approaches to solve this problem, which supports both
> > raw subsequence matching (RSM) and constrained normalized subsequence
> > matching (cNSM) under either Euclidean Distance (ED) or Dynamic Time
> > Warping (DTW).
> >
> > As KV-match is just one implementation, in SQL layer we should add a new
> > query type as follows.
> >
> >
> > > *SELECT similar(<pattern_path>, <parameter_1[,parameter_2,...]>)
> > > from <query_path> where <single_range_time_criterion>;*
> >
> >
> > IoTDB should can be configured to use KV-match or other algorithms, or
> even
> > provide a naive brute force option for user to choose.
> > Add a configuration e.g.
> >
> > similarity-search-implementation = KV-match | KV-match_simple | Naive |
> ...
> >
> >
> > The parameter list is highly related to the implementation.
> > For KV-match, we have following parameters (Please refer to the paper for
> > their meanings):
> >
> > query_type           : enum   [RSM|cNSM] (required)
> > > distance_measure     : enum   [ED|DTW]   (required)
> > > distance_threshold   : double            (required)
> > > amplitude_scaling    : double            (optional, for cNSM only,
> > default
> > > 1.1)
> > > offset_shifting      : double            (optional, for cNSM only,
> > default
> > > 5.0)
> >
> > warping_path         : double            (optional, for  DTW only,
> default
> > > 5.0)
> >
> > max_answers          : int               (optional, default 100)
> >
> > remove_overlapped    : bool              (optional, default TRUE)
> >
> > overlapped_threshold : double            (optional, default 0.5)
> >
> >
> > It is ideal that IoTDB can support temporary memory tables (time series)
> > which is associated with session life-cycle, likes MySQL as follows:
> >
> > *CREATE TEMPORARY TIMESERIES <pattern_path> SELECT *
> > > FROM <pattern_original_from_path> WHERE <criterion>;*
> >
> >
> > User can use the above SQL to select a pattern from other series and
> modify
> > it if they want. Then pass the <pattern_path> to the first SQL for
> > similarity search.
> > If this can not be implemented in the near future, we can just let user
> > store the pattern to a normal path and use the full series as the pattern
> > for query.
> >
> > The result of similarity search will like follows:
> >
> > +--------------------------+
> >
> > | Rank |   Time | Distance |
> >
> > +--------------------------+
> >
> > |    1 |    124 |      0.3 |
> > > |    2 |    544 |      1.4 |
> > > |    3 |    346 |      3.5 |
> >
> > +--------------------------+
> >
> >
> > KV-match or other approaches may need to build index on timeseries to
> > facilitate queries. IoTDB should support index creation SQL.
> >
> > *CREATE INDEX
> > > ON <query_path> USING <index_type> <parameter_1[,parameter_2,...]>*
> >
> >
> >
> > For KV-match, we use KV-index as our index structure, which has following
> > parameters (Please refer to the paper for their meanings):
> >
> > window_lengths       : [int, int, ...]   (optional, default
> > > [25,50,100,200,400])
> > > merge_enabled        : bool              (optional, default TRUE)
> > > initial_row_range    : double            (optional, default 0.5)
> >
> >
> > Any suggestion on this? Appreciate your idea and help!
> >
> > Thanks,
> > Jiaye Wu
> >
>

Re: Design of Time Series Similarity Search (KV-match) Integration - SQL Part

Posted by Xiangdong Huang <sa...@gmail.com>.

Hi,

Jiaye, welcome to join the community. It is great that adding Time series
similarity search  function into IoTDB.

I re-organize your suggestions and there are some questions:

1. sub query (or called nested query).
- Temp Table (and we should to call it as Temp Time Series in IoTDB) is a
good solution. (Is there any other choices? @Stefanie Zhao FYI).

2. Create Index
- CREATE INDEX <name> ON <series/group> USING <type> WITH <parameters>
- So, do we declare an index on a time series, or a series group?
- How to describe the using clause? (using `org.apache.iotdb.index.ABCImpl`
or a alias)
- How to describe the parameters? (a list? a key value map in json format?)

3. SELECT... FROM... WHERE...
- FUN SIMILAR(<pattern_path>, <parameter_1[,parameter_2,...]>)
I think a better way is putting query conditions in the WHERE clause, not
the SELECT clause.

If possible, can you give some examples and declare there input and output
in detail? An example can be found at
https://github.com/thulab/iotdb/issues/499

@Gaofei Cao, maybe you can give more advices to let Jiaye know which
classes he has to deal with.

By the way, Jiaye, can you open a ticket on IoTDB JIRA for this feature?

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jiaye Wu <wu...@gmail.com> 于2019年1月22日周二 下午10:53写道：

> Hi all,
>
> Let's start the thread to discuss about the integration. As the first step,
> we need to design the SQL interface.
>
> Time series similarity search is a fundamental problem in data mining. It
> is quite useful for analyzing sensor data stored in IoTDB.
> KV-match is one of approaches to solve this problem, which supports both
> raw subsequence matching (RSM) and constrained normalized subsequence
> matching (cNSM) under either Euclidean Distance (ED) or Dynamic Time
> Warping (DTW).
>
> As KV-match is just one implementation, in SQL layer we should add a new
> query type as follows.
>
>
> > *SELECT similar(<pattern_path>, <parameter_1[,parameter_2,...]>)
> > from <query_path> where <single_range_time_criterion>;*
>
>
> IoTDB should can be configured to use KV-match or other algorithms, or even
> provide a naive brute force option for user to choose.
> Add a configuration e.g.
>
> similarity-search-implementation = KV-match | KV-match_simple | Naive | ...
>
>
> The parameter list is highly related to the implementation.
> For KV-match, we have following parameters (Please refer to the paper for
> their meanings):
>
> query_type           : enum   [RSM|cNSM] (required)
> > distance_measure     : enum   [ED|DTW]   (required)
> > distance_threshold   : double            (required)
> > amplitude_scaling    : double            (optional, for cNSM only,
> default
> > 1.1)
> > offset_shifting      : double            (optional, for cNSM only,
> default
> > 5.0)
>
> warping_path         : double            (optional, for  DTW only, default
> > 5.0)
>
> max_answers          : int               (optional, default 100)
>
> remove_overlapped    : bool              (optional, default TRUE)
>
> overlapped_threshold : double            (optional, default 0.5)
>
>
> It is ideal that IoTDB can support temporary memory tables (time series)
> which is associated with session life-cycle, likes MySQL as follows:
>
> *CREATE TEMPORARY TIMESERIES <pattern_path> SELECT *
> > FROM <pattern_original_from_path> WHERE <criterion>;*
>
>
> User can use the above SQL to select a pattern from other series and modify
> it if they want. Then pass the <pattern_path> to the first SQL for
> similarity search.
> If this can not be implemented in the near future, we can just let user
> store the pattern to a normal path and use the full series as the pattern
> for query.
>
> The result of similarity search will like follows:
>
> +--------------------------+
>
> | Rank |   Time | Distance |
>
> +--------------------------+
>
> |    1 |    124 |      0.3 |
> > |    2 |    544 |      1.4 |
> > |    3 |    346 |      3.5 |
>
> +--------------------------+
>
>
> KV-match or other approaches may need to build index on timeseries to
> facilitate queries. IoTDB should support index creation SQL.
>
> *CREATE INDEX
> > ON <query_path> USING <index_type> <parameter_1[,parameter_2,...]>*
>
>
>
> For KV-match, we use KV-index as our index structure, which has following
> parameters (Please refer to the paper for their meanings):
>
> window_lengths       : [int, int, ...]   (optional, default
> > [25,50,100,200,400])
> > merge_enabled        : bool              (optional, default TRUE)
> > initial_row_range    : double            (optional, default 0.5)
>
>
> Any suggestion on this? Appreciate your idea and help!
>
> Thanks,
> Jiaye Wu
>