You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Julian Feinauer <j....@pragmaticminds.de> on 2018/10/21 19:24:10 UTC

Calcite on Traces / Industry 4.0 data

Dear calcite devs,

I follow the project for a long time and love how calcite made it possible to use SQL everywhere (have done several sql interfaces on top of specific file formats myself). I also like the strong support for streaming SQL.

The reason I'm writing this email is not only to give the project some love but because we are thinking about a SQL "extension" which I think is not so specific but could serve others as well in different use cases.

In detail, we are working with Streams of Data from Devices (think of industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X project where I contribute) and do analytics on them. The analysis which are done there are pretty similar when working with traces from tests, e.g., automotive test drives or from related industries. What all these streams have in  common is
* usually ordered by time
* elements of different groups of signals ("rows" from "tables") arrive ordered by time but not with equal timestamps, e.g., time each second, other quantities much more frequent
* "natural" join for these signal groups ("tables") is some kind of interpolation (sample and hold, linear interpolation, splinces, ...) with respect to (event-)time
* In some cases signal types are not known and can only be guessed based on first value, e.g., on CAN there is no strict notion of "double" or "integer" channels but rather there are integer base values + a conversion formula (like a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)

On the other hand the analysis we like to perform are often timestamps
* get timestamps where a condition becomes true
  * boolean value toggled
  * numeric value is above / below threshold
  * signal change rate is above / below threshold
  * ...
* get the values of certain signals at the point in time when a condition becomes true (see above)
* get windows based on conditions
  * while signal is true
  * while value above ...
  * ...
* Do aggregations on signals in the mentioned windows

Parts of this could done in most SQL dialects (I'm no expert for the standard but in Postgres one could use LAG and partitions) but this is not efficient and not all of the above could be done with that.
So we think about an extension (or a dialect) for "traces" or "time series" which has a syntax that is slightly extended to allow such queries as stated above.

To give you an example of what such an extension could look like:

```
SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE cycle_in_progress = TRUE
SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
```

Why am I bothering you with this?
Well, first, you are experts and I would love to get some feedback on thoughts of that.
But, most important, I am thinking about writing (yet another) SQL parser with slight extensions and would then have to care for a "runtime" which would be partially similar (in functionality, not in maturity or sophistication) to Calcites Enumerable-Trait. So I was thinking whether there is a way to make all of this work "on top" of Calcite (custom RelNodes and an extension to the parser) but I'm unsure about that as some of the internals of Calcite are tied very specifically to Sql... like, e.g., SqlToRelConverter.
Do you have any ideas on how one would be able to implement this "minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or if this should then be done in the same codebase (e.g. a subproject) as it would need some changes near Calcites core?

Please excuse this rather long email but I would really appreciate any answers, comments or suggestions.

Best
Julian

Re: Calcite on Traces / Industry 4.0 data

Posted by Julian Feinauer <j....@pragmaticminds.de>.

Hey,

thanks for your reply and your ideas and reading your answer it becomes quite clear that we have some kind of "duality" between the SQL and the "signal" world. And indeed, your approach would allow us to close the bridge in one direction (signal to SQL).

For many use cases (Temperature Sensor with values when the value changes, or at least each minute) this is a good solution and, as you state, would make the application of MATCH_RECOGINZE more easy (and also other characteristics, like AVG).  And also this would allow Joins with other tables to enrich the data there. Regarding the Implementation, this has to be done as a Table Function, or?

But (at least in our use cases) there are many situations where this approach would probably not be sufficient or feasible. Because, often times sampling rates are pretty low and given with ms precision, this means that we would have to "expand" to one point / ms which is very inefficient (all sensors have their own timestamps so we really have to go to the lowest "tick"). Also to your note on the Join, I agree that the "natural join" for this problem is not a SQL Join and that your "expansion" would solve that but think about a Situation where you have one sensor sending temperature values each minute and one sensor sending current values each ms... I don’t know If this can be handled efficiently by the engine as it is pretty easy do some things wrong.
Thus, my idea was to have a separate join (perhaps this could be realized using a different Trait) which does the sample and hold implicitly and uses "backpressure" to fetch only values from the stream that is behind (we already use these algorithms in our framework).

I like the discussion as we think about these problems for more than 2 years now and developed our very own approach and it is a good point in time now to reflect and to see how we can join this with well-established solutions.

So except for your approach (to transform the problem to a valid SQL Problem) do you see the possibility to extend Calcite (I'm not speaking about the parser now), e.g. by using a different "Timeseries trait" which can handle the problem more optimized and only do things like the expansion you propose "in the end".

Best
Julian


Am 29.10.18, 20:40 schrieb "Julian Hyde" <jh...@apache.org>:

    I’ve been thinking a bit more about this use case.
    
    It’s tricky because signal processing and SQL have a different model of “event”. EE folks talk about “edge triggering” and “level triggering”, which are really just two different data models for the same physical phenomenon.
    
    Consider an example of a step function. Suppose a signal is 0 until time 3.25 when it becomes 1. SQL would probably represent this as an event
    
      (timestamp=0, value=0)
    
    and another event
    
      (timestamp=3.25, value=1)
    
    But to use MATCH_RECOGNIZE effectively, you would need a stream of events
    
      {(0, 0), (1, 0), (2, 0), (3, 0), (4, 1), (5, 1), …}
    
    And even then, the stream does not capture exactly when the transition occurs, just that it happens sometime between 3 and 4.
    
    We could provide a transform in SQL that converts the edge events
    
      {(0, 0), (3.25, 1)}
    
    into level events on clock ticks
    
      {(0, 0), (1, 0), (2, 0), (3, 0), (4, 1), …}
    
    We could also provide a transform from an event stream that has time gaps; e.g. 
    
      {(0, 0), (1, 0), (6, 1)}
    
    becomes
    
      {(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 1), (7, 1)}
    
    These transforms produce virtual data streams in a form that is more easy to write SQL on. For example, joins are easier if you can guarantee that there is a value for every clock tick. So are windowed aggregates. They also bring streams, constant tables and time-varying tables under the same roof.
    
    Do you think this approach is worthwhile? Are there other data models in signal processing/event processing that we could bring into the relational model by applying transforms?
    
    Julian
    
    
    
    
    
    > On Oct 25, 2018, at 1:58 PM, Julian Feinauer <j....@pragmaticminds.de> wrote:
    > 
    > .... I just noted that I forgot to comment on Flinks Implementation, sorry.
    > 
    > I went through the patch which implemented basic functionality in the master[1] and I think that we cannot learn much from their approach directly as they reduce it to a CEP Pattern which is then forwarded to CEP where most of the magic happens.
    > Thus, what they implemented now to make this feature work is, from my impression, on the level whats already implemented with the parsing and the LogicalMatch.
    > 
    > Sorry for the two emails
    > Julian
    > 
    > [1] https://github.com/apache/flink/commit/3acd92b45c21e081f781affc8cb5700d972f9b0b
    > 
    > Am 25.10.18, 22:46 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:
    > 
    >    Hi Julian,
    > 
    >    I filed a Jira form y general suggestion about "Timeseries SQL" (CALCITE-2640).
    >    For the discussion in the other thread, I had a look into the present state of the code (from you and Zhiqiang He) for parsing and the logical node.
    > 
    >    I also thought about the necessary implementation for the EnumerableMatch.
    >    I'm pretty familiar with the regex to NFA / DFS part (from our implementations) and the define part.
    >    But what I'm pretty unfamiliar with is the order and partition part (and especially how its implemented in Calcite).
    >    Do you see any possibility to transform the Matching Part into a Window Aggregation function, or do I make things overly easy with this thought?
    > 
    >    Wouldn’t this also make it easier with regards to the PREV, NEXT, FIRST, LAST window agg functions?
    >    I can try to help with the implementation of the "inner" parts but I don’t feel that I'm familiar enough with the codebase to make the whole thing work.
    > 
    >    Thus, if anybody of the seasoned Calcite devs could offer some help I would be happy to discuss details of the implementation and support the implementation as good as possible.
    > 
    >    Best
    >    Julian
    > 
    > 
    >    Am 23.10.18, 07:57 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:
    > 
    >        Hi Julian,
    > 
    >        first of thanks for your reply and your thoughts.
    >        Thinking about your arguments, I fully agree to what you say and we should really consider using MATCH_REGOCNIZE first and see where it gets us.
    > 
    >        To our second "problem", the different channel groups (with unequal time stamps), we also need a sound mapping to SQL then. My first thought was to use the "drill approach" and to simply simulate a table which has all columns somebody wants (as we do not know that upfront) and return NULL or NaN values when the channel is not present at evaluation time (and do all the interpolation and stuff in the background). Or does anybody have a better idea?
    > 
    >        For your suggested approach I agree and will try to write some of our analysis (in our Java DSL) with MATCH_RECOGNICE to see how well it fits and come back then to the list.
    > 
    >        Thanks
    >        Julian
    > 
    >        Am 23.10.18, 05:55 schrieb "Julian Hyde" <jh...@apache.org>:
    > 
    >            Julian,
    > 
    >            Thanks for posting this to Calcite. We appreciate the opportunity to mull over a language and prevent a mis-guided SQL-like language.
    > 
    >            I agree with both you and Mark: MATCH_RECOGNIZE seems to be very well suited to your problem domain. And MATCH_RECOGNIZE is non-trivial and difficult to learn.
    > 
    >            But in its favor, MATCH_RECOGNIZE is in the SQL standard and has reference implementations in systems like Oracle, so we can assume that it is well-specified. And, in my opinion, it is well designed - it delivers significant extra power to SQL that could not be done efficiently or at all without it, and is consistent with existing SQL semantics. Lastly, the streaming systems such as Flink and Beam are adopting it.
    > 
    >            When your proposed language has gone through the same process, I suspect that it would end up being very similar to MATCH_RECOGNIZE. MATCH_RECOGNIZE may seem “imperative” because it it is creating a state-transition engine, but finite-state automata can be reasoned and safely transformed, and are therefore to all intents and purposes “declarative”.
    > 
    >            The biggest reason not to use MATCH_RECOGNIZE is your audience. There’s no point creating the perfect language if the audience doesn’t like it and want to adopt it. So perhaps your best path is to design your own language, find some examples and code them up as use cases in that language, and iterate based on your users’ feedback.
    > 
    >            If I were you, I would also code each of those examples in SQL using MATCH_RECOGNIZE, and make sure that there is a sound mapping between those languages. And maybe your language could be implemented as a thin layer above MATCH_RECOGNIZE.
    > 
    >            This is the same advice I would give to everyone who is writing a database: I don’t care whether you use SQL, but make sure your language maps onto (extended) relational algebra. (And if you create a SQL-like language that breaks some of the concepts of SQL, such automatically joining tables, please don’t tell people that your language is SQL.)
    > 
    >            I’m sorry to say that Calcite’s implementation of MATCH_RECOGNIZE has not moved forward much since my email. Maybe your effort is the kick necessary to get it going. I can assure you that I still believe that MATCH_RECOGNIZE, and the algebra that underlies it, is a solid foundation.
    > 
    >            Julian
    > 
    > 
    > 
    >> On Oct 21, 2018, at 10:04 PM, Julian Feinauer <j....@pragmaticminds.de> wrote:
    >> 
    >> Hi Mark,
    >> 
    >> thanks for your reply.
    >> In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original mail.
    >> I was really excited when I first heard about MATCH_RECOGNIZE as it is incredibly powerful and could be used so solve many of the problems I state in my mail.
    >> The only "drawback" I see is that it feels so technical and complex.
    >> By that I mean that it took me quite a while to figure out how to use it (and I would consider myself as experienced SQL user). And it kind of "breaks" the foundation of SQL in the sense that it is pretty imperative and not to declarative.
    >> 
    >> This is no general critics to the feature. The point I'm trying to make is that there is a (from my perspective) large class of similar problems and I would love to have a solution which "feels" natural and offers suitable "semantics" for the field.
    >> 
    >> But coming back to the MATCH_RECOGNIZE support in Calcite, is there any progress with regards to Julians Post from July?
    >> If not I can offer to give some support with the implementation of the FSM / NFA.
    >> One solution for us could then also be to take a query in the "Timeseries SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.
    >> 
    >> So if there is still help needed please let me know (a quick search through the JIRA showed CALCITE-1935) which seems like there is still some implementation missing.
    >> 
    >> Best
    >> Julian
    >> 
    >> 
    >> Am 22.10.18, 02:41 schrieb "Mark Hammond" <gp...@themarkhammond.com>:
    >> 
    >>   Hi Julian Feinauer,
    >> 
    >>   Do share your thoughts on MATCH_RECOGNIZE operator suitability, http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%3cC6A37DAE-F884-4D90-8EC0-8FD4EFDE1B0E@apache.org%3e
    >> 
    >>   Cheers,
    >>   Mark.
    >> 
    >>> On 22 Oct 2018, at 02:24, Julian Feinauer <j....@pragmaticminds.de> wrote:
    >>> 
    >>> Dear calcite devs,
    >>> 
    >>> I follow the project for a long time and love how calcite made it possible to use SQL everywhere (have done several sql interfaces on top of specific file formats myself). I also like the strong support for streaming SQL.
    >>> 
    >>> The reason I'm writing this email is not only to give the project some love but because we are thinking about a SQL "extension" which I think is not so specific but could serve others as well in different use cases.
    >>> 
    >>> In detail, we are working with Streams of Data from Devices (think of industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X project where I contribute) and do analytics on them. The analysis which are done there are pretty similar when working with traces from tests, e.g., automotive test drives or from related industries. What all these streams have in  common is
    >>> * usually ordered by time
    >>> * elements of different groups of signals ("rows" from "tables") arrive ordered by time but not with equal timestamps, e.g., time each second, other quantities much more frequent
    >>> * "natural" join for these signal groups ("tables") is some kind of interpolation (sample and hold, linear interpolation, splinces, ...) with respect to (event-)time
    >>> * In some cases signal types are not known and can only be guessed based on first value, e.g., on CAN there is no strict notion of "double" or "integer" channels but rather there are integer base values + a conversion formula (like a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
    >>> 
    >>> On the other hand the analysis we like to perform are often timestamps
    >>> * get timestamps where a condition becomes true
    >>> * boolean value toggled
    >>> * numeric value is above / below threshold
    >>> * signal change rate is above / below threshold
    >>> * ...
    >>> * get the values of certain signals at the point in time when a condition becomes true (see above)
    >>> * get windows based on conditions
    >>> * while signal is true
    >>> * while value above ...
    >>> * ...
    >>> * Do aggregations on signals in the mentioned windows
    >>> 
    >>> Parts of this could done in most SQL dialects (I'm no expert for the standard but in Postgres one could use LAG and partitions) but this is not efficient and not all of the above could be done with that.
    >>> So we think about an extension (or a dialect) for "traces" or "time series" which has a syntax that is slightly extended to allow such queries as stated above.
    >>> 
    >>> To give you an example of what such an extension could look like:
    >>> 
    >>> ```
    >>> SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE cycle_in_progress = TRUE
    >>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
    >>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
    >>> ```
    >>> 
    >>> Why am I bothering you with this?
    >>> Well, first, you are experts and I would love to get some feedback on thoughts of that.
    >>> But, most important, I am thinking about writing (yet another) SQL parser with slight extensions and would then have to care for a "runtime" which would be partially similar (in functionality, not in maturity or sophistication) to Calcites Enumerable-Trait. So I was thinking whether there is a way to make all of this work "on top" of Calcite (custom RelNodes and an extension to the parser) but I'm unsure about that as some of the internals of Calcite are tied very specifically to Sql... like, e.g., SqlToRelConverter.
    >>> Do you have any ideas on how one would be able to implement this "minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or if this should then be done in the same codebase (e.g. a subproject) as it would need some changes near Calcites core?
    >>> 
    >>> Please excuse this rather long email but I would really appreciate any answers, comments or suggestions.
    >>> 
    >>> Best
    >>> Julian
    >> 
    >> 
    > 
    > 
    > 
    > 
    > 
    > 
    >

Re: Calcite on Traces / Industry 4.0 data

Posted by Julian Hyde <jh...@apache.org>.

I’ve been thinking a bit more about this use case.

It’s tricky because signal processing and SQL have a different model of “event”. EE folks talk about “edge triggering” and “level triggering”, which are really just two different data models for the same physical phenomenon.

Consider an example of a step function. Suppose a signal is 0 until time 3.25 when it becomes 1. SQL would probably represent this as an event

  (timestamp=0, value=0)

and another event

  (timestamp=3.25, value=1)

But to use MATCH_RECOGNIZE effectively, you would need a stream of events

  {(0, 0), (1, 0), (2, 0), (3, 0), (4, 1), (5, 1), …}

And even then, the stream does not capture exactly when the transition occurs, just that it happens sometime between 3 and 4.

We could provide a transform in SQL that converts the edge events

  {(0, 0), (3.25, 1)}

into level events on clock ticks

  {(0, 0), (1, 0), (2, 0), (3, 0), (4, 1), …}

We could also provide a transform from an event stream that has time gaps; e.g. 

  {(0, 0), (1, 0), (6, 1)}

becomes

  {(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 1), (7, 1)}

These transforms produce virtual data streams in a form that is more easy to write SQL on. For example, joins are easier if you can guarantee that there is a value for every clock tick. So are windowed aggregates. They also bring streams, constant tables and time-varying tables under the same roof.

Do you think this approach is worthwhile? Are there other data models in signal processing/event processing that we could bring into the relational model by applying transforms?

Julian





> On Oct 25, 2018, at 1:58 PM, Julian Feinauer <j....@pragmaticminds.de> wrote:
> 
> .... I just noted that I forgot to comment on Flinks Implementation, sorry.
> 
> I went through the patch which implemented basic functionality in the master[1] and I think that we cannot learn much from their approach directly as they reduce it to a CEP Pattern which is then forwarded to CEP where most of the magic happens.
> Thus, what they implemented now to make this feature work is, from my impression, on the level whats already implemented with the parsing and the LogicalMatch.
> 
> Sorry for the two emails
> Julian
> 
> [1] https://github.com/apache/flink/commit/3acd92b45c21e081f781affc8cb5700d972f9b0b
> 
> Am 25.10.18, 22:46 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:
> 
>    Hi Julian,
> 
>    I filed a Jira form y general suggestion about "Timeseries SQL" (CALCITE-2640).
>    For the discussion in the other thread, I had a look into the present state of the code (from you and Zhiqiang He) for parsing and the logical node.
> 
>    I also thought about the necessary implementation for the EnumerableMatch.
>    I'm pretty familiar with the regex to NFA / DFS part (from our implementations) and the define part.
>    But what I'm pretty unfamiliar with is the order and partition part (and especially how its implemented in Calcite).
>    Do you see any possibility to transform the Matching Part into a Window Aggregation function, or do I make things overly easy with this thought?
> 
>    Wouldn’t this also make it easier with regards to the PREV, NEXT, FIRST, LAST window agg functions?
>    I can try to help with the implementation of the "inner" parts but I don’t feel that I'm familiar enough with the codebase to make the whole thing work.
> 
>    Thus, if anybody of the seasoned Calcite devs could offer some help I would be happy to discuss details of the implementation and support the implementation as good as possible.
> 
>    Best
>    Julian
> 
> 
>    Am 23.10.18, 07:57 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:
> 
>        Hi Julian,
> 
>        first of thanks for your reply and your thoughts.
>        Thinking about your arguments, I fully agree to what you say and we should really consider using MATCH_REGOCNIZE first and see where it gets us.
> 
>        To our second "problem", the different channel groups (with unequal time stamps), we also need a sound mapping to SQL then. My first thought was to use the "drill approach" and to simply simulate a table which has all columns somebody wants (as we do not know that upfront) and return NULL or NaN values when the channel is not present at evaluation time (and do all the interpolation and stuff in the background). Or does anybody have a better idea?
> 
>        For your suggested approach I agree and will try to write some of our analysis (in our Java DSL) with MATCH_RECOGNICE to see how well it fits and come back then to the list.
> 
>        Thanks
>        Julian
> 
>        Am 23.10.18, 05:55 schrieb "Julian Hyde" <jh...@apache.org>:
> 
>            Julian,
> 
>            Thanks for posting this to Calcite. We appreciate the opportunity to mull over a language and prevent a mis-guided SQL-like language.
> 
>            I agree with both you and Mark: MATCH_RECOGNIZE seems to be very well suited to your problem domain. And MATCH_RECOGNIZE is non-trivial and difficult to learn.
> 
>            But in its favor, MATCH_RECOGNIZE is in the SQL standard and has reference implementations in systems like Oracle, so we can assume that it is well-specified. And, in my opinion, it is well designed - it delivers significant extra power to SQL that could not be done efficiently or at all without it, and is consistent with existing SQL semantics. Lastly, the streaming systems such as Flink and Beam are adopting it.
> 
>            When your proposed language has gone through the same process, I suspect that it would end up being very similar to MATCH_RECOGNIZE. MATCH_RECOGNIZE may seem “imperative” because it it is creating a state-transition engine, but finite-state automata can be reasoned and safely transformed, and are therefore to all intents and purposes “declarative”.
> 
>            The biggest reason not to use MATCH_RECOGNIZE is your audience. There’s no point creating the perfect language if the audience doesn’t like it and want to adopt it. So perhaps your best path is to design your own language, find some examples and code them up as use cases in that language, and iterate based on your users’ feedback.
> 
>            If I were you, I would also code each of those examples in SQL using MATCH_RECOGNIZE, and make sure that there is a sound mapping between those languages. And maybe your language could be implemented as a thin layer above MATCH_RECOGNIZE.
> 
>            This is the same advice I would give to everyone who is writing a database: I don’t care whether you use SQL, but make sure your language maps onto (extended) relational algebra. (And if you create a SQL-like language that breaks some of the concepts of SQL, such automatically joining tables, please don’t tell people that your language is SQL.)
> 
>            I’m sorry to say that Calcite’s implementation of MATCH_RECOGNIZE has not moved forward much since my email. Maybe your effort is the kick necessary to get it going. I can assure you that I still believe that MATCH_RECOGNIZE, and the algebra that underlies it, is a solid foundation.
> 
>            Julian
> 
> 
> 
>> On Oct 21, 2018, at 10:04 PM, Julian Feinauer <j....@pragmaticminds.de> wrote:
>> 
>> Hi Mark,
>> 
>> thanks for your reply.
>> In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original mail.
>> I was really excited when I first heard about MATCH_RECOGNIZE as it is incredibly powerful and could be used so solve many of the problems I state in my mail.
>> The only "drawback" I see is that it feels so technical and complex.
>> By that I mean that it took me quite a while to figure out how to use it (and I would consider myself as experienced SQL user). And it kind of "breaks" the foundation of SQL in the sense that it is pretty imperative and not to declarative.
>> 
>> This is no general critics to the feature. The point I'm trying to make is that there is a (from my perspective) large class of similar problems and I would love to have a solution which "feels" natural and offers suitable "semantics" for the field.
>> 
>> But coming back to the MATCH_RECOGNIZE support in Calcite, is there any progress with regards to Julians Post from July?
>> If not I can offer to give some support with the implementation of the FSM / NFA.
>> One solution for us could then also be to take a query in the "Timeseries SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.
>> 
>> So if there is still help needed please let me know (a quick search through the JIRA showed CALCITE-1935) which seems like there is still some implementation missing.
>> 
>> Best
>> Julian
>> 
>> 
>> Am 22.10.18, 02:41 schrieb "Mark Hammond" <gp...@themarkhammond.com>:
>> 
>>   Hi Julian Feinauer,
>> 
>>   Do share your thoughts on MATCH_RECOGNIZE operator suitability, http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%3cC6A37DAE-F884-4D90-8EC0-8FD4EFDE1B0E@apache.org%3e
>> 
>>   Cheers,
>>   Mark.
>> 
>>> On 22 Oct 2018, at 02:24, Julian Feinauer <j....@pragmaticminds.de> wrote:
>>> 
>>> Dear calcite devs,
>>> 
>>> I follow the project for a long time and love how calcite made it possible to use SQL everywhere (have done several sql interfaces on top of specific file formats myself). I also like the strong support for streaming SQL.
>>> 
>>> The reason I'm writing this email is not only to give the project some love but because we are thinking about a SQL "extension" which I think is not so specific but could serve others as well in different use cases.
>>> 
>>> In detail, we are working with Streams of Data from Devices (think of industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X project where I contribute) and do analytics on them. The analysis which are done there are pretty similar when working with traces from tests, e.g., automotive test drives or from related industries. What all these streams have in  common is
>>> * usually ordered by time
>>> * elements of different groups of signals ("rows" from "tables") arrive ordered by time but not with equal timestamps, e.g., time each second, other quantities much more frequent
>>> * "natural" join for these signal groups ("tables") is some kind of interpolation (sample and hold, linear interpolation, splinces, ...) with respect to (event-)time
>>> * In some cases signal types are not known and can only be guessed based on first value, e.g., on CAN there is no strict notion of "double" or "integer" channels but rather there are integer base values + a conversion formula (like a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
>>> 
>>> On the other hand the analysis we like to perform are often timestamps
>>> * get timestamps where a condition becomes true
>>> * boolean value toggled
>>> * numeric value is above / below threshold
>>> * signal change rate is above / below threshold
>>> * ...
>>> * get the values of certain signals at the point in time when a condition becomes true (see above)
>>> * get windows based on conditions
>>> * while signal is true
>>> * while value above ...
>>> * ...
>>> * Do aggregations on signals in the mentioned windows
>>> 
>>> Parts of this could done in most SQL dialects (I'm no expert for the standard but in Postgres one could use LAG and partitions) but this is not efficient and not all of the above could be done with that.
>>> So we think about an extension (or a dialect) for "traces" or "time series" which has a syntax that is slightly extended to allow such queries as stated above.
>>> 
>>> To give you an example of what such an extension could look like:
>>> 
>>> ```
>>> SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE cycle_in_progress = TRUE
>>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
>>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
>>> ```
>>> 
>>> Why am I bothering you with this?
>>> Well, first, you are experts and I would love to get some feedback on thoughts of that.
>>> But, most important, I am thinking about writing (yet another) SQL parser with slight extensions and would then have to care for a "runtime" which would be partially similar (in functionality, not in maturity or sophistication) to Calcites Enumerable-Trait. So I was thinking whether there is a way to make all of this work "on top" of Calcite (custom RelNodes and an extension to the parser) but I'm unsure about that as some of the internals of Calcite are tied very specifically to Sql... like, e.g., SqlToRelConverter.
>>> Do you have any ideas on how one would be able to implement this "minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or if this should then be done in the same codebase (e.g. a subproject) as it would need some changes near Calcites core?
>>> 
>>> Please excuse this rather long email but I would really appreciate any answers, comments or suggestions.
>>> 
>>> Best
>>> Julian
>> 
>> 
> 
> 
> 
> 
> 
> 
>

Re: Calcite on Traces / Industry 4.0 data

Posted by Julian Feinauer <j....@pragmaticminds.de>.

.... I just noted that I forgot to comment on Flinks Implementation, sorry.

I went through the patch which implemented basic functionality in the master[1] and I think that we cannot learn much from their approach directly as they reduce it to a CEP Pattern which is then forwarded to CEP where most of the magic happens.
Thus, what they implemented now to make this feature work is, from my impression, on the level whats already implemented with the parsing and the LogicalMatch.

Sorry for the two emails
Julian

[1] https://github.com/apache/flink/commit/3acd92b45c21e081f781affc8cb5700d972f9b0b

Am 25.10.18, 22:46 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:

    Hi Julian,
    
    I filed a Jira form y general suggestion about "Timeseries SQL" (CALCITE-2640).
    For the discussion in the other thread, I had a look into the present state of the code (from you and Zhiqiang He) for parsing and the logical node.
    
    I also thought about the necessary implementation for the EnumerableMatch.
    I'm pretty familiar with the regex to NFA / DFS part (from our implementations) and the define part.
    But what I'm pretty unfamiliar with is the order and partition part (and especially how its implemented in Calcite).
    Do you see any possibility to transform the Matching Part into a Window Aggregation function, or do I make things overly easy with this thought?
    
    Wouldn’t this also make it easier with regards to the PREV, NEXT, FIRST, LAST window agg functions?
    I can try to help with the implementation of the "inner" parts but I don’t feel that I'm familiar enough with the codebase to make the whole thing work.
    
    Thus, if anybody of the seasoned Calcite devs could offer some help I would be happy to discuss details of the implementation and support the implementation as good as possible.
    
    Best
    Julian
    
    
    Am 23.10.18, 07:57 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:
    
        Hi Julian,
        
        first of thanks for your reply and your thoughts.
        Thinking about your arguments, I fully agree to what you say and we should really consider using MATCH_REGOCNIZE first and see where it gets us.
        
        To our second "problem", the different channel groups (with unequal time stamps), we also need a sound mapping to SQL then. My first thought was to use the "drill approach" and to simply simulate a table which has all columns somebody wants (as we do not know that upfront) and return NULL or NaN values when the channel is not present at evaluation time (and do all the interpolation and stuff in the background). Or does anybody have a better idea?
        
        For your suggested approach I agree and will try to write some of our analysis (in our Java DSL) with MATCH_RECOGNICE to see how well it fits and come back then to the list.
        
        Thanks
        Julian
        
        Am 23.10.18, 05:55 schrieb "Julian Hyde" <jh...@apache.org>:
        
            Julian,
            
            Thanks for posting this to Calcite. We appreciate the opportunity to mull over a language and prevent a mis-guided SQL-like language.
            
            I agree with both you and Mark: MATCH_RECOGNIZE seems to be very well suited to your problem domain. And MATCH_RECOGNIZE is non-trivial and difficult to learn.
            
            But in its favor, MATCH_RECOGNIZE is in the SQL standard and has reference implementations in systems like Oracle, so we can assume that it is well-specified. And, in my opinion, it is well designed - it delivers significant extra power to SQL that could not be done efficiently or at all without it, and is consistent with existing SQL semantics. Lastly, the streaming systems such as Flink and Beam are adopting it.
            
            When your proposed language has gone through the same process, I suspect that it would end up being very similar to MATCH_RECOGNIZE. MATCH_RECOGNIZE may seem “imperative” because it it is creating a state-transition engine, but finite-state automata can be reasoned and safely transformed, and are therefore to all intents and purposes “declarative”.
            
            The biggest reason not to use MATCH_RECOGNIZE is your audience. There’s no point creating the perfect language if the audience doesn’t like it and want to adopt it. So perhaps your best path is to design your own language, find some examples and code them up as use cases in that language, and iterate based on your users’ feedback.
            
            If I were you, I would also code each of those examples in SQL using MATCH_RECOGNIZE, and make sure that there is a sound mapping between those languages. And maybe your language could be implemented as a thin layer above MATCH_RECOGNIZE.
            
            This is the same advice I would give to everyone who is writing a database: I don’t care whether you use SQL, but make sure your language maps onto (extended) relational algebra. (And if you create a SQL-like language that breaks some of the concepts of SQL, such automatically joining tables, please don’t tell people that your language is SQL.)
            
            I’m sorry to say that Calcite’s implementation of MATCH_RECOGNIZE has not moved forward much since my email. Maybe your effort is the kick necessary to get it going. I can assure you that I still believe that MATCH_RECOGNIZE, and the algebra that underlies it, is a solid foundation.
            
            Julian
            
            
            
            > On Oct 21, 2018, at 10:04 PM, Julian Feinauer <j....@pragmaticminds.de> wrote:
            > 
            > Hi Mark,
            > 
            > thanks for your reply.
            > In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original mail.
            > I was really excited when I first heard about MATCH_RECOGNIZE as it is incredibly powerful and could be used so solve many of the problems I state in my mail.
            > The only "drawback" I see is that it feels so technical and complex.
            > By that I mean that it took me quite a while to figure out how to use it (and I would consider myself as experienced SQL user). And it kind of "breaks" the foundation of SQL in the sense that it is pretty imperative and not to declarative.
            > 
            > This is no general critics to the feature. The point I'm trying to make is that there is a (from my perspective) large class of similar problems and I would love to have a solution which "feels" natural and offers suitable "semantics" for the field.
            > 
            > But coming back to the MATCH_RECOGNIZE support in Calcite, is there any progress with regards to Julians Post from July?
            > If not I can offer to give some support with the implementation of the FSM / NFA.
            > One solution for us could then also be to take a query in the "Timeseries SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.
            > 
            > So if there is still help needed please let me know (a quick search through the JIRA showed CALCITE-1935) which seems like there is still some implementation missing.
            > 
            > Best
            > Julian
            > 
            > 
            > Am 22.10.18, 02:41 schrieb "Mark Hammond" <gp...@themarkhammond.com>:
            > 
            >    Hi Julian Feinauer,
            > 
            >    Do share your thoughts on MATCH_RECOGNIZE operator suitability, http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%3cC6A37DAE-F884-4D90-8EC0-8FD4EFDE1B0E@apache.org%3e
            > 
            >    Cheers,
            >    Mark.
            > 
            >> On 22 Oct 2018, at 02:24, Julian Feinauer <j....@pragmaticminds.de> wrote:
            >> 
            >> Dear calcite devs,
            >> 
            >> I follow the project for a long time and love how calcite made it possible to use SQL everywhere (have done several sql interfaces on top of specific file formats myself). I also like the strong support for streaming SQL.
            >> 
            >> The reason I'm writing this email is not only to give the project some love but because we are thinking about a SQL "extension" which I think is not so specific but could serve others as well in different use cases.
            >> 
            >> In detail, we are working with Streams of Data from Devices (think of industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X project where I contribute) and do analytics on them. The analysis which are done there are pretty similar when working with traces from tests, e.g., automotive test drives or from related industries. What all these streams have in  common is
            >> * usually ordered by time
            >> * elements of different groups of signals ("rows" from "tables") arrive ordered by time but not with equal timestamps, e.g., time each second, other quantities much more frequent
            >> * "natural" join for these signal groups ("tables") is some kind of interpolation (sample and hold, linear interpolation, splinces, ...) with respect to (event-)time
            >> * In some cases signal types are not known and can only be guessed based on first value, e.g., on CAN there is no strict notion of "double" or "integer" channels but rather there are integer base values + a conversion formula (like a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
            >> 
            >> On the other hand the analysis we like to perform are often timestamps
            >> * get timestamps where a condition becomes true
            >> * boolean value toggled
            >> * numeric value is above / below threshold
            >> * signal change rate is above / below threshold
            >> * ...
            >> * get the values of certain signals at the point in time when a condition becomes true (see above)
            >> * get windows based on conditions
            >> * while signal is true
            >> * while value above ...
            >> * ...
            >> * Do aggregations on signals in the mentioned windows
            >> 
            >> Parts of this could done in most SQL dialects (I'm no expert for the standard but in Postgres one could use LAG and partitions) but this is not efficient and not all of the above could be done with that.
            >> So we think about an extension (or a dialect) for "traces" or "time series" which has a syntax that is slightly extended to allow such queries as stated above.
            >> 
            >> To give you an example of what such an extension could look like:
            >> 
            >> ```
            >> SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE cycle_in_progress = TRUE
            >> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
            >> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
            >> ```
            >> 
            >> Why am I bothering you with this?
            >> Well, first, you are experts and I would love to get some feedback on thoughts of that.
            >> But, most important, I am thinking about writing (yet another) SQL parser with slight extensions and would then have to care for a "runtime" which would be partially similar (in functionality, not in maturity or sophistication) to Calcites Enumerable-Trait. So I was thinking whether there is a way to make all of this work "on top" of Calcite (custom RelNodes and an extension to the parser) but I'm unsure about that as some of the internals of Calcite are tied very specifically to Sql... like, e.g., SqlToRelConverter.
            >> Do you have any ideas on how one would be able to implement this "minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or if this should then be done in the same codebase (e.g. a subproject) as it would need some changes near Calcites core?
            >> 
            >> Please excuse this rather long email but I would really appreciate any answers, comments or suggestions.
            >> 
            >> Best
            >> Julian
            > 
            >

Re: Calcite on Traces / Industry 4.0 data

Posted by Julian Feinauer <j....@pragmaticminds.de>.

Hi Julian,

I filed a Jira form y general suggestion about "Timeseries SQL" (CALCITE-2640).
For the discussion in the other thread, I had a look into the present state of the code (from you and Zhiqiang He) for parsing and the logical node.

I also thought about the necessary implementation for the EnumerableMatch.
I'm pretty familiar with the regex to NFA / DFS part (from our implementations) and the define part.
But what I'm pretty unfamiliar with is the order and partition part (and especially how its implemented in Calcite).
Do you see any possibility to transform the Matching Part into a Window Aggregation function, or do I make things overly easy with this thought?

Wouldn’t this also make it easier with regards to the PREV, NEXT, FIRST, LAST window agg functions?
I can try to help with the implementation of the "inner" parts but I don’t feel that I'm familiar enough with the codebase to make the whole thing work.

Thus, if anybody of the seasoned Calcite devs could offer some help I would be happy to discuss details of the implementation and support the implementation as good as possible.

Best
Julian


Am 23.10.18, 07:57 schrieb "Julian Feinauer" <j....@pragmaticminds.de>:

    Hi Julian,
    
    first of thanks for your reply and your thoughts.
    Thinking about your arguments, I fully agree to what you say and we should really consider using MATCH_REGOCNIZE first and see where it gets us.
    
    To our second "problem", the different channel groups (with unequal time stamps), we also need a sound mapping to SQL then. My first thought was to use the "drill approach" and to simply simulate a table which has all columns somebody wants (as we do not know that upfront) and return NULL or NaN values when the channel is not present at evaluation time (and do all the interpolation and stuff in the background). Or does anybody have a better idea?
    
    For your suggested approach I agree and will try to write some of our analysis (in our Java DSL) with MATCH_RECOGNICE to see how well it fits and come back then to the list.
    
    Thanks
    Julian
    
    Am 23.10.18, 05:55 schrieb "Julian Hyde" <jh...@apache.org>:
    
        Julian,
        
        Thanks for posting this to Calcite. We appreciate the opportunity to mull over a language and prevent a mis-guided SQL-like language.
        
        I agree with both you and Mark: MATCH_RECOGNIZE seems to be very well suited to your problem domain. And MATCH_RECOGNIZE is non-trivial and difficult to learn.
        
        But in its favor, MATCH_RECOGNIZE is in the SQL standard and has reference implementations in systems like Oracle, so we can assume that it is well-specified. And, in my opinion, it is well designed - it delivers significant extra power to SQL that could not be done efficiently or at all without it, and is consistent with existing SQL semantics. Lastly, the streaming systems such as Flink and Beam are adopting it.
        
        When your proposed language has gone through the same process, I suspect that it would end up being very similar to MATCH_RECOGNIZE. MATCH_RECOGNIZE may seem “imperative” because it it is creating a state-transition engine, but finite-state automata can be reasoned and safely transformed, and are therefore to all intents and purposes “declarative”.
        
        The biggest reason not to use MATCH_RECOGNIZE is your audience. There’s no point creating the perfect language if the audience doesn’t like it and want to adopt it. So perhaps your best path is to design your own language, find some examples and code them up as use cases in that language, and iterate based on your users’ feedback.
        
        If I were you, I would also code each of those examples in SQL using MATCH_RECOGNIZE, and make sure that there is a sound mapping between those languages. And maybe your language could be implemented as a thin layer above MATCH_RECOGNIZE.
        
        This is the same advice I would give to everyone who is writing a database: I don’t care whether you use SQL, but make sure your language maps onto (extended) relational algebra. (And if you create a SQL-like language that breaks some of the concepts of SQL, such automatically joining tables, please don’t tell people that your language is SQL.)
        
        I’m sorry to say that Calcite’s implementation of MATCH_RECOGNIZE has not moved forward much since my email. Maybe your effort is the kick necessary to get it going. I can assure you that I still believe that MATCH_RECOGNIZE, and the algebra that underlies it, is a solid foundation.
        
        Julian
        
        
        
        > On Oct 21, 2018, at 10:04 PM, Julian Feinauer <j....@pragmaticminds.de> wrote:
        > 
        > Hi Mark,
        > 
        > thanks for your reply.
        > In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original mail.
        > I was really excited when I first heard about MATCH_RECOGNIZE as it is incredibly powerful and could be used so solve many of the problems I state in my mail.
        > The only "drawback" I see is that it feels so technical and complex.
        > By that I mean that it took me quite a while to figure out how to use it (and I would consider myself as experienced SQL user). And it kind of "breaks" the foundation of SQL in the sense that it is pretty imperative and not to declarative.
        > 
        > This is no general critics to the feature. The point I'm trying to make is that there is a (from my perspective) large class of similar problems and I would love to have a solution which "feels" natural and offers suitable "semantics" for the field.
        > 
        > But coming back to the MATCH_RECOGNIZE support in Calcite, is there any progress with regards to Julians Post from July?
        > If not I can offer to give some support with the implementation of the FSM / NFA.
        > One solution for us could then also be to take a query in the "Timeseries SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.
        > 
        > So if there is still help needed please let me know (a quick search through the JIRA showed CALCITE-1935) which seems like there is still some implementation missing.
        > 
        > Best
        > Julian
        > 
        > 
        > Am 22.10.18, 02:41 schrieb "Mark Hammond" <gp...@themarkhammond.com>:
        > 
        >    Hi Julian Feinauer,
        > 
        >    Do share your thoughts on MATCH_RECOGNIZE operator suitability, http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%3cC6A37DAE-F884-4D90-8EC0-8FD4EFDE1B0E@apache.org%3e
        > 
        >    Cheers,
        >    Mark.
        > 
        >> On 22 Oct 2018, at 02:24, Julian Feinauer <j....@pragmaticminds.de> wrote:
        >> 
        >> Dear calcite devs,
        >> 
        >> I follow the project for a long time and love how calcite made it possible to use SQL everywhere (have done several sql interfaces on top of specific file formats myself). I also like the strong support for streaming SQL.
        >> 
        >> The reason I'm writing this email is not only to give the project some love but because we are thinking about a SQL "extension" which I think is not so specific but could serve others as well in different use cases.
        >> 
        >> In detail, we are working with Streams of Data from Devices (think of industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X project where I contribute) and do analytics on them. The analysis which are done there are pretty similar when working with traces from tests, e.g., automotive test drives or from related industries. What all these streams have in  common is
        >> * usually ordered by time
        >> * elements of different groups of signals ("rows" from "tables") arrive ordered by time but not with equal timestamps, e.g., time each second, other quantities much more frequent
        >> * "natural" join for these signal groups ("tables") is some kind of interpolation (sample and hold, linear interpolation, splinces, ...) with respect to (event-)time
        >> * In some cases signal types are not known and can only be guessed based on first value, e.g., on CAN there is no strict notion of "double" or "integer" channels but rather there are integer base values + a conversion formula (like a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
        >> 
        >> On the other hand the analysis we like to perform are often timestamps
        >> * get timestamps where a condition becomes true
        >> * boolean value toggled
        >> * numeric value is above / below threshold
        >> * signal change rate is above / below threshold
        >> * ...
        >> * get the values of certain signals at the point in time when a condition becomes true (see above)
        >> * get windows based on conditions
        >> * while signal is true
        >> * while value above ...
        >> * ...
        >> * Do aggregations on signals in the mentioned windows
        >> 
        >> Parts of this could done in most SQL dialects (I'm no expert for the standard but in Postgres one could use LAG and partitions) but this is not efficient and not all of the above could be done with that.
        >> So we think about an extension (or a dialect) for "traces" or "time series" which has a syntax that is slightly extended to allow such queries as stated above.
        >> 
        >> To give you an example of what such an extension could look like:
        >> 
        >> ```
        >> SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE cycle_in_progress = TRUE
        >> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
        >> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
        >> ```
        >> 
        >> Why am I bothering you with this?
        >> Well, first, you are experts and I would love to get some feedback on thoughts of that.
        >> But, most important, I am thinking about writing (yet another) SQL parser with slight extensions and would then have to care for a "runtime" which would be partially similar (in functionality, not in maturity or sophistication) to Calcites Enumerable-Trait. So I was thinking whether there is a way to make all of this work "on top" of Calcite (custom RelNodes and an extension to the parser) but I'm unsure about that as some of the internals of Calcite are tied very specifically to Sql... like, e.g., SqlToRelConverter.
        >> Do you have any ideas on how one would be able to implement this "minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or if this should then be done in the same codebase (e.g. a subproject) as it would need some changes near Calcites core?
        >> 
        >> Please excuse this rather long email but I would really appreciate any answers, comments or suggestions.
        >> 
        >> Best
        >> Julian
        > 
        >

Re: Calcite on Traces / Industry 4.0 data

Posted by Julian Feinauer <j....@pragmaticminds.de>.

Hi Julian,

first of thanks for your reply and your thoughts.
Thinking about your arguments, I fully agree to what you say and we should really consider using MATCH_REGOCNIZE first and see where it gets us.

To our second "problem", the different channel groups (with unequal time stamps), we also need a sound mapping to SQL then. My first thought was to use the "drill approach" and to simply simulate a table which has all columns somebody wants (as we do not know that upfront) and return NULL or NaN values when the channel is not present at evaluation time (and do all the interpolation and stuff in the background). Or does anybody have a better idea?

For your suggested approach I agree and will try to write some of our analysis (in our Java DSL) with MATCH_RECOGNICE to see how well it fits and come back then to the list.

Thanks
Julian

Am 23.10.18, 05:55 schrieb "Julian Hyde" <jh...@apache.org>:

    Julian,
    
    Thanks for posting this to Calcite. We appreciate the opportunity to mull over a language and prevent a mis-guided SQL-like language.
    
    I agree with both you and Mark: MATCH_RECOGNIZE seems to be very well suited to your problem domain. And MATCH_RECOGNIZE is non-trivial and difficult to learn.
    
    But in its favor, MATCH_RECOGNIZE is in the SQL standard and has reference implementations in systems like Oracle, so we can assume that it is well-specified. And, in my opinion, it is well designed - it delivers significant extra power to SQL that could not be done efficiently or at all without it, and is consistent with existing SQL semantics. Lastly, the streaming systems such as Flink and Beam are adopting it.
    
    When your proposed language has gone through the same process, I suspect that it would end up being very similar to MATCH_RECOGNIZE. MATCH_RECOGNIZE may seem “imperative” because it it is creating a state-transition engine, but finite-state automata can be reasoned and safely transformed, and are therefore to all intents and purposes “declarative”.
    
    The biggest reason not to use MATCH_RECOGNIZE is your audience. There’s no point creating the perfect language if the audience doesn’t like it and want to adopt it. So perhaps your best path is to design your own language, find some examples and code them up as use cases in that language, and iterate based on your users’ feedback.
    
    If I were you, I would also code each of those examples in SQL using MATCH_RECOGNIZE, and make sure that there is a sound mapping between those languages. And maybe your language could be implemented as a thin layer above MATCH_RECOGNIZE.
    
    This is the same advice I would give to everyone who is writing a database: I don’t care whether you use SQL, but make sure your language maps onto (extended) relational algebra. (And if you create a SQL-like language that breaks some of the concepts of SQL, such automatically joining tables, please don’t tell people that your language is SQL.)
    
    I’m sorry to say that Calcite’s implementation of MATCH_RECOGNIZE has not moved forward much since my email. Maybe your effort is the kick necessary to get it going. I can assure you that I still believe that MATCH_RECOGNIZE, and the algebra that underlies it, is a solid foundation.
    
    Julian
    
    
    
    > On Oct 21, 2018, at 10:04 PM, Julian Feinauer <j....@pragmaticminds.de> wrote:
    > 
    > Hi Mark,
    > 
    > thanks for your reply.
    > In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original mail.
    > I was really excited when I first heard about MATCH_RECOGNIZE as it is incredibly powerful and could be used so solve many of the problems I state in my mail.
    > The only "drawback" I see is that it feels so technical and complex.
    > By that I mean that it took me quite a while to figure out how to use it (and I would consider myself as experienced SQL user). And it kind of "breaks" the foundation of SQL in the sense that it is pretty imperative and not to declarative.
    > 
    > This is no general critics to the feature. The point I'm trying to make is that there is a (from my perspective) large class of similar problems and I would love to have a solution which "feels" natural and offers suitable "semantics" for the field.
    > 
    > But coming back to the MATCH_RECOGNIZE support in Calcite, is there any progress with regards to Julians Post from July?
    > If not I can offer to give some support with the implementation of the FSM / NFA.
    > One solution for us could then also be to take a query in the "Timeseries SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.
    > 
    > So if there is still help needed please let me know (a quick search through the JIRA showed CALCITE-1935) which seems like there is still some implementation missing.
    > 
    > Best
    > Julian
    > 
    > 
    > Am 22.10.18, 02:41 schrieb "Mark Hammond" <gp...@themarkhammond.com>:
    > 
    >    Hi Julian Feinauer,
    > 
    >    Do share your thoughts on MATCH_RECOGNIZE operator suitability, http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%3cC6A37DAE-F884-4D90-8EC0-8FD4EFDE1B0E@apache.org%3e
    > 
    >    Cheers,
    >    Mark.
    > 
    >> On 22 Oct 2018, at 02:24, Julian Feinauer <j....@pragmaticminds.de> wrote:
    >> 
    >> Dear calcite devs,
    >> 
    >> I follow the project for a long time and love how calcite made it possible to use SQL everywhere (have done several sql interfaces on top of specific file formats myself). I also like the strong support for streaming SQL.
    >> 
    >> The reason I'm writing this email is not only to give the project some love but because we are thinking about a SQL "extension" which I think is not so specific but could serve others as well in different use cases.
    >> 
    >> In detail, we are working with Streams of Data from Devices (think of industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X project where I contribute) and do analytics on them. The analysis which are done there are pretty similar when working with traces from tests, e.g., automotive test drives or from related industries. What all these streams have in  common is
    >> * usually ordered by time
    >> * elements of different groups of signals ("rows" from "tables") arrive ordered by time but not with equal timestamps, e.g., time each second, other quantities much more frequent
    >> * "natural" join for these signal groups ("tables") is some kind of interpolation (sample and hold, linear interpolation, splinces, ...) with respect to (event-)time
    >> * In some cases signal types are not known and can only be guessed based on first value, e.g., on CAN there is no strict notion of "double" or "integer" channels but rather there are integer base values + a conversion formula (like a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
    >> 
    >> On the other hand the analysis we like to perform are often timestamps
    >> * get timestamps where a condition becomes true
    >> * boolean value toggled
    >> * numeric value is above / below threshold
    >> * signal change rate is above / below threshold
    >> * ...
    >> * get the values of certain signals at the point in time when a condition becomes true (see above)
    >> * get windows based on conditions
    >> * while signal is true
    >> * while value above ...
    >> * ...
    >> * Do aggregations on signals in the mentioned windows
    >> 
    >> Parts of this could done in most SQL dialects (I'm no expert for the standard but in Postgres one could use LAG and partitions) but this is not efficient and not all of the above could be done with that.
    >> So we think about an extension (or a dialect) for "traces" or "time series" which has a syntax that is slightly extended to allow such queries as stated above.
    >> 
    >> To give you an example of what such an extension could look like:
    >> 
    >> ```
    >> SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE cycle_in_progress = TRUE
    >> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
    >> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
    >> ```
    >> 
    >> Why am I bothering you with this?
    >> Well, first, you are experts and I would love to get some feedback on thoughts of that.
    >> But, most important, I am thinking about writing (yet another) SQL parser with slight extensions and would then have to care for a "runtime" which would be partially similar (in functionality, not in maturity or sophistication) to Calcites Enumerable-Trait. So I was thinking whether there is a way to make all of this work "on top" of Calcite (custom RelNodes and an extension to the parser) but I'm unsure about that as some of the internals of Calcite are tied very specifically to Sql... like, e.g., SqlToRelConverter.
    >> Do you have any ideas on how one would be able to implement this "minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or if this should then be done in the same codebase (e.g. a subproject) as it would need some changes near Calcites core?
    >> 
    >> Please excuse this rather long email but I would really appreciate any answers, comments or suggestions.
    >> 
    >> Best
    >> Julian
    > 
    >

Re: Calcite on Traces / Industry 4.0 data

Posted by Julian Hyde <jh...@apache.org>.

Julian,

Thanks for posting this to Calcite. We appreciate the opportunity to mull over a language and prevent a mis-guided SQL-like language.

I agree with both you and Mark: MATCH_RECOGNIZE seems to be very well suited to your problem domain. And MATCH_RECOGNIZE is non-trivial and difficult to learn.

But in its favor, MATCH_RECOGNIZE is in the SQL standard and has reference implementations in systems like Oracle, so we can assume that it is well-specified. And, in my opinion, it is well designed - it delivers significant extra power to SQL that could not be done efficiently or at all without it, and is consistent with existing SQL semantics. Lastly, the streaming systems such as Flink and Beam are adopting it.

When your proposed language has gone through the same process, I suspect that it would end up being very similar to MATCH_RECOGNIZE. MATCH_RECOGNIZE may seem “imperative” because it it is creating a state-transition engine, but finite-state automata can be reasoned and safely transformed, and are therefore to all intents and purposes “declarative”.

The biggest reason not to use MATCH_RECOGNIZE is your audience. There’s no point creating the perfect language if the audience doesn’t like it and want to adopt it. So perhaps your best path is to design your own language, find some examples and code them up as use cases in that language, and iterate based on your users’ feedback.

If I were you, I would also code each of those examples in SQL using MATCH_RECOGNIZE, and make sure that there is a sound mapping between those languages. And maybe your language could be implemented as a thin layer above MATCH_RECOGNIZE.

This is the same advice I would give to everyone who is writing a database: I don’t care whether you use SQL, but make sure your language maps onto (extended) relational algebra. (And if you create a SQL-like language that breaks some of the concepts of SQL, such automatically joining tables, please don’t tell people that your language is SQL.)

I’m sorry to say that Calcite’s implementation of MATCH_RECOGNIZE has not moved forward much since my email. Maybe your effort is the kick necessary to get it going. I can assure you that I still believe that MATCH_RECOGNIZE, and the algebra that underlies it, is a solid foundation.

Julian



> On Oct 21, 2018, at 10:04 PM, Julian Feinauer <j....@pragmaticminds.de> wrote:
> 
> Hi Mark,
> 
> thanks for your reply.
> In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original mail.
> I was really excited when I first heard about MATCH_RECOGNIZE as it is incredibly powerful and could be used so solve many of the problems I state in my mail.
> The only "drawback" I see is that it feels so technical and complex.
> By that I mean that it took me quite a while to figure out how to use it (and I would consider myself as experienced SQL user). And it kind of "breaks" the foundation of SQL in the sense that it is pretty imperative and not to declarative.
> 
> This is no general critics to the feature. The point I'm trying to make is that there is a (from my perspective) large class of similar problems and I would love to have a solution which "feels" natural and offers suitable "semantics" for the field.
> 
> But coming back to the MATCH_RECOGNIZE support in Calcite, is there any progress with regards to Julians Post from July?
> If not I can offer to give some support with the implementation of the FSM / NFA.
> One solution for us could then also be to take a query in the "Timeseries SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.
> 
> So if there is still help needed please let me know (a quick search through the JIRA showed CALCITE-1935) which seems like there is still some implementation missing.
> 
> Best
> Julian
> 
> 
> Am 22.10.18, 02:41 schrieb "Mark Hammond" <gp...@themarkhammond.com>:
> 
>    Hi Julian Feinauer,
> 
>    Do share your thoughts on MATCH_RECOGNIZE operator suitability, http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%3cC6A37DAE-F884-4D90-8EC0-8FD4EFDE1B0E@apache.org%3e
> 
>    Cheers,
>    Mark.
> 
>> On 22 Oct 2018, at 02:24, Julian Feinauer <j....@pragmaticminds.de> wrote:
>> 
>> Dear calcite devs,
>> 
>> I follow the project for a long time and love how calcite made it possible to use SQL everywhere (have done several sql interfaces on top of specific file formats myself). I also like the strong support for streaming SQL.
>> 
>> The reason I'm writing this email is not only to give the project some love but because we are thinking about a SQL "extension" which I think is not so specific but could serve others as well in different use cases.
>> 
>> In detail, we are working with Streams of Data from Devices (think of industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X project where I contribute) and do analytics on them. The analysis which are done there are pretty similar when working with traces from tests, e.g., automotive test drives or from related industries. What all these streams have in  common is
>> * usually ordered by time
>> * elements of different groups of signals ("rows" from "tables") arrive ordered by time but not with equal timestamps, e.g., time each second, other quantities much more frequent
>> * "natural" join for these signal groups ("tables") is some kind of interpolation (sample and hold, linear interpolation, splinces, ...) with respect to (event-)time
>> * In some cases signal types are not known and can only be guessed based on first value, e.g., on CAN there is no strict notion of "double" or "integer" channels but rather there are integer base values + a conversion formula (like a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
>> 
>> On the other hand the analysis we like to perform are often timestamps
>> * get timestamps where a condition becomes true
>> * boolean value toggled
>> * numeric value is above / below threshold
>> * signal change rate is above / below threshold
>> * ...
>> * get the values of certain signals at the point in time when a condition becomes true (see above)
>> * get windows based on conditions
>> * while signal is true
>> * while value above ...
>> * ...
>> * Do aggregations on signals in the mentioned windows
>> 
>> Parts of this could done in most SQL dialects (I'm no expert for the standard but in Postgres one could use LAG and partitions) but this is not efficient and not all of the above could be done with that.
>> So we think about an extension (or a dialect) for "traces" or "time series" which has a syntax that is slightly extended to allow such queries as stated above.
>> 
>> To give you an example of what such an extension could look like:
>> 
>> ```
>> SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE cycle_in_progress = TRUE
>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
>> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
>> ```
>> 
>> Why am I bothering you with this?
>> Well, first, you are experts and I would love to get some feedback on thoughts of that.
>> But, most important, I am thinking about writing (yet another) SQL parser with slight extensions and would then have to care for a "runtime" which would be partially similar (in functionality, not in maturity or sophistication) to Calcites Enumerable-Trait. So I was thinking whether there is a way to make all of this work "on top" of Calcite (custom RelNodes and an extension to the parser) but I'm unsure about that as some of the internals of Calcite are tied very specifically to Sql... like, e.g., SqlToRelConverter.
>> Do you have any ideas on how one would be able to implement this "minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or if this should then be done in the same codebase (e.g. a subproject) as it would need some changes near Calcites core?
>> 
>> Please excuse this rather long email but I would really appreciate any answers, comments or suggestions.
>> 
>> Best
>> Julian
> 
>

Re: Calcite on Traces / Industry 4.0 data

Posted by Julian Feinauer <j....@pragmaticminds.de>.

Hi Mark,

thanks for your reply.
In fact, I'm sorry that I missed to mention MATCH_RECOGNIZE in my original mail.
I was really excited when I first heard about MATCH_RECOGNIZE as it is incredibly powerful and could be used so solve many of the problems I state in my mail.
The only "drawback" I see is that it feels so technical and complex.
By that I mean that it took me quite a while to figure out how to use it (and I would consider myself as experienced SQL user). And it kind of "breaks" the foundation of SQL in the sense that it is pretty imperative and not to declarative.

This is no general critics to the feature. The point I'm trying to make is that there is a (from my perspective) large class of similar problems and I would love to have a solution which "feels" natural and offers suitable "semantics" for the field.

But coming back to the MATCH_RECOGNIZE support in Calcite, is there any progress with regards to Julians Post from July?
If not I can offer to give some support with the implementation of the FSM / NFA.
One solution for us could then also be to take a query in the "Timeseries SQL"-dialect and transform it to a Query with MATCH_RECOGNIZE.

So if there is still help needed please let me know (a quick search through the JIRA showed CALCITE-1935) which seems like there is still some implementation missing.

Best
Julian


Am 22.10.18, 02:41 schrieb "Mark Hammond" <gp...@themarkhammond.com>:

    Hi Julian Feinauer,
    
    Do share your thoughts on MATCH_RECOGNIZE operator suitability, http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%3cC6A37DAE-F884-4D90-8EC0-8FD4EFDE1B0E@apache.org%3e
    
    Cheers,
    Mark.
    
    > On 22 Oct 2018, at 02:24, Julian Feinauer <j....@pragmaticminds.de> wrote:
    > 
    > Dear calcite devs,
    > 
    > I follow the project for a long time and love how calcite made it possible to use SQL everywhere (have done several sql interfaces on top of specific file formats myself). I also like the strong support for streaming SQL.
    > 
    > The reason I'm writing this email is not only to give the project some love but because we are thinking about a SQL "extension" which I think is not so specific but could serve others as well in different use cases.
    > 
    > In detail, we are working with Streams of Data from Devices (think of industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X project where I contribute) and do analytics on them. The analysis which are done there are pretty similar when working with traces from tests, e.g., automotive test drives or from related industries. What all these streams have in  common is
    > * usually ordered by time
    > * elements of different groups of signals ("rows" from "tables") arrive ordered by time but not with equal timestamps, e.g., time each second, other quantities much more frequent
    > * "natural" join for these signal groups ("tables") is some kind of interpolation (sample and hold, linear interpolation, splinces, ...) with respect to (event-)time
    > * In some cases signal types are not known and can only be guessed based on first value, e.g., on CAN there is no strict notion of "double" or "integer" channels but rather there are integer base values + a conversion formula (like a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
    > 
    > On the other hand the analysis we like to perform are often timestamps
    > * get timestamps where a condition becomes true
    >  * boolean value toggled
    >  * numeric value is above / below threshold
    >  * signal change rate is above / below threshold
    >  * ...
    > * get the values of certain signals at the point in time when a condition becomes true (see above)
    > * get windows based on conditions
    >  * while signal is true
    >  * while value above ...
    >  * ...
    > * Do aggregations on signals in the mentioned windows
    > 
    > Parts of this could done in most SQL dialects (I'm no expert for the standard but in Postgres one could use LAG and partitions) but this is not efficient and not all of the above could be done with that.
    > So we think about an extension (or a dialect) for "traces" or "time series" which has a syntax that is slightly extended to allow such queries as stated above.
    > 
    > To give you an example of what such an extension could look like:
    > 
    > ```
    > SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE cycle_in_progress = TRUE
    > SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
    > SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
    > ```
    > 
    > Why am I bothering you with this?
    > Well, first, you are experts and I would love to get some feedback on thoughts of that.
    > But, most important, I am thinking about writing (yet another) SQL parser with slight extensions and would then have to care for a "runtime" which would be partially similar (in functionality, not in maturity or sophistication) to Calcites Enumerable-Trait. So I was thinking whether there is a way to make all of this work "on top" of Calcite (custom RelNodes and an extension to the parser) but I'm unsure about that as some of the internals of Calcite are tied very specifically to Sql... like, e.g., SqlToRelConverter.
    > Do you have any ideas on how one would be able to implement this "minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or if this should then be done in the same codebase (e.g. a subproject) as it would need some changes near Calcites core?
    > 
    > Please excuse this rather long email but I would really appreciate any answers, comments or suggestions.
    > 
    > Best
    > Julian

Re: Calcite on Traces / Industry 4.0 data

Posted by Mark Hammond <gp...@themarkhammond.com>.

Hi Julian Feinauer,

Do share your thoughts on MATCH_RECOGNIZE operator suitability, http://mail-archives.apache.org/mod_mbox/calcite-dev/201807.mbox/%3cC6A37DAE-F884-4D90-8EC0-8FD4EFDE1B0E@apache.org%3e

Cheers,
Mark.

> On 22 Oct 2018, at 02:24, Julian Feinauer <j....@pragmaticminds.de> wrote:
> 
> Dear calcite devs,
> 
> I follow the project for a long time and love how calcite made it possible to use SQL everywhere (have done several sql interfaces on top of specific file formats myself). I also like the strong support for streaming SQL.
> 
> The reason I'm writing this email is not only to give the project some love but because we are thinking about a SQL "extension" which I think is not so specific but could serve others as well in different use cases.
> 
> In detail, we are working with Streams of Data from Devices (think of industry 4.0). We read data, e.g., from PLCs (using the (incubating) Apache PLC4X project where I contribute) and do analytics on them. The analysis which are done there are pretty similar when working with traces from tests, e.g., automotive test drives or from related industries. What all these streams have in  common is
> * usually ordered by time
> * elements of different groups of signals ("rows" from "tables") arrive ordered by time but not with equal timestamps, e.g., time each second, other quantities much more frequent
> * "natural" join for these signal groups ("tables") is some kind of interpolation (sample and hold, linear interpolation, splinces, ...) with respect to (event-)time
> * In some cases signal types are not known and can only be guessed based on first value, e.g., on CAN there is no strict notion of "double" or "integer" channels but rather there are integer base values + a conversion formula (like a x + b) + possible lookup tables for "other" values (SNA, NULL, DISABLED, ...)
> 
> On the other hand the analysis we like to perform are often timestamps
> * get timestamps where a condition becomes true
>  * boolean value toggled
>  * numeric value is above / below threshold
>  * signal change rate is above / below threshold
>  * ...
> * get the values of certain signals at the point in time when a condition becomes true (see above)
> * get windows based on conditions
>  * while signal is true
>  * while value above ...
>  * ...
> * Do aggregations on signals in the mentioned windows
> 
> Parts of this could done in most SQL dialects (I'm no expert for the standard but in Postgres one could use LAG and partitions) but this is not efficient and not all of the above could be done with that.
> So we think about an extension (or a dialect) for "traces" or "time series" which has a syntax that is slightly extended to allow such queries as stated above.
> 
> To give you an example of what such an extension could look like:
> 
> ```
> SELECT start(), end(), MAX(current) FROM s7://127.0.0.1/0/0 WHILE cycle_in_progress = TRUE
> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
> SELECT timestamp, current AS start_current FROM s7://127.0.0.1/0/0 WHERE cycle_in_progress = TRUE TRIGGER ON_BECOME_TRUE
> ```
> 
> Why am I bothering you with this?
> Well, first, you are experts and I would love to get some feedback on thoughts of that.
> But, most important, I am thinking about writing (yet another) SQL parser with slight extensions and would then have to care for a "runtime" which would be partially similar (in functionality, not in maturity or sophistication) to Calcites Enumerable-Trait. So I was thinking whether there is a way to make all of this work "on top" of Calcite (custom RelNodes and an extension to the parser) but I'm unsure about that as some of the internals of Calcite are tied very specifically to Sql... like, e.g., SqlToRelConverter.
> Do you have any ideas on how one would be able to implement this "minimaly invasive" on top of Calcite and whether this is possible "ex-situ" or if this should then be done in the same codebase (e.g. a subproject) as it would need some changes near Calcites core?
> 
> Please excuse this rather long email but I would really appreciate any answers, comments or suggestions.
> 
> Best
> Julian