You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Julian Feinauer <j....@pragmaticminds.de> on 2018/12/16 23:38:48 UTC

Relational algebra and signal processing

Hi Calcite-devs,

I just had a very interesting mail exchange with Julian (Hyde) on the incubator list [1]. It was about our project CRUNCH (which is mostly about time series analyses and signal processing) and its relation to relational algebra and I wanted to bring the discussion to this list to continue here.
We already had some discussion about how time series would work in calcite [2] and it’s closely related to MATCH_RECOGNIZE.

But, I have a more general question in mind, to ask the experts here on the list.
I ask myself if we can see the signal processing and analysis tasks as proper application of relational algebra.
Disclaimer, I’m mathematician, so I know the formals of (relational) algebra pretty well but I’m lacking a lot of experience and knowledge in the database theory. Most of my knowledge there comes from Calcites source code and the book from Garcia-Molina and Ullman).

So if we take, for example, a stream of signals from a sensor, then we can of course do filtering or smoothing on it and this can be seen as a mapping between the input relation and the output relation. But as we usually need more than just one tuple at a time we lose many of the advantages of the relational theory. And then, if we analyze the signal, we can again model it as a mapping between relations, but the input relation is a “time series” and the output relation consists of “events”, so these are in some way different dimensions. In this situation it becomes mostly obvious where the main differences between time series and relational algebra are. Think of something simple, an event should be registered, whenever the signal switches from FALSE to TRUE (so not for every TRUE). This could also be modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels “unnatural” because we cannot use any indices (we don’t care about the ratio of TRUE and FALSE in the DB, except for probably some very rough outer bounds). And we are lacking the “right” information for the optimizer like estimations on the number of analysis results.
It gets even more complicated when moving to continuous valued signals (INT, DOUBLE, …), e.g., temperature readings or something.
If we want to analyze the number of times where we have a temperature change of more than 5 degrees in under 4 hours, this should also be doable with MATCH_RECOGNIZE but again, there is no index to help us and we have no information for the optimizer, so it feels very “black box” for the relational algebra.

I’m not sure if you get my point, but for me, the elegance of relational algebra was always this optimization stuff, which comes from declarative and ends in an “optimal” physical plan. And I do not see how we can use much of this for the examples given above.

Perhaps, one solution would be to do the same as for spatial queries (or the JSON / JSONB support in postgres, [3]) to add specialized indices, statistics and optimizer rules. Then, this would make it more “relational algebra”-esque in the sense that there really is a possibility to apply transformations to a given query.

What do you think? Do I see things to complicated or am I missing something?

Julian

[1] https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
[2] https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
[3] https://www.postgresql.org/docs/9.4/datatype-json.html

Re: Relational algebra and signal processing

Posted by Julian Feinauer <j....@pragmaticminds.de>.

Hey,

Julian (H) you are right with your assumptions. But, in our situation we do not necessarily need timestamps to be aligned on a regular grid but they have to be ordered (for processing at least).
I think stock prices are a very good example.

Three reasons why regular grids usually don’t work are
1. It's very hard to sample regular if the time resolution is high enough (jitter!)
2. Sometimes you want to reduce data by storing values only when they change
3. Sometimes it is not clear what the "minimal" time is, or how this should be chosen (the width of the grid)

But, nonetheless I read the link about the array databases (graphite is one also, I think) because I didn't know this description, thanks!

I see two ways for achieving the same thing.
First, we could try to make the time series a "proper relational problem" by using something like FILL.
The other option could be to do something like for Geo Data and use another Trait (it should be a trait in Calcite, or?) together with appropriate functions and stay in this Trait as long as necessary before we switch over to a "regular" relation (see above).

Does that make sense?

JulianF

Am 18.12.18, 19:04 schrieb "Julian Hyde" <jh...@apache.org>:

    I think the difficulty with JulianF’s signal processing domain is that he needs there to be precisely one record at every clock tick (or more generally, at every point in an N-dimensional discrete space).
    
    Consider stock trading. A stock trade is an event that happens in continuous time, say
    
      (9:58:02 ORCL 41), (10:01:55 ORCL 43)
    
    Our query wants to know the stock price at 10:00 (or at any 1-minute interval). Therefore we have to convert the event-oriented data into an array:
    
      (9:59 ORCL 41), (10:00 ORCL 41), (10:01 ORCL 41), (10:02 ORCL 43).
    
    JulianF’s domain may be more naturally in the realm of array databases [1] but there are a lot of advantages to relational algebra and SQL, not least that we have reasonable story for streaming data, so let’s try to bridge the gap. Suppose we add a FILL operator that converts an event-based relation into a dense array:
    
     SELECT *
     FROM TABLE(FILL(Trades, ‘ROWTIME’, INTERVAL ‘1’ MINUTE))
    
    Now we can safely join with other data at the same granularity.
    
    Is this a step in the right direction?
    
    Julian
    
    [1] https://en.wikipedia.org/wiki/Array_DBMS
    
    > On Dec 18, 2018, at 7:05 AM, Michael Mior <mm...@apache.org> wrote:
    > 
    > I would say a similar theory applies. Some things are different when you're
    > dealing with streams. Mainly joins and aggregations. Semantics are
    > necessarily different whenever you have operations involving more than one
    > row at a time from the input stream. When dealing with a relation an
    > aggregation is straightforward since you just consume the entire input, and
    > output the result of the aggregation. Since streams don't end, you need to
    > decide how this is handled which usually amounts to a choice of windowing
    > algorithm. There are a few other things to think about. The presentation
    > linked below from Julian Hyde has a nice overview
    > 
    > https://www.slideshare.net/julianhyde/streaming-sql-62376119
    > 
    > --
    > Michael Mior
    > mmior@apache.org
    > 
    > 
    > Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <j....@pragmaticminds.de>
    > a écrit :
    > 
    >> Hi Michael,
    >> 
    >> yes, our workloads are usually in the context of streaming (but for replay
    >> or so we also use batch).
    >> But, if I understand it correctly, the same theory applies to both, tables
    >> ("relations") and streaming tables, or?
    >> I hope to find time soon to write a PLC4X - Calicte source which creates
    >> one or many streams based on readings from a plc.
    >> 
    >> Julian
    >> 
    >> Am 18.12.18, 03:19 schrieb "Michael Mior" <mm...@apache.org>:
    >> 
    >>    Perhaps you've thought of this already, but it sounds like streaming
    >>    relational algebra could be a good fit here.
    >> 
    >>    https://calcite.apache.org/docs/stream.html
    >>    --
    >>    Michael Mior
    >>    mmior@apache.org
    >> 
    >> 
    >>    Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <
    >> j.feinauer@pragmaticminds.de>
    >>    a écrit :
    >> 
    >>> Hi Calcite-devs,
    >>> 
    >>> I just had a very interesting mail exchange with Julian (Hyde) on the
    >>> incubator list [1]. It was about our project CRUNCH (which is mostly
    >> about
    >>> time series analyses and signal processing) and its relation to
    >> relational
    >>> algebra and I wanted to bring the discussion to this list to
    >> continue here.
    >>> We already had some discussion about how time series would work in
    >> calcite
    >>> [2] and it’s closely related to MATCH_RECOGNIZE.
    >>> 
    >>> But, I have a more general question in mind, to ask the experts here
    >> on
    >>> the list.
    >>> I ask myself if we can see the signal processing and analysis tasks
    >> as
    >>> proper application of relational algebra.
    >>> Disclaimer, I’m mathematician, so I know the formals of (relational)
    >>> algebra pretty well but I’m lacking a lot of experience and
    >> knowledge in
    >>> the database theory. Most of my knowledge there comes from Calcites
    >> source
    >>> code and the book from Garcia-Molina and Ullman).
    >>> 
    >>> So if we take, for example, a stream of signals from a sensor, then
    >> we can
    >>> of course do filtering or smoothing on it and this can be seen as a
    >> mapping
    >>> between the input relation and the output relation. But as we
    >> usually need
    >>> more than just one tuple at a time we lose many of the advantages of
    >> the
    >>> relational theory. And then, if we analyze the signal, we can again
    >> model
    >>> it as a mapping between relations, but the input relation is a “time
    >>> series” and the output relation consists of “events”, so these are
    >> in some
    >>> way different dimensions. In this situation it becomes mostly
    >> obvious where
    >>> the main differences between time series and relational algebra are.
    >> Think
    >>> of something simple, an event should be registered, whenever the
    >> signal
    >>> switches from FALSE to TRUE (so not for every TRUE). This could also
    >> be
    >>> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
    >>> “unnatural” because we cannot use any indices (we don’t care about
    >> the
    >>> ratio of TRUE and FALSE in the DB, except for probably some very
    >> rough
    >>> outer bounds). And we are lacking the “right” information for the
    >> optimizer
    >>> like estimations on the number of analysis results.
    >>> It gets even more complicated when moving to continuous valued
    >> signals
    >>> (INT, DOUBLE, …), e.g., temperature readings or something.
    >>> If we want to analyze the number of times where we have a temperature
    >>> change of more than 5 degrees in under 4 hours, this should also be
    >> doable
    >>> with MATCH_RECOGNIZE but again, there is no index to help us and we
    >> have no
    >>> information for the optimizer, so it feels very “black box” for the
    >>> relational algebra.
    >>> 
    >>> I’m not sure if you get my point, but for me, the elegance of
    >> relational
    >>> algebra was always this optimization stuff, which comes from
    >> declarative
    >>> and ends in an “optimal” physical plan. And I do not see how we can
    >> use
    >>> much of this for the examples given above.
    >>> 
    >>> Perhaps, one solution would be to do the same as for spatial queries
    >> (or
    >>> the JSON / JSONB support in postgres, [3]) to add specialized
    >> indices,
    >>> statistics and optimizer rules. Then, this would make it more
    >> “relational
    >>> algebra”-esque in the sense that there really is a possibility to
    >> apply
    >>> transformations to a given query.
    >>> 
    >>> What do you think? Do I see things to complicated or am I missing
    >>> something?
    >>> 
    >>> Julian
    >>> 
    >>> [1]
    >>> 
    >> https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
    >>> [2]
    >>> 
    >> https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
    >>> [3] https://www.postgresql.org/docs/9.4/datatype-json.html
    >>> 
    >>> 
    >> 
    >> 
    >>

Re: Relational algebra and signal processing

Posted by Ruhollah Farchtchi <ru...@gmail.com>.

Hi Julian, I know Apache Phoenix uses materialized views in calcite as a
secondary index. So a function-based index may be just a materialization of
your function that would give you the level set of the tuples you
are looking for. You would need to implement some way to look up based on
the index the other values you need or for basic counts you could possibly
use the index. There's a pretty good writeup of materialized views here (
http://calcite.apache.org/docs/materialized_views)  and see slide 29 here (
https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite
).

Ruhollah Farchtchi
ruhollah.farchtchi@gmail.com


On Tue, Dec 18, 2018 at 2:42 PM Julian Feinauer <
j.feinauer@pragmaticminds.de> wrote:

> Hi Ruhollah,
>
> thanks for your mail.
> Regarding your MATCH_RECOGNIZE question, I'm not sure whether this could
> work or not (I'm skeptic but it is a really powerful feature).
>
> But to your other question, the thing you describe should be a perfect fit
> for what we usually do, yes.
> In our situations we usually have pretty weak windows (only by time or
> during a condition is met).
> But then, it is absolutely doable.
>
> Regarding your suggestions for indices... this sounds very interesting,
> but I didn’t get it fully. Could you explain a bit more what you mean by a
> function-based index?
> In our situation a proper index could be level sets [1].
> A query for "Current is above xxx" could be optimized with such an index.
>
> JulianF
>
> [1] https://en.wikipedia.org/wiki/Level_set
>
>
> Am 18.12.18, 19:43 schrieb "Ruhollah Farchtchi" <
> ruhollah.farchtchi@gmail.com>:
>
>     Maybe this is a separate but related problem, however we see the same
> thing
>     with events in other use cases that are complex such as path analysis.
>     Let's say you are a cable provider and you want to identify channel
>     surfers. You define a channel surfer as any user that has flipped
> across 3
>     channels in a 5 minute window. Now you want to count the number of
> channel
>     surfers you had watching the Super Bowl. Can that be accomplished with
>     MATCH_RECOGNIZE? Some of this seems very similar to the use case
> Julian F
>     kicked this thread off with as it requires a transformation from time
>     series to event by way of pattern identification within a window of
> time.
>     Julian, you may need to FILL the window to achieve equal time
> increments so
>     the pattern match can be accomplished, but I'm not sure in this use
> case
>     you need to. Julian F is this use case similar? I would imagine you
> could
>     index the pattern matching part with a function-based index, which
> could be
>     implemented as some kind of secondary index via materialized views in
>     Calcite. Since it is on time series you could optimize maintenance of
> that
>     index as long as your window for pattern discovery was small enough.
>
>     Ruhollah Farchtchi
>     ruhollah.farchtchi@gmail.com
>
>
>     On Tue, Dec 18, 2018 at 1:04 PM Julian Hyde <jh...@apache.org> wrote:
>
>     > I think the difficulty with JulianF’s signal processing domain is
> that he
>     > needs there to be precisely one record at every clock tick (or more
>     > generally, at every point in an N-dimensional discrete space).
>     >
>     > Consider stock trading. A stock trade is an event that happens in
>     > continuous time, say
>     >
>     >   (9:58:02 ORCL 41), (10:01:55 ORCL 43)
>     >
>     > Our query wants to know the stock price at 10:00 (or at any 1-minute
>     > interval). Therefore we have to convert the event-oriented data into
> an
>     > array:
>     >
>     >   (9:59 ORCL 41), (10:00 ORCL 41), (10:01 ORCL 41), (10:02 ORCL 43).
>     >
>     > JulianF’s domain may be more naturally in the realm of array
> databases [1]
>     > but there are a lot of advantages to relational algebra and SQL, not
> least
>     > that we have reasonable story for streaming data, so let’s try to
> bridge
>     > the gap. Suppose we add a FILL operator that converts an event-based
>     > relation into a dense array:
>     >
>     >  SELECT *
>     >  FROM TABLE(FILL(Trades, ‘ROWTIME’, INTERVAL ‘1’ MINUTE))
>     >
>     > Now we can safely join with other data at the same granularity.
>     >
>     > Is this a step in the right direction?
>     >
>     > Julian
>     >
>     > [1] https://en.wikipedia.org/wiki/Array_DBMS
>     >
>     > > On Dec 18, 2018, at 7:05 AM, Michael Mior <mm...@apache.org>
> wrote:
>     > >
>     > > I would say a similar theory applies. Some things are different
> when
>     > you're
>     > > dealing with streams. Mainly joins and aggregations. Semantics are
>     > > necessarily different whenever you have operations involving more
> than
>     > one
>     > > row at a time from the input stream. When dealing with a relation
> an
>     > > aggregation is straightforward since you just consume the entire
> input,
>     > and
>     > > output the result of the aggregation. Since streams don't end, you
> need
>     > to
>     > > decide how this is handled which usually amounts to a choice of
> windowing
>     > > algorithm. There are a few other things to think about. The
> presentation
>     > > linked below from Julian Hyde has a nice overview
>     > >
>     > > https://www.slideshare.net/julianhyde/streaming-sql-62376119
>     > >
>     > > --
>     > > Michael Mior
>     > > mmior@apache.org
>     > >
>     > >
>     > > Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <
>     > j.feinauer@pragmaticminds.de>
>     > > a écrit :
>     > >
>     > >> Hi Michael,
>     > >>
>     > >> yes, our workloads are usually in the context of streaming (but
> for
>     > replay
>     > >> or so we also use batch).
>     > >> But, if I understand it correctly, the same theory applies to
> both,
>     > tables
>     > >> ("relations") and streaming tables, or?
>     > >> I hope to find time soon to write a PLC4X - Calicte source which
> creates
>     > >> one or many streams based on readings from a plc.
>     > >>
>     > >> Julian
>     > >>
>     > >> Am 18.12.18, 03:19 schrieb "Michael Mior" <mm...@apache.org>:
>     > >>
>     > >>    Perhaps you've thought of this already, but it sounds like
> streaming
>     > >>    relational algebra could be a good fit here.
>     > >>
>     > >>    https://calcite.apache.org/docs/stream.html
>     > >>    --
>     > >>    Michael Mior
>     > >>    mmior@apache.org
>     > >>
>     > >>
>     > >>    Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <
>     > >> j.feinauer@pragmaticminds.de>
>     > >>    a écrit :
>     > >>
>     > >>> Hi Calcite-devs,
>     > >>>
>     > >>> I just had a very interesting mail exchange with Julian (Hyde)
> on the
>     > >>> incubator list [1]. It was about our project CRUNCH (which is
> mostly
>     > >> about
>     > >>> time series analyses and signal processing) and its relation to
>     > >> relational
>     > >>> algebra and I wanted to bring the discussion to this list to
>     > >> continue here.
>     > >>> We already had some discussion about how time series would work
> in
>     > >> calcite
>     > >>> [2] and it’s closely related to MATCH_RECOGNIZE.
>     > >>>
>     > >>> But, I have a more general question in mind, to ask the experts
> here
>     > >> on
>     > >>> the list.
>     > >>> I ask myself if we can see the signal processing and analysis
> tasks
>     > >> as
>     > >>> proper application of relational algebra.
>     > >>> Disclaimer, I’m mathematician, so I know the formals of
> (relational)
>     > >>> algebra pretty well but I’m lacking a lot of experience and
>     > >> knowledge in
>     > >>> the database theory. Most of my knowledge there comes from
> Calcites
>     > >> source
>     > >>> code and the book from Garcia-Molina and Ullman).
>     > >>>
>     > >>> So if we take, for example, a stream of signals from a sensor,
> then
>     > >> we can
>     > >>> of course do filtering or smoothing on it and this can be seen
> as a
>     > >> mapping
>     > >>> between the input relation and the output relation. But as we
>     > >> usually need
>     > >>> more than just one tuple at a time we lose many of the
> advantages of
>     > >> the
>     > >>> relational theory. And then, if we analyze the signal, we can
> again
>     > >> model
>     > >>> it as a mapping between relations, but the input relation is a
> “time
>     > >>> series” and the output relation consists of “events”, so these
> are
>     > >> in some
>     > >>> way different dimensions. In this situation it becomes mostly
>     > >> obvious where
>     > >>> the main differences between time series and relational algebra
> are.
>     > >> Think
>     > >>> of something simple, an event should be registered, whenever the
>     > >> signal
>     > >>> switches from FALSE to TRUE (so not for every TRUE). This could
> also
>     > >> be
>     > >>> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
>     > >>> “unnatural” because we cannot use any indices (we don’t care
> about
>     > >> the
>     > >>> ratio of TRUE and FALSE in the DB, except for probably some very
>     > >> rough
>     > >>> outer bounds). And we are lacking the “right” information for the
>     > >> optimizer
>     > >>> like estimations on the number of analysis results.
>     > >>> It gets even more complicated when moving to continuous valued
>     > >> signals
>     > >>> (INT, DOUBLE, …), e.g., temperature readings or something.
>     > >>> If we want to analyze the number of times where we have a
> temperature
>     > >>> change of more than 5 degrees in under 4 hours, this should also
> be
>     > >> doable
>     > >>> with MATCH_RECOGNIZE but again, there is no index to help us and
> we
>     > >> have no
>     > >>> information for the optimizer, so it feels very “black box” for
> the
>     > >>> relational algebra.
>     > >>>
>     > >>> I’m not sure if you get my point, but for me, the elegance of
>     > >> relational
>     > >>> algebra was always this optimization stuff, which comes from
>     > >> declarative
>     > >>> and ends in an “optimal” physical plan. And I do not see how we
> can
>     > >> use
>     > >>> much of this for the examples given above.
>     > >>>
>     > >>> Perhaps, one solution would be to do the same as for spatial
> queries
>     > >> (or
>     > >>> the JSON / JSONB support in postgres, [3]) to add specialized
>     > >> indices,
>     > >>> statistics and optimizer rules. Then, this would make it more
>     > >> “relational
>     > >>> algebra”-esque in the sense that there really is a possibility to
>     > >> apply
>     > >>> transformations to a given query.
>     > >>>
>     > >>> What do you think? Do I see things to complicated or am I missing
>     > >>> something?
>     > >>>
>     > >>> Julian
>     > >>>
>     > >>> [1]
>     > >>>
>     > >>
>     >
> https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
>     > >>> [2]
>     > >>>
>     > >>
>     >
> https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
>     > >>> [3] https://www.postgresql.org/docs/9.4/datatype-json.html
>     > >>>
>     > >>>
>     > >>
>     > >>
>     > >>
>     >
>     >
>
>
>

Re: Relational algebra and signal processing

Posted by Julian Feinauer <j....@pragmaticminds.de>.

Hi Ruhollah,

thanks for your mail.
Regarding your MATCH_RECOGNIZE question, I'm not sure whether this could work or not (I'm skeptic but it is a really powerful feature).

But to your other question, the thing you describe should be a perfect fit for what we usually do, yes.
In our situations we usually have pretty weak windows (only by time or during a condition is met).
But then, it is absolutely doable.

Regarding your suggestions for indices... this sounds very interesting, but I didn’t get it fully. Could you explain a bit more what you mean by a function-based index?
In our situation a proper index could be level sets [1].
A query for "Current is above xxx" could be optimized with such an index.

JulianF

[1] https://en.wikipedia.org/wiki/Level_set


Am 18.12.18, 19:43 schrieb "Ruhollah Farchtchi" <ru...@gmail.com>:

    Maybe this is a separate but related problem, however we see the same thing
    with events in other use cases that are complex such as path analysis.
    Let's say you are a cable provider and you want to identify channel
    surfers. You define a channel surfer as any user that has flipped across 3
    channels in a 5 minute window. Now you want to count the number of channel
    surfers you had watching the Super Bowl. Can that be accomplished with
    MATCH_RECOGNIZE? Some of this seems very similar to the use case Julian F
    kicked this thread off with as it requires a transformation from time
    series to event by way of pattern identification within a window of time.
    Julian, you may need to FILL the window to achieve equal time increments so
    the pattern match can be accomplished, but I'm not sure in this use case
    you need to. Julian F is this use case similar? I would imagine you could
    index the pattern matching part with a function-based index, which could be
    implemented as some kind of secondary index via materialized views in
    Calcite. Since it is on time series you could optimize maintenance of that
    index as long as your window for pattern discovery was small enough.
    
    Ruhollah Farchtchi
    ruhollah.farchtchi@gmail.com
    
    
    On Tue, Dec 18, 2018 at 1:04 PM Julian Hyde <jh...@apache.org> wrote:
    
    > I think the difficulty with JulianF’s signal processing domain is that he
    > needs there to be precisely one record at every clock tick (or more
    > generally, at every point in an N-dimensional discrete space).
    >
    > Consider stock trading. A stock trade is an event that happens in
    > continuous time, say
    >
    >   (9:58:02 ORCL 41), (10:01:55 ORCL 43)
    >
    > Our query wants to know the stock price at 10:00 (or at any 1-minute
    > interval). Therefore we have to convert the event-oriented data into an
    > array:
    >
    >   (9:59 ORCL 41), (10:00 ORCL 41), (10:01 ORCL 41), (10:02 ORCL 43).
    >
    > JulianF’s domain may be more naturally in the realm of array databases [1]
    > but there are a lot of advantages to relational algebra and SQL, not least
    > that we have reasonable story for streaming data, so let’s try to bridge
    > the gap. Suppose we add a FILL operator that converts an event-based
    > relation into a dense array:
    >
    >  SELECT *
    >  FROM TABLE(FILL(Trades, ‘ROWTIME’, INTERVAL ‘1’ MINUTE))
    >
    > Now we can safely join with other data at the same granularity.
    >
    > Is this a step in the right direction?
    >
    > Julian
    >
    > [1] https://en.wikipedia.org/wiki/Array_DBMS
    >
    > > On Dec 18, 2018, at 7:05 AM, Michael Mior <mm...@apache.org> wrote:
    > >
    > > I would say a similar theory applies. Some things are different when
    > you're
    > > dealing with streams. Mainly joins and aggregations. Semantics are
    > > necessarily different whenever you have operations involving more than
    > one
    > > row at a time from the input stream. When dealing with a relation an
    > > aggregation is straightforward since you just consume the entire input,
    > and
    > > output the result of the aggregation. Since streams don't end, you need
    > to
    > > decide how this is handled which usually amounts to a choice of windowing
    > > algorithm. There are a few other things to think about. The presentation
    > > linked below from Julian Hyde has a nice overview
    > >
    > > https://www.slideshare.net/julianhyde/streaming-sql-62376119
    > >
    > > --
    > > Michael Mior
    > > mmior@apache.org
    > >
    > >
    > > Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <
    > j.feinauer@pragmaticminds.de>
    > > a écrit :
    > >
    > >> Hi Michael,
    > >>
    > >> yes, our workloads are usually in the context of streaming (but for
    > replay
    > >> or so we also use batch).
    > >> But, if I understand it correctly, the same theory applies to both,
    > tables
    > >> ("relations") and streaming tables, or?
    > >> I hope to find time soon to write a PLC4X - Calicte source which creates
    > >> one or many streams based on readings from a plc.
    > >>
    > >> Julian
    > >>
    > >> Am 18.12.18, 03:19 schrieb "Michael Mior" <mm...@apache.org>:
    > >>
    > >>    Perhaps you've thought of this already, but it sounds like streaming
    > >>    relational algebra could be a good fit here.
    > >>
    > >>    https://calcite.apache.org/docs/stream.html
    > >>    --
    > >>    Michael Mior
    > >>    mmior@apache.org
    > >>
    > >>
    > >>    Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <
    > >> j.feinauer@pragmaticminds.de>
    > >>    a écrit :
    > >>
    > >>> Hi Calcite-devs,
    > >>>
    > >>> I just had a very interesting mail exchange with Julian (Hyde) on the
    > >>> incubator list [1]. It was about our project CRUNCH (which is mostly
    > >> about
    > >>> time series analyses and signal processing) and its relation to
    > >> relational
    > >>> algebra and I wanted to bring the discussion to this list to
    > >> continue here.
    > >>> We already had some discussion about how time series would work in
    > >> calcite
    > >>> [2] and it’s closely related to MATCH_RECOGNIZE.
    > >>>
    > >>> But, I have a more general question in mind, to ask the experts here
    > >> on
    > >>> the list.
    > >>> I ask myself if we can see the signal processing and analysis tasks
    > >> as
    > >>> proper application of relational algebra.
    > >>> Disclaimer, I’m mathematician, so I know the formals of (relational)
    > >>> algebra pretty well but I’m lacking a lot of experience and
    > >> knowledge in
    > >>> the database theory. Most of my knowledge there comes from Calcites
    > >> source
    > >>> code and the book from Garcia-Molina and Ullman).
    > >>>
    > >>> So if we take, for example, a stream of signals from a sensor, then
    > >> we can
    > >>> of course do filtering or smoothing on it and this can be seen as a
    > >> mapping
    > >>> between the input relation and the output relation. But as we
    > >> usually need
    > >>> more than just one tuple at a time we lose many of the advantages of
    > >> the
    > >>> relational theory. And then, if we analyze the signal, we can again
    > >> model
    > >>> it as a mapping between relations, but the input relation is a “time
    > >>> series” and the output relation consists of “events”, so these are
    > >> in some
    > >>> way different dimensions. In this situation it becomes mostly
    > >> obvious where
    > >>> the main differences between time series and relational algebra are.
    > >> Think
    > >>> of something simple, an event should be registered, whenever the
    > >> signal
    > >>> switches from FALSE to TRUE (so not for every TRUE). This could also
    > >> be
    > >>> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
    > >>> “unnatural” because we cannot use any indices (we don’t care about
    > >> the
    > >>> ratio of TRUE and FALSE in the DB, except for probably some very
    > >> rough
    > >>> outer bounds). And we are lacking the “right” information for the
    > >> optimizer
    > >>> like estimations on the number of analysis results.
    > >>> It gets even more complicated when moving to continuous valued
    > >> signals
    > >>> (INT, DOUBLE, …), e.g., temperature readings or something.
    > >>> If we want to analyze the number of times where we have a temperature
    > >>> change of more than 5 degrees in under 4 hours, this should also be
    > >> doable
    > >>> with MATCH_RECOGNIZE but again, there is no index to help us and we
    > >> have no
    > >>> information for the optimizer, so it feels very “black box” for the
    > >>> relational algebra.
    > >>>
    > >>> I’m not sure if you get my point, but for me, the elegance of
    > >> relational
    > >>> algebra was always this optimization stuff, which comes from
    > >> declarative
    > >>> and ends in an “optimal” physical plan. And I do not see how we can
    > >> use
    > >>> much of this for the examples given above.
    > >>>
    > >>> Perhaps, one solution would be to do the same as for spatial queries
    > >> (or
    > >>> the JSON / JSONB support in postgres, [3]) to add specialized
    > >> indices,
    > >>> statistics and optimizer rules. Then, this would make it more
    > >> “relational
    > >>> algebra”-esque in the sense that there really is a possibility to
    > >> apply
    > >>> transformations to a given query.
    > >>>
    > >>> What do you think? Do I see things to complicated or am I missing
    > >>> something?
    > >>>
    > >>> Julian
    > >>>
    > >>> [1]
    > >>>
    > >>
    > https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
    > >>> [2]
    > >>>
    > >>
    > https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
    > >>> [3] https://www.postgresql.org/docs/9.4/datatype-json.html
    > >>>
    > >>>
    > >>
    > >>
    > >>
    >
    >

Re: Relational algebra and signal processing

Posted by Ruhollah Farchtchi <ru...@gmail.com>.

Maybe this is a separate but related problem, however we see the same thing
with events in other use cases that are complex such as path analysis.
Let's say you are a cable provider and you want to identify channel
surfers. You define a channel surfer as any user that has flipped across 3
channels in a 5 minute window. Now you want to count the number of channel
surfers you had watching the Super Bowl. Can that be accomplished with
MATCH_RECOGNIZE? Some of this seems very similar to the use case Julian F
kicked this thread off with as it requires a transformation from time
series to event by way of pattern identification within a window of time.
Julian, you may need to FILL the window to achieve equal time increments so
the pattern match can be accomplished, but I'm not sure in this use case
you need to. Julian F is this use case similar? I would imagine you could
index the pattern matching part with a function-based index, which could be
implemented as some kind of secondary index via materialized views in
Calcite. Since it is on time series you could optimize maintenance of that
index as long as your window for pattern discovery was small enough.

Ruhollah Farchtchi
ruhollah.farchtchi@gmail.com


On Tue, Dec 18, 2018 at 1:04 PM Julian Hyde <jh...@apache.org> wrote:

> I think the difficulty with JulianF’s signal processing domain is that he
> needs there to be precisely one record at every clock tick (or more
> generally, at every point in an N-dimensional discrete space).
>
> Consider stock trading. A stock trade is an event that happens in
> continuous time, say
>
>   (9:58:02 ORCL 41), (10:01:55 ORCL 43)
>
> Our query wants to know the stock price at 10:00 (or at any 1-minute
> interval). Therefore we have to convert the event-oriented data into an
> array:
>
>   (9:59 ORCL 41), (10:00 ORCL 41), (10:01 ORCL 41), (10:02 ORCL 43).
>
> JulianF’s domain may be more naturally in the realm of array databases [1]
> but there are a lot of advantages to relational algebra and SQL, not least
> that we have reasonable story for streaming data, so let’s try to bridge
> the gap. Suppose we add a FILL operator that converts an event-based
> relation into a dense array:
>
>  SELECT *
>  FROM TABLE(FILL(Trades, ‘ROWTIME’, INTERVAL ‘1’ MINUTE))
>
> Now we can safely join with other data at the same granularity.
>
> Is this a step in the right direction?
>
> Julian
>
> [1] https://en.wikipedia.org/wiki/Array_DBMS
>
> > On Dec 18, 2018, at 7:05 AM, Michael Mior <mm...@apache.org> wrote:
> >
> > I would say a similar theory applies. Some things are different when
> you're
> > dealing with streams. Mainly joins and aggregations. Semantics are
> > necessarily different whenever you have operations involving more than
> one
> > row at a time from the input stream. When dealing with a relation an
> > aggregation is straightforward since you just consume the entire input,
> and
> > output the result of the aggregation. Since streams don't end, you need
> to
> > decide how this is handled which usually amounts to a choice of windowing
> > algorithm. There are a few other things to think about. The presentation
> > linked below from Julian Hyde has a nice overview
> >
> > https://www.slideshare.net/julianhyde/streaming-sql-62376119
> >
> > --
> > Michael Mior
> > mmior@apache.org
> >
> >
> > Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <
> j.feinauer@pragmaticminds.de>
> > a écrit :
> >
> >> Hi Michael,
> >>
> >> yes, our workloads are usually in the context of streaming (but for
> replay
> >> or so we also use batch).
> >> But, if I understand it correctly, the same theory applies to both,
> tables
> >> ("relations") and streaming tables, or?
> >> I hope to find time soon to write a PLC4X - Calicte source which creates
> >> one or many streams based on readings from a plc.
> >>
> >> Julian
> >>
> >> Am 18.12.18, 03:19 schrieb "Michael Mior" <mm...@apache.org>:
> >>
> >>    Perhaps you've thought of this already, but it sounds like streaming
> >>    relational algebra could be a good fit here.
> >>
> >>    https://calcite.apache.org/docs/stream.html
> >>    --
> >>    Michael Mior
> >>    mmior@apache.org
> >>
> >>
> >>    Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <
> >> j.feinauer@pragmaticminds.de>
> >>    a écrit :
> >>
> >>> Hi Calcite-devs,
> >>>
> >>> I just had a very interesting mail exchange with Julian (Hyde) on the
> >>> incubator list [1]. It was about our project CRUNCH (which is mostly
> >> about
> >>> time series analyses and signal processing) and its relation to
> >> relational
> >>> algebra and I wanted to bring the discussion to this list to
> >> continue here.
> >>> We already had some discussion about how time series would work in
> >> calcite
> >>> [2] and it’s closely related to MATCH_RECOGNIZE.
> >>>
> >>> But, I have a more general question in mind, to ask the experts here
> >> on
> >>> the list.
> >>> I ask myself if we can see the signal processing and analysis tasks
> >> as
> >>> proper application of relational algebra.
> >>> Disclaimer, I’m mathematician, so I know the formals of (relational)
> >>> algebra pretty well but I’m lacking a lot of experience and
> >> knowledge in
> >>> the database theory. Most of my knowledge there comes from Calcites
> >> source
> >>> code and the book from Garcia-Molina and Ullman).
> >>>
> >>> So if we take, for example, a stream of signals from a sensor, then
> >> we can
> >>> of course do filtering or smoothing on it and this can be seen as a
> >> mapping
> >>> between the input relation and the output relation. But as we
> >> usually need
> >>> more than just one tuple at a time we lose many of the advantages of
> >> the
> >>> relational theory. And then, if we analyze the signal, we can again
> >> model
> >>> it as a mapping between relations, but the input relation is a “time
> >>> series” and the output relation consists of “events”, so these are
> >> in some
> >>> way different dimensions. In this situation it becomes mostly
> >> obvious where
> >>> the main differences between time series and relational algebra are.
> >> Think
> >>> of something simple, an event should be registered, whenever the
> >> signal
> >>> switches from FALSE to TRUE (so not for every TRUE). This could also
> >> be
> >>> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
> >>> “unnatural” because we cannot use any indices (we don’t care about
> >> the
> >>> ratio of TRUE and FALSE in the DB, except for probably some very
> >> rough
> >>> outer bounds). And we are lacking the “right” information for the
> >> optimizer
> >>> like estimations on the number of analysis results.
> >>> It gets even more complicated when moving to continuous valued
> >> signals
> >>> (INT, DOUBLE, …), e.g., temperature readings or something.
> >>> If we want to analyze the number of times where we have a temperature
> >>> change of more than 5 degrees in under 4 hours, this should also be
> >> doable
> >>> with MATCH_RECOGNIZE but again, there is no index to help us and we
> >> have no
> >>> information for the optimizer, so it feels very “black box” for the
> >>> relational algebra.
> >>>
> >>> I’m not sure if you get my point, but for me, the elegance of
> >> relational
> >>> algebra was always this optimization stuff, which comes from
> >> declarative
> >>> and ends in an “optimal” physical plan. And I do not see how we can
> >> use
> >>> much of this for the examples given above.
> >>>
> >>> Perhaps, one solution would be to do the same as for spatial queries
> >> (or
> >>> the JSON / JSONB support in postgres, [3]) to add specialized
> >> indices,
> >>> statistics and optimizer rules. Then, this would make it more
> >> “relational
> >>> algebra”-esque in the sense that there really is a possibility to
> >> apply
> >>> transformations to a given query.
> >>>
> >>> What do you think? Do I see things to complicated or am I missing
> >>> something?
> >>>
> >>> Julian
> >>>
> >>> [1]
> >>>
> >>
> https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
> >>> [2]
> >>>
> >>
> https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
> >>> [3] https://www.postgresql.org/docs/9.4/datatype-json.html
> >>>
> >>>
> >>
> >>
> >>
>
>

Re: Relational algebra and signal processing

Posted by Julian Hyde <jh...@apache.org>.

I think the difficulty with JulianF’s signal processing domain is that he needs there to be precisely one record at every clock tick (or more generally, at every point in an N-dimensional discrete space).

Consider stock trading. A stock trade is an event that happens in continuous time, say

  (9:58:02 ORCL 41), (10:01:55 ORCL 43)

Our query wants to know the stock price at 10:00 (or at any 1-minute interval). Therefore we have to convert the event-oriented data into an array:

  (9:59 ORCL 41), (10:00 ORCL 41), (10:01 ORCL 41), (10:02 ORCL 43).

JulianF’s domain may be more naturally in the realm of array databases [1] but there are a lot of advantages to relational algebra and SQL, not least that we have reasonable story for streaming data, so let’s try to bridge the gap. Suppose we add a FILL operator that converts an event-based relation into a dense array:

 SELECT *
 FROM TABLE(FILL(Trades, ‘ROWTIME’, INTERVAL ‘1’ MINUTE))

Now we can safely join with other data at the same granularity.

Is this a step in the right direction?

Julian

[1] https://en.wikipedia.org/wiki/Array_DBMS

> On Dec 18, 2018, at 7:05 AM, Michael Mior <mm...@apache.org> wrote:
> 
> I would say a similar theory applies. Some things are different when you're
> dealing with streams. Mainly joins and aggregations. Semantics are
> necessarily different whenever you have operations involving more than one
> row at a time from the input stream. When dealing with a relation an
> aggregation is straightforward since you just consume the entire input, and
> output the result of the aggregation. Since streams don't end, you need to
> decide how this is handled which usually amounts to a choice of windowing
> algorithm. There are a few other things to think about. The presentation
> linked below from Julian Hyde has a nice overview
> 
> https://www.slideshare.net/julianhyde/streaming-sql-62376119
> 
> --
> Michael Mior
> mmior@apache.org
> 
> 
> Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <j....@pragmaticminds.de>
> a écrit :
> 
>> Hi Michael,
>> 
>> yes, our workloads are usually in the context of streaming (but for replay
>> or so we also use batch).
>> But, if I understand it correctly, the same theory applies to both, tables
>> ("relations") and streaming tables, or?
>> I hope to find time soon to write a PLC4X - Calicte source which creates
>> one or many streams based on readings from a plc.
>> 
>> Julian
>> 
>> Am 18.12.18, 03:19 schrieb "Michael Mior" <mm...@apache.org>:
>> 
>>    Perhaps you've thought of this already, but it sounds like streaming
>>    relational algebra could be a good fit here.
>> 
>>    https://calcite.apache.org/docs/stream.html
>>    --
>>    Michael Mior
>>    mmior@apache.org
>> 
>> 
>>    Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <
>> j.feinauer@pragmaticminds.de>
>>    a écrit :
>> 
>>> Hi Calcite-devs,
>>> 
>>> I just had a very interesting mail exchange with Julian (Hyde) on the
>>> incubator list [1]. It was about our project CRUNCH (which is mostly
>> about
>>> time series analyses and signal processing) and its relation to
>> relational
>>> algebra and I wanted to bring the discussion to this list to
>> continue here.
>>> We already had some discussion about how time series would work in
>> calcite
>>> [2] and it’s closely related to MATCH_RECOGNIZE.
>>> 
>>> But, I have a more general question in mind, to ask the experts here
>> on
>>> the list.
>>> I ask myself if we can see the signal processing and analysis tasks
>> as
>>> proper application of relational algebra.
>>> Disclaimer, I’m mathematician, so I know the formals of (relational)
>>> algebra pretty well but I’m lacking a lot of experience and
>> knowledge in
>>> the database theory. Most of my knowledge there comes from Calcites
>> source
>>> code and the book from Garcia-Molina and Ullman).
>>> 
>>> So if we take, for example, a stream of signals from a sensor, then
>> we can
>>> of course do filtering or smoothing on it and this can be seen as a
>> mapping
>>> between the input relation and the output relation. But as we
>> usually need
>>> more than just one tuple at a time we lose many of the advantages of
>> the
>>> relational theory. And then, if we analyze the signal, we can again
>> model
>>> it as a mapping between relations, but the input relation is a “time
>>> series” and the output relation consists of “events”, so these are
>> in some
>>> way different dimensions. In this situation it becomes mostly
>> obvious where
>>> the main differences between time series and relational algebra are.
>> Think
>>> of something simple, an event should be registered, whenever the
>> signal
>>> switches from FALSE to TRUE (so not for every TRUE). This could also
>> be
>>> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
>>> “unnatural” because we cannot use any indices (we don’t care about
>> the
>>> ratio of TRUE and FALSE in the DB, except for probably some very
>> rough
>>> outer bounds). And we are lacking the “right” information for the
>> optimizer
>>> like estimations on the number of analysis results.
>>> It gets even more complicated when moving to continuous valued
>> signals
>>> (INT, DOUBLE, …), e.g., temperature readings or something.
>>> If we want to analyze the number of times where we have a temperature
>>> change of more than 5 degrees in under 4 hours, this should also be
>> doable
>>> with MATCH_RECOGNIZE but again, there is no index to help us and we
>> have no
>>> information for the optimizer, so it feels very “black box” for the
>>> relational algebra.
>>> 
>>> I’m not sure if you get my point, but for me, the elegance of
>> relational
>>> algebra was always this optimization stuff, which comes from
>> declarative
>>> and ends in an “optimal” physical plan. And I do not see how we can
>> use
>>> much of this for the examples given above.
>>> 
>>> Perhaps, one solution would be to do the same as for spatial queries
>> (or
>>> the JSON / JSONB support in postgres, [3]) to add specialized
>> indices,
>>> statistics and optimizer rules. Then, this would make it more
>> “relational
>>> algebra”-esque in the sense that there really is a possibility to
>> apply
>>> transformations to a given query.
>>> 
>>> What do you think? Do I see things to complicated or am I missing
>>> something?
>>> 
>>> Julian
>>> 
>>> [1]
>>> 
>> https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
>>> [2]
>>> 
>> https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
>>> [3] https://www.postgresql.org/docs/9.4/datatype-json.html
>>> 
>>> 
>> 
>> 
>>

Re: Relational algebra and signal processing

Posted by Michael Mior <mm...@apache.org>.

I would say a similar theory applies. Some things are different when you're
dealing with streams. Mainly joins and aggregations. Semantics are
necessarily different whenever you have operations involving more than one
row at a time from the input stream. When dealing with a relation an
aggregation is straightforward since you just consume the entire input, and
output the result of the aggregation. Since streams don't end, you need to
decide how this is handled which usually amounts to a choice of windowing
algorithm. There are a few other things to think about. The presentation
linked below from Julian Hyde has a nice overview

https://www.slideshare.net/julianhyde/streaming-sql-62376119

--
Michael Mior
mmior@apache.org


Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <j....@pragmaticminds.de>
a écrit :

> Hi Michael,
>
> yes, our workloads are usually in the context of streaming (but for replay
> or so we also use batch).
> But, if I understand it correctly, the same theory applies to both, tables
> ("relations") and streaming tables, or?
> I hope to find time soon to write a PLC4X - Calicte source which creates
> one or many streams based on readings from a plc.
>
> Julian
>
> Am 18.12.18, 03:19 schrieb "Michael Mior" <mm...@apache.org>:
>
>     Perhaps you've thought of this already, but it sounds like streaming
>     relational algebra could be a good fit here.
>
>     https://calcite.apache.org/docs/stream.html
>     --
>     Michael Mior
>     mmior@apache.org
>
>
>     Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <
> j.feinauer@pragmaticminds.de>
>     a écrit :
>
>     > Hi Calcite-devs,
>     >
>     > I just had a very interesting mail exchange with Julian (Hyde) on the
>     > incubator list [1]. It was about our project CRUNCH (which is mostly
> about
>     > time series analyses and signal processing) and its relation to
> relational
>     > algebra and I wanted to bring the discussion to this list to
> continue here.
>     > We already had some discussion about how time series would work in
> calcite
>     > [2] and it’s closely related to MATCH_RECOGNIZE.
>     >
>     > But, I have a more general question in mind, to ask the experts here
> on
>     > the list.
>     > I ask myself if we can see the signal processing and analysis tasks
> as
>     > proper application of relational algebra.
>     > Disclaimer, I’m mathematician, so I know the formals of (relational)
>     > algebra pretty well but I’m lacking a lot of experience and
> knowledge in
>     > the database theory. Most of my knowledge there comes from Calcites
> source
>     > code and the book from Garcia-Molina and Ullman).
>     >
>     > So if we take, for example, a stream of signals from a sensor, then
> we can
>     > of course do filtering or smoothing on it and this can be seen as a
> mapping
>     > between the input relation and the output relation. But as we
> usually need
>     > more than just one tuple at a time we lose many of the advantages of
> the
>     > relational theory. And then, if we analyze the signal, we can again
> model
>     > it as a mapping between relations, but the input relation is a “time
>     > series” and the output relation consists of “events”, so these are
> in some
>     > way different dimensions. In this situation it becomes mostly
> obvious where
>     > the main differences between time series and relational algebra are.
> Think
>     > of something simple, an event should be registered, whenever the
> signal
>     > switches from FALSE to TRUE (so not for every TRUE). This could also
> be
>     > modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
>     > “unnatural” because we cannot use any indices (we don’t care about
> the
>     > ratio of TRUE and FALSE in the DB, except for probably some very
> rough
>     > outer bounds). And we are lacking the “right” information for the
> optimizer
>     > like estimations on the number of analysis results.
>     > It gets even more complicated when moving to continuous valued
> signals
>     > (INT, DOUBLE, …), e.g., temperature readings or something.
>     > If we want to analyze the number of times where we have a temperature
>     > change of more than 5 degrees in under 4 hours, this should also be
> doable
>     > with MATCH_RECOGNIZE but again, there is no index to help us and we
> have no
>     > information for the optimizer, so it feels very “black box” for the
>     > relational algebra.
>     >
>     > I’m not sure if you get my point, but for me, the elegance of
> relational
>     > algebra was always this optimization stuff, which comes from
> declarative
>     > and ends in an “optimal” physical plan. And I do not see how we can
> use
>     > much of this for the examples given above.
>     >
>     > Perhaps, one solution would be to do the same as for spatial queries
> (or
>     > the JSON / JSONB support in postgres, [3]) to add specialized
> indices,
>     > statistics and optimizer rules. Then, this would make it more
> “relational
>     > algebra”-esque in the sense that there really is a possibility to
> apply
>     > transformations to a given query.
>     >
>     > What do you think? Do I see things to complicated or am I missing
>     > something?
>     >
>     > Julian
>     >
>     > [1]
>     >
> https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
>     > [2]
>     >
> https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
>     > [3] https://www.postgresql.org/docs/9.4/datatype-json.html
>     >
>     >
>
>
>

Re: Relational algebra and signal processing

Posted by Julian Feinauer <j....@pragmaticminds.de>.

Hi Michael,

yes, our workloads are usually in the context of streaming (but for replay or so we also use batch).
But, if I understand it correctly, the same theory applies to both, tables ("relations") and streaming tables, or?
I hope to find time soon to write a PLC4X - Calicte source which creates one or many streams based on readings from a plc.

Julian

Am 18.12.18, 03:19 schrieb "Michael Mior" <mm...@apache.org>:

    Perhaps you've thought of this already, but it sounds like streaming
    relational algebra could be a good fit here.
    
    https://calcite.apache.org/docs/stream.html
    --
    Michael Mior
    mmior@apache.org
    
    
    Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <j....@pragmaticminds.de>
    a écrit :
    
    > Hi Calcite-devs,
    >
    > I just had a very interesting mail exchange with Julian (Hyde) on the
    > incubator list [1]. It was about our project CRUNCH (which is mostly about
    > time series analyses and signal processing) and its relation to relational
    > algebra and I wanted to bring the discussion to this list to continue here.
    > We already had some discussion about how time series would work in calcite
    > [2] and it’s closely related to MATCH_RECOGNIZE.
    >
    > But, I have a more general question in mind, to ask the experts here on
    > the list.
    > I ask myself if we can see the signal processing and analysis tasks as
    > proper application of relational algebra.
    > Disclaimer, I’m mathematician, so I know the formals of (relational)
    > algebra pretty well but I’m lacking a lot of experience and knowledge in
    > the database theory. Most of my knowledge there comes from Calcites source
    > code and the book from Garcia-Molina and Ullman).
    >
    > So if we take, for example, a stream of signals from a sensor, then we can
    > of course do filtering or smoothing on it and this can be seen as a mapping
    > between the input relation and the output relation. But as we usually need
    > more than just one tuple at a time we lose many of the advantages of the
    > relational theory. And then, if we analyze the signal, we can again model
    > it as a mapping between relations, but the input relation is a “time
    > series” and the output relation consists of “events”, so these are in some
    > way different dimensions. In this situation it becomes mostly obvious where
    > the main differences between time series and relational algebra are. Think
    > of something simple, an event should be registered, whenever the signal
    > switches from FALSE to TRUE (so not for every TRUE). This could also be
    > modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
    > “unnatural” because we cannot use any indices (we don’t care about the
    > ratio of TRUE and FALSE in the DB, except for probably some very rough
    > outer bounds). And we are lacking the “right” information for the optimizer
    > like estimations on the number of analysis results.
    > It gets even more complicated when moving to continuous valued signals
    > (INT, DOUBLE, …), e.g., temperature readings or something.
    > If we want to analyze the number of times where we have a temperature
    > change of more than 5 degrees in under 4 hours, this should also be doable
    > with MATCH_RECOGNIZE but again, there is no index to help us and we have no
    > information for the optimizer, so it feels very “black box” for the
    > relational algebra.
    >
    > I’m not sure if you get my point, but for me, the elegance of relational
    > algebra was always this optimization stuff, which comes from declarative
    > and ends in an “optimal” physical plan. And I do not see how we can use
    > much of this for the examples given above.
    >
    > Perhaps, one solution would be to do the same as for spatial queries (or
    > the JSON / JSONB support in postgres, [3]) to add specialized indices,
    > statistics and optimizer rules. Then, this would make it more “relational
    > algebra”-esque in the sense that there really is a possibility to apply
    > transformations to a given query.
    >
    > What do you think? Do I see things to complicated or am I missing
    > something?
    >
    > Julian
    >
    > [1]
    > https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
    > [2]
    > https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
    > [3] https://www.postgresql.org/docs/9.4/datatype-json.html
    >
    >

Re: Relational algebra and signal processing

Posted by Michael Mior <mm...@apache.org>.

Perhaps you've thought of this already, but it sounds like streaming
relational algebra could be a good fit here.

https://calcite.apache.org/docs/stream.html
--
Michael Mior
mmior@apache.org


Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <j....@pragmaticminds.de>
a écrit :

> Hi Calcite-devs,
>
> I just had a very interesting mail exchange with Julian (Hyde) on the
> incubator list [1]. It was about our project CRUNCH (which is mostly about
> time series analyses and signal processing) and its relation to relational
> algebra and I wanted to bring the discussion to this list to continue here.
> We already had some discussion about how time series would work in calcite
> [2] and it’s closely related to MATCH_RECOGNIZE.
>
> But, I have a more general question in mind, to ask the experts here on
> the list.
> I ask myself if we can see the signal processing and analysis tasks as
> proper application of relational algebra.
> Disclaimer, I’m mathematician, so I know the formals of (relational)
> algebra pretty well but I’m lacking a lot of experience and knowledge in
> the database theory. Most of my knowledge there comes from Calcites source
> code and the book from Garcia-Molina and Ullman).
>
> So if we take, for example, a stream of signals from a sensor, then we can
> of course do filtering or smoothing on it and this can be seen as a mapping
> between the input relation and the output relation. But as we usually need
> more than just one tuple at a time we lose many of the advantages of the
> relational theory. And then, if we analyze the signal, we can again model
> it as a mapping between relations, but the input relation is a “time
> series” and the output relation consists of “events”, so these are in some
> way different dimensions. In this situation it becomes mostly obvious where
> the main differences between time series and relational algebra are. Think
> of something simple, an event should be registered, whenever the signal
> switches from FALSE to TRUE (so not for every TRUE). This could also be
> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
> “unnatural” because we cannot use any indices (we don’t care about the
> ratio of TRUE and FALSE in the DB, except for probably some very rough
> outer bounds). And we are lacking the “right” information for the optimizer
> like estimations on the number of analysis results.
> It gets even more complicated when moving to continuous valued signals
> (INT, DOUBLE, …), e.g., temperature readings or something.
> If we want to analyze the number of times where we have a temperature
> change of more than 5 degrees in under 4 hours, this should also be doable
> with MATCH_RECOGNIZE but again, there is no index to help us and we have no
> information for the optimizer, so it feels very “black box” for the
> relational algebra.
>
> I’m not sure if you get my point, but for me, the elegance of relational
> algebra was always this optimization stuff, which comes from declarative
> and ends in an “optimal” physical plan. And I do not see how we can use
> much of this for the examples given above.
>
> Perhaps, one solution would be to do the same as for spatial queries (or
> the JSON / JSONB support in postgres, [3]) to add specialized indices,
> statistics and optimizer rules. Then, this would make it more “relational
> algebra”-esque in the sense that there really is a possibility to apply
> transformations to a given query.
>
> What do you think? Do I see things to complicated or am I missing
> something?
>
> Julian
>
> [1]
> https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
> [2]
> https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
> [3] https://www.postgresql.org/docs/9.4/datatype-json.html
>
>