You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Tyler Akidau <ta...@google.com.INVALID> on 2016/10/17 20:00:38 UTC

Re: Streams, joins, and temporal tables

Hi Julian,

Thanks a lot for writing this up. And sorry for taking so long to respond
on this; attending Strata and then catching up after Strata ended up being
more time consuming than I'd hoped. :-P

Overall, I love the temporal table idea. I think it's a very clean way of
capturing and accessing the evolution of a table over time. I also like how
the example used concisely motivates the challenges involved.

I actually don't have any major comments for you at this time. I think the
idea is solid at a high level (and I'm generally not qualified to comment
on the nitty gritty SQL details :-).

My only real complaint is that, as you allude to in the Further Work
section, it doesn't address the details of handling out-of-order data. But
after thinking about it a bit, I believe it meshes nicely with the
semantics of the EMIT statement, at least as I was envisioning them (still
not 100% sure we're in agreement there yet, but maybe). I was not expecting
the two to interact so cleanly when I first started reading your doc, so I
find that encouraging.

I'm going to try to write up my thoughts on this over the next week or two
with some diagrams to show what I mean. I'll post the doc on this thread
once it's done.

-Tyler

On Wed, Sep 21, 2016 at 6:57 PM Julian Hyde <jh...@apache.org> wrote:

> At FlinkForward the SQL BoF (birds-of-a-feather) group discussed several
> issues relating to streaming SQL, but the discussion of stream-to-table
> joins was particularly interesting. It raises issues of time semantics if
> the table is changing over time, especially if we wish to be able to
> re-play the stream later, join to the table, and get the same results.
>
> I have been thinking about the problem for the past few days, and have
> written up my thoughts in a document [1]. The link is read-only, but I
> welcome comments, so contact me back-channel if you would like write access.
>
> It is a rather lengthy read, I am afraid, partly because I include a lot
> of examples. But I believe that the new concept that I have introduced, the
> temporal table function (which I often abbreviate temporal table) has huge
> expressive power, and can be a foundation for the semantics of streaming
> query systems that need to interact with “historic” data.
>
> Please give it a read, and comment either in this email thread or in the
> document. Also, please feel free to forward it to others outside the
> Calcite community who are working on streaming query semantics, and invite
> them to participate also.
>
> Julian
>
>
> https://docs.google.com/document/d/1RvnLEEQK92axdAaZ9XIU5szpkbGqFMBtzYiIY4dHe0Q/edit?usp=sharing

Re: Streams, joins, and temporal tables

Posted by Fabian Hueske <fh...@gmail.com>.

Hi everybody,

I second Tyler: thanks for this write-up, Julian, and sorry for not
responding earlier to this thread.
After Flink Forward, a few folks and I started to draft a proposal for
Flink's StreamSQL semantics.
I wanted to finalize our proposal (just shared it on the Flink dev list)
before replying here to see how our approaches align with each other.

According to your document, temporal tables can be queried for arbitrary
points in time, i.e., you can fetch historic data as well as the most
recent data that arrives from a stream.
An implementation of this model would require to integrate a streaming
source (e.g., a Kafka topic) and a batch source (e.g., an HBase table) and
have both queryable.

Since this is a very ambitious goal, the StreamSQL proposal for Flink
focuses on the streaming aspect of temporal tables.
We define dynamic tables which are conceptually similar to temporal tables,
i.e., tables with time-varying content.
However, in contrast to temporal tables, dynamic tables are only evaluated
at the current logical time (defined by watermarks or wall clock time).
Conceptually, the result of a query on a dynamic table should always be
equivalent to a batch query being executed on the current data of the
dynamic table.
Therefore, the query result is continuously computed (updated) based on the
data of its input dynamic tables at time 'now'.
Essentially, a query is translated into streaming application which is
similar to a view maintenance operation.
Since we do not allow to specify a timestamp for a dynamic table, we plan
to use regular batch SQL without the special notation for temporal tables.

I think both proposals align quite well. Our approach allows to cover many
of the variations you listed like append-only table, rolling time period or
row counts, and paged window tables.
We do not plan to support the nice shortcut of lateral joins against
temporal tables because we want to restrict the syntax to regular SQL
initially.
Our proposal does also describe how to define early results, update
frequency, and late updates.

Here is the link to our proposal:

-->
https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_f4konQPW4tnl8THw6rzGUdaqU/edit#

Best,
Fabian

2016-10-17 22:00 GMT+02:00 Tyler Akidau <ta...@google.com.invalid>:

> Hi Julian,
>
> Thanks a lot for writing this up. And sorry for taking so long to respond
> on this; attending Strata and then catching up after Strata ended up being
> more time consuming than I'd hoped. :-P
>
> Overall, I love the temporal table idea. I think it's a very clean way of
> capturing and accessing the evolution of a table over time. I also like how
> the example used concisely motivates the challenges involved.
>
> I actually don't have any major comments for you at this time. I think the
> idea is solid at a high level (and I'm generally not qualified to comment
> on the nitty gritty SQL details :-).
>
> My only real complaint is that, as you allude to in the Further Work
> section, it doesn't address the details of handling out-of-order data. But
> after thinking about it a bit, I believe it meshes nicely with the
> semantics of the EMIT statement, at least as I was envisioning them (still
> not 100% sure we're in agreement there yet, but maybe). I was not expecting
> the two to interact so cleanly when I first started reading your doc, so I
> find that encouraging.
>
> I'm going to try to write up my thoughts on this over the next week or two
> with some diagrams to show what I mean. I'll post the doc on this thread
> once it's done.
>
> -Tyler
>
> On Wed, Sep 21, 2016 at 6:57 PM Julian Hyde <jh...@apache.org> wrote:
>
> > At FlinkForward the SQL BoF (birds-of-a-feather) group discussed several
> > issues relating to streaming SQL, but the discussion of stream-to-table
> > joins was particularly interesting. It raises issues of time semantics if
> > the table is changing over time, especially if we wish to be able to
> > re-play the stream later, join to the table, and get the same results.
> >
> > I have been thinking about the problem for the past few days, and have
> > written up my thoughts in a document [1]. The link is read-only, but I
> > welcome comments, so contact me back-channel if you would like write
> access.
> >
> > It is a rather lengthy read, I am afraid, partly because I include a lot
> > of examples. But I believe that the new concept that I have introduced,
> the
> > temporal table function (which I often abbreviate temporal table) has
> huge
> > expressive power, and can be a foundation for the semantics of streaming
> > query systems that need to interact with “historic” data.
> >
> > Please give it a read, and comment either in this email thread or in the
> > document. Also, please feel free to forward it to others outside the
> > Calcite community who are working on streaming query semantics, and
> invite
> > them to participate also.
> >
> > Julian
> >
> >
> > https://docs.google.com/document/d/1RvnLEEQK92axdAaZ9XIU5szpkbGqF
> MBtzYiIY4dHe0Q/edit?usp=sharing
>