You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Ryan Blue <bl...@tabular.io> on 2022/02/02 22:58:51 UTC

Re: Time-sliced incremental scan

Thanks for the reply, Walaa. Your example is exactly what I was pointing
out.

I think there are a couple of topics here. First, I agree that we should
have a way of reading changes with columns that describe what happened.
That’s really useful for actually consuming incremental changes from a
table and is supported natively in some processing engines, like Flink. I’m
all for coming up with better ways of working with the data that return
rows along with this information. I’m not sure how to do this in SQL, but I
can imagine a few helpers to produce or consume DataFrames.

Second, there’s the more immediate question: what does a query for VERSIONS
BETWEEN SYSTEM TIME 2 AND 4 actually mean?

My initial answer is that I would return the equivalent of diffing the two
versions, (SELECT * FROM t AS OF TIMESTAMP 2) start_state and (SELECT *
FROM t AS OF TIMESTAMP 4) end_state. That is, SELECT * FROM end_state
EXCEPT start_state. This behavior may be specified in the spec about VERSIONS
BETWEEN but assuming that there aren’t requirements, I would assume that
this is basically a way of getting the same result without doing the full
table scans and joins.

Ryan

On Thu, Jan 20, 2022 at 7:05 PM Walaa Eldin Moustafa <wa...@gmail.com>
wrote:

> Thanks Ryan and Kyle for your input! I am glad we are directionally
> aligned on the importance and convenience of this feature.
>
> Regarding the choice between table history versus snapshot creation times,
> table history makes a lot of sense, since it captures a change log. This
> aligns with the current implementation of asOfTime() too.
>
> I think the result of a SQL query on timestamp (or snapshot ID) ranges
> should be the changes that took place on the table during that time,
> ideally capturing insertions and deletions, which brings us to Ryan's great
> point about rollbacks, which I think can be modeled as deletions. For
> example, let us assume the following sequence of events (hopefully it
> reflects Ryan's example):
>
> Snapshot A is committed at timestamp 1
> Snapshot B is committed at timestamp 2
> Snapshot C is committed at timestamp 3
> Table is rolled back to Snapshot B at timestamp 4
>
> The table history would look like:
> (Snapshot ID, event time)
> (A, 1)
> (B, 2)
> (C, 3)
> (B, 4)
>
> The rollback issue arises when someone queries the table for VERSIONS
> BETWEEN SYSTEM TIME 3 AND 4, which is linear in time, but translates to
> snapshot range (C, B) from the table history. (C, B) may appear to be an
> unacceptable range (since B is created before C); however, that can be
> considered acceptable if we treat reverse snapshot ranges as delete
> operations. In that case, even if anything was consumed from timestamp 2 to
> timestamp 3, it will now be taken back (deleted) when consuming from
> timestamp 3 to timestamp 4.
>
> That leads to the question of how to model a SQL query like SELECT * FROM
> T VERSIONS BETWEEN SYSTEM TIME ... AND ... from the user and the engine
> points of view.
>
> As stated above, I think the answer is that the result should capture the
> changes that happened between those times, and also have extra metadata in
> the result (e.g., extra column) to tell if those changes were inserts or
> deletes. That, of course, means that engines need to support those
> semantics in their data source APIs too. Does that sound reasonable?
>
> Thanks,
> Walaa.
>
>
> On Sat, Jan 8, 2022 at 12:58 AM Kyle Bendickson <ky...@tabular.io> wrote:
>
>> Thank you Ryan for summarizing that so well.
>>
>> I'm in agreement that it's too convenient to simply ignore due to those
>> caveats, though they are admittedly potentially large caveats.
>>
>> However, some people don't interact with their table that way and I see
>> discussion around ways to implement incremental scans often. I agree this
>> syntax is reasonable and that it's suefuk enough to implement despite the
>> challenges.
>>
>> I believe there's been some previous work on this. Without looking at the
>> other thread again, I do think we should either resurrect that previous
>> work or try to prioritize this.
>>
>> Especially with certain other convenience mechanisms coming up such as
>> snapshots and branches on the horizon, I would hopefully get this in sooner
>> than later so that it's existence can be considered as these new features
>> are developed.
>>
>> - Kyle (kbendick)
>>
>>
>> On Fri, Jan 7, 2022 at 4:35 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Walaa,
>>>
>>> At supporting syntax for VERSIONS BETWEEN SYSTEM TIME ... AND ... seems
>>> reasonable to me. I think it’s often really nice to be able to select the
>>> changes between two points in time for debugging. It would also be nice to
>>> be able to do the same for snapshot IDs, so you could reliably use similar
>>> syntax for incremental consumption.
>>>
>>> There are some challenges with implementing VERSIONS BETWEEN ... that I
>>> want to highlight, though. First, the FOR SYSTEM_TIME syntax produces
>>> the result that would have been returned if you ran the query at that time.
>>> That uses the table history instead of snapshot creation timestamps. If you
>>> update the table and then roll back to a previous snapshot, FOR
>>> SYSTEM_TIME will read the rolled back snapshot if the timestamp is in
>>> the interval when it was the table’s current state.
>>>
>>> We will need to decide whether VERSIONS BETWEEN ... uses table history
>>> or snapshot creation timestamps. If it uses table history, there may be
>>> intervals that don’t have a linear history and the query would fail. If it
>>> uses snapshot creation timestamps, then you’d be able to select intervals
>>> that never really existed, like between commits in a transaction.
>>>
>>> So there are major issues with timestamps. We would want to make it
>>> clear that timestamps are for convenience, and not for incremental
>>> consumption (because of the issue from that thread) and may not reflect the
>>> actual table state (or may fail). That said, it’s a really convenient
>>> feature and I would support adding both VERSIONS BETWEEN with timestamp
>>> and snapshot ID.
>>>
>>> Ryan
>>>
>>> On Thu, Jan 6, 2022 at 9:52 PM Walaa Eldin Moustafa <
>>> wa.moustafa@gmail.com> wrote:
>>>
>>>> Hi Iceberg devs,
>>>>
>>>> We have been considering the problem of Time-sliced incremental scan
>>>> (i.e., reading data that is committed between two timestamps), and I ran
>>>> into this thread [1] in the Iceberg dev mailing list. The summary of the
>>>> thread is that incremental scan should leverage snapshot IDs as opposed to
>>>> timestamps since there is no guarantee that commit timestamps are linear.
>>>>
>>>> I wanted to follow up on that discussion to see if folks are open to
>>>> still supporting time-sliced incremental scan APIs with the caveat above.
>>>> The reasons are many fold:
>>>>
>>>> * Time-slice APIs are more human friendly and simplify the need for a
>>>> state store to track last read snapshot IDs. While some sort of state
>>>> recording may be required in some cases, it is not always the case,
>>>> especially if data consumption happens at regular intervals.
>>>>
>>>> * I understand the caveat discussed in the thread still applies to the
>>>> existing "TIMESTAMP AS OF" API, yet the API is supported. By extension, I
>>>> think it is fair to extend the support to time-sliced incremental scan.
>>>>
>>>> * Iceberg already provides a deterministic function to translate a
>>>> timestamp to a snapshot ID [2]. Since a table can be scanned by snapshot
>>>> range, I think it is reasonable to allow scanning by timestamp range, since
>>>> the translation mechanism already exists.
>>>>
>>>> * Incremental scan using timestamp range is part of the SQL Standard
>>>> (e.g. "VERSIONS BETWEEN SYSTEM TIME ... AND ...") and is supported by some
>>>> existing engines. See [3] for SQL Server support.
>>>>
>>>> * Conceptually, it is possible to implement the same query semantics at
>>>> the SQL level using existing APIs and operators such as "TIMESTAMP AS OF"
>>>> and "EXCEPT" (e.g., by selecting: (T as of timestamp1) EXCEPT (T as of
>>>> timestamp2)). It sounds that incremental scan is an optimization to push
>>>> the differencing operation to the data source as opposed to letting the
>>>> engine deal with it, so it is better to implement the SQL shorthand, and
>>>> its corresponding data source optimization.
>>>>
>>>> Therefore, I would like to ask if we can proceed with supporting an API
>>>> along the lines of "VERSIONS BETWEEN SYSTEM TIME ... AND ...", and consider
>>>> the discussion in [1] as a caveat that folks using this API (or existing
>>>> timestamp APIs) have to keep in mind, until the implementation guarantees
>>>> linearity at some point in the future.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>> [1] https://www.mail-archive.com/dev@iceberg.apache.org/msg01504.html
>>>> [2]
>>>> https://github.com/apache/iceberg/blob/1a11038baaca9f4cf03674654a6d9a85f39a1ce1/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L265
>>>> [3]
>>>> https://docs.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular

Re: Time-sliced incremental scan

Posted by Peidian Li <li...@gmail.com>.

*Second, there’s the more immediate question: what does a query
for VERSIONS BETWEEN SYSTEM TIME 2 AND 4 actually mean?*

*My initial answer is that I would return the equivalent of diffing the two
versions, (SELECT * FROM t AS OF TIMESTAMP 2) start_state and (SELECT *
FROM t AS OF TIMESTAMP 4) end_state. That is, SELECT * FROM end_state
EXCEPT start_state. This behavior may be specified in the spec
about VERSIONS BETWEEN but assuming that there aren’t requirements, I would
assume that this is basically a way of getting the same result without
doing the full table scans and joins.*

*--*

We have a similar need. There are some hive tables partitioned by day that
store the full amount of data every day. And users can query the history
data easily, but this method is a waste of storage space. If  `VERSIONS
BETWEEN SYSTEM TIME 2 AND 4` can return all records in every snapshot, the
problem can be solved. Considering the different operations of snapshots,
maybe it should probably filter by operation type.

Ryan Blue <bl...@tabular.io> 于2022年2月3日周四 06:59写道：

> Thanks for the reply, Walaa. Your example is exactly what I was pointing
> out.
>
> I think there are a couple of topics here. First, I agree that we should
> have a way of reading changes with columns that describe what happened.
> That’s really useful for actually consuming incremental changes from a
> table and is supported natively in some processing engines, like Flink. I’m
> all for coming up with better ways of working with the data that return
> rows along with this information. I’m not sure how to do this in SQL, but I
> can imagine a few helpers to produce or consume DataFrames.
>
> Second, there’s the more immediate question: what does a query for VERSIONS
> BETWEEN SYSTEM TIME 2 AND 4 actually mean?
>
> My initial answer is that I would return the equivalent of diffing the two
> versions, (SELECT * FROM t AS OF TIMESTAMP 2) start_state and (SELECT *
> FROM t AS OF TIMESTAMP 4) end_state. That is, SELECT * FROM end_state
> EXCEPT start_state. This behavior may be specified in the spec about VERSIONS
> BETWEEN but assuming that there aren’t requirements, I would assume that
> this is basically a way of getting the same result without doing the full
> table scans and joins.
>
> Ryan
>
> On Thu, Jan 20, 2022 at 7:05 PM Walaa Eldin Moustafa <
> wa.moustafa@gmail.com> wrote:
>
>> Thanks Ryan and Kyle for your input! I am glad we are directionally
>> aligned on the importance and convenience of this feature.
>>
>> Regarding the choice between table history versus snapshot creation
>> times, table history makes a lot of sense, since it captures a change log.
>> This aligns with the current implementation of asOfTime() too.
>>
>> I think the result of a SQL query on timestamp (or snapshot ID) ranges
>> should be the changes that took place on the table during that time,
>> ideally capturing insertions and deletions, which brings us to Ryan's great
>> point about rollbacks, which I think can be modeled as deletions. For
>> example, let us assume the following sequence of events (hopefully it
>> reflects Ryan's example):
>>
>> Snapshot A is committed at timestamp 1
>> Snapshot B is committed at timestamp 2
>> Snapshot C is committed at timestamp 3
>> Table is rolled back to Snapshot B at timestamp 4
>>
>> The table history would look like:
>> (Snapshot ID, event time)
>> (A, 1)
>> (B, 2)
>> (C, 3)
>> (B, 4)
>>
>> The rollback issue arises when someone queries the table for VERSIONS
>> BETWEEN SYSTEM TIME 3 AND 4, which is linear in time, but translates to
>> snapshot range (C, B) from the table history. (C, B) may appear to be an
>> unacceptable range (since B is created before C); however, that can be
>> considered acceptable if we treat reverse snapshot ranges as delete
>> operations. In that case, even if anything was consumed from timestamp 2 to
>> timestamp 3, it will now be taken back (deleted) when consuming from
>> timestamp 3 to timestamp 4.
>>
>> That leads to the question of how to model a SQL query like SELECT *
>> FROM T VERSIONS BETWEEN SYSTEM TIME ... AND ... from the user and the
>> engine points of view.
>>
>> As stated above, I think the answer is that the result should capture the
>> changes that happened between those times, and also have extra metadata in
>> the result (e.g., extra column) to tell if those changes were inserts or
>> deletes. That, of course, means that engines need to support those
>> semantics in their data source APIs too. Does that sound reasonable?
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Sat, Jan 8, 2022 at 12:58 AM Kyle Bendickson <ky...@tabular.io> wrote:
>>
>>> Thank you Ryan for summarizing that so well.
>>>
>>> I'm in agreement that it's too convenient to simply ignore due to those
>>> caveats, though they are admittedly potentially large caveats.
>>>
>>> However, some people don't interact with their table that way and I see
>>> discussion around ways to implement incremental scans often. I agree this
>>> syntax is reasonable and that it's suefuk enough to implement despite the
>>> challenges.
>>>
>>> I believe there's been some previous work on this. Without looking at
>>> the other thread again, I do think we should either resurrect that previous
>>> work or try to prioritize this.
>>>
>>> Especially with certain other convenience mechanisms coming up such as
>>> snapshots and branches on the horizon, I would hopefully get this in sooner
>>> than later so that it's existence can be considered as these new features
>>> are developed.
>>>
>>> - Kyle (kbendick)
>>>
>>>
>>> On Fri, Jan 7, 2022 at 4:35 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Walaa,
>>>>
>>>> At supporting syntax for VERSIONS BETWEEN SYSTEM TIME ... AND ...
>>>> seems reasonable to me. I think it’s often really nice to be able to select
>>>> the changes between two points in time for debugging. It would also be nice
>>>> to be able to do the same for snapshot IDs, so you could reliably use
>>>> similar syntax for incremental consumption.
>>>>
>>>> There are some challenges with implementing VERSIONS BETWEEN ... that
>>>> I want to highlight, though. First, the FOR SYSTEM_TIME syntax
>>>> produces the result that would have been returned if you ran the query at
>>>> that time. That uses the table history instead of snapshot creation
>>>> timestamps. If you update the table and then roll back to a previous
>>>> snapshot, FOR SYSTEM_TIME will read the rolled back snapshot if the
>>>> timestamp is in the interval when it was the table’s current state.
>>>>
>>>> We will need to decide whether VERSIONS BETWEEN ... uses table history
>>>> or snapshot creation timestamps. If it uses table history, there may be
>>>> intervals that don’t have a linear history and the query would fail. If it
>>>> uses snapshot creation timestamps, then you’d be able to select intervals
>>>> that never really existed, like between commits in a transaction.
>>>>
>>>> So there are major issues with timestamps. We would want to make it
>>>> clear that timestamps are for convenience, and not for incremental
>>>> consumption (because of the issue from that thread) and may not reflect the
>>>> actual table state (or may fail). That said, it’s a really convenient
>>>> feature and I would support adding both VERSIONS BETWEEN with
>>>> timestamp and snapshot ID.
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Jan 6, 2022 at 9:52 PM Walaa Eldin Moustafa <
>>>> wa.moustafa@gmail.com> wrote:
>>>>
>>>>> Hi Iceberg devs,
>>>>>
>>>>> We have been considering the problem of Time-sliced incremental scan
>>>>> (i.e., reading data that is committed between two timestamps), and I ran
>>>>> into this thread [1] in the Iceberg dev mailing list. The summary of the
>>>>> thread is that incremental scan should leverage snapshot IDs as opposed to
>>>>> timestamps since there is no guarantee that commit timestamps are linear.
>>>>>
>>>>> I wanted to follow up on that discussion to see if folks are open to
>>>>> still supporting time-sliced incremental scan APIs with the caveat above.
>>>>> The reasons are many fold:
>>>>>
>>>>> * Time-slice APIs are more human friendly and simplify the need for a
>>>>> state store to track last read snapshot IDs. While some sort of state
>>>>> recording may be required in some cases, it is not always the case,
>>>>> especially if data consumption happens at regular intervals.
>>>>>
>>>>> * I understand the caveat discussed in the thread still applies to the
>>>>> existing "TIMESTAMP AS OF" API, yet the API is supported. By extension, I
>>>>> think it is fair to extend the support to time-sliced incremental scan.
>>>>>
>>>>> * Iceberg already provides a deterministic function to translate a
>>>>> timestamp to a snapshot ID [2]. Since a table can be scanned by snapshot
>>>>> range, I think it is reasonable to allow scanning by timestamp range, since
>>>>> the translation mechanism already exists.
>>>>>
>>>>> * Incremental scan using timestamp range is part of the SQL Standard
>>>>> (e.g. "VERSIONS BETWEEN SYSTEM TIME ... AND ...") and is supported by some
>>>>> existing engines. See [3] for SQL Server support.
>>>>>
>>>>> * Conceptually, it is possible to implement the same query semantics
>>>>> at the SQL level using existing APIs and operators such as "TIMESTAMP AS
>>>>> OF" and "EXCEPT" (e.g., by selecting: (T as of timestamp1) EXCEPT (T as of
>>>>> timestamp2)). It sounds that incremental scan is an optimization to push
>>>>> the differencing operation to the data source as opposed to letting the
>>>>> engine deal with it, so it is better to implement the SQL shorthand, and
>>>>> its corresponding data source optimization.
>>>>>
>>>>> Therefore, I would like to ask if we can proceed with supporting an
>>>>> API along the lines of "VERSIONS BETWEEN SYSTEM TIME ... AND ...", and
>>>>> consider the discussion in [1] as a caveat that folks using this API (or
>>>>> existing timestamp APIs) have to keep in mind, until the implementation
>>>>> guarantees linearity at some point in the future.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>> [1] https://www.mail-archive.com/dev@iceberg.apache.org/msg01504.html
>>>>> [2]
>>>>> https://github.com/apache/iceberg/blob/1a11038baaca9f4cf03674654a6d9a85f39a1ce1/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L265
>>>>> [3]
>>>>> https://docs.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>
> --
> Ryan Blue
> Tabular
>


-- 
-----------
Best  Regards
Peidian Li