You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "CodingCat (via GitHub)" <gi...@apache.org> on 2023/04/08 15:21:40 UTC
[GitHub] [iceberg] CodingCat commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

CodingCat commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1161124315


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,119 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Remove carry-over rows
+
+The procedure removes the carry-over rows by default. Carry-over rows are the result of row-level operations(`MERGE`, `UPDATE` and `DELETE`)
+when using copy-on-write. For example, given a file which contains row1 `(id=1, name='Alice')` and row2 `(id=2, name='Bob')`.
+A copy-on-write delete of row2 would require erasing this file and preserving row1 in a new file. The changelog table
+reports this as the following pair of rows, despite it not being an actual change to the table.
+
+| id  | name  | _change_type |
+|-----|-------|--------------|
+| 1   | Alice | DELETE       |
+| 1   | Alice | INSERT       |
+
+By default, this view finds the carry-over rows and removes them from the result. User can disable this 
+behavior by setting the `remove_carryovers` option to `false`.
+
+#### Compute pre/post update images
+
+The procedure computes the pre/post update images if configured. Pre/post update images are converted from a
+pair of a delete row and an insert row. Identifier columns are used for determining whether an insert and a delete record
+refer to the same row. If the two records share the same values for the identity columns they are considered to be before
+and after states of the same row. You can either set identifier fields in the table schema or input them as the procedure parameters.
+
+The following example shows pre/post update images computation with an identifier column(`id`), where a row deletion
+and an insertion with the same `id` are treated as a single update operation. Specifically, suppose we have the following pair of rows:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | DELETE       |
+| 3   | Dan    | INSERT       |
+
+In this case, the procedure marks the row before the update as an `UPDATE_BEFORE` image and the row after the update
+as an `UPDATE_AFTER` image, resulting in the following pre/post update images:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | UPDATE_BEFORE|
+| 3   | Dan    | UPDATE_AFTER |
+
+#### Usage

Review Comment:
   I think we may want to move the basic usage as the first subsection under `create_changelog_view` , and then some special subsections to explain what's a carry-over row and what is a `pre/post update` and (how to configure it with some examples)....which is a more straightforward tutorial structure?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org