You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "flyrain (via GitHub)" <gi...@apache.org> on 2023/03/20 20:35:34 UTC

[GitHub] [iceberg] flyrain opened a new pull request, #7147: Spark 3.3: Add doc for the changelog view procedure.

flyrain opened a new pull request, #7147:
URL: https://github.com/apache/iceberg/pull/7147

   cc @aokolnychyi @RussellSpitzer @chenjunjiedada @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#issuecomment-1504060283

   Thanks @RussellSpitzer @CodingCat for the reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1143673162


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,75 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Usage
+
+| Argument Name | Required? | Type | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+|---------------|----------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table`       | ✔️ | string | Name of the table to create changlog view                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| `changelog_view`        |   | string | Name of the view to create                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| `options`     |   | map<string, string> | A map of Spark read options to use. For example, `start-snapshot-id`, the snapshot id to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point. `end-snapshot-id`, the snapshot id to stop reading at inclusively. Default to the current snapshot. `start-timestamp`, the timestamp to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point.`end-timestamp`, the timestamp to stop reading at inclusively. If not provided, the table’s current snapshot will be used as the ending point. | 
+|`compute_updates`| | boolean | Whether to compute updates. Defaults to false                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 
+|`identifier_columns`| | array<string> | The list of identifier columns. If not provided, and `compute_updates` is true, the table’s current identifier fields will be used.                                                                                                                                                                                                                                                                                                                                                                                                                                                             |

Review Comment:
   Description should contain details on how these columns are used. In this case, 
   "Used when compute_updates is true to identify pre and post update rows. If a delete and insert record have identical identifier columns, they are considered to be pre and post updates of the same row."
   
   Or something like that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1162945043


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,119 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Remove carry-over rows
+
+The procedure removes the carry-over rows by default. Carry-over rows are the result of row-level operations(`MERGE`, `UPDATE` and `DELETE`)
+when using copy-on-write. For example, given a file which contains row1 `(id=1, name='Alice')` and row2 `(id=2, name='Bob')`.
+A copy-on-write delete of row2 would require erasing this file and preserving row1 in a new file. The changelog table
+reports this as the following pair of rows, despite it not being an actual change to the table.
+
+| id  | name  | _change_type |
+|-----|-------|--------------|
+| 1   | Alice | DELETE       |
+| 1   | Alice | INSERT       |
+
+By default, this view finds the carry-over rows and removes them from the result. User can disable this 
+behavior by setting the `remove_carryovers` option to `false`.
+
+#### Compute pre/post update images
+
+The procedure computes the pre/post update images if configured. Pre/post update images are converted from a
+pair of a delete row and an insert row. Identifier columns are used for determining whether an insert and a delete record
+refer to the same row. If the two records share the same values for the identity columns they are considered to be before
+and after states of the same row. You can either set identifier fields in the table schema or input them as the procedure parameters.
+
+The following example shows pre/post update images computation with an identifier column(`id`), where a row deletion
+and an insertion with the same `id` are treated as a single update operation. Specifically, suppose we have the following pair of rows:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | DELETE       |
+| 3   | Dan    | INSERT       |
+
+In this case, the procedure marks the row before the update as an `UPDATE_BEFORE` image and the row after the update
+as an `UPDATE_AFTER` image, resulting in the following pre/post update images:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | UPDATE_BEFORE|
+| 3   | Dan    | UPDATE_AFTER |
+
+#### Usage
+
+| Argument Name | Required? | Type | Description                                                                                                                                                                                                           |
+|---------------|----------|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table`       | ✔️ | string | Name of the source table for the changelog                                                                                                                                                                            |
+| `changelog_view`        |   | string | Name of the view to create                                                                                                                                                                                            |
+| `options`     |   | map<string, string> | A map of Spark read options to use                                                                                                                                                                                    |
+|`compute_updates`| | boolean | Whether to compute pre/post update images, defaults to false.                                                                                                                                                         | 

Review Comment:
   update images (see below for more information)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain merged pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain merged PR #7147:
URL: https://github.com/apache/iceberg/pull/7147


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] CodingCat commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "CodingCat (via GitHub)" <gi...@apache.org>.
CodingCat commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1161124315


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,119 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Remove carry-over rows
+
+The procedure removes the carry-over rows by default. Carry-over rows are the result of row-level operations(`MERGE`, `UPDATE` and `DELETE`)
+when using copy-on-write. For example, given a file which contains row1 `(id=1, name='Alice')` and row2 `(id=2, name='Bob')`.
+A copy-on-write delete of row2 would require erasing this file and preserving row1 in a new file. The changelog table
+reports this as the following pair of rows, despite it not being an actual change to the table.
+
+| id  | name  | _change_type |
+|-----|-------|--------------|
+| 1   | Alice | DELETE       |
+| 1   | Alice | INSERT       |
+
+By default, this view finds the carry-over rows and removes them from the result. User can disable this 
+behavior by setting the `remove_carryovers` option to `false`.
+
+#### Compute pre/post update images
+
+The procedure computes the pre/post update images if configured. Pre/post update images are converted from a
+pair of a delete row and an insert row. Identifier columns are used for determining whether an insert and a delete record
+refer to the same row. If the two records share the same values for the identity columns they are considered to be before
+and after states of the same row. You can either set identifier fields in the table schema or input them as the procedure parameters.
+
+The following example shows pre/post update images computation with an identifier column(`id`), where a row deletion
+and an insertion with the same `id` are treated as a single update operation. Specifically, suppose we have the following pair of rows:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | DELETE       |
+| 3   | Dan    | INSERT       |
+
+In this case, the procedure marks the row before the update as an `UPDATE_BEFORE` image and the row after the update
+as an `UPDATE_AFTER` image, resulting in the following pre/post update images:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | UPDATE_BEFORE|
+| 3   | Dan    | UPDATE_AFTER |
+
+#### Usage

Review Comment:
   I think we may want to move the basic usage as the first subsection under `create_changelog_view` , and then some special subsections to explain what's a carry-over row and what is a `pre/post update` and (how to configure it with some examples)....which is a more straightforward tutorial structure?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1143664068


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,75 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Usage
+
+| Argument Name | Required? | Type | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+|---------------|----------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table`       | ✔️ | string | Name of the table to create changlog view                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| `changelog_view`        |   | string | Name of the view to create                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| `options`     |   | map<string, string> | A map of Spark read options to use. For example, `start-snapshot-id`, the snapshot id to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point. `end-snapshot-id`, the snapshot id to stop reading at inclusively. Default to the current snapshot. `start-timestamp`, the timestamp to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point.`end-timestamp`, the timestamp to stop reading at inclusively. If not provided, the table’s current snapshot will be used as the ending point. | 
+|`compute_updates`| | boolean | Whether to compute updates. Defaults to false                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 

Review Comment:
   This is a bit confusing. I think looking at this from an outside perspective i'm not sure what "compute_updates" would mean. Ideally all of our descriptions should be a more descriptive than a restatement of the name of the column.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#issuecomment-1496716618

   Hi @RussellSpitzer, resolved comments. It's ready for another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1143674923


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,75 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Usage
+
+| Argument Name | Required? | Type | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+|---------------|----------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table`       | ✔️ | string | Name of the table to create changlog view                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| `changelog_view`        |   | string | Name of the view to create                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| `options`     |   | map<string, string> | A map of Spark read options to use. For example, `start-snapshot-id`, the snapshot id to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point. `end-snapshot-id`, the snapshot id to stop reading at inclusively. Default to the current snapshot. `start-timestamp`, the timestamp to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point.`end-timestamp`, the timestamp to stop reading at inclusively. If not provided, the table’s current snapshot will be used as the ending point. | 
+|`compute_updates`| | boolean | Whether to compute updates. Defaults to false                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 
+|`identifier_columns`| | array<string> | The list of identifier columns. If not provided, and `compute_updates` is true, the table’s current identifier fields will be used.                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+|`remove_carryovers`| | boolean | Whether to remove carry-over rows. Defaults to true.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |

Review Comment:
   carry_over rows is not defined and similar comment to above options. Each description need to explain the usage and function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1162944301


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,119 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Remove carry-over rows
+
+The procedure removes the carry-over rows by default. Carry-over rows are the result of row-level operations(`MERGE`, `UPDATE` and `DELETE`)
+when using copy-on-write. For example, given a file which contains row1 `(id=1, name='Alice')` and row2 `(id=2, name='Bob')`.
+A copy-on-write delete of row2 would require erasing this file and preserving row1 in a new file. The changelog table
+reports this as the following pair of rows, despite it not being an actual change to the table.
+
+| id  | name  | _change_type |
+|-----|-------|--------------|
+| 1   | Alice | DELETE       |
+| 1   | Alice | INSERT       |
+
+By default, this view finds the carry-over rows and removes them from the result. User can disable this 
+behavior by setting the `remove_carryovers` option to `false`.
+
+#### Compute pre/post update images
+
+The procedure computes the pre/post update images if configured. Pre/post update images are converted from a
+pair of a delete row and an insert row. Identifier columns are used for determining whether an insert and a delete record
+refer to the same row. If the two records share the same values for the identity columns they are considered to be before
+and after states of the same row. You can either set identifier fields in the table schema or input them as the procedure parameters.
+
+The following example shows pre/post update images computation with an identifier column(`id`), where a row deletion
+and an insertion with the same `id` are treated as a single update operation. Specifically, suppose we have the following pair of rows:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | DELETE       |
+| 3   | Dan    | INSERT       |
+
+In this case, the procedure marks the row before the update as an `UPDATE_BEFORE` image and the row after the update
+as an `UPDATE_AFTER` image, resulting in the following pre/post update images:
+
+| id  | name   | _change_type |
+|-----|--------|--------------|
+| 3   | Robert | UPDATE_BEFORE|
+| 3   | Dan    | UPDATE_AFTER |
+
+#### Usage
+
+| Argument Name | Required? | Type | Description                                                                                                                                                                                                           |
+|---------------|----------|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table`       | ✔️ | string | Name of the source table for the changelog                                                                                                                                                                            |
+| `changelog_view`        |   | string | Name of the view to create                                                                                                                                                                                            |
+| `options`     |   | map<string, string> | A map of Spark read options to use                                                                                                                                                                                    |
+|`compute_updates`| | boolean | Whether to compute pre/post update images, defaults to false.                                                                                                                                                         | 
+|`identifier_columns`| | array<string> | The list of identifier columns to compute updates. If the argument `compute_updates` is set to true and `identifier_columns` are not provided, the table’s current identifier fields will be used to compute updates. |
+|`remove_carryovers`| | boolean | Whether to remove carry-over rows. Defaults to true.                                                                                                                                                                  |

Review Comment:
   "Whether to remove carry-over rows (see below for more information)"
   
   After we reorder this so that usage output and examples comes before carryover and post image info



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1143661223


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,75 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Usage
+
+| Argument Name | Required? | Type | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+|---------------|----------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table`       | ✔️ | string | Name of the table to create changlog view                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| `changelog_view`        |   | string | Name of the view to create                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| `options`     |   | map<string, string> | A map of Spark read options to use. For example, `start-snapshot-id`, the snapshot id to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point. `end-snapshot-id`, the snapshot id to stop reading at inclusively. Default to the current snapshot. `start-timestamp`, the timestamp to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point.`end-timestamp`, the timestamp to stop reading at inclusively. If not provided, the table’s current snapshot will be used as the ending point. | 

Review Comment:
   I think this ends up being hard to read, maybe we should have a small "options" table below here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1143675439


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,75 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Usage
+
+| Argument Name | Required? | Type | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+|---------------|----------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table`       | ✔️ | string | Name of the table to create changlog view                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| `changelog_view`        |   | string | Name of the view to create                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| `options`     |   | map<string, string> | A map of Spark read options to use. For example, `start-snapshot-id`, the snapshot id to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point. `end-snapshot-id`, the snapshot id to stop reading at inclusively. Default to the current snapshot. `start-timestamp`, the timestamp to start reading from exclusively. If not provided, the table’s first snapshot will be used as the starting point.`end-timestamp`, the timestamp to stop reading at inclusively. If not provided, the table’s current snapshot will be used as the ending point. | 
+|`compute_updates`| | boolean | Whether to compute updates. Defaults to false                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 
+|`identifier_columns`| | array<string> | The list of identifier columns. If not provided, and `compute_updates` is true, the table’s current identifier fields will be used.                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+|`remove_carryovers`| | boolean | Whether to remove carry-over rows. Defaults to true.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+
+#### Output
+| Output Name | Type | Description |
+| ------------|------|-------------|
+| `changelog_view` | string | The name of the changelog view |

Review Comment:
   name of the created changelog view



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #7147: Spark 3.3: Add doc for the changelog view procedure.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on code in PR #7147:
URL: https://github.com/apache/iceberg/pull/7147#discussion_r1143665517


##########
docs/spark-procedures.md:
##########
@@ -587,3 +587,75 @@ Get all the snapshot ancestors by a particular snapshot
 CALL spark_catalog.system.ancestors_of('db.tbl', 1)
 CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl')
 ```
+
+## Change Data Capture 
+
+### `create_changelog_view`
+
+Creates a view that contains the changes from a given table. 
+
+#### Usage
+
+| Argument Name | Required? | Type | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+|---------------|----------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table`       | ✔️ | string | Name of the table to create changlog view                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |

Review Comment:
   The source table for the change log? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org