You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/28 19:06:21 UTC

[GitHub] [iceberg] rdblue commented on a change in pull request #1499: Update the Iceberg spec for row-level deletes

rdblue commented on a change in pull request #1499:
URL: https://github.com/apache/iceberg/pull/1499#discussion_r513694054



##########
File path: site/docs/spec.md
##########
@@ -208,136 +256,202 @@ Notes:
 
 ### Manifests
 
-A manifest is an immutable Avro file that lists a set of data files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a snapshot, which tracks all of the files in a table at some point in time.
+A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a [snapshot](#snapshots), which tracks all of the files in a table at some point in time. Manifests are tracked by a [manifest list](#manifest-lists) for each table snapshot.
+
+A manifest is a valid Iceberg data file: files must use valid Iceberg formats, schemas, and column projection.
+
+A manifest may store either data files or delete files, but not both because manifests that contain delete files are scanned first during job planning. Whether a manifest is a data manifest or a delete manifest is stored in manifest metadata.
 
-A manifest is a valid Iceberg data file. Files must use Iceberg schemas and column projection.
+A manifest stores files for a single partition spec. When a table’s partition spec changes, old files remain in the older manifest and newer files are written to a new manifest. This is required because a manifest file’s schema is based on its partition spec (see below). The partition spec of each manifest is also used to transform predicates on the table's data rows into predicates on partition values that are used during job planning to select files from a manifest.
 
-A manifest stores files for a single partition spec. When a table’s partition spec changes, old files remain in the older manifest and newer files are written to a new manifest. This is required because a manifest file’s schema is based on its partition spec (see below). This restriction also simplifies selecting files from a manifest because the same boolean expression can be used to select or filter all rows.
+A manifest file must store the partition spec and other metadata as properties in the Avro file's key-value metadata:
 
-The partition spec for a manifest and the current table schema must be stored in the key-value properties of the manifest file. The partition spec is stored as a JSON string under the key `partition-spec`. The table schema is stored as a JSON string under the key `schema`.
+| v1         | v2         | Key                 | Value                                                                        |
+|------------|------------|---------------------|------------------------------------------------------------------------------|
+| _required_ | _required_ | `schema`            | JSON representation of the table schema at the time the manifest was written |
+| _required_ | _required_ | `partition-spec`    | JSON fields representation of the partition spec used to write the manifest  |
+| _optional_ | _required_ | `partition-spec-id` | Id of the partition spec used to write the manifest as a string              |
+| _optional_ | _required_ | `format-version`    | Table format version number of the manifest as a string                      |
+|            | _required_ | `content`           | Type of content files tracked by the manifest: "data" or "deletes"           |
 
 The schema of a manifest file is a struct called `manifest_entry` with the following fields:
 
-| Field id, name       | Type                                                      | Description                                                     |
-|----------------------|-----------------------------------------------------------|-----------------------------------------------------------------|
-| **`0  status`**      | `int` with meaning: `0: EXISTING` `1: ADDED` `2: DELETED` | Used to track additions and deletions                           |
-| **`1  snapshot_id`** | `long`                                                    | Snapshot id where the file was added, or deleted if status is 2 |
-| **`2  data_file`**   | `data_file` `struct` (see below)                          | File path, partition tuple, metrics, ...                        |
+| v1         | v2         | Field id, name           | Type                                                      | Description                                                                           |
+| ---------- | ---------- |--------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`0  status`**          | `int` with meaning: `0: EXISTING` `1: ADDED` `2: DELETED` | Used to track additions and deletions                                                 |
+| _required_ | _optional_ | **`1  snapshot_id`**     | `long`                                                    | Snapshot id where the file was added, or deleted if status is 2. Inherited when null. |
+|            | _optional_ | **`3  sequence_number`** | `long`                                                    | Sequence number when the file was added. Inherited when null.                         |
+| _required_ | _required_ | **`2  data_file`**       | `data_file` `struct` (see below)                          | File path, partition tuple, metrics, ...                                              |
 
 `data_file` is a struct with the following fields:
 
-| Field id, name                    | Type                                  | Description                                                                                                                                                                                          |
-|-----------------------------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| **`100  file_path`**              | `string`                              | Full URI for the file with FS scheme                                                                                                                                                                 |
-| **`101  file_format`**            | `string`                              | String file format name, avro, orc or parquet                                                                                                                                                        |
-| **`102  partition`**              | `struct<...>`                         | Partition data tuple, schema based on the partition spec                                                                                                                                             |
-| **`103  record_count`**           | `long`                                | Number of records in this file                                                                                                                                                                       |
-| **`104  file_size_in_bytes`**     | `long`                                | Total file size in bytes                                                                                                                                                                             |
-| ~~**`105 block_size_in_bytes`**~~ | `long`                                | **Deprecated. Always write a default value and do not read.**                                                                                                                                        |
-| ~~**`106  file_ordinal`**~~       | `optional int`                        | **Deprecated. Do not use.**                                                                                                                                                                          |
-| ~~**`107  sort_columns`**~~       | `optional list`                       | **Deprecated. Do not use.**                                                                                                                                                                          |
-| **`108  column_sizes`**           | `optional map`                        | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro). |
-| **`109  value_counts`**           | `optional map`                        | Map from column id to number of values in the column (including null values)                                                                                                                         |
-| **`110  null_value_counts`**      | `optional map`                        | Map from column id to number of null values in the column                                                                                                                                            |
-| ~~**`111 distinct_counts`**~~     | `optional map`                        | **Deprecated. Do not use.**                                                                                                                                                                          |
-| **`125  lower_bounds`**           | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file.                                            |
-| **`128  upper_bounds`**           | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file.                                         |
-| **`131  key_metadata`**           | `optional binary`                     | Implementation-specific key metadata for encryption                                                                                                                                                  |
-| **`132  split_offsets`**          | `optional list`                       | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending.                                                                                     |
+| v1         | v2         | Field id, name                    | Type                         | Description |
+| ---------- | ---------- |-----------------------------------|------------------------------|-------------|
+|            | _required_ | **`134  content`**                | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) |
+| _required_ | _required_ | **`100  file_path`**              | `string`                     | Full URI for the file with FS scheme |
+| _required_ | _required_ | **`101  file_format`**            | `string`                     | String file format name, avro, orc or parquet |
+| _required_ | _required_ | **`102  partition`**              | `struct<...>`                | Partition data tuple, schema based on the partition spec |
+| _required_ | _required_ | **`103  record_count`**           | `long`                       | Number of records in this file |
+| _required_ | _required_ | **`104  file_size_in_bytes`**     | `long`                       | Total file size in bytes |
+| _required_ |            | ~~**`105 block_size_in_bytes`**~~ | `long`                       | **Deprecated. Always write a default in v1. Do not write in v2.** |
+| _optional_ |            | ~~**`106  file_ordinal`**~~       | `int`                        | **Deprecated. Do not write.** |
+| _optional_ |            | ~~**`107  sort_columns`**~~       | `list<112: int>`             | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`108  column_sizes`**           | `map<117: int, 118: long>`   | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) |
+| _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 120: long>`   | Map from column id to number of values in the column (including null values) |
+| _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column |
+| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`125  lower_bounds`**           | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file |
+| _optional_ | _optional_ | **`128  upper_bounds`**           | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file |
+| _optional_ | _optional_ | **`131  key_metadata`**           | `binary`                     | Implementation-specific key metadata for encryption |
+| _optional_ | _optional_ | **`132  split_offsets`**          | `list<133: long>`            | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending |
+|            | _optional_ | **`135  equality_ids`**           | `list<136: int>`             | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file |
 
 Notes:
 
 1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
 
-The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec for the manifest file.
+The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec.
 
-Each manifest file must store its partition spec and the current table schema in the Avro file’s key-value metadata. The partition spec is used to transform predicates on the table’s data rows into predicates on the manifest’s partition values during job planning.
+The column metrics maps are used when filtering to select both data and delete files. For delete files, the metrics must store bounds and counts for all deleted rows, or must be omitted. Storing metrics for deleted rows ensures that the values can be used during job planning to find delete files that must be merged during a scan.
 
 
 #### Manifest Entry Fields
 
 The manifest entry fields are used to keep track of the snapshot in which files were added or logically deleted. The `data_file` struct is nested inside of the manifest entry so that it can be easily passed to job planning without the manifest entry fields.
 
-When a data file is added to the dataset, it’s manifest entry should store the snapshot ID in which the file was added and set status to 1 (added).
+When a file is added to the dataset, it’s manifest entry should store the snapshot ID in which the file was added and set status to 1 (added).
 
-When a data file is replaced or deleted from the dataset, it’s manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1].
+When a file is replaced or deleted from the dataset, it’s manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1].
+
+Iceberg v2 adds a sequence number to the entry and makes the snapshot id optional. Both fields, `sequence_number` and `snapshot_id`, are inherited from manifest metadata when `null`. That is, if the field is `null` for an entry, then the entry must inherit its value from the manifest file's metadata, stored in the manifest list [2].
 
 Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
+2. Manifest list files are required in v2, so that the `sequence_number` and `snapshot_id` to inherit are always available.
+
+#### Sequence Number Inheritance
+
+Manifests track the sequence number when a data or delete file was added to the table.
+
+When adding new file, its sequence number is set to `null` because the snapshot's sequence number is not assigned until the snapshot is successfully committed. When reading, sequence numbers are inherited by replacing `null` with the manifest's sequence number from the manifest list.
+
+When writing an existing file to a new manifest, the sequence number must be non-null and set to the sequence number that was inherited.
+
+Inheriting sequence numbers through the metadata tree allows writing a new manifest without a known sequence number, so that a manifest can be written once and reused in commit retries. To change a sequence number for a retry, only the manifest list must be rewritten.
+
+When reading v1 manifests with no sequence number column, sequence numbers for all files must default to 0.
+
 
 ### Snapshots
 
 A snapshot consists of the following fields:
 
-*   **`snapshot-id`** -- A unique long ID.
-*   **`parent-snapshot-id`** -- (Optional) The snapshot ID of the snapshot’s parent. This field is not present for snapshots that have no parent snapshot, such as snapshots created before this field was added or the first snapshot of a table.
-*   **`sequence-number`** -- A monotonically increasing long that tracks the order of snapshots in a table. (**v2 only**)
-*   **`timestamp-ms`** -- A timestamp when the snapshot was created. This is used when garbage collecting snapshots.
-*   **`manifests`** -- A list of manifest file locations. The data files in a snapshot are the union of all data files listed in these manifests. (Deprecated in favor of `manifest-list`)
-*   **`manifest-list`** -- (Optional) The location of a manifest list file for this snapshot, which contains a list of manifest files with additional metadata. If present, the manifests field must be omitted.
-*   **`summary`** -- (Optional) A summary that encodes the `operation` that produced the snapshot and other relevant information specific to that operation. This allows some operations like snapshot expiration to skip processing some snapshots. Possible values of `operation` are:
-    *   `append` -- Data files were added and no files were removed.
-    *   `replace` -- Data files were rewritten with the same data; i.e., compaction, changing the data file format, or relocating data files.
-    *   `overwrite` -- Data files were deleted and added in a logical overwrite operation.
-    *   `delete` -- Data files were removed and their contents logically deleted.
+| v1         | v2         | Field                    | Description |
+| ---------- | ---------- | ------------------------ | ----------- |
+| _required_ | _required_ | **`snapshot-id`**        | A unique long ID |
+| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the snapshot's parent. Omitted for any snapshot with no parent |
+|            | _required_ | **`sequence-number`**    | A monotonically increasing long that tracks the order of changes to a table |
+| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the snapshot was created, used for garbage collection and table inspection |
+| _optional_ | _required_ | **`manifest-list`**      | The location of a manifest list for this snapshot that tracks manifest files with additional meadata |
+| _optional_ |            | **`manifests`**          | A list of manifest file locations. Must be omitted if `manifest-list` is present |
+| _optional_ | _required_ | **`summary`**            | A string map that summarizes the snapshot changes, including `operation` (see below) |
 
-Snapshots can be split across more than one manifest. This enables:
+The snapshot summary's `operation` field is used by some operations, like snapshot expiration, to skip processing certain snapshots. Possible `operation` values are:
+
+*   `append` -- Only data files were added and no files were removed.
+*   `replace` -- Data and delete files were added and removed without changing table data; i.e., compaction, changing the data file format, or relocating data files.
+*   `overwrite` -- Data and delete files were added and removed in a logical overwrite operation.
+*   `delete` -- Data files were removed and their contents logically deleted and/or delete files were added to delete rows.
+
+Data and delete files for a snapshot can be stored in more than one manifest. This enables:
 
 *   Appends can add a new manifest to minimize the amount of data written, instead of adding new records by rewriting and appending to an existing manifest. (This is called a “fast append”.)
 *   Tables can use multiple partition specs. A table’s partition configuration can evolve if, for example, its data volume changes. Each manifest uses a single partition spec, and queries do not need to change because partition filters are derived from data predicates.
 *   Large tables can be split across multiple manifests so that implementations can parallelize job planning or reduce the cost of rewriting a manifest.
 
+Manifests for a snapshot are tracked by a manifest list.
+
 Valid snapshots are stored as a list in table metadata. For serialization, see Appendix C.
 
 
+#### Manifest Lists
+
+Snapshots are embedded in table metadata, but the list of manifests for a snapshot are stored in a separate manifest list file.
+
+A new manifest list is written for each attempt to commit a snapshot because the list of manifests always changes to produce a new snapshot. When a manifest list is written, the (optimistic) sequence number of the snapshot is written for all new manifest files tracked by the list.
+
+A manifest list includes summary metadata that can be used to avoid scanning all of the manifests in a snapshot when planning a table scan. This includes the number of added, existing, and deleted files, and a summary of values for each field of the partition spec used to write the manifest.
+
+A manifest list is a valid Iceberg data file: files must use valid Iceberg formats, schemas, and column projection.
+
+Manifest list files store `manifest_file`, a struct with the following fields:
+
+| v1         | v2         | Field id, name                 | Type                                        | Description |
+| ---------- | ---------- |--------------------------------|---------------------------------------------|-------------|
+| _required_ | _required_ | **`500 manifest_path`**        | `string`                                    | Location of the manifest file |
+| _required_ | _required_ | **`501 manifest_length`**      | `long`                                      | Length of the manifest file |
+| _required_ | _required_ | **`502 partition_spec_id`**    | `int`                                       | ID of a partition spec for the table; must be listed in table metadata `partition-specs` |

Review comment:
       Yes, you're right. It is the spec ID used for the manifest.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org