You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@iceberg.apache.org by bl...@apache.org on 2022/02/10 00:33:47 UTC

[iceberg-docs] branch main updated: Fix spec.md (#47)

This is an automated email from the ASF dual-hosted git repository.

blue pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg-docs.git


The following commit(s) were added to refs/heads/main by this push:
     new cb037b9  Fix spec.md (#47)
cb037b9 is described below

commit cb037b9c2908a73d2608e119e9c68761b0886d1f
Author: Prashant Singh <35...@users.noreply.github.com>
AuthorDate: Thu Feb 10 06:01:40 2022 +0530

    Fix spec.md (#47)
    
    Co-authored-by: Prashant Singh <ps...@amazon.com>
---
 landing-page/content/common/format/spec.md | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/landing-page/content/common/format/spec.md b/landing-page/content/common/format/spec.md
index c04062f..4eeee46 100644
--- a/landing-page/content/common/format/spec.md
+++ b/landing-page/content/common/format/spec.md
@@ -34,7 +34,7 @@ The format version number is incremented when new features are added that will b
 
 Version 1 of the Iceberg spec defines how to manage large analytic tables using immutable file formats: Parquet, Avro, and ORC.
 
-All version 1 data and metadata files are valid after upgrading a table to version 2. [Appendix E](#version-2) documents how to default version 2 fields when reading version 1 metadata.
+All version 1 data and metadata files are valid after upgrading a table to version 2. [Appendix E](spec/#version-2) documents how to default version 2 fields when reading version 1 metadata.
 
 #### Version 2: Row-level Deletes
 
@@ -42,7 +42,7 @@ Version 2 of the Iceberg spec adds row-level updates and deletes for analytic ta
 
 The primary change in version 2 adds delete files to encode that rows that are deleted in existing data files. This version can be used to delete or replace individual rows in immutable data files without rewriting the files.
 
-In addition to row-level deletes, version 2 makes some requirements stricter for writers. The full set of changes are listed in [Appendix E](#version-2).
+In addition to row-level deletes, version 2 makes some requirements stricter for writers. The full set of changes are listed in [Appendix E](spec/#version-2).
 
 
 ## Goals
@@ -92,10 +92,10 @@ Row-level deletes are stored in delete files.
 
 There are two ways to encode a row-level delete:
 
-* [_Position deletes_](#position-delete-files) mark a row deleted by data file path and the row position in the data file
-* [_Equality deletes_](#equality-delete-files) mark a row deleted by one or more column values, like `id = 5`
+* [_Position deletes_](spec/#position-delete-files) mark a row deleted by data file path and the row position in the data file
+* [_Equality deletes_](spec/#equality-delete-files) mark a row deleted by one or more column values, like `id = 5`
 
-Like data files, delete files are tracked by partition. In general, a delete file must be applied to older data files with the same partition; see [Scan Planning](#scan-planning) for details. Column metrics can be used to determine whether a delete file's rows overlap the contents of a data file or a scan range.
+Like data files, delete files are tracked by partition. In general, a delete file must be applied to older data files with the same partition; see [Scan Planning](spec/#scan-planning) for details. Column metrics can be used to determine whether a delete file's rows overlap the contents of a data file or a scan range.
 
 
 #### File System Operations
@@ -236,7 +236,7 @@ Field mapping fields are constrained by the following rules:
 * Map types should contain mappings in `fields` for `key` and `value`. 
 * Struct types should contain mappings in `fields` for their child fields.
 
-For details on serialization, see [Appendix C](#name-mapping-serialization).
+For details on serialization, see [Appendix C](spec/#name-mapping-serialization).
 
 #### Identifier Field IDs
 
@@ -355,7 +355,7 @@ Users can sort their data within partitions by columns to gain performance. The
 A sort order is defined by an sort order id and a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field consists of:
 
 *   A **source column id** from the table's schema
-*   A **transform** that is used to produce values to be sorted on from the source column. This is the same transform as described in [partition transforms](#partition-transforms).
+*   A **transform** that is used to produce values to be sorted on from the source column. This is the same transform as described in [partition transforms](spec/#partition-transforms).
 *   A **sort direction**, that can only be either `asc` or `desc`
 *   A **null order** that describes the order of null values when sorted. Can only be either `nulls-first` or `nulls-last`
 
@@ -363,12 +363,12 @@ Order id `0` is reserved for the unsorted order.
 
 Sorting floating-point numbers should produce the following behavior: `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns with the implementation of Java floating-point types comparisons. 
 
-A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.
+A data or delete file is associated with a sort order by the sort order's id within [a manifest](spec/#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.
 
 
 ### Manifests
 
-A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a [snapshot](#snapshots), which tracks all of the files in a table at some point in time. Manifests are tracked by a [manifest list](#manifest-lists) for each table snapshot.
+A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a [snapshot](spec/#snapshots), which tracks all of the files in a table at some point in time. Manifests are tracked by a [manifest list](spec/#manifest-lists) for each table snapshot.
 
 A manifest is a valid Iceberg data file: files must use valid Iceberg formats, schemas, and column projection.
 
@@ -425,7 +425,7 @@ Notes:
 
 1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate.
-3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.
+3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](spec/#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.
 
 The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec.
 
@@ -576,9 +576,9 @@ Notes:
 #### Snapshot Reference
 
 Iceberg tables keep track of branches and tags using snapshot references. 
-Tags are labels for individual snapshots. Branches are mutable named references that can be updated by committing a new snapshot as the branch's referenced snapshot using the [Commit Conflict Resolution and Retry](#commit-conflict-resolution-and-retry) procedures.
+Tags are labels for individual snapshots. Branches are mutable named references that can be updated by committing a new snapshot as the branch's referenced snapshot using the [Commit Conflict Resolution and Retry](spec/#commit-conflict-resolution-and-retry) procedures.
 
-The snapshot reference object records all the information of a reference including snapshot ID, reference type and [Snapshot Retention Policy](#snapshot-retention-policy).
+The snapshot reference object records all the information of a reference including snapshot ID, reference type and [Snapshot Retention Policy](spec/#snapshot-retention-policy).
 
 | v1         | v2         | Field name                   | Type      | Description |
 | ---------- | ---------- | ---------------------------- | --------- | ----------- |
@@ -727,9 +727,9 @@ The rows in the delete file must be sorted by `file_path` then `position` to opt
 
 Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row.
 
-Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). Float and double columns cannot be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](spec/#manifests). Float and double columns cannot be used as delete columns in equality delete files.
 
-A data row is deleted if its values are equal to all delete columns for any row in an equality delete file that applies to the row's data file (see [`Scan Planning`](#scan-planning)).
+A data row is deleted if its values are equal to all delete columns for any row in an equality delete file that applies to the row's data file (see [`Scan Planning`](spec/#scan-planning)).
 
 Each row of the delete file produces one equality predicate that matches any row where the delete columns are equal. Multiple columns can be thought of as an `AND` of equality predicates. A `null` value in a delete column matches a row if the row's value is `null`, equivalent to `col IS NULL`.
 
@@ -839,7 +839,7 @@ Note that the string map case is for maps where the key type is a string. Using
 
 Values should be stored in Parquet using the types and logical type annotations in the table below. Column IDs are required.
 
-Lists must use the [3-level representation](https://github.com/apache/parquet-format/blob/master/LogicalTypes#lists).
+Lists must use the [3-level representation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists).
 
 | Type               | Parquet physical type                                              | Logical type                                | Notes                                                          |
 |--------------------|--------------------------------------------------------------------|---------------------------------------------|----------------------------------------------------------------|