You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/20 19:16:56 UTC

[GitHub] [iceberg] jackye1995 commented on a change in pull request #3159: Adding documentation for metadata tables

jackye1995 commented on a change in pull request #3159:
URL: https://github.com/apache/iceberg/pull/3159#discussion_r712442301



##########
File path: site/docs/metadata.md
##########
@@ -0,0 +1,134 @@
+# Metadata Tables
+
+This page describes the internal metadata tables maintained by Iceberg. Please refer to [definitions page](terms.md)

Review comment:
       nit: prefer to change line on full sentence.

##########
File path: site/docs/metadata.md
##########
@@ -0,0 +1,134 @@
+# Metadata Tables
+
+This page describes the internal metadata tables maintained by Iceberg. Please refer to [definitions page](terms.md)

Review comment:
       nit: the definitions page

##########
File path: site/docs/metadata.md
##########
@@ -0,0 +1,134 @@
+# Metadata Tables
+
+This page describes the internal metadata tables maintained by Iceberg. Please refer to [definitions page](terms.md)
+for more information on terms and definitions and the [specifications page](spec.md) for more information on Iceberg's
+table specification. Complete metadata table schema can be found on the [Spark Queries page](spark-queries.md#metadata-table-schema). 
+
+| Name                                              | Description |
+| --------------------------------------------------| ------------|
+| [`AllDataFilesTable`](#AllDataFilesTable)         | Contains rows representing all of the data files in the table. Each row will contain metadata as well as path information stored by the Iceberg. This differs from the `DataFilesTable` because it contains all files currently referenced by any existing Snapshot from this table rather than just the current one.
+| [`AllEntriesTable`](#AllEntriesTable)             | Contains a table's manifest entries as rows, for both delete and data files. Please note that this table exposes internal details, like files that have been deleted. For a table of the live data files, please use `DataFilesTable`.
+| [`AllManifestsTable`](#AllManifestsTable)         | Contains a table's valid manifest files as rows. A valid manifest file is referenced from any snapshot currently tracked by the table. This table may contain duplicate rows. 
+| [`DataFilesTable`](#DataFilesTable)               | Contains a table's data files as rows.
+| [`HistoryTable`](#HistoryTable)                   | Contains a table's history as rows. History is based on the table's snapshot log, which logs each update to the table's current snapshot.
+| [`ManifestEntriesTable`](#ManifestEntriesTable)   | Contains a table's manifest entries as rows, for both delete and data files. Please note that this table exposes internal details, like files that have been deleted. For a table of the live data files, please use `DataFilesTable`.
+| [`ManifestsTable`](#ManifestsTable)               | Contains a table's manifest files as rows.
+| [`PartitionsTable`](#PartitionsTable)             | Contains a table's partitions as rows.
+| [`SnapshotsTable`](#SnapshotsTable)               | Contains a table's known snapshots as rows. This does not include snapshots that have been expired using [`ExpireSnapshots`](https://iceberg.apache.org/javadoc/master/org/apache/iceberg/ExpireSnapshots.html).
+
+
+## Table Schema
+
+### <a id="AllDataFilesTable"></a> 1. `AllDataFilesTable`

Review comment:
       What is the use of this HTML tag?

##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ A snapshot consists of the following fields:
 | _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the snapshot's parent. Omitted for any snapshot with no parent |
 |            | _required_ | **`sequence-number`**    | A monotonically increasing long that tracks the order of changes to a table |
 | _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the snapshot was created, used for garbage collection and table inspection |
-| _optional_ | _required_ | **`manifest-list`**      | The location of a manifest list for this snapshot that tracks manifest files with additional meadata |

Review comment:
       nice catch!

##########
File path: site/docs/spark-queries.md
##########
@@ -234,6 +234,189 @@ SELECT * FROM prod.db.table.manifests
 +----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
 ```
 
+### Metadata Table Schema

Review comment:
       Not sure what other people think, I think I would prefer having the schema all in that `metadata.md` page, and we can remove this section in Spark and only provide a SQL example for the syntax to query the table + a link to that page.

##########
File path: site/docs/metadata.md
##########
@@ -0,0 +1,134 @@
+# Metadata Tables
+
+This page describes the internal metadata tables maintained by Iceberg. Please refer to [definitions page](terms.md)
+for more information on terms and definitions and the [specifications page](spec.md) for more information on Iceberg's
+table specification. Complete metadata table schema can be found on the [Spark Queries page](spark-queries.md#metadata-table-schema). 
+
+| Name                                              | Description |
+| --------------------------------------------------| ------------|
+| [`AllDataFilesTable`](#AllDataFilesTable)         | Contains rows representing all of the data files in the table. Each row will contain metadata as well as path information stored by the Iceberg. This differs from the `DataFilesTable` because it contains all files currently referenced by any existing Snapshot from this table rather than just the current one.
+| [`AllEntriesTable`](#AllEntriesTable)             | Contains a table's manifest entries as rows, for both delete and data files. Please note that this table exposes internal details, like files that have been deleted. For a table of the live data files, please use `DataFilesTable`.
+| [`AllManifestsTable`](#AllManifestsTable)         | Contains a table's valid manifest files as rows. A valid manifest file is referenced from any snapshot currently tracked by the table. This table may contain duplicate rows. 
+| [`DataFilesTable`](#DataFilesTable)               | Contains a table's data files as rows.
+| [`HistoryTable`](#HistoryTable)                   | Contains a table's history as rows. History is based on the table's snapshot log, which logs each update to the table's current snapshot.
+| [`ManifestEntriesTable`](#ManifestEntriesTable)   | Contains a table's manifest entries as rows, for both delete and data files. Please note that this table exposes internal details, like files that have been deleted. For a table of the live data files, please use `DataFilesTable`.
+| [`ManifestsTable`](#ManifestsTable)               | Contains a table's manifest files as rows.
+| [`PartitionsTable`](#PartitionsTable)             | Contains a table's partitions as rows.
+| [`SnapshotsTable`](#SnapshotsTable)               | Contains a table's known snapshots as rows. This does not include snapshots that have been expired using [`ExpireSnapshots`](https://iceberg.apache.org/javadoc/master/org/apache/iceberg/ExpireSnapshots.html).
+
+
+## Table Schema
+
+### <a id="AllDataFilesTable"></a> 1. `AllDataFilesTable`

Review comment:
       I think we should use the actual names of the tables, instead of the class name, like files, manifests, entries, etc.

##########
File path: site/docs/metadata.md
##########
@@ -0,0 +1,134 @@
+# Metadata Tables
+
+This page describes the internal metadata tables maintained by Iceberg. Please refer to [definitions page](terms.md)
+for more information on terms and definitions and the [specifications page](spec.md) for more information on Iceberg's
+table specification. Complete metadata table schema can be found on the [Spark Queries page](spark-queries.md#metadata-table-schema). 
+
+| Name                                              | Description |
+| --------------------------------------------------| ------------|
+| [`AllDataFilesTable`](#AllDataFilesTable)         | Contains rows representing all of the data files in the table. Each row will contain metadata as well as path information stored by the Iceberg. This differs from the `DataFilesTable` because it contains all files currently referenced by any existing Snapshot from this table rather than just the current one.
+| [`AllEntriesTable`](#AllEntriesTable)             | Contains a table's manifest entries as rows, for both delete and data files. Please note that this table exposes internal details, like files that have been deleted. For a table of the live data files, please use `DataFilesTable`.
+| [`AllManifestsTable`](#AllManifestsTable)         | Contains a table's valid manifest files as rows. A valid manifest file is referenced from any snapshot currently tracked by the table. This table may contain duplicate rows. 
+| [`DataFilesTable`](#DataFilesTable)               | Contains a table's data files as rows.
+| [`HistoryTable`](#HistoryTable)                   | Contains a table's history as rows. History is based on the table's snapshot log, which logs each update to the table's current snapshot.
+| [`ManifestEntriesTable`](#ManifestEntriesTable)   | Contains a table's manifest entries as rows, for both delete and data files. Please note that this table exposes internal details, like files that have been deleted. For a table of the live data files, please use `DataFilesTable`.
+| [`ManifestsTable`](#ManifestsTable)               | Contains a table's manifest files as rows.
+| [`PartitionsTable`](#PartitionsTable)             | Contains a table's partitions as rows.
+| [`SnapshotsTable`](#SnapshotsTable)               | Contains a table's known snapshots as rows. This does not include snapshots that have been expired using [`ExpireSnapshots`](https://iceberg.apache.org/javadoc/master/org/apache/iceberg/ExpireSnapshots.html).
+
+
+## Table Schema
+
+### <a id="AllDataFilesTable"></a> 1. `AllDataFilesTable`
+
+| Column name           | Required  | Data type         | Description |
+|-----------------------|-----------|-------------------|-------------|

Review comment:
       missing a `Column ID` column.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org