You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@iceberg.apache.org by sz...@apache.org on 2022/09/27 17:11:27 UTC

[iceberg] branch master updated: Docs: Make it clear metadata tables support time travel in Spark (#4709)

This is an automated email from the ASF dual-hosted git repository.

szehon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git


The following commit(s) were added to refs/heads/master by this push:
     new d11d53ac50 Docs: Make it clear metadata tables support time travel in Spark (#4709)
d11d53ac50 is described below

commit d11d53ac5053b60237b1d55c6ee82f5b17f25837
Author: Mingliang Liu <li...@apache.org>
AuthorDate: Tue Sep 27 10:11:19 2022 -0700

    Docs: Make it clear metadata tables support time travel in Spark (#4709)
---
 docs/spark-queries.md | 53 ++++++++++++++++++++++++++++++++++-----------------
 format/spec.md        |  4 ++--
 2 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/docs/spark-queries.md b/docs/spark-queries.md
index ab86996a09..acdd538209 100644
--- a/docs/spark-queries.md
+++ b/docs/spark-queries.md
@@ -49,15 +49,15 @@ Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog impleme
 In Spark 3, tables use identifiers that include a [catalog name](../spark-configuration#using-catalogs).
 
 ```sql
-SELECT * FROM prod.db.table -- catalog: prod, namespace: db, table: table
+SELECT * FROM prod.db.table; -- catalog: prod, namespace: db, table: table
 ```
 
 Metadata tables, like `history` and `snapshots`, can use the Iceberg table name as a namespace.
 
 For example, to read from the `files` metadata table for `prod.db.table`:
 
-```
-SELECT * FROM prod.db.table.files
+```sql
+SELECT * FROM prod.db.table.files;
 ```
 |content|file_path                                                                                                                                   |file_format|spec_id|partition|record_count|file_size_in_bytes|column_sizes      |value_counts    |null_value_counts|nan_value_counts|lower_bounds           |upper_bounds           |key_metadata|split_offsets|equality_ids|sort_order_id|
 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
@@ -107,14 +107,14 @@ SELECT * FROM prod.db.table VERSION AS OF 10963874102873;
 
 In addition, `FOR SYSTEM_TIME AS OF` and `FOR SYSTEM_VERSION AS OF` clauses are also supported:
 
-```
+```sql
 SELECT * FROM prod.db.table FOR SYSTEM_TIME AS OF '1986-10-26 01:21:00';
 SELECT * FROM prod.db.table FOR SYSTEM_VERSION AS OF 10963874102873;
 ```
 
 Timestamps may also be supplied as a Unix timestamp, in seconds:
 
-```
+```sql
 -- timestamp in seconds
 SELECT * FROM prod.db.table TIMESTAMP AS OF 499162860;
 SELECT * FROM prod.db.table FOR SYSTEM_TIME AS OF 499162860;
@@ -211,7 +211,7 @@ For Spark 3, prior to 3.2, the Spark [session catalog](../spark-configuration#re
 To show table history:
 
 ```sql
-SELECT * FROM prod.db.table.history
+SELECT * FROM prod.db.table.history;
 ```
 
 | made_current_at | snapshot_id  | parent_id | is_current_ancestor |
@@ -232,7 +232,7 @@ SELECT * FROM prod.db.table.history
 To show table metadata log entries:
 
 ```sql
-SELECT * from prod.db.table.metadata_log_entries
+SELECT * from prod.db.table.metadata_log_entries;
 ```
 
 | timestamp | file | latest_snapshot_id | latest_schema_id | latest_sequence_number |
@@ -246,7 +246,7 @@ SELECT * from prod.db.table.metadata_log_entries
 To show the valid snapshots for a table:
 
 ```sql
-SELECT * FROM prod.db.table.snapshots
+SELECT * FROM prod.db.table.snapshots;
 ```
 
 | committed_at | snapshot_id | parent_id | operation | manifest_list | summary |
@@ -280,7 +280,7 @@ order by made_current_at
 To show a table's current data files:
 
 ```sql
-SELECT * FROM prod.db.table.files
+SELECT * FROM prod.db.table.files;
 ```
 
 |content|file_path                                                                                                                                   |file_format|spec_id|partition|record_count|file_size_in_bytes|column_sizes      |value_counts    |null_value_counts|nan_value_counts|lower_bounds           |upper_bounds           |key_metadata|split_offsets|equality_ids|sort_order_id|
@@ -294,7 +294,7 @@ SELECT * FROM prod.db.table.files
 To show a table's current file manifests:
 
 ```sql
-SELECT * FROM prod.db.table.manifests
+SELECT * FROM prod.db.table.manifests;
 ```
 
 | path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries |
@@ -315,7 +315,7 @@ Note:
 To show a table's current partitions:
 
 ```sql
-SELECT * FROM prod.db.table.partitions
+SELECT * FROM prod.db.table.partitions;
 ```
 
 | partition | record_count | file_count | spec_id |
@@ -341,7 +341,7 @@ The "all" metadata tables may produce more than one row per data file or manifes
 To show all of the table's data files and each file's metadata:
 
 ```sql
-SELECT * FROM prod.db.table.all_data_files
+SELECT * FROM prod.db.table.all_data_files;
 ```
 
 | content | file_path | file_format | partition | record_count | file_size_in_bytes | column_sizes| value_counts | null_value_counts | nan_value_counts| lower_bounds| upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|
@@ -355,7 +355,7 @@ SELECT * FROM prod.db.table.all_data_files
 To show all of the table's manifest files:
 
 ```sql
-SELECT * FROM prod.db.table.all_manifests
+SELECT * FROM prod.db.table.all_manifests;
 ```
 
 | path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count| partition_summaries|
@@ -376,7 +376,7 @@ Note:
 To show a table's known snapshot references:
 
 ```sql
-SELECT * FROM prod.db.table.refs
+SELECT * FROM prod.db.table.refs;
 ```
 
 | name | type | snapshot_id | max_reference_age_in_ms | min_snapshots_to_keep | max_snapshot_age_in_ms | 
@@ -384,13 +384,32 @@ SELECT * FROM prod.db.table.refs
 | main | BRANCH | 4686954189838128572 | 10 | 20 | 30 |
 | testTag | TAG | 4686954189838128572 | 10 | null | null |
 
-## Inspecting with DataFrames
+### Inspecting with DataFrames
 
 Metadata tables can be loaded in Spark 2.4 or Spark 3 using the DataFrameReader API:
 
 ```scala
 // named metastore table
-spark.read.format("iceberg").load("db.table.files").show(truncate = false)
+spark.read.format("iceberg").load("db.table.files")
 // Hadoop path table
-spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
+spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files")
+```
+
+### Time Travel with Metadata Tables
+
+To inspect a tables's metadata with the time travel feature:
+
+```sql
+-- get the table's file manifests at timestamp Sep 20, 2021 08:00:00
+SELECT * FROM prod.db.table.manifests TIMESTAMP AS OF '2021-09-20 08:00:00';
+
+-- get the table's partitions with snapshot id 10963874102873L
+SELECT * FROM prod.db.table.partitions VERSION AS OF 10963874102873;
+```
+
+Metadata tables can also be inspected with time travel using the DataFrameReader API:
+
+```scala
+// load the table's file metadata at snapshot-id 10963874102873 as DataFrame
+spark.read.format("iceberg").option("snapshot-id", 10963874102873L).load("db.table.files")
 ```
diff --git a/format/spec.md b/format/spec.md
index 2ce1e6c9b5..7301776feb 100644
--- a/format/spec.md
+++ b/format/spec.md
@@ -459,9 +459,9 @@ The column metrics maps are used when filtering to select both data and delete f
 
 The manifest entry fields are used to keep track of the snapshot in which files were added or logically deleted. The `data_file` struct is nested inside of the manifest entry so that it can be easily passed to job planning without the manifest entry fields.
 
-When a file is added to the dataset, it’s manifest entry should store the snapshot ID in which the file was added and set status to 1 (added).
+When a file is added to the dataset, its manifest entry should store the snapshot ID in which the file was added and set status to 1 (added).
 
-When a file is replaced or deleted from the dataset, it’s manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1].
+When a file is replaced or deleted from the dataset, its manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1].
 
 Iceberg v2 adds a sequence number to the entry and makes the snapshot id optional. Both fields, `sequence_number` and `snapshot_id`, are inherited from manifest metadata when `null`. That is, if the field is `null` for an entry, then the entry must inherit its value from the manifest file's metadata, stored in the manifest list [2].