You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@iceberg.apache.org by bl...@apache.org on 2019/07/15 04:40:26 UTC
[incubator-iceberg] branch master updated: Add metadata table docs.
This is an automated email from the ASF dual-hosted git repository.
blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-iceberg.git
The following commit(s) were added to refs/heads/master by this push:
new 089343d Add metadata table docs.
089343d is described below
commit 089343d52190e90a8c2975747b375d9ede8f9de6
Author: Ryan Blue <bl...@apache.org>
AuthorDate: Sun Jul 14 21:39:56 2019 -0700
Add metadata table docs.
---
site/docs/css/extra.css | 19 ++++++++
site/docs/spark.md | 119 ++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 134 insertions(+), 4 deletions(-)
diff --git a/site/docs/css/extra.css b/site/docs/css/extra.css
index dab91c4..b9b9f1e 100644
--- a/site/docs/css/extra.css
+++ b/site/docs/css/extra.css
@@ -47,6 +47,11 @@ h3:target .headerlink {
opacity: 1;
}
+h4 {
+ font-weight: 500;
+ font-size: 22px;
+}
+
h4:target .headerlink {
color: #008cba;
opacity: 1;
@@ -60,3 +65,17 @@ h5:target .headerlink {
code {
color: #458;
}
+
+pre {
+ width: max-content;
+ min-width: 60em;
+ margin-top: 0.5em;
+ margin-bottom: 0.5em;
+}
+
+.admonition {
+ margin: 0.5em;
+ margin-left: 0em;
+ padding: 0.5em;
+ padding-left: 1em;
+}
diff --git a/site/docs/spark.md b/site/docs/spark.md
index 5e6bd66..9489f37 100644
--- a/site/docs/spark.md
+++ b/site/docs/spark.md
@@ -4,6 +4,10 @@ Iceberg uses Spark's DataSourceV2 API for data source and catalog implementation
| Feature support | Spark 2.4 | Spark 3.0 (unreleased) | Notes |
|----------------------------------------------|-----------|------------------------|------------------------------------------------|
+| [DataFrame reads](#reading-an-iceberg-table) | ✔️ | ✔️ | |
+| [DataFrame append](#appending-data) | ✔️ | ✔️ | |
+| [DataFrame overwrite](#overwriting-data) | ✔️ | ✔️ | Overwrite mode replaces partitions dynamically |
+| [Metadata tables](#inspecting-tables) | ✔️ | ✔️ | |
| SQL create table | | ✔️ | |
| SQL alter table | | ✔️ | |
| SQL drop table | | ✔️ | |
@@ -12,9 +16,6 @@ Iceberg uses Spark's DataSourceV2 API for data source and catalog implementation
| SQL replace table as | | ✔️ | |
| SQL insert into | | ✔️ | |
| SQL insert overwrite | | ✔️ | |
-| [DataFrame reads](#reading-an-iceberg-table) | ✔️ | ✔️ | |
-| [DataFrame append](#appending-data) | ✔️ | ✔️ | |
-| [DataFrame overwrite](#overwriting-data) | ✔️ | ✔️ | Overwrite mode replaces partitions dynamically |
!!! Note
Spark 2.4 can't create Iceberg tables with DDL, instead use the [Iceberg API](../api-quickstart).
@@ -65,7 +66,6 @@ spark.read
.load("db.table")
```
-
### Querying with SQL
To run SQL `SELECT` statements on Iceberg tables in 2.4, register the DataFrame as a temporary table:
@@ -104,3 +104,114 @@ data.write
!!! Warning
**Spark does not define the behavior of DataFrame overwrite**. Like most sources, Iceberg will dynamically overwrite partitions when the dataframe contains rows in a partition. Unpartitioned tables are completely overwritten.
+
+
+### Inspecting tables
+
+To inspect a table's history, snapshots, and other metadata, Iceberg supports metadata tables.
+
+Metadata tables are identified by adding the metadata table name after the original table name. For example, history for `db.table` is read using `db.table.history`.
+
+#### History
+
+To show table history, run:
+
+```scala
+spark.read.format("iceberg").load("db.table.history").show(truncate = false)
+```
+```text
++-------------------------+---------------------+---------------------+---------------------+
+| made_current_at | snapshot_id | parent_id | is_current_ancestor |
++-------------------------+---------------------+---------------------+---------------------+
+| 2019-02-08 03:29:51.215 | 5781947118336215154 | NULL | true |
+| 2019-02-08 03:47:55.948 | 5179299526185056830 | 5781947118336215154 | true |
+| 2019-02-09 16:24:30.13 | 296410040247533544 | 5179299526185056830 | false |
+| 2019-02-09 16:32:47.336 | 2999875608062437330 | 5179299526185056830 | true |
+| 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true |
+| 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true |
++-------------------------+---------------------+---------------------+---------------------+
+```
+
+!!! Note
+ **This shows a commit that was rolled back.** The example has two snapshots with the same parent, and one is *not* an ancestor of the current table state.
+
+#### Snapshots
+
+To show the valid snapshots for a table, run:
+
+```scala
+spark.read.format("iceberg").load("db.table.snapshots").show(truncate = false)
+```
+```text
++-------------------------+----------------+-----------+-----------+----------------------------------------------------+-------------------------------------------------------+
+| committed_at | snapshot_id | parent_id | operation | manifest_list | summary |
++-------------------------+----------------+-----------+-----------+----------------------------------------------------+-------------------------------------------------------+
+| 2019-02-08 03:29:51.215 | 57897183625154 | null | append | s3://.../table/metadata/snap-57897183625154-1.avro | { added-records -> 2478404, total-records -> 2478404, |
+| | | | | | added-data-files -> 438, total-data-files -> 438, |
+| | | | | | spark.app.id -> application_1520379288616_155055 } |
+| ... | ... | ... | ... | ... | ... |
++-------------------------+----------------+-----------+-----------+----------------------------------------------------+-------------------------------------------------------+
+```
+
+You can also join snapshots to table history. For example, this query will show table history, with the application ID that wrote each snapshot:
+
+```scala
+spark.read.format("iceberg").load("db.table.history").createOrReplaceTempView("history")
+spark.read.format("iceberg").load("db.table.snapshots").createOrReplaceTempView("snapshots")
+```
+```sql
+select
+ h.made_current_at,
+ s.operation,
+ h.snapshot_id,
+ h.is_current_ancestor,
+ s.summary['spark.app.id']
+from history h
+join snapshots s
+ on h.snapshot_id = s.snapshot_id
+order by made_current_at
+```
+```text
++-------------------------+-----------+----------------+---------------------+----------------------------------+
+| made_current_at | operation | snapshot_id | is_current_ancestor | summary[spark.app.id] |
++-------------------------+-----------+----------------+---------------------+----------------------------------+
+| 2019-02-08 03:29:51.215 | append | 57897183625154 | true | application_1520379288616_155055 |
+| 2019-02-09 16:24:30.13 | delete | 29641004024753 | false | application_1520379288616_151109 |
+| 2019-02-09 16:32:47.336 | append | 57897183625154 | true | application_1520379288616_155055 |
+| 2019-02-08 03:47:55.948 | overwrite | 51792995261850 | true | application_1520379288616_152431 |
++-------------------------+-----------+----------------+---------------------+----------------------------------+
+```
+
+#### Manifests
+
+To show the a table's file manifests and each file's metadata, run:
+
+```scala
+spark.read.format("iceberg").load("db.table.manifests").show(truncate = false)
+```
+```text
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
+| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | partitions |
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
+| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479 | 0 | 6668963634911763636 | 8 | 0 | 0 | [[false,2019-05-13,2019-05-15]] |
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
+```
+
+#### Files
+
+To show the a table's data files and each file's metadata, run:
+
+```scala
+spark.read.format("iceberg").load("db.table.files").show(truncate = false)
+```
+```text
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
+| file_path | file_format | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets |
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
+| s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null | [4] |
+| s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null | [4] |
+| s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null | [4] |
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
+```
+
+