You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/05 23:09:59 UTC

[GitHub] [iceberg] liuml07 opened a new pull request, #4709: Spark: Document that metadata tables support time travel

liuml07 opened a new pull request, #4709:
URL: https://github.com/apache/iceberg/pull/4709

   I think the current spark-queries doc is unclear about metadata tables can be inspected with time travel feature. This PR documents that with a sample query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#issuecomment-1259807136

   Thanks @liuml07 for the change and patience !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r979317742


##########
docs/spark-queries.md:
##########
@@ -394,3 +394,22 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)

Review Comment:
   Sorry its unrelated, but could you also remove the show (truncate=true) here?  Should have added the comment in the that pr.  This doc should just explain how to load the dataframe, not necessarily how to show it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] hililiwei commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

hililiwei commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r867318606


##########
docs/spark/spark-queries.md:
##########
@@ -332,4 +332,15 @@ Metadata tables can be loaded in Spark 2.4 or Spark 3 using the DataFrameReader
 spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
-```
\ No newline at end of file
+```
+
+You can also inspect Iceberg metadata tables with the time travel feature:
+
+```scala
+// get table's all data files and each data file's metadata at snapshot-id 7277403863961056344
+spark.read
+        .format("iceberg")
+        .option("snapshot-id", 7277403863961056344L)
+        .load("db.table.files")
+        .show()
+```

Review Comment:
   +1 for show it here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] liuml07 commented on pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

liuml07 commented on PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#issuecomment-1254487516

   I have rebased this to avoid conflicts. Not sure if I can get a review? @kbendick 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] liuml07 commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

liuml07 commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r979349868


##########
docs/spark-queries.md:
##########
@@ -394,3 +394,22 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
 ```
+
+### Time Travel with Metadata Tables
+
+To inspect a tables's metadata with the time travel feature:
+
+```sql
+-- get the table's file manifests at timestamp Sep 20, 2021 08:00:00
+SELECT * FROM prod.db.table.manifests TIMESTAMP AS OF '2021-09-20 08:00:00';

Review Comment:
   I see some places end with this `;` and some other places do not in this doc. I do not have specific preference for example code. But it seems a bit clearer for a code block with multiple SQL statement (like here). Do you think it's a good idea to make all places end with `;` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] liuml07 commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

liuml07 commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r979350065


##########
docs/spark-queries.md:
##########
@@ -394,3 +394,22 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
 ```
+
+### Time Travel with Metadata Tables
+
+To inspect a tables's metadata with the time travel feature:
+
+```sql
+-- get the table's file manifests at timestamp Sep 20, 2021 08:00:00
+SELECT * FROM prod.db.table.manifests TIMESTAMP AS OF '2021-09-20 08:00:00';
+
+-- get the table's partitions with snapshot id 10963874102873L
+SELECT * FROM prod.db.table.partitions VERSION AS OF 10963874102873;
+```
+
+Metadata tables can also be inspected with time travel using the DataFrameReader API:
+
+```scala
+// get table's data files and each data file's metadata at snapshot-id 10963874102873
+spark.read.format("iceberg").option("snapshot-id", 10963874102873L).load("db.table.files").show()

Review Comment:
   Yeah this makes perfect sense - `show()` is totally irrelevant and also we do not attach result examples here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] liuml07 commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

liuml07 commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r867032827


##########
format/spec.md:
##########
@@ -456,9 +456,9 @@ The column metrics maps are used when filtering to select both data and delete f
 
 The manifest entry fields are used to keep track of the snapshot in which files were added or logically deleted. The `data_file` struct is nested inside of the manifest entry so that it can be easily passed to job planning without the manifest entry fields.
 
-When a file is added to the dataset, it’s manifest entry should store the snapshot ID in which the file was added and set status to 1 (added).

Review Comment:
   The character `’` seems non-English character. It should be `its` instead of `it's` anyway so I fix this as a trivial change in this pull request.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] liuml07 commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

liuml07 commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r980715212


##########
docs/spark-queries.md:
##########
@@ -394,3 +394,22 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
 ```
+
+### Time Travel with Metadata Tables
+
+To inspect a tables's metadata with the time travel feature:
+
+```sql
+-- get the table's file manifests at timestamp Sep 20, 2021 08:00:00
+SELECT * FROM prod.db.table.manifests TIMESTAMP AS OF '2021-09-20 08:00:00';

Review Comment:
   I make all other SQL example end with `;` to make this style consistent and embarrassingly clear that the statement is complete.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho merged pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

szehon-ho merged PR #4709:
URL: https://github.com/apache/iceberg/pull/4709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r978107679


##########
docs/spark-queries.md:
##########
@@ -394,3 +394,14 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
 ```
+
+You can also inspect Iceberg metadata tables with the time travel feature:

Review Comment:
   1. I think we tend to avoid second person 'you' in the docs, and just state that something can be done.
   
   2. Also what do you think about making a separate section (Time Travel with Metadata Tables) and showing both the DF and the SQL way ?  Otherwise people may assume its only possible via dataframe.
   
   3.  Also this is unrelated but I realized that this section (Inspecting with DataFrames) has the wrong heading level, it should be under the level of (Inspecting Tables), as that level is about all metadata tables.  Can you help fix it as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r979317742


##########
docs/spark-queries.md:
##########
@@ -394,3 +394,22 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)

Review Comment:
   Sorry its unrelated, but could you also remove the show (truncate=true) here?  Should have added the comment in the previous pr.  This doc should just explain how to load the dataframe, not necessarily how to show it.



##########
docs/spark-queries.md:
##########
@@ -394,3 +394,22 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
 ```
+
+### Time Travel with Metadata Tables
+
+To inspect a tables's metadata with the time travel feature:
+
+```sql
+-- get the table's file manifests at timestamp Sep 20, 2021 08:00:00
+SELECT * FROM prod.db.table.manifests TIMESTAMP AS OF '2021-09-20 08:00:00';
+
+-- get the table's partitions with snapshot id 10963874102873L
+SELECT * FROM prod.db.table.partitions VERSION AS OF 10963874102873;
+```
+
+Metadata tables can also be inspected with time travel using the DataFrameReader API:
+
+```scala
+// get table's data files and each data file's metadata at snapshot-id 10963874102873

Review Comment:
   What do you think simplifying it, // Load table's file metadata at snapshot-id... as dataframe
   
   (Also files table may be delete files as well, hence removing 'data')



##########
docs/spark-queries.md:
##########
@@ -394,3 +394,22 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
 ```
+
+### Time Travel with Metadata Tables
+
+To inspect a tables's metadata with the time travel feature:
+
+```sql
+-- get the table's file manifests at timestamp Sep 20, 2021 08:00:00
+SELECT * FROM prod.db.table.manifests TIMESTAMP AS OF '2021-09-20 08:00:00';

Review Comment:
   Nit: can we remove the end semicolon for consistency?



##########
docs/spark-queries.md:
##########
@@ -394,3 +394,22 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
 ```
+
+### Time Travel with Metadata Tables
+
+To inspect a tables's metadata with the time travel feature:
+
+```sql
+-- get the table's file manifests at timestamp Sep 20, 2021 08:00:00
+SELECT * FROM prod.db.table.manifests TIMESTAMP AS OF '2021-09-20 08:00:00';
+
+-- get the table's partitions with snapshot id 10963874102873L
+SELECT * FROM prod.db.table.partitions VERSION AS OF 10963874102873;
+```
+
+Metadata tables can also be inspected with time travel using the DataFrameReader API:
+
+```scala
+// get table's data files and each data file's metadata at snapshot-id 10963874102873
+spark.read.format("iceberg").option("snapshot-id", 10963874102873L).load("db.table.files").show()

Review Comment:
   Can we remove show()?  I know the other one had it, but just realized its a bit unnecessary (this doc should just explain how to load the metadata table as dataframe, just like the previous section explain how to load the data table as dataframe, not necessarily how to show it)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r980595774


##########
docs/spark-queries.md:
##########
@@ -394,3 +394,22 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
 ```
+
+### Time Travel with Metadata Tables
+
+To inspect a tables's metadata with the time travel feature:
+
+```sql
+-- get the table's file manifests at timestamp Sep 20, 2021 08:00:00
+SELECT * FROM prod.db.table.manifests TIMESTAMP AS OF '2021-09-20 08:00:00';

Review Comment:
   You are right, I didn't notice it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

singhpk234 commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r867298757


##########
docs/spark/spark-queries.md:
##########
@@ -332,4 +332,15 @@ Metadata tables can be loaded in Spark 2.4 or Spark 3 using the DataFrameReader
 spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
-```
\ No newline at end of file
+```
+
+You can also inspect Iceberg metadata tables with the time travel feature:
+
+```scala
+// get table's all data files and each data file's metadata at snapshot-id 7277403863961056344
+spark.read
+        .format("iceberg")
+        .option("snapshot-id", 7277403863961056344L)
+        .load("db.table.files")
+        .show()
+```

Review Comment:
   [question] There is a dedicated `TimeTravel` section above, should we move this there ? Your thoughts ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] liuml07 commented on a diff in pull request #4709: Docs: Make it clear metadata tables support time travel in Spark

Posted by GitBox <gi...@apache.org>.

liuml07 commented on code in PR #4709:
URL: https://github.com/apache/iceberg/pull/4709#discussion_r867304962


##########
docs/spark/spark-queries.md:
##########
@@ -332,4 +332,15 @@ Metadata tables can be loaded in Spark 2.4 or Spark 3 using the DataFrameReader
 spark.read.format("iceberg").load("db.table.files").show(truncate = false)
 // Hadoop path table
 spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
-```
\ No newline at end of file
+```
+
+You can also inspect Iceberg metadata tables with the time travel feature:
+
+```scala
+// get table's all data files and each data file's metadata at snapshot-id 7277403863961056344
+spark.read
+        .format("iceberg")
+        .option("snapshot-id", 7277403863961056344L)
+        .load("db.table.files")
+        .show()
+```

Review Comment:
   Yes adding to that section also works. But my concern was that, the `TimeTravel` section is before metadata tables section. By then, readers may not understand what the table name `db.table.files` is for.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org