You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/18 00:08:48 UTC
[GitHub] [iceberg] kbendick commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

kbendick commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r559259799



##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")

Review comment:
       I'm unable to find the `removeField` function with two string parameters in the `UpdatePartitionSpec` interface. Are you sure you intended to call `removeField` here and not possibly `renameField`?

##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")

Review comment:
       What does this partition spec update do? Seems like a no-op for a field rename (other than possibly reassigning IDs for this column? - just a guess on that front).
   
   I can appreciate that this is allowed, but unless this call does something that I'm not aware of, I think that adding this to the documentation's example is potentially more confusing than helpful to those who are learning. Otherwise, like I mentioned elsewhere, it's probably good to write out what this updateSpec call does as it's not immediately self evident - at least to me, though that could be my own limitation and maybe it's clear to others.

##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")
+    .commit();
+```

Review comment:
       Something to consider:
   
   You might consider stating what this `updateSpec` code is going to do. Something like `For example, the following code could be used to update the partition spec to bucket `id` column into 8 buckets....`.
   
   Additionally, I think it would be helpful to indicate whether or not this changes the old data.
   
   Your added `Sort order evolution` docs say `When you evolve a sort order, the old data written with an earlier order remains unchanged.`, to me it begs the question of whether or not updating the partition spec via `updateSpec` will rewrite old data - and if it does not rewrite old data, what precautions do we recommend to people who might use this?

##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")
+    .commit();
+```
+
+Spark supports updating partition spec through its `ALTER TABLE` SQL statement, see more details in [Spark SQL](../spark/#alter-table-add-partition-field).
+
+## Sort order evolution
+
+Similar to partition spec, Iceberg sort order can also be updated in an existing table.
+When you evolve a sort order, the old data written with an earlier order remains unchanged.
+Engines can always choose to write data in the latest sort order or unsorted when sorting is prohibitively expensive.

Review comment:
       When a table has a sort order spec, but the older data is not sorted according to the spec, can this cause queries to silently return incorrect data? Or is this not an issue given that engines can already choose to write data sorted or not based on how expensive it's deemed to be.
   
   Possibly this is more elucidated elsewhere, but otherwise I think it would be good to clarify if changes to the sort order can cause incorrect query results (e.g. if the query engine makes the assumption that data is sorted during execution planning).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org