You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/17 06:12:08 UTC

[GitHub] [iceberg] jackye1995 opened a new pull request #2101: Doc: add partition spec and sort order evolution doc

jackye1995 opened a new pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101


   1. add spark SQL for partition and sort order update
   2. add a section for sort order evolution
   3. add a section for all supported partition specs


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560359417



##########
File path: site/docs/spark.md
##########
@@ -258,6 +258,37 @@ ALTER TABLE prod.db.sample DROP COLUMN id
 ALTER TABLE prod.db.sample DROP COLUMN point.z
 ```
 
+### `ALTER TABLE ... ADD PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample ADD PARTITION FIELD catalog

Review comment:
       added




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560355573



##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")
+    .commit();
+```

Review comment:
       Thanks for the comments, added explanations. For the second part, I think it is already explained in docs before: 
   
   > When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560356068



##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")

Review comment:
       It was a typo when I copied from tests, sorry for the confusion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r559259799



##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")

Review comment:
       I'm unable to find the `removeField` function with two string parameters in the `UpdatePartitionSpec` interface. Are you sure you intended to call `removeField` here and not possibly `renameField`?

##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")

Review comment:
       What does this partition spec update do? Seems like a no-op for a field rename (other than possibly reassigning IDs for this column? - just a guess on that front).
   
   I can appreciate that this is allowed, but unless this call does something that I'm not aware of, I think that adding this to the documentation's example is potentially more confusing than helpful to those who are learning. Otherwise, like I mentioned elsewhere, it's probably good to write out what this updateSpec call does as it's not immediately self evident - at least to me, though that could be my own limitation and maybe it's clear to others.

##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")
+    .commit();
+```

Review comment:
       Something to consider:
   
   You might consider stating what this `updateSpec` code is going to do. Something like `For example, the following code could be used to update the partition spec to bucket `id` column into 8 buckets....`.
   
   Additionally, I think it would be helpful to indicate whether or not this changes the old data.
   
   Your added `Sort order evolution` docs say `When you evolve a sort order, the old data written with an earlier order remains unchanged.`, to me it begs the question of whether or not updating the partition spec via `updateSpec` will rewrite old data - and if it does not rewrite old data, what precautions do we recommend to people who might use this?

##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")
+    .commit();
+```
+
+Spark supports updating partition spec through its `ALTER TABLE` SQL statement, see more details in [Spark SQL](../spark/#alter-table-add-partition-field).
+
+## Sort order evolution
+
+Similar to partition spec, Iceberg sort order can also be updated in an existing table.
+When you evolve a sort order, the old data written with an earlier order remains unchanged.
+Engines can always choose to write data in the latest sort order or unsorted when sorting is prohibitively expensive.

Review comment:
       When a table has a sort order spec, but the older data is not sorted according to the spec, can this cause queries to silently return incorrect data? Or is this not an issue given that engines can already choose to write data sorted or not based on how expensive it's deemed to be.
   
   Possibly this is more elucidated elsewhere, but otherwise I think it would be good to clarify if changes to the sort order can cause incorrect query results (e.g. if the query engine makes the assumption that data is sorted during execution planning).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560522744



##########
File path: site/docs/spark.md
##########
@@ -258,6 +258,38 @@ ALTER TABLE prod.db.sample DROP COLUMN id
 ALTER TABLE prod.db.sample DROP COLUMN point.z
 ```
 
+### `ALTER TABLE ... ADD PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample ADD PARTITION FIELD catalog -- identity transform
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD truncate(data, 4)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD years(ts)
+-- use optional AS keyword to specify a custom name for the partition field 
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id) AS shard
+```
+
+### `ALTER TABLE ... DROP PARTITION FIELD`

Review comment:
       Thanks for the information, I updated a warning block for this. I also added a line at the top Spark feature support table suggesting they need extensions enabled.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] skambha commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

skambha commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r559946466



##########
File path: site/docs/spark.md
##########
@@ -258,6 +258,37 @@ ALTER TABLE prod.db.sample DROP COLUMN id
 ALTER TABLE prod.db.sample DROP COLUMN point.z
 ```
 
+### `ALTER TABLE ... ADD PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample ADD PARTITION FIELD catalog
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD truncate(data, 4)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD years(ts)
+-- use optional AS keyword to specify a custom name for the partition field 
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id) AS shard
+```
+
+### `ALTER TABLE ... DROP PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample DROP PARTITION FIELD catalog
+ALTER TABLE prod.db.sample DROP PARTITION FIELD bucket(16, id)
+ALTER TABLE prod.db.sample DROP PARTITION FIELD truncate(data, 4)
+ALTER TABLE prod.db.sample DROP PARTITION FIELD years(ts)

Review comment:
       It would be good to add statement to drop the `shard` partition from the above example as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560516334



##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,40 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. 
+For example, the following code could be used to update the partition spec to 
+add a new partition field that places `id` column values into 8 buckets,
+remove an existing partition field `category`, and rename a partition field `id_bucket_8` to `shard`:

Review comment:
       Sure, I will remove it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r561113660



##########
File path: site/docs/spark.md
##########
@@ -24,7 +24,7 @@ Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog impleme
 | [SQL create table](#create-table)                | ✔️        |            |                                                |
 | [SQL create table as](#create-table-as-select)   | ✔️        |            |                                                |
 | [SQL replace table as](#replace-table-as-select) | ✔️        |            |                                                |
-| [SQL alter table](#alter-table)                  | ✔️        |            |                                                |
+| [SQL alter table](#alter-table)                  | ✔️        |            |  ⚠ Updating partition field or sort order requires extensions enabled  |

Review comment:
       I am following the stored procedure doc that says "requires extensions enabled", maybe "requires extensions enabled to update partition field and sort order" is better




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r561120510



##########
File path: site/docs/spark.md
##########
@@ -258,6 +258,47 @@ ALTER TABLE prod.db.sample DROP COLUMN id
 ALTER TABLE prod.db.sample DROP COLUMN point.z
 ```
 
+### `ALTER TABLE ... ADD PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample ADD PARTITION FIELD catalog -- identity transform
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD truncate(data, 4)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD years(ts)
+-- use optional AS keyword to specify a custom name for the partition field 
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id) AS shard
+```
+
+!!! Warning
+    Changing partitioning will change the behavior of dynamic writes, which overwrite any partition that is written to. 

Review comment:
       Yes. Yeah the term is a bit convoluted, but I don't have a better way either. I think the example next line should make this sentence clear to readers.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560355730



##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")

Review comment:
       Sorry it was a typo, fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560505439



##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,40 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. 
+For example, the following code could be used to update the partition spec to 
+add a new partition field that places `id` column values into 8 buckets,
+remove an existing partition field `category`, and rename a partition field `id_bucket_8` to `shard`:

Review comment:
       It looks like this is renaming a partition field that was just added because `bucket("id", 8)` will be named `id_bucket_8`. What about simplifying this by removing the rename?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560506957



##########
File path: site/docs/spark.md
##########
@@ -258,6 +258,38 @@ ALTER TABLE prod.db.sample DROP COLUMN id
 ALTER TABLE prod.db.sample DROP COLUMN point.z
 ```
 
+### `ALTER TABLE ... ADD PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample ADD PARTITION FIELD catalog -- identity transform
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD truncate(data, 4)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD years(ts)
+-- use optional AS keyword to specify a custom name for the partition field 
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id) AS shard
+```
+
+### `ALTER TABLE ... DROP PARTITION FIELD`

Review comment:
       Both of these new commands should have a warning that changing partitioning will change the behavior of dynamic writes, which overwrite any partition that is written to. If you partition by days and move to partitioning by hours, overwrites will overwrite hourly partitions but not days anymore. Definitely something to be aware of and a good reason to use `MERGE INTO` or the new `DataFrameWriterV2` API with an explicit overwrite filter.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] yyanyy commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

yyanyy commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560620148



##########
File path: site/docs/spark.md
##########
@@ -258,6 +258,47 @@ ALTER TABLE prod.db.sample DROP COLUMN id
 ALTER TABLE prod.db.sample DROP COLUMN point.z
 ```
 
+### `ALTER TABLE ... ADD PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample ADD PARTITION FIELD catalog -- identity transform
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD truncate(data, 4)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD years(ts)
+-- use optional AS keyword to specify a custom name for the partition field 
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id) AS shard
+```
+
+!!! Warning
+    Changing partitioning will change the behavior of dynamic writes, which overwrite any partition that is written to. 

Review comment:
       Based on my understanding: "dynamic writes" -> "`INSERT OVERWRITE` with dynamic overwrites"? Also "that is written to" is a bit hard to understand, but I couldn't come up with a good suggestion on this...

##########
File path: site/docs/spark.md
##########
@@ -24,7 +24,7 @@ Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog impleme
 | [SQL create table](#create-table)                | ✔️        |            |                                                |
 | [SQL create table as](#create-table-as-select)   | ✔️        |            |                                                |
 | [SQL replace table as](#replace-table-as-select) | ✔️        |            |                                                |
-| [SQL alter table](#alter-table)                  | ✔️        |            |                                                |
+| [SQL alter table](#alter-table)                  | ✔️        |            |  ⚠ Updating partition field or sort order requires extensions enabled  |

Review comment:
       Nit: either "extensions-enabled" or "extensions to be enabled"? Took me some time to parse this sentence correctly... 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] skambha commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

skambha commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r559948372



##########
File path: site/docs/spark.md
##########
@@ -258,6 +258,37 @@ ALTER TABLE prod.db.sample DROP COLUMN id
 ALTER TABLE prod.db.sample DROP COLUMN point.z
 ```
 
+### `ALTER TABLE ... ADD PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample ADD PARTITION FIELD catalog

Review comment:
       Should we add a line that this maps to the identity transform in Iceberg




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

kbendick commented on pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#issuecomment-761905492


   Thank you so much for adding to the documentation @jackye1995! I left some thoughts, but overall I think that this is a good addition.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560359324



##########
File path: site/docs/spark.md
##########
@@ -258,6 +258,37 @@ ALTER TABLE prod.db.sample DROP COLUMN id
 ALTER TABLE prod.db.sample DROP COLUMN point.z
 ```
 
+### `ALTER TABLE ... ADD PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample ADD PARTITION FIELD catalog
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD truncate(data, 4)
+ALTER TABLE prod.db.sample ADD PARTITION FIELD years(ts)
+-- use optional AS keyword to specify a custom name for the partition field 
+ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id) AS shard
+```
+
+### `ALTER TABLE ... DROP PARTITION FIELD`
+
+```sql
+ALTER TABLE prod.db.sample DROP PARTITION FIELD catalog
+ALTER TABLE prod.db.sample DROP PARTITION FIELD bucket(16, id)
+ALTER TABLE prod.db.sample DROP PARTITION FIELD truncate(data, 4)
+ALTER TABLE prod.db.sample DROP PARTITION FIELD years(ts)

Review comment:
       good point, added




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r560357281



##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")
+    .commit();
+```
+
+Spark supports updating partition spec through its `ALTER TABLE` SQL statement, see more details in [Spark SQL](../spark/#alter-table-add-partition-field).
+
+## Sort order evolution
+
+Similar to partition spec, Iceberg sort order can also be updated in an existing table.
+When you evolve a sort order, the old data written with an earlier order remains unchanged.
+Engines can always choose to write data in the latest sort order or unsorted when sorting is prohibitively expensive.

Review comment:
       It won't return incorrect data, the default sort order ID is always 0, which means unsorted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org