You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/11 18:35:11 UTC

[GitHub] [iceberg] rdblue commented on a change in pull request #2055: Spec: add sort order to spec

rdblue commented on a change in pull request #2055:
URL: https://github.com/apache/iceberg/pull/2055#discussion_r555257207



##########
File path: site/docs/spec.md
##########
@@ -254,6 +254,24 @@ Notes:
 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters.
 
 
+### Sorting
+
+Users could sort their data within partitions by columns to gain performance. The information on how the data is sorted could be declared per data or delete file, by a **sort order**.
+
+A sort order is defined by an sort order id and a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field consists of:
+
+*   A **source column id** from the table's schema
+*   A **transform** that is used to produce values to be sorted on from the source column. This is the same transform as described in [partition transforms](#partition-transforms).
+*   A **sort direction**, that can only be either `asc` or `desc`
+*   A **null order** that describes the order of null values when sorted. Can only be either `nulls-first` or `nulls-last`
+
+Order id `0` is reserved for the unsorted order. 
+
+A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. This default could be overridden per file basis if the file is sorted differently, such as if the engine is incapable of ensure ordering of the data on write, the generated files should be annotated with sort order id 0 (unsorted). 

Review comment:
       I think that it is a little unclear to say the "default could be overridden per file" because that makes it sound as though the default applies to files that do not have a sort order. The spec should specify in the manifests section that if the order ID is not present or unknown, then the order is assumed to be unsorted.
   
   Here, I think you want to clarify what the default sort order is used for: writers _should_ use the default order, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org