You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/10/29 15:39:24 UTC

[GitHub] [incubator-iceberg] aokolnychyi opened a new pull request #589: [WIP] Extend metadata with SortOrder

aokolnychyi opened a new pull request #589: [WIP] Extend metadata with SortOrder
URL: https://github.com/apache/incubator-iceberg/pull/589
 
 
   This PR contains **_a very early preview_** of how we can extend Iceberg metadata with `SortOrder`.
   
   The API currently looks like this:
   
   ```
   SortOrder.builderFor(schema)
       .natural("col1")
       .natural("col2", ASC, NULLS_FIRST)
       .build()
   ```
   
   The extended metadata looks like this:
   
   ```
     "default-sort-order-id" : 0,
     "sort-orders" : [ {
       "order-id" : 0,
       "fields" : [ {
         "name" : "id",
         "direction" : "desc",
         "null-order" : "nulls_last",
         "transform" : "identity",
         "source-ids" : [ 1 ]
       }, {
         "name" : "data",
         "direction" : "asc",
         "null-order" : "nulls_first",
         "transform" : "identity",
         "source-ids" : [ 2 ]
       } ]
     } ],
   ```
   
   Later, we can have something like this:
   
   ```
     "default-sort-order-id" : 3,
     "sort-orders" : [ {
       "order-id" : 3,
       "fields" : [ {
         "name" : "zvalue",
         "direction" : "asc",
         "null-order" : "nulls_first",
         "transform" : "zorder(8, 16)",
         "source-ids" : [ 1, 4 ]
       } ]
     } ],
   ```
   
   **Open Questions:**
   - Overall approach (any ideas are welcome).
   - Whether we need to have the actual logic for sort transformations in Iceberg or in query engines. The current implementation assumes that query engines will simply report the ordering of incoming data. Another way is to have sort transformations in Iceberg and register them in query engines (e.g. function catalog), but it might be tricky for query engines to report that ordering back to Iceberg on writes. Moreover, all query engines have their internal memory format and we want to avoid the cost of serializing data to Java objects to simply perform the sorting step.
   - Whether to keep `Direction` and `NullOrder` as enums or convert them to booleans.
   - Avoid major changes in the Catalog API. It is not a good idea to break the existing API but I am not sure generating that many methods is worth it. Maybe, we can simply reduce the number of overloaded methods that were added by this commit.
   
   **TODOs:**
   - Handle the sort order in all `TableMetadata` operations.
   - Propagate the sort order to files and annotate each file with orderId.
   - Remove `sortColumns` from `DataFile` and replace it with `sortOrderId`.
   - API for updating the sort order in a table.
   - It looks like query engines will have to report the ordering of data before writing. Then we will need to propagate this info to files.
   - Document changes to the format.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org