You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/06 22:44:13 UTC
[GitHub] [iceberg] yyanyy edited a comment on pull request #1975: Core: add sort order id to content file

yyanyy edited a comment on pull request #1975:
URL: https://github.com/apache/iceberg/pull/1975#issuecomment-755053818


   > > Not sure if sort order should be nullable by default or 0 (from unsorted_order)
   > 
   > The field should be optional because v1 manifests will not have the order field. Iceberg will read the value as null, so I think it makes sense to use null. And you're right about not storing it for position deletes.
   > 
   > > Do we want only sort order id, or actual sort order struct?
   > 
   > We want the ID. Sort orders are attached to table metadata, so loading the order should be a simple hash map lookup.
   > 
   > > For the next PR, do we assume the table's current sort order id is the authoritative place to get sort order information when adding a new file?
   > 
   > No. Engines must specify which sort order was used to write a file explicitly. So this needs to be exposed in the DataFile and DeleteFile builders. By default, we should write either null or 0 (unordered). Probably null.
   
   Thank you for the response! 
   
   > We want the ID. Sort orders are attached to table metadata, so loading the order should be a simple hash map lookup.
   
   I guess in order to do that, we may need to add the sort order map in `FileScanTask`, as it seems like in readers (e.g. `RowDataReader`) we rely on it for reading rows, meanwhile we don't have the table available for metadata lookup?
   
   > Engines must specify which sort order was used to write a file explicitly.
   
   (Sorry for the naive question) I guess the sort order needs to be decided when building the writer (e.g. add a `sortOrder` parameter in [`SparkWriter` writer factory](https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L518)), and how does the engine know which sort order to use when writing files? Maybe the sort order could be an optional thing to specify when a job is created (e.g. as part of the the sql command for ingesting data) , and thus the engine will already know the sort order to use when it creates the writer, although some validations might need to be done against table metadata before that (e.g. check for such sort order exists, create one or abort if not); and if nothing is specified for this job/command, the engine will look for table's default sort order, and use it for creating the writer? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org