You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/08/24 21:10:56 UTC

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #348: Avoid floating point number ordering NaN semantics

rdblue commented on a change in pull request #348: Avoid floating point number ordering NaN semantics
URL: https://github.com/apache/incubator-iceberg/pull/348#discussion_r317374413
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -206,19 +206,21 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | **`104  file_size_in_bytes`**     | `long`                                | Total file size in bytes                                                                                                                                                                             |
 | ~~**`105 block_size_in_bytes`**~~ | `long`                                | **Deprecated. Always write a default value and do not read.**                                                                                                                                        |
 | **`106  file_ordinal`**           | `optional int`                        | Ordinal of the file w.r.t files with the same partition tuple and snapshot id                                                                                                                        |
-| **`107  sort_columns`**           | `optional list`                       | Columns the file is sorted by                                                                                                                                                                        |
+| **`107  sort_columns`**           | `optional list`                       | Columns the file is sorted by [2]. If a column has type `float` or `double` and contains `NaN`, it must not be in `sort_columns`.                                                                    |
 | **`108  column_sizes`**           | `optional map`                        | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro). |
 | **`109  value_counts`**           | `optional map`                        | Map from column id to number of values in the column (including null values)                                                                                                                         |
 | **`110  null_value_counts`**      | `optional map`                        | Map from column id to number of null values in the column                                                                                                                                            |
 | ~~**`111 distinct_counts`**~~     | `optional map`                        | **Deprecated. Do not use.**                                                                                                                                                                          |
-| **`125  lower_bounds`**           | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file.                                            |
-| **`128  upper_bounds`**           | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file.                                         |
+| **`125  lower_bounds`**           | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file. [3]                                        |
+| **`128  upper_bounds`**           | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file. [3]                                     |
 | **`131  key_metadata`**           | `optional binary`                     | Implementation-specific key metadata for encryption                                                                                                                                                  |
 | **`132  split_offsets`**          | `optional list`                       | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending.                                                                                     |
 
 Notes:
 
 1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
+2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate.
+3. Since `NaN` is not less than or equal to or greater than any value, this implies that columns of type `float` or `double` may not appear in `lower_bounds` or `upper_bounds` when the column contains `NaN`. As for `float` or `double` columns in `sort_columns`, `-0.0` is considered to be strictly less than `+0.0`, following IEEE 754's `totalOrder` predicate.
 
 Review comment:
   Is there an alternative to this interpretation? Could we require NaN counts or specifically except NaN from this as we do with null values?
   
   To handle nulls, we never use `null` with equality/inequality predicates. We could do a similar thing for NaN, where the lower and upper bounds apply to non-null and non-NaN values.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org