You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@parquet.apache.org by bl...@apache.org on 2015/03/04 21:08:58 UTC

incubator-parquet-format git commit: PARQUET-113: Add specs for LIST and MAP annotations.

Repository: incubator-parquet-format
Updated Branches:
  refs/heads/master e0e4ce153 -> 0e2e0a469


PARQUET-113: Add specs for LIST and MAP annotations.

Draft specs for using `MAP` and `LIST` annotations.

Please help verify that this can read all existing map and list data correctly!

Author: Ryan Blue <bl...@apache.org>

Closes #17 from rdblue/PARQUET-113-add-list-and-map-spec and squashes the following commits:

7c50699 [Ryan Blue] PARQUET-113: Clarify LIST and MAP annotations.
eb627c7 [Ryan Blue] PARQUET-113: Add rules for maps written with Hive.
2515ffc [Ryan Blue] PARQUET-113: Clarify rules after working on implementations.
969a71e [Ryan Blue] PARQUET-113: Remove requirement for annotated repeated types.
3135c61 [Ryan Blue] PARQUET-113: Add specs for LIST and MAP annotations.


Project: http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/commit/0e2e0a46
Tree: http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/tree/0e2e0a46
Diff: http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/diff/0e2e0a46

Branch: refs/heads/master
Commit: 0e2e0a469f3b2dd6b53210a89c851cbbf663fd6f
Parents: e0e4ce1
Author: Ryan Blue <bl...@apache.org>
Authored: Wed Mar 4 12:08:49 2015 -0800
Committer: Ryan Blue <bl...@apache.org>
Committed: Wed Mar 4 12:08:49 2015 -0800

----------------------------------------------------------------------
 LogicalTypes.md | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 205 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-parquet-format/blob/0e2e0a46/LogicalTypes.md
----------------------------------------------------------------------
diff --git a/LogicalTypes.md b/LogicalTypes.md
index e686a27..6bbd27a 100644
--- a/LogicalTypes.md
+++ b/LogicalTypes.md
@@ -148,3 +148,208 @@ primitive type. The `binary` data is interpreted as an encoded BSON document as
 defined by the [BSON specification][bson-spec].
 
 [bson-spec]: http://bsonspec.org/spec.html
+
+## Nested Types
+
+This section specifies how `LIST` and `MAP` can be used to encode nested types
+by adding group levels around repeated fields that are not present in the data.
+
+This does not affect repeated fields that are not annotated: A repeated field
+that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated
+by `LIST` or `MAP` should be interpreted as a required list of required
+elements where the element type is the type of the field.
+
+Implementations should use either `LIST` and `MAP` annotations _or_ unannotated
+repeated fields, but not both. When using the annotations, no unannotated
+repeated types are allowed.
+
+### Lists
+
+`LIST` is used to annotate types that should be interpreted as lists.
+
+`LIST` must always annotate a 3-level structure:
+
+```
+<list-repetition> group <name> (LIST) {
+  repeated group list {
+    <element-repetition> <element-type> element;
+  }
+}
+```
+
+* The outer-most level must be a group annotated with `LIST` that contains a
+  single field named `list`. The repetition of this level must be either
+  `optional` or `required` and determines whether the list is nullable.
+* The middle level, named `list`, must be a repeated group with a single
+  field named `element`.
+* The `element` field encodes the list's element type and repetition. Element
+  repetition must be `required` or `optional`.
+
+The following examples demonstrate two of the possible lists of string values.
+
+```
+// List<String> (list non-null, elements nullable)
+required group my_list (LIST) {
+  repeated group list {
+    optional binary element (UTF8);
+  }
+}
+
+// List<String> (list nullable, elements non-null)
+optional group my_list (LIST) {
+  repeated group list {
+    required binary element (UTF8);
+  }
+}
+```
+
+Element types can be nested structures. For example, a list of lists:
+
+```
+// List<List<Integer>>
+optional group array_of_arrays (LIST) {
+  repeated group list {
+    required group element (LIST) {
+      repeated group list {
+        required int32 element;
+      }
+    }
+  }
+}
+```
+
+#### Backward-compatibility rules
+
+It is required that the repeated group of elements is named `array` and that
+its element field is named `element`. However, these names may not be used in
+existing data and should not be enforced as errors when reading. For example,
+the following field schema should produce a nullable list of non-null strings,
+even though the repeated group is named `element`.
+
+```
+optional group my_list (LIST) {
+  repeated group element {
+    required binary str (UTF8);
+  };
+}
+```
+
+Some existing data did not include the inner element layer. For
+backward-compatibility, the type of elements in `LIST`-annotated structures
+should always be determined by the following rules based on the repeated field:
+
+1. If the repeated field is not a group, then its type is the element type and
+   elements are required.
+2. If the repeated field is a group with multiple fields, then its type is the
+   element type and elements are required.
+3. If the repeated field is a group with one field and is named either "array"
+   or uses the `LIST`-annotated group's name with "tuple" appended then the
+   repeated type is the element type and elements are required.
+4. Otherwise, the repeated field's type is the element type with the repeated
+   field's repetition.
+
+Examples that can be interpreted using these rules:
+
+```
+// List<Integer> (nullable list, non-null elements)
+optional group my_list (LIST) {
+  repeated int32 element;
+}
+
+// List<Tuple<String, Integer>> (nullable list, non-null elements)
+optional group my_list (LIST) {
+  repeated group element {
+    required binary str (UTF8);
+    required int32 num;
+  };
+}
+
+// List<OneTuple<String>> (nullable list, non-null elements)
+optional group my_list (LIST) {
+  repeated group array {
+    required binary str (UTF8);
+  };
+}
+
+// List<OneTuple<String>> (nullable list, non-null elements)
+optional group my_list (LIST) {
+  repeated group my_list_tuple {
+    required binary str (UTF8);
+  };
+}
+```
+
+### Maps
+
+`MAP` is used to annotate types that should be interpreted as a map from keys
+to values. `MAP` must annotate a 3-level structure:
+
+```
+<map-repetition> group <name> (MAP) {
+  repeated group key_value {
+    required <key-type> key;
+    <value-repetition> <value-type> value;
+  }
+}
+```
+
+* The outer-most level must be a group annotated with `MAP` that contains a
+  single field named `key_value`. The repetition of this level must be either
+  `optional` or `required` and determines whether the list is nullable.
+* The middle level, named `key_value`, must be a repeated group with a `key`
+  field for map keys and, optionally, a `value` field for map values.
+* The `key` field encodes the map's key type. This field must have
+  repetition `required` and must always be present.
+* The `value` field encodes the map's value type and repetition. This field can
+  be `required`, `optional`, or omitted.
+
+The following example demonstrates the type for a non-null map from strings to
+nullable integers:
+
+```
+// Map<String, Integer>
+required group my_map (MAP) {
+  repeated group key_value {
+    required binary key (UTF8);
+    optional int32 value;
+  }
+}
+```
+
+If there are multiple key-value pairs for the same key, then the final value
+for that key must be the last value. Other values may be ignored or may be
+added with replacement to the map container in the order that they are encoded.
+The `MAP` annotation should not be used to encode multi-maps using duplicate
+keys.
+
+#### Backward-compatibility rules
+
+It is required that the repeated group of key-value pairs is named `key_value`
+and that its fields are named `key` and `value`. However, these names may not
+be used in existing data and should not be enforced as errors when reading.
+
+Some existing data incorrectly used `MAP_KEY_VALUE` in place of `MAP`. For
+backward-compatibility, a group annotated with `MAP_KEY_VALUE` that is not
+contained by a `MAP`-annotated group should be handled as a `MAP`-annotated
+group.
+
+Examples that can be interpreted using these rules:
+
+```
+// Map<String, Integer> (nullable map, non-null values)
+optional group my_map (MAP) {
+  repeated group map {
+    required binary str (UTF8);
+    required int32 num;
+  }
+}
+
+// Map<String, Integer> (nullable map, nullable values)
+optional group my_map (MAP_KEY_VALUE) {
+  repeated group map {
+    required binary key (UTF8);
+    optional int32 value;
+  }
+}
+```
+