You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2016/03/25 03:19:46 UTC

arrow git commit: ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction

Repository: arrow
Updated Branches:
  refs/heads/master fbbee3d2d -> c06b7654b


ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction

As the initial scribe for the Arrow format, I made a mistake in what the null bits mean (1 for not-null, 0 for null). I also addressed ARROW-56 (bit-numbering) here.

Database systems are split on this subject. PostgreSQL for example does it this way:

http://www.postgresql.org/docs/9.5/static/storage-page-layout.html

> In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null.

Since the Drill implementation predates the Arrow project, I think it's safe to go with this.

This patch also includes ARROW-76 which adds a "null count" to the memory layout indicating the actual number of nulls in an array. This also strikes the "non-nullable" distinction from the memory layout as there is no semantic difference between arrays with null count 0 and a non-nullable array. Instead, users may choose to set `nullable=false` in the schema metadata and verify that Arrow memory conforms to the schema.

Author: Wes McKinney <we...@apache.org>

Closes #34 from wesm/ARROW-62 and squashes the following commits:

8c92926 [Wes McKinney] Add to README about what the format documents are
1f6fe03 [Wes McKinney] Account for null count and non-nullable removal from ARROW-76
648fd47 [Wes McKinney] Indicate that bitmaps should be a multiple of 8 bytes
4333d82 [Wes McKinney] Use 'null bitmap' similar to PostgreSQL documentation
dac77d4 [Wes McKinney] Revise format document language re: null bitmaps per feedback
f7a3898 [Wes McKinney] Revise format to indicate LSB bit numbering and 0/1 null/not-null distinction


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/c06b7654
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/c06b7654
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/c06b7654

Branch: refs/heads/master
Commit: c06b7654bccfe8c461869a6e5922668896c27c45
Parents: fbbee3d
Author: Wes McKinney <we...@apache.org>
Authored: Thu Mar 24 19:19:22 2016 -0700
Committer: Wes McKinney <we...@apache.org>
Committed: Thu Mar 24 19:19:22 2016 -0700

----------------------------------------------------------------------
 format/Layout.md                          |  77 +++++++++++++++++--------
 format/Message.fbs                        |  10 ++--
 format/README.md                          |  17 ++++++
 format/diagrams/layout-list-of-struct.png | Bin 60600 -> 54122 bytes
 4 files changed, 74 insertions(+), 30 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/c06b7654/format/Layout.md
----------------------------------------------------------------------
diff --git a/format/Layout.md b/format/Layout.md
index c393163..2d46ece 100644
--- a/format/Layout.md
+++ b/format/Layout.md
@@ -42,7 +42,7 @@ Base requirements
 * Capable of representing fully-materialized and decoded / decompressed Parquet
   data
 * All leaf nodes (primitive value arrays) use contiguous memory regions
-* Each relative type can be nullable or non-nullable
+* Any relative type can be have null slots
 * Arrays are immutable once created. Implementations can provide APIs to mutate
   an array, but applying mutations will require a new array data structure to
   be built.
@@ -56,7 +56,7 @@ Base requirements
 * To describe relative types (physical value types and a preliminary set of
   nested types) sufficient for an unambiguous implementation
 * Memory layout and random access patterns for each relative type
-* Null representation for nullable types
+* Null value representation
 
 ## Non-goals (for this document
 
@@ -79,28 +79,55 @@ Base requirements
 Any array has a known and fixed length, stored as a 32-bit signed integer, so a
 maximum of 2^31 - 1 elements. We choose a signed int32 for a couple reasons:
 
-* Enhance compatibility with Java and client languages which may have varying quality of support for unsigned integers.
+* Enhance compatibility with Java and client languages which may have varying
+  quality of support for unsigned integers.
 * To encourage developers to compose smaller arrays (each of which contains
   contiguous memory in its leaf nodes) to create larger array structures
   possibly exceeding 2^31 - 1 elements, as opposed to allocating very large
   contiguous memory blocks.
 
-## Nullable and non-nullable arrays
+## Null count
 
-Any relative type can be nullable or non-nullable.
+The number of null value slots is a property of the physical array and
+considered part of the data structure. The null count is stored as a 32-bit
+signed integer, as it may be as large as the array length.
 
-Nullable arrays have a contiguous memory buffer, known as the null bitmask,
-whose length is large enough to have 1 bit for each array slot. Whether any
-array slot is null is encoded in the respective bits of this bitmask, i.e.:
+## Null bitmaps
+
+Any relative type can have null value slots, whether primitive or nested type.
+
+An array with nulls must have a contiguous memory buffer, known as the null (or
+validity) bitmap, whose length is a multiple of 8 bytes (to avoid
+word-alignment concerns) and large enough to have at least 1 bit for each array
+slot.
+
+Whether any array slot is valid (non-null) is encoded in the respective bits of
+this bitmap. A 1 (set bit) for index `j` indicates that the value is not null,
+while a 0 (bit not set) indicates that it is null. Bitmaps are to be
+initialized to be all unset at allocation time.
 
 ```
-is_null[j] -> bitmask[j / 8] & (1 << (j % 8))
+is_valid[j] -> bitmap[j / 8] & (1 << (j % 8))
 ```
 
-Physically, non-nullable (NN) arrays do not have a null bitmask.
+We use [least-significant bit (LSB) numbering][1] (also known as
+bit-endianness). This means that within a group of 8 bits, we read
+right-to-left:
 
-For nested types, if the top-level nested type is nullable, it has its own
-bitmask regardless of whether the child types are nullable.
+```
+values = [0, 1, null, 2, null, 3]
+
+bitmap
+j mod 8   7  6  5  4  3  2  1  0
+          0  0  1  0  1  0  1  1
+```
+
+Arrays having a 0 null count may choose to not allocate the null
+bitmap. Implementations may choose to always allocate one anyway as a matter of
+convenience, but this should be noted when memory is being shared.
+
+Nested type arrays have their own null bitmap and null count regardless of
+the null count and null bits of their child arrays.
 
 ## Primitive value arrays
 
@@ -112,9 +139,8 @@ Internally, the array contains a contiguous memory buffer whose total size is
 equal to the slot width multiplied by the array length. For bit-packed types,
 the size is rounded up to the nearest byte.
 
-The associated null bitmask (for nullable types) is contiguously allocated (as
-described above) but does not need to be adjacent in memory to the values
-buffer.
+The associated null bitmap is contiguously allocated (as described above) but
+does not need to be adjacent in memory to the values buffer.
 
 (diagram not to scale)
 
@@ -180,22 +206,22 @@ For example, the struct (field names shown here as strings for illustration
 purposes)
 
 ```
-Struct [nullable] <
-  name: String (= List<char>) [nullable],
-  age: Int32 [not-nullable]
+Struct <
+  name: String (= List<char>),
+  age: Int32
 >
 ```
 
-has two child arrays, one List<char> array (layout as above) and one
-non-nullable 4-byte physical value array having Int32 (not-null) logical
-type. Here is a diagram showing the full physical layout of this struct:
+has two child arrays, one List<char> array (layout as above) and one 4-byte
+physical value array having Int32 logical type. Here is a diagram showing the
+full physical layout of this struct:
 
 <img src="diagrams/layout-list-of-struct.png" width="400"/>
 
 While a struct does not have physical storage for each of its semantic slots
 (i.e. each scalar C-like struct), an entire struct slot can be set to null via
-the bitmask. Whether each of the child field arrays can have null values
-depends on whether or not the respective relative type is nullable.
+the null bitmap. Any of the child field arrays can have null values according
+to their respective independent null bitmaps.
 
 ## Dense union type
 
@@ -233,8 +259,7 @@ Here is a diagram of an example dense union:
 
 A sparse union has the same structure as a dense union, with the omission of
 the offsets array. In this case, the child arrays are each equal in length to
-the length of the union. This is analogous to a large struct in which all
-fields are nullable.
+the length of the union.
 
 While a sparse union may use significantly more space compared with a dense
 union, it has some advantages that may be desirable in certain use cases:
@@ -251,3 +276,5 @@ the correct value.
 ## References
 
 Drill docs https://drill.apache.org/docs/value-vectors/
+
+[1]: https://en.wikipedia.org/wiki/Bit_numbering
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/arrow/blob/c06b7654/format/Message.fbs
----------------------------------------------------------------------
diff --git a/format/Message.fbs b/format/Message.fbs
index 3ffd203..fc849ee 100644
--- a/format/Message.fbs
+++ b/format/Message.fbs
@@ -129,8 +129,8 @@ struct FieldNode {
   length: int;
 
   /// The number of observed nulls. Fields with null_count == 0 may choose not
-  /// to write their physical null bitmap out as a materialized buffer, instead
-  /// setting the length of the null buffer to 0.
+  /// to write their physical validity bitmap out as a materialized buffer,
+  /// instead setting the length of the bitmap buffer to 0.
   null_count: int;
 }
 
@@ -148,9 +148,9 @@ table RecordBatch {
   /// Buffers correspond to the pre-ordered flattened buffer tree
   ///
   /// The number of buffers appended to this list depends on the schema. For
-  /// example, most primitive arrays will have 2 buffers, 1 for the null bitmap
-  /// and 1 for the values. For struct arrays, there will only be a single
-  /// buffer for the null bitmap
+  /// example, most primitive arrays will have 2 buffers, 1 for the validity
+  /// bitmap and 1 for the values. For struct arrays, there will only be a
+  /// single buffer for the validity (nulls) bitmap
   buffers: [Buffer];
 }
 

http://git-wip-us.apache.org/repos/asf/arrow/blob/c06b7654/format/README.md
----------------------------------------------------------------------
diff --git a/format/README.md b/format/README.md
index 1120e62..c84e007 100644
--- a/format/README.md
+++ b/format/README.md
@@ -3,3 +3,20 @@
 > **Work-in-progress specification documents**. These are discussion documents
 > created by the Arrow developers during late 2015 and in no way represents a
 > finalized specification.
+
+Currently, the Arrow specification consists of these pieces:
+
+- Physical memory layout specification (see Layout.md)
+- Metadata serialized representation (see Message.fbs)
+
+The metadata currently uses Google's [flatbuffers library][1] for serializing a
+couple related pieces of information:
+
+- Schemas for tables or record (row) batches. This contains the logical types,
+  field names, and other metadata. Schemas do not contain any information about
+  actual data.
+- *Data headers* for record (row) batches. These must correspond to a known
+   schema, and enable a system to send and receive Arrow row batches in a form
+   that can be precisely disassembled or reconstructed.
+
+[1]: http://github.com/google/flatbuffers
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/arrow/blob/c06b7654/format/diagrams/layout-list-of-struct.png
----------------------------------------------------------------------
diff --git a/format/diagrams/layout-list-of-struct.png b/format/diagrams/layout-list-of-struct.png
index 00d6c6f..fb6f2a2 100644
Binary files a/format/diagrams/layout-list-of-struct.png and b/format/diagrams/layout-list-of-struct.png differ