You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by uw...@apache.org on 2018/12/23 16:31:51 UTC

[44/51] [partial] arrow-site git commit: Upload nightly docs

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/62ef7145/docs/latest/_sources/format/IPC.rst.txt
----------------------------------------------------------------------
diff --git a/docs/latest/_sources/format/IPC.rst.txt b/docs/latest/_sources/format/IPC.rst.txt
new file mode 100644
index 0000000..8cb74b8
--- /dev/null
+++ b/docs/latest/_sources/format/IPC.rst.txt
@@ -0,0 +1,237 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Interprocess messaging / communication (IPC)
+============================================
+
+Encapsulated message format
+---------------------------
+
+Data components in the stream and file formats are represented as encapsulated
+*messages* consisting of:
+
+* A length prefix indicating the metadata size
+* The message metadata as a `Flatbuffer`_
+* Padding bytes to an 8-byte boundary
+* The message body, which must be a multiple of 8 bytes
+
+Schematically, we have: ::
+
+    <metadata_size: int32>
+    <metadata_flatbuffer: bytes>
+    <padding>
+    <message body>
+
+The complete serialized message must be a multiple of 8 bytes so that messages
+can be relocated between streams. Otherwise the amount of padding between the
+metadata and the message body could be non-deterministic.
+
+The ``metadata_size`` includes the size of the flatbuffer plus padding. The
+``Message`` flatbuffer includes a version number, the particular message (as a
+flatbuffer union), and the size of the message body: ::
+
+    table Message {
+      version: org.apache.arrow.flatbuf.MetadataVersion;
+      header: MessageHeader;
+      bodyLength: long;
+    }
+
+Currently, we support 4 types of messages:
+
+* Schema
+* RecordBatch
+* DictionaryBatch
+* Tensor
+
+Streaming format
+----------------
+
+We provide a streaming format for record batches. It is presented as a sequence
+of encapsulated messages, each of which follows the format above. The schema
+comes first in the stream, and it is the same for all of the record batches
+that follow. If any fields in the schema are dictionary-encoded, one or more
+``DictionaryBatch`` messages will be included. ``DictionaryBatch`` and
+``RecordBatch`` messages may be interleaved, but before any dictionary key is used
+in a ``RecordBatch`` it should be defined in a ``DictionaryBatch``. ::
+
+    <SCHEMA>
+    <DICTIONARY 0>
+    ...
+    <DICTIONARY k - 1>
+    <RECORD BATCH 0>
+    ...
+    <DICTIONARY x DELTA>
+    ...
+    <DICTIONARY y DELTA>
+    ...
+    <RECORD BATCH n - 1>
+    <EOS [optional]: int32>
+
+When a stream reader implementation is reading a stream, after each message, it
+may read the next 4 bytes to know how large the message metadata that follows
+is. Once the message flatbuffer is read, you can then read the message body.
+
+The stream writer can signal end-of-stream (EOS) either by writing a 0 length
+as an ``int32`` or simply closing the stream interface.
+
+File format
+-----------
+
+We define a "file format" supporting random access in a very similar format to
+the streaming format. The file starts and ends with a magic string ``ARROW1``
+(plus padding). What follows in the file is identical to the stream format. At
+the end of the file, we write a *footer* containing a redundant copy of the
+schema (which is a part of the streaming format) plus memory offsets and sizes
+for each of the data blocks in the file. This enables random access any record
+batch in the file. See ``File.fbs`` for the precise details of the file
+footer.
+
+Schematically we have: ::
+
+    <magic number "ARROW1">
+    <empty padding bytes [to 8 byte boundary]>
+    <STREAMING FORMAT>
+    <FOOTER>
+    <FOOTER SIZE: int32>
+    <magic number "ARROW1">
+
+In the file format, there is no requirement that dictionary keys should be
+defined in a ``DictionaryBatch`` before they are used in a ``RecordBatch``, as long
+as the keys are defined somewhere in the file.
+
+RecordBatch body structure
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``RecordBatch`` metadata contains a depth-first (pre-order) flattened set of
+field metadata and physical memory buffers (some comments from ``Message.fbs``
+have been shortened / removed): ::
+
+    table RecordBatch {
+      length: long;
+      nodes: [FieldNode];
+      buffers: [Buffer];
+    }
+
+    struct FieldNode {
+      length: long;
+      null_count: long;
+    }
+
+    struct Buffer {
+      /// The relative offset into the shared memory page where the bytes for this
+      /// buffer starts
+      offset: long;
+
+      /// The absolute length (in bytes) of the memory buffer. The memory is found
+      /// from offset (inclusive) to offset + length (non-inclusive).
+      length: long;
+    }
+
+In the context of a file, the ``page`` is not used, and the ``Buffer`` offsets use
+as a frame of reference the start of the message body. So, while in a general
+IPC setting these offsets may be anyplace in one or more shared memory regions,
+in the file format the offsets start from 0.
+
+The location of a record batch and the size of the metadata block as well as
+the body of buffers is stored in the file footer: ::
+
+    struct Block {
+      offset: long;
+      metaDataLength: int;
+      bodyLength: long;
+    }
+
+The ``metaDataLength`` here includes the metadata length prefix, serialized
+metadata, and any additional padding bytes, and by construction must be a
+multiple of 8 bytes.
+
+Some notes about this
+
+* The ``Block`` offset indicates the starting byte of the record batch.
+* The metadata length includes the flatbuffer size, the record batch metadata
+  flatbuffer, and any padding bytes
+
+Dictionary Batches
+~~~~~~~~~~~~~~~~~~
+
+Dictionaries are written in the stream and file formats as a sequence of record
+batches, each having a single field. The complete semantic schema for a
+sequence of record batches, therefore, consists of the schema along with all of
+the dictionaries. The dictionary types are found in the schema, so it is
+necessary to read the schema to first determine the dictionary types so that
+the dictionaries can be properly interpreted. ::
+
+    table DictionaryBatch {
+      id: long;
+      data: RecordBatch;
+      isDelta: boolean = false;
+    }
+
+The dictionary ``id`` in the message metadata can be referenced one or more times
+in the schema, so that dictionaries can even be used for multiple fields. See
+the :doc:`Layout` document for more about the semantics of
+dictionary-encoded data.
+
+The dictionary ``isDelta`` flag allows dictionary batches to be modified
+mid-stream.  A dictionary batch with ``isDelta`` set indicates that its vector
+should be concatenated with those of any previous batches with the same ``id``. A
+stream which encodes one column, the list of strings
+``["A", "B", "C", "B", "D", "C", "E", "A"]``, with a delta dictionary batch could
+take the form: ::
+
+    <SCHEMA>
+    <DICTIONARY 0>
+    (0) "A"
+    (1) "B"
+    (2) "C"
+
+    <RECORD BATCH 0>
+    0
+    1
+    2
+    1
+
+    <DICTIONARY 0 DELTA>
+    (3) "D"
+    (4) "E"
+
+    <RECORD BATCH 1>
+    3
+    2
+    4
+    0
+    EOS
+
+Tensor (Multi-dimensional Array) Message Format
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``Tensor`` message types provides a way to write a multidimensional array of
+fixed-size values (such as a NumPy ndarray) using Arrow's shared memory
+tools. Arrow implementations in general are not required to implement this data
+format, though we provide a reference implementation in C++.
+
+When writing a standalone encapsulated tensor message, we use the format as
+indicated above, but additionally align the starting offset of the metadata as
+well as the starting offset of the tensor body (if writing to a shared memory
+region) to be multiples of 64 bytes: ::
+
+    <PADDING>
+    <metadata size: int32>
+    <metadata>
+    <tensor body>
+
+.. _Flatbuffer: https://github.com/google/flatbuffers

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/62ef7145/docs/latest/_sources/format/Layout.rst.txt
----------------------------------------------------------------------
diff --git a/docs/latest/_sources/format/Layout.rst.txt b/docs/latest/_sources/format/Layout.rst.txt
new file mode 100644
index 0000000..868a99b
--- /dev/null
+++ b/docs/latest/_sources/format/Layout.rst.txt
@@ -0,0 +1,664 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Physical memory layout
+======================
+
+Definitions / Terminology
+-------------------------
+
+Since different projects have used different words to describe various
+concepts, here is a small glossary to help disambiguate.
+
+* Array: a sequence of values with known length all having the same type.
+* Slot or array slot: a single logical value in an array of some particular data type
+* Contiguous memory region: a sequential virtual address space with a given
+  length. Any byte can be reached via a single pointer offset less than the
+  region's length.
+* Contiguous memory buffer: A contiguous memory region that stores
+  a multi-value component of an Array.  Sometimes referred to as just "buffer".
+* Primitive type: a data type that occupies a fixed-size memory slot specified
+  in bit width or byte width
+* Nested or parametric type: a data type whose full structure depends on one or
+  more other child relative types. Two fully-specified nested types are equal
+  if and only if their child types are equal. For example, ``List<U>`` is distinct
+  from ``List<V>`` iff U and V are different relative types.
+* Relative type or simply type (unqualified): either a specific primitive type
+  or a fully-specified nested type. When we say slot we mean a relative type
+  value, not necessarily any physical storage region.
+* Logical type: A data type that is implemented using some relative (physical)
+  type. For example, Decimal values are stored as 16 bytes in a fixed byte
+  size array. Similarly, strings can be stored as ``List<1-byte>``.
+* Parent and child arrays: names to express relationships between physical
+  value arrays in a nested type structure. For example, a ``List<T>``-type parent
+  array has a T-type array as its child (see more on lists below).
+* Leaf node or leaf: A primitive value array that may or may not be a child
+  array of some array with a nested type.
+
+Requirements, goals, and non-goals
+----------------------------------
+
+Base requirements
+
+* A physical memory layout enabling zero-deserialization data interchange
+  amongst a variety of systems handling flat and nested columnar data, including
+  such systems as Spark, Drill, Impala, Kudu, Ibis, ODBC protocols, and
+  proprietary systems that utilize the open source components.
+* All array slots are accessible in constant time, with complexity growing
+  linearly in the nesting level
+* Capable of representing fully-materialized and decoded / decompressed `Parquet`_
+  data
+* It is required to have all the contiguous memory buffers in an IPC payload
+  aligned at 8-byte boundaries. In other words, each buffer must start at
+  an aligned 8-byte offset.
+* The general recommendation is to align the buffers at 64-byte boundary, but
+  this is not absolutely necessary.
+* Any relative type can have null slots
+* Arrays are immutable once created. Implementations can provide APIs to mutate
+  an array, but applying mutations will require a new array data structure to
+  be built.
+* Arrays are relocatable (e.g. for RPC/transient storage) without pointer
+  swizzling. Another way of putting this is that contiguous memory regions can
+  be migrated to a different address space (e.g. via a memcpy-type of
+  operation) without altering their contents.
+
+Goals (for this document)
+-------------------------
+
+* To describe relative types (physical value types and a preliminary set of
+  nested types) sufficient for an unambiguous implementation
+* Memory layout and random access patterns for each relative type
+* Null value representation
+
+Non-goals (for this document)
+-----------------------------
+
+* To enumerate or specify logical types that can be implemented as primitive
+  (fixed-width) value types. For example: signed and unsigned integers,
+  floating point numbers, boolean, exact decimals, date and time types,
+  CHAR(K), VARCHAR(K), etc.
+* To specify standardized metadata or a data layout for RPC or transient file
+  storage.
+* To define a selection or masking vector construct
+* Implementation-specific details
+* Details of a user or developer C/C++/Java API.
+* Any "table" structure composed of named arrays each having their own type or
+  any other structure that composes arrays.
+* Any memory management or reference counting subsystem
+* To enumerate or specify types of encodings or compression support
+
+Byte Order (`Endianness`_)
+---------------------------
+
+The Arrow format is little endian by default.
+The Schema metadata has an endianness field indicating endianness of RecordBatches.
+Typically this is the endianness of the system where the RecordBatch was generated.
+The main use case is exchanging RecordBatches between systems with the same Endianness.
+At first we will return an error when trying to read a Schema with an endianness
+that does not match the underlying system. The reference implementation is focused on
+Little Endian and provides tests for it. Eventually we may provide automatic conversion
+via byte swapping.
+
+Alignment and Padding
+---------------------
+
+As noted above, all buffers must be aligned in memory at 8-byte boundaries and padded
+to a length that is a multiple of 8 bytes.  The alignment requirement follows best
+practices for optimized memory access:
+
+* Elements in numeric arrays will be guaranteed to be retrieved via aligned access.
+* On some architectures alignment can help limit partially used cache lines.
+* 64 byte alignment is recommended by the `Intel performance guide`_ for
+  data-structures over 64 bytes (which will be a common case for Arrow Arrays).
+
+Recommending padding to a multiple of 64 bytes allows for using `SIMD`_ instructions
+consistently in loops without additional conditional checks.
+This should allow for simpler, efficient and CPU cache-friendly code.
+The specific padding length was chosen because it matches the largest known
+SIMD instruction registers available as of April 2016 (Intel AVX-512). In other
+words, we can load the entire 64-byte buffer into a 512-bit wide SIMD register
+and get data-level parallelism on all the columnar values packed into the 64-byte
+buffer. Guaranteed padding can also allow certain compilers
+to generate more optimized code directly (e.g. One can safely use Intel's
+``-qopt-assume-safe-padding``).
+
+Unless otherwise noted, padded bytes do not need to have a specific value.
+
+Array lengths
+-------------
+
+Array lengths are represented in the Arrow metadata as a 64-bit signed
+integer. An implementation of Arrow is considered valid even if it only
+supports lengths up to the maximum 32-bit signed integer, though. If using
+Arrow in a multi-language environment, we recommend limiting lengths to
+2 :sup:`31` - 1 elements or less. Larger data sets can be represented using
+multiple array chunks.
+
+Null count
+----------
+
+The number of null value slots is a property of the physical array and
+considered part of the data structure. The null count is represented in the
+Arrow metadata as a 64-bit signed integer, as it may be as large as the array
+length.
+
+Null bitmaps
+------------
+
+Any relative type can have null value slots, whether primitive or nested type.
+
+An array with nulls must have a contiguous memory buffer, known as the null (or
+validity) bitmap, whose length is a multiple of 64 bytes (as discussed above)
+and large enough to have at least 1 bit for each array
+slot.
+
+Whether any array slot is valid (non-null) is encoded in the respective bits of
+this bitmap. A 1 (set bit) for index ``j`` indicates that the value is not null,
+while a 0 (bit not set) indicates that it is null. Bitmaps are to be
+initialized to be all unset at allocation time (this includes padding).::
+
+    is_valid[j] -> bitmap[j / 8] & (1 << (j % 8))
+
+We use `least-significant bit (LSB) numbering`_ (also known as
+bit-endianness). This means that within a group of 8 bits, we read
+right-to-left: ::
+
+    values = [0, 1, null, 2, null, 3]
+
+    bitmap
+    j mod 8   7  6  5  4  3  2  1  0
+              0  0  1  0  1  0  1  1
+
+Arrays having a 0 null count may choose to not allocate the null
+bitmap. Implementations may choose to always allocate one anyway as a matter of
+convenience, but this should be noted when memory is being shared.
+
+Nested type arrays have their own null bitmap and null count regardless of
+the null count and null bits of their child arrays.
+
+Primitive value arrays
+----------------------
+
+A primitive value array represents a fixed-length array of values each having
+the same physical slot width typically measured in bytes, though the spec also
+provides for bit-packed types (e.g. boolean values encoded in bits).
+
+Internally, the array contains a contiguous memory buffer whose total size is
+equal to the slot width multiplied by the array length. For bit-packed types,
+the size is rounded up to the nearest byte.
+
+The associated null bitmap is contiguously allocated (as described above) but
+does not need to be adjacent in memory to the values buffer.
+
+
+Example Layout: Int32 Array
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For example a primitive array of int32s: ::
+
+    [1, null, 2, 4, 8]
+
+Would look like: ::
+
+    * Length: 5, Null count: 1
+    * Null bitmap buffer:
+
+      |Byte 0 (validity bitmap) | Bytes 1-63            |
+      |-------------------------|-----------------------|
+      | 00011101                | 0 (padding)           |
+
+    * Value Buffer:
+
+      |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
+      |------------|-------------|-------------|-------------|-------------|-------------|
+      | 1          | unspecified | 2           | 4           | 8           | unspecified |
+
+Example Layout: Non-null int32 Array
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``[1, 2, 3, 4, 8]`` has two possible layouts: ::
+
+    * Length: 5, Null count: 0
+    * Null bitmap buffer:
+
+      | Byte 0 (validity bitmap) | Bytes 1-63            |
+      |--------------------------|-----------------------|
+      | 00011111                 | 0 (padding)           |
+
+    * Value Buffer:
+
+      |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | bytes 12-15 | bytes 16-19 | Bytes 20-63 |
+      |------------|-------------|-------------|-------------|-------------|-------------|
+      | 1          | 2           | 3           | 4           | 8           | unspecified |
+
+or with the bitmap elided: ::
+
+    * Length 5, Null count: 0
+    * Null bitmap buffer: Not required
+    * Value Buffer:
+
+      |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | bytes 12-15 | bytes 16-19 | Bytes 20-63 |
+      |------------|-------------|-------------|-------------|-------------|-------------|
+      | 1          | 2           | 3           | 4           | 8           | unspecified |
+
+List type
+---------
+
+List is a nested type in which each array slot contains a variable-size
+sequence of values all having the same relative type (heterogeneity can be
+achieved through unions, described later).
+
+A list type is specified like ``List<T>``, where ``T`` is any relative type
+(primitive or nested).
+
+A list-array is represented by the combination of the following:
+
+* A values array, a child array of type T. T may also be a nested type.
+* An offsets buffer containing 32-bit signed integers with length equal to the
+  length of the top-level array plus one. Note that this limits the size of the
+  values array to 2 :sup:`31` -1.
+
+The offsets array encodes a start position in the values array, and the length
+of the value in each slot is computed using the first difference with the next
+element in the offsets array. For example, the position and length of slot j is
+computed as: ::
+
+    slot_position = offsets[j]
+    slot_length = offsets[j + 1] - offsets[j]  // (for 0 <= j < length)
+
+The first value in the offsets array is 0, and the last element is the length
+of the values array.
+
+Example Layout: ``List<Char>`` Array
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Let's consider an example, the type ``List<Char>``, where Char is a 1-byte
+logical type.
+
+For an array of length 4 with respective values: ::
+
+    [['j', 'o', 'e'], null, ['m', 'a', 'r', 'k'], []]
+
+will have the following representation: ::
+
+    * Length: 4, Null count: 1
+    * Null bitmap buffer:
+
+      | Byte 0 (validity bitmap) | Bytes 1-63            |
+      |--------------------------|-----------------------|
+      | 00001101                 | 0 (padding)           |
+
+    * Offsets buffer (int32)
+
+      | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
+      |------------|-------------|-------------|-------------|-------------|-------------|
+      | 0          | 3           | 3           | 7           | 7           | unspecified |
+
+    * Values array (char array):
+      * Length: 7,  Null count: 0
+      * Null bitmap buffer: Not required
+
+        | Bytes 0-6  | Bytes 7-63  |
+        |------------|-------------|
+        | joemark    | unspecified |
+
+Example Layout: ``List<List<byte>>``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``[[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]]``
+
+will be be represented as follows: ::
+
+    * Length 3
+    * Nulls count: 0
+    * Null bitmap buffer: Not required
+    * Offsets buffer (int32)
+
+      | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
+      |------------|------------|------------|-------------|-------------|
+      | 0          |  2         |  5         |  6          | unspecified |
+
+    * Values array (`List<byte>`)
+      * Length: 6, Null count: 1
+      * Null bitmap buffer:
+
+        | Byte 0 (validity bitmap) | Bytes 1-63  |
+        |--------------------------|-------------|
+        | 00110111                 | 0 (padding) |
+
+      * Offsets buffer (int32)
+
+        | Bytes 0-27           | Bytes 28-63 |
+        |----------------------|-------------|
+        | 0, 2, 4, 7, 7, 8, 10 | unspecified |
+
+      * Values array (bytes):
+        * Length: 10, Null count: 0
+        * Null bitmap buffer: Not required
+
+          | Bytes 0-9                     | Bytes 10-63 |
+          |-------------------------------|-------------|
+          | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified |
+
+Struct type
+-----------
+
+A struct is a nested type parameterized by an ordered sequence of relative
+types (which can all be distinct), called its fields.
+
+Typically the fields have names, but the names and their types are part of the
+type metadata, not the physical memory layout.
+
+A struct array does not have any additional allocated physical storage for its values.
+A struct array must still have an allocated null bitmap, if it has one or more null values.
+
+Physically, a struct type has one child array for each field. The child arrays are independent and need not be adjacent to each other in memory.
+
+For example, the struct (field names shown here as strings for illustration
+purposes)::
+
+    Struct <
+      name: String (= List<char>),
+      age: Int32
+    >
+
+has two child arrays, one ``List<char>`` array (layout as above) and one 4-byte
+primitive value array having ``Int32`` logical type.
+
+Example Layout: ``Struct<List<char>, Int32>``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The layout for ``[{'joe', 1}, {null, 2}, null, {'mark', 4}]`` would be: ::
+
+    * Length: 4, Null count: 1
+    * Null bitmap buffer:
+
+      |Byte 0 (validity bitmap) | Bytes 1-63            |
+      |-------------------------|-----------------------|
+      | 00001011                | 0 (padding)           |
+
+    * Children arrays:
+      * field-0 array (`List<char>`):
+        * Length: 4, Null count: 2
+        * Null bitmap buffer:
+
+          | Byte 0 (validity bitmap) | Bytes 1-63            |
+          |--------------------------|-----------------------|
+          | 00001001                 | 0 (padding)           |
+
+        * Offsets buffer:
+
+          | Bytes 0-19     |
+          |----------------|
+          | 0, 3, 3, 3, 7  |
+
+         * Values array:
+            * Length: 7, Null count: 0
+            * Null bitmap buffer: Not required
+
+            * Value buffer:
+
+              | Bytes 0-6      |
+              |----------------|
+              | joemark        |
+
+      * field-1 array (int32 array):
+        * Length: 4, Null count: 1
+        * Null bitmap buffer:
+
+          | Byte 0 (validity bitmap) | Bytes 1-63            |
+          |--------------------------|-----------------------|
+          | 00001011                 | 0 (padding)           |
+
+        * Value Buffer:
+
+          |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63 |
+          |------------|-------------|-------------|-------------|-------------|
+          | 1          | 2           | unspecified | 4           | unspecified |
+
+While a struct does not have physical storage for each of its semantic slots
+(i.e. each scalar C-like struct), an entire struct slot can be set to null via
+the null bitmap. Any of the child field arrays can have null values according
+to their respective independent null bitmaps.
+This implies that for a particular struct slot the null bitmap for the struct
+array might indicate a null slot when one or more of its child arrays has a
+non-null value in their corresponding slot.  When reading the struct array the
+parent null bitmap is authoritative.
+This is illustrated in the example above, the child arrays have valid entries
+for the null struct but are 'hidden' from the consumer by the parent array's
+null bitmap.  However, when treated independently corresponding
+values of the children array will be non-null.
+
+Dense union type
+----------------
+
+A dense union is semantically similar to a struct, and contains an ordered
+sequence of relative types. While a struct contains multiple arrays, a union is
+semantically a single array in which each slot can have a different type.
+
+The union types may be named, but like structs this will be a matter of the
+metadata and will not affect the physical memory layout.
+
+We define two distinct union types that are optimized for different use
+cases. This first, the dense union, represents a mixed-type array with 5 bytes
+of overhead for each value. Its physical layout is as follows:
+
+* One child array for each relative type
+* Types buffer: A buffer of 8-bit signed integers, enumerated from 0 corresponding
+  to each type.  A union with more then 127 possible types can be modeled as a
+  union of unions.
+* Offsets buffer: A buffer of signed int32 values indicating the relative offset
+  into the respective child array for the type in a given slot. The respective
+  offsets for each child value array must be in order / increasing.
+
+Critically, the dense union allows for minimal overhead in the ubiquitous
+union-of-structs with non-overlapping-fields use case (``Union<s1: Struct1, s2:
+Struct2, s3: Struct3, ...>``)
+
+Example Layout: Dense union
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+An example layout for logical union of:
+``Union<f: float, i: int32>`` having the values:
+``[{f=1.2}, null, {f=3.4}, {i=5}]``::
+
+    * Length: 4, Null count: 1
+    * Null bitmap buffer:
+      |Byte 0 (validity bitmap) | Bytes 1-63            |
+      |-------------------------|-----------------------|
+      |00001101                 | 0 (padding)           |
+
+    * Types buffer:
+
+      |Byte 0   | Byte 1      | Byte 2   | Byte 3   | Bytes 4-63  |
+      |---------|-------------|----------|----------|-------------|
+      | 0       | unspecified | 0        | 1        | unspecified |
+
+    * Offset buffer:
+
+      |Byte 0-3 | Byte 4-7    | Byte 8-11 | Byte 12-15 | Bytes 16-63 |
+      |---------|-------------|-----------|------------|-------------|
+      | 0       | unspecified | 1         | 0          | unspecified |
+
+    * Children arrays:
+      * Field-0 array (f: float):
+        * Length: 2, nulls: 0
+        * Null bitmap buffer: Not required
+
+        * Value Buffer:
+
+          | Bytes 0-7 | Bytes 8-63  |
+          |-----------|-------------|
+          | 1.2, 3.4  | unspecified |
+
+
+      * Field-1 array (i: int32):
+        * Length: 1, nulls: 0
+        * Null bitmap buffer: Not required
+
+        * Value Buffer:
+
+          | Bytes 0-3 | Bytes 4-63  |
+          |-----------|-------------|
+          | 5         | unspecified |
+
+Sparse union type
+-----------------
+
+A sparse union has the same structure as a dense union, with the omission of
+the offsets array. In this case, the child arrays are each equal in length to
+the length of the union.
+
+While a sparse union may use significantly more space compared with a dense
+union, it has some advantages that may be desirable in certain use cases:
+
+* A sparse union is more amenable to vectorized expression evaluation in some use cases.
+* Equal-length arrays can be interpreted as a union by only defining the types array.
+
+Example layout: ``SparseUnion<u0: Int32, u1: Float, u2: List<Char>>``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For the union array: ::
+
+    [{u0=5}, {u1=1.2}, {u2='joe'}, {u1=3.4}, {u0=4}, {u2='mark'}]
+
+will have the following layout: ::
+
+    * Length: 6, Null count: 0
+    * Null bitmap buffer: Not required
+
+    * Types buffer:
+
+     | Byte 0     | Byte 1      | Byte 2      | Byte 3      | Byte 4      | Byte 5       | Bytes  6-63           |
+     |------------|-------------|-------------|-------------|-------------|--------------|-----------------------|
+     | 0          | 1           | 2           | 1           | 0           | 2            | unspecified (padding) |
+
+    * Children arrays:
+
+      * u0 (Int32):
+        * Length: 6, Null count: 4
+        * Null bitmap buffer:
+
+          |Byte 0 (validity bitmap) | Bytes 1-63            |
+          |-------------------------|-----------------------|
+          |00010001                 | 0 (padding)           |
+
+        * Value buffer:
+
+          |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-23  | Bytes 24-63           |
+          |------------|-------------|-------------|-------------|-------------|--------------|-----------------------|
+          | 5          | unspecified | unspecified | unspecified | 4           |  unspecified | unspecified (padding) |
+
+      * u1 (float):
+        * Length: 6, Null count: 4
+        * Null bitmap buffer:
+
+          |Byte 0 (validity bitmap) | Bytes 1-63            |
+          |-------------------------|-----------------------|
+          | 00001010                | 0 (padding)           |
+
+        * Value buffer:
+
+          |Bytes 0-3    | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-23  | Bytes 24-63           |
+          |-------------|-------------|-------------|-------------|-------------|--------------|-----------------------|
+          | unspecified |  1.2        | unspecified | 3.4         | unspecified |  unspecified | unspecified (padding) |
+
+      * u2 (`List<char>`)
+        * Length: 6, Null count: 4
+        * Null bitmap buffer:
+
+          | Byte 0 (validity bitmap) | Bytes 1-63            |
+          |--------------------------|-----------------------|
+          | 00100100                 | 0 (padding)           |
+
+        * Offsets buffer (int32)
+
+          | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-27 | Bytes 28-63 |
+          |------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|
+          | 0          | 0           | 0           | 3           | 3           | 3           | 7           | unspecified |
+
+        * Values array (char array):
+          * Length: 7,  Null count: 0
+          * Null bitmap buffer: Not required
+
+            | Bytes 0-7  | Bytes 8-63            |
+            |------------|-----------------------|
+            | joemark    | unspecified (padding) |
+
+Note that nested types in a sparse union must be internally consistent
+(e.g. see the List in the diagram), i.e. random access at any index j
+on any child array will not cause an error.
+In other words, the array for the nested type must be valid if it is
+reinterpreted as a non-nested array.
+
+Similar to structs, a particular child array may have a non-null slot
+even if the null bitmap of the parent union array indicates the slot is
+null.  Additionally, a child array may have a non-null slot even if
+the types array indicates that a slot contains a different type at the index.
+
+Dictionary encoding
+-------------------
+
+When a field is dictionary encoded, the values are represented by an array of
+Int32 representing the index of the value in the dictionary.  The Dictionary is
+received as one or more DictionaryBatches with the id referenced by a
+dictionary attribute defined in the metadata (Message.fbs) in the Field
+table.  The dictionary has the same layout as the type of the field would
+dictate. Each entry in the dictionary can be accessed by its index in the
+DictionaryBatches.  When a Schema references a Dictionary id, it must send at
+least one DictionaryBatch for this id.
+
+As an example, you could have the following data: ::
+
+    type: List<String>
+
+    [
+     ['a', 'b'],
+     ['a', 'b'],
+     ['a', 'b'],
+     ['c', 'd', 'e'],
+     ['c', 'd', 'e'],
+     ['c', 'd', 'e'],
+     ['c', 'd', 'e'],
+     ['a', 'b']
+    ]
+
+In dictionary-encoded form, this could appear as: ::
+
+    data List<String> (dictionary-encoded, dictionary id i)
+    indices: [0, 0, 0, 1, 1, 1, 0]
+
+    dictionary i
+
+    type: List<String>
+
+    [
+     ['a', 'b'],
+     ['c', 'd', 'e'],
+    ]
+
+References
+----------
+
+Apache Drill Documentation - `Value Vectors`_
+
+.. _least-significant bit (LSB) numbering: https://en.wikipedia.org/wiki/Bit_numbering
+.. _Intel performance guide: https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
+.. _Endianness: https://en.wikipedia.org/wiki/Endianness
+.. _SIMD: https://software.intel.com/en-us/node/600110
+.. _Parquet: https://parquet.apache.org/documentation/latest/
+.. _Value Vectors: https://drill.apache.org/docs/value-vectors/

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/62ef7145/docs/latest/_sources/format/Metadata.rst.txt
----------------------------------------------------------------------
diff --git a/docs/latest/_sources/format/Metadata.rst.txt b/docs/latest/_sources/format/Metadata.rst.txt
new file mode 100644
index 0000000..293d011
--- /dev/null
+++ b/docs/latest/_sources/format/Metadata.rst.txt
@@ -0,0 +1,396 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Metadata: Logical types, schemas, data headers
+==============================================
+
+This is documentation for the Arrow metadata specification, which enables
+systems to communicate the
+
+* Logical array types (which are implemented using the physical memory layouts
+  specified in :doc:`Layout`)
+
+* Schemas for table-like collections of Arrow data structures
+
+* "Data headers" indicating the physical locations of memory buffers sufficient
+  to reconstruct a Arrow data structures without copying memory.
+
+Canonical implementation
+------------------------
+
+We are using `Flatbuffers`_ for low-overhead reading and writing of the Arrow
+metadata. See ``Message.fbs``.
+
+Schemas
+-------
+
+The ``Schema`` type describes a table-like structure consisting of any number of
+Arrow arrays, each of which can be interpreted as a column in the table. A
+schema by itself does not describe the physical structure of any particular set
+of data.
+
+A schema consists of a sequence of **fields**, which are metadata describing
+the columns. The Flatbuffers IDL for a field is: ::
+
+    table Field {
+      // Name is not required, in i.e. a List
+      name: string;
+      nullable: bool;
+      type: Type;
+
+      // Present only if the field is dictionary encoded
+      dictionary: DictionaryEncoding;
+
+      // children apply only to Nested data types like Struct, List and Union
+      children: [Field];
+
+      // User-defined metadata
+      custom_metadata: [ KeyValue ];
+    }
+
+The ``type`` is the logical type of the field. Nested types, such as List,
+Struct, and Union, have a sequence of child fields.
+
+A JSON representation of the schema is also provided:
+
+Field: ::
+
+    {
+      "name" : "name_of_the_field",
+      "nullable" : false,
+      "type" : /* Type */,
+      "children" : [ /* Field */ ],
+    }
+
+Type: ::
+
+    {
+      "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval"
+      // fields as defined in the Flatbuffer depending on the type name
+    }
+
+Union: ::
+
+    {
+      "name" : "union",
+      "mode" : "Sparse|Dense",
+      "typeIds" : [ /* integer */ ]
+    }
+
+The ``typeIds`` field in the Union are the codes used to denote each type, which
+may be different from the index of the child array. This is so that the union
+type ids do not have to be enumerated from 0.
+
+Int: ::
+
+    {
+      "name" : "int",
+      "bitWidth" : /* integer */,
+      "isSigned" : /* boolean */
+    }
+
+FloatingPoint: ::
+
+    {
+      "name" : "floatingpoint",
+      "precision" : "HALF|SINGLE|DOUBLE"
+    }
+
+Decimal: ::
+
+    {
+      "name" : "decimal",
+      "precision" : /* integer */,
+      "scale" : /* integer */
+    }
+
+Timestamp: ::
+
+    {
+      "name" : "timestamp",
+      "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND"
+    }
+
+Date: ::
+
+    {
+      "name" : "date",
+      "unit" : "DAY|MILLISECOND"
+    }
+
+Time: ::
+
+    {
+      "name" : "time",
+      "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND",
+      "bitWidth": /* integer: 32 or 64 */
+    }
+
+Interval: ::
+
+    {
+      "name" : "interval",
+      "unit" : "YEAR_MONTH|DAY_TIME"
+    }
+
+Schema: ::
+
+    {
+      "fields" : [
+        /* Field */
+      ]
+    }
+
+Record data headers
+-------------------
+
+A record batch is a collection of top-level named, equal length Arrow arrays
+(or vectors). If one of the arrays contains nested data, its child arrays are
+not required to be the same length as the top-level arrays.
+
+One can be thought of as a realization of a particular schema. The metadata
+describing a particular record batch is called a "data header". Here is the
+Flatbuffers IDL for a record batch data header: ::
+
+    table RecordBatch {
+      length: long;
+      nodes: [FieldNode];
+      buffers: [Buffer];
+    }
+
+The ``RecordBatch`` metadata provides for record batches with length exceeding
+2 :sup:`31` - 1, but Arrow implementations are not required to implement support
+beyond this size.
+
+The ``nodes`` and ``buffers`` fields are produced by a depth-first traversal /
+flattening of a schema (possibly containing nested types) for a given in-memory
+data set.
+
+Buffers
+~~~~~~~
+
+A buffer is metadata describing a contiguous memory region relative to some
+virtual address space. This may include:
+
+* Shared memory, e.g. a memory-mapped file
+* An RPC message received in-memory
+* Data in a file
+
+The key form of the Buffer type is: ::
+
+    struct Buffer {
+      offset: long;
+      length: long;
+    }
+
+In the context of a record batch, each field has some number of buffers
+associated with it, which are derived from their physical memory layout.
+
+Each logical type (separate from its children, if it is a nested type) has a
+deterministic number of buffers associated with it. These will be specified in
+the logical types section.
+
+Field metadata
+~~~~~~~~~~~~~~
+
+The ``FieldNode`` values contain metadata about each level in a nested type
+hierarchy. ::
+
+    struct FieldNode {
+      /// The number of value slots in the Arrow array at this level of a nested
+      /// tree
+      length: long;
+
+      /// The number of observed nulls.
+      null_count: lohng;
+    }
+
+The ``FieldNode`` metadata provides for fields with length exceeding 2 :sup:`31` - 1,
+but Arrow implementations are not required to implement support for large
+arrays.
+
+Flattening of nested data
+-------------------------
+
+Nested types are flattened in the record batch in depth-first order. When
+visiting each field in the nested type tree, the metadata is appended to the
+top-level ``fields`` array and the buffers associated with that field (but not
+its children) are appended to the ``buffers`` array.
+
+For example, let's consider the schema ::
+
+    col1: Struct<a: Int32, b: List<Int64>, c: Float64>
+    col2: Utf8
+
+The flattened version of this is: ::
+
+    FieldNode 0: Struct name='col1'
+    FieldNode 1: Int32 name=a'
+    FieldNode 2: List name='b'
+    FieldNode 3: Int64 name='item'  # arbitrary
+    FieldNode 4: Float64 name='c'
+    FieldNode 5: Utf8 name='col2'
+
+For the buffers produced, we would have the following (as described in more
+detail for each type below): ::
+
+    buffer 0: field 0 validity bitmap
+
+    buffer 1: field 1 validity bitmap
+    buffer 2: field 1 values <int32_t*>
+
+    buffer 3: field 2 validity bitmap
+    buffer 4: field 2 list offsets <int32_t*>
+
+    buffer 5: field 3 validity bitmap
+    buffer 6: field 3 values <int64_t*>
+
+    buffer 7: field 4 validity bitmap
+    buffer 8: field 4 values <double*>
+
+    buffer 9: field 5 validity bitmap
+    buffer 10: field 5 offsets <int32_t*>
+    buffer 11: field 5 data <uint8_t*>
+
+.. _spec-logical-types:
+
+Logical types
+-------------
+
+A logical type consists of a type name and metadata along with an explicit
+mapping to a physical memory representation. These may fall into some different
+categories:
+
+* Types represented as fixed-width primitive arrays (for example: C-style
+  integers and floating point numbers)
+* Types having equivalent memory layout to a physical nested type (e.g. strings
+  use the list representation, but logically are not nested types)
+
+Integers
+~~~~~~~~
+
+In the first version of Arrow we provide the standard 8-bit through 64-bit size
+standard C integer types, both signed and unsigned:
+
+* Signed types: Int8, Int16, Int32, Int64
+* Unsigned types: UInt8, UInt16, UInt32, UInt64
+
+The IDL looks like: ::
+
+    table Int {
+      bitWidth: int;
+      is_signed: bool;
+    }
+
+The integer endianness is currently set globally at the schema level. If a
+schema is set to be little-endian, then all integer types occurring within must
+be little-endian. Integers that are part of other data representations, such as
+list offsets and union types, must have the same endianness as the entire
+record batch.
+
+Floating point numbers
+~~~~~~~~~~~~~~~~~~~~~~
+
+We provide 3 types of floating point numbers as fixed bit-width primitive array
+
+- Half precision, 16-bit width
+- Single precision, 32-bit width
+- Double precision, 64-bit width
+
+The IDL looks like: ::
+
+    enum Precision:int {HALF, SINGLE, DOUBLE}
+
+    table FloatingPoint {
+      precision: Precision;
+    }
+
+Boolean
+~~~~~~~
+
+The Boolean logical type is represented as a 1-bit wide primitive physical
+type. The bits are numbered using least-significant bit (LSB) ordering.
+
+Like other fixed bit-width primitive types, boolean data appears as 2 buffers
+in the data header (one bitmap for the validity vector and one for the values).
+
+List
+~~~~
+
+The ``List`` logical type is the logical (and identically-named) counterpart to
+the List physical type.
+
+In data header form, the list field node contains 2 buffers:
+
+* Validity bitmap
+* List offsets
+
+The buffers associated with a list's child field are handled recursively
+according to the child logical type (e.g. ``List<Utf8>`` vs. ``List<Boolean>``).
+
+Utf8 and Binary
+~~~~~~~~~~~~~~~
+
+We specify two logical types for variable length bytes:
+
+* ``Utf8`` data is Unicode values with UTF-8 encoding
+* ``Binary`` is any other variable length bytes
+
+These types both have the same memory layout as the nested type ``List<UInt8>``,
+with the constraint that the inner bytes can contain no null values. From a
+logical type perspective they are primitive, not nested types.
+
+In data header form, while ``List<UInt8>`` would appear as 2 field nodes (``List``
+and ``UInt8``) and 4 buffers (2 for each of the nodes, as per above), these types
+have a simplified representation single field node (of ``Utf8`` or ``Binary``
+logical type, which have no children) and 3 buffers:
+
+* Validity bitmap
+* List offsets
+* Byte data
+
+Decimal
+~~~~~~~
+
+Decimals are represented as a 2's complement 128-bit (16 byte) signed integer
+in little-endian byte order.
+
+Timestamp
+~~~~~~~~~
+
+All timestamps are stored as a 64-bit integer, with one of four unit
+resolutions: second, millisecond, microsecond, and nanosecond.
+
+Date
+~~~~
+
+We support two different date types:
+
+* Days since the UNIX epoch as a 32-bit integer
+* Milliseconds since the UNIX epoch as a 64-bit integer
+
+Time
+~~~~
+
+Time supports the same unit resolutions: second, millisecond, microsecond, and
+nanosecond. We represent time as the smallest integer accommodating the
+indicated unit. For second and millisecond: 32-bit, for the others 64-bit.
+
+Dictionary encoding
+-------------------
+
+.. _Flatbuffers: http://github.com/google/flatbuffers

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/62ef7145/docs/latest/_sources/format/README.rst.txt
----------------------------------------------------------------------
diff --git a/docs/latest/_sources/format/README.rst.txt b/docs/latest/_sources/format/README.rst.txt
new file mode 100644
index 0000000..f2f770b
--- /dev/null
+++ b/docs/latest/_sources/format/README.rst.txt
@@ -0,0 +1,53 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Arrow specification documents
+=============================
+
+Currently, the Arrow specification consists of these pieces:
+
+- Metadata specification (see :doc:`Metadata`)
+- Physical memory layout specification (see :doc:`Layout`)
+- Logical Types, Schemas, and Record Batch Metadata (see Schema.fbs)
+- Encapsulated Messages (see Message.fbs)
+- Mechanics of messaging between Arrow systems (IPC, RPC, etc.) (see :doc:`IPC`)
+- Tensor (Multi-dimensional array) Metadata (see Tensor.fbs)
+
+The metadata currently uses Google's `flatbuffers library`_ for serializing a
+couple related pieces of information:
+
+- Schemas for tables or record (row) batches. This contains the logical types,
+  field names, and other metadata. Schemas do not contain any information about
+  actual data.
+- *Data headers* for record (row) batches. These must correspond to a known
+  schema, and enable a system to send and receive Arrow row batches in a form
+  that can be precisely disassembled or reconstructed.
+
+Arrow Format Maturity and Stability
+-----------------------------------
+
+We have made significant progress hardening the Arrow in-memory format and
+Flatbuffer metadata since the project started in February 2016. We have
+integration tests which verify binary compatibility between the Java and C++
+implementations, for example.
+
+Major versions may still include breaking changes to the memory format or
+metadata, so it is recommended to use the same released version of all
+libraries in your applications for maximum compatibility. Data stored in the
+Arrow IPC formats should not be used for long term storage.
+
+.. _flatbuffers library: http://github.com/google/flatbuffers

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/62ef7145/docs/latest/_sources/index.rst.txt
----------------------------------------------------------------------
diff --git a/docs/latest/_sources/index.rst.txt b/docs/latest/_sources/index.rst.txt
new file mode 100644
index 0000000..fa6c683
--- /dev/null
+++ b/docs/latest/_sources/index.rst.txt
@@ -0,0 +1,42 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Apache Arrow
+============
+
+Apache Arrow is a cross-language development platform for in-memory data. It
+specifies a standardized language-independent columnar memory format for flat
+and hierarchical data, organized for efficient analytic operations on modern
+hardware. It also provides computational libraries and zero-copy streaming
+messaging and interprocess communication.
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Memory Format
+
+   format/README
+   format/Guidelines
+   format/Layout
+   format/Metadata
+   format/IPC
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Languages
+
+   cpp/index
+   python/index

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/62ef7145/docs/latest/_sources/python/api.rst.txt
----------------------------------------------------------------------
diff --git a/docs/latest/_sources/python/api.rst.txt b/docs/latest/_sources/python/api.rst.txt
new file mode 100644
index 0000000..0bad76f
--- /dev/null
+++ b/docs/latest/_sources/python/api.rst.txt
@@ -0,0 +1,399 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. _api:
+
+*************
+API Reference
+*************
+
+.. _api.types:
+
+Type and Schema Factory Functions
+---------------------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   null
+   bool_
+   int8
+   int16
+   int32
+   int64
+   uint8
+   uint16
+   uint32
+   uint64
+   float16
+   float32
+   float64
+   time32
+   time64
+   timestamp
+   date32
+   date64
+   binary
+   string
+   utf8
+   decimal128
+   list_
+   struct
+   dictionary
+   field
+   schema
+   from_numpy_dtype
+
+.. currentmodule:: pyarrow.types
+.. _api.types.checking:
+
+Type checking functions
+-----------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   is_boolean
+   is_integer
+   is_signed_integer
+   is_unsigned_integer
+   is_int8
+   is_int16
+   is_int32
+   is_int64
+   is_uint8
+   is_uint16
+   is_uint32
+   is_uint64
+   is_floating
+   is_float16
+   is_float32
+   is_float64
+   is_decimal
+   is_list
+   is_struct
+   is_union
+   is_nested
+   is_temporal
+   is_timestamp
+   is_date
+   is_date32
+   is_date64
+   is_time
+   is_time32
+   is_time64
+   is_null
+   is_binary
+   is_unicode
+   is_string
+   is_fixed_size_binary
+   is_map
+   is_dictionary
+
+.. currentmodule:: pyarrow
+
+.. _api.value:
+
+Scalar Value Types
+------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   NA
+   Scalar
+   ArrayValue
+   BooleanValue
+   Int8Value
+   Int16Value
+   Int32Value
+   Int64Value
+   UInt8Value
+   UInt16Value
+   UInt32Value
+   UInt64Value
+   FloatValue
+   DoubleValue
+   ListValue
+   BinaryValue
+   StringValue
+   FixedSizeBinaryValue
+   Date32Value
+   Date64Value
+   TimestampValue
+   DecimalValue
+
+.. _api.array:
+
+.. currentmodule:: pyarrow
+
+Array Types
+-----------
+
+.. autosummary::
+   :toctree: generated/
+
+   array
+   Array
+   BooleanArray
+   DictionaryArray
+   FloatingPointArray
+   IntegerArray
+   Int8Array
+   Int16Array
+   Int32Array
+   Int64Array
+   NullArray
+   NumericArray
+   UInt8Array
+   UInt16Array
+   UInt32Array
+   UInt64Array
+   BinaryArray
+   FixedSizeBinaryArray
+   StringArray
+   Time32Array
+   Time64Array
+   Date32Array
+   Date64Array
+   TimestampArray
+   Decimal128Array
+   ListArray
+
+.. _api.table:
+
+.. currentmodule:: pyarrow
+
+Tables and Record Batches
+-------------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   column
+   chunked_array
+   concat_tables
+   ChunkedArray
+   Column
+   RecordBatch
+   Table
+
+.. _api.tensor:
+
+Tensor type and Functions
+-------------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   Tensor
+
+.. _api.io:
+
+In-Memory Buffers
+-----------------
+
+.. autosummary::
+   :toctree: generated/
+
+   allocate_buffer
+   compress
+   decompress
+   py_buffer
+   foreign_buffer
+   Buffer
+   ResizableBuffer
+
+Input / Output and Shared Memory
+--------------------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   input_stream
+   output_stream
+   BufferReader
+   BufferOutputStream
+   FixedSizeBufferWriter
+   NativeFile
+   OSFile
+   MemoryMappedFile
+   CompressedInputStream
+   CompressedOutputStream
+   memory_map
+   create_memory_map
+   PythonFile
+
+File Systems
+------------
+
+.. autosummary::
+   :toctree: generated/
+
+   hdfs.connect
+   LocalFileSystem
+
+.. class:: HadoopFileSystem
+   :noindex:
+
+.. _api.ipc:
+
+Serialization and IPC
+---------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   ipc.open_file
+   ipc.open_stream
+   Message
+   MessageReader
+   RecordBatchFileReader
+   RecordBatchFileWriter
+   RecordBatchStreamReader
+   RecordBatchStreamWriter
+   read_message
+   read_record_batch
+   get_record_batch_size
+   read_tensor
+   write_tensor
+   get_tensor_size
+   serialize
+   serialize_to
+   deserialize
+   deserialize_components
+   deserialize_from
+   read_serialized
+   SerializedPyObject
+   SerializationContext
+
+.. _api.memory_pool:
+
+Memory Pools
+------------
+
+.. currentmodule:: pyarrow
+
+.. autosummary::
+   :toctree: generated/
+
+   MemoryPool
+   default_memory_pool
+   total_allocated_bytes
+   set_memory_pool
+   log_memory_allocations
+
+.. _api.type_classes:
+
+.. currentmodule:: pyarrow
+
+Type Classes
+------------
+
+.. autosummary::
+   :toctree: generated/
+
+   DataType
+   Field
+   Schema
+
+.. currentmodule:: pyarrow.plasma
+
+.. _api.plasma:
+
+Plasma In-Memory Object Store
+-----------------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   ObjectID
+   PlasmaClient
+   PlasmaBuffer
+
+.. currentmodule:: pyarrow.csv
+
+.. _api.csv:
+
+CSV Files
+---------
+
+.. autosummary::
+   :toctree: generated/
+
+   ReadOptions
+   ParseOptions
+   ConvertOptions
+   read_csv
+
+.. _api.feather:
+
+Feather Files
+-------------
+
+.. currentmodule:: pyarrow.feather
+
+.. autosummary::
+   :toctree: generated/
+
+   read_feather
+   write_feather
+
+.. currentmodule:: pyarrow
+
+.. _api.parquet:
+
+Parquet Files
+-------------
+
+.. currentmodule:: pyarrow.parquet
+
+.. autosummary::
+   :toctree: generated/
+
+   ParquetDataset
+   ParquetFile
+   ParquetWriter
+   read_table
+   read_metadata
+   read_pandas
+   read_schema
+   write_metadata
+   write_table
+   write_to_dataset
+
+.. currentmodule:: pyarrow
+
+Multi-Threading
+---------------
+
+.. autosummary::
+   :toctree: generated/
+
+   cpu_count
+   set_cpu_count
+
+Using with C extensions
+-----------------------
+
+.. autosummary::
+   :toctree: generated/
+
+   get_include
+   get_libraries
+   get_library_dirs

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/62ef7145/docs/latest/_sources/python/benchmarks.rst.txt
----------------------------------------------------------------------
diff --git a/docs/latest/_sources/python/benchmarks.rst.txt b/docs/latest/_sources/python/benchmarks.rst.txt
new file mode 100644
index 0000000..6c3144a
--- /dev/null
+++ b/docs/latest/_sources/python/benchmarks.rst.txt
@@ -0,0 +1,53 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Benchmarks
+==========
+
+The ``pyarrow`` package comes with a suite of benchmarks meant to
+run with `asv`_.  You'll need to install the ``asv`` package first
+(``pip install asv`` or ``conda install -c conda-forge asv``).
+
+The benchmarks are run using `asv`_ which is also their only requirement.
+
+Running the benchmarks
+----------------------
+
+To run the benchmarks, call ``asv run --python=same``. You cannot use the
+plain ``asv run`` command at the moment as asv cannot handle python packages
+in subdirectories of a repository.
+
+Running with arbitrary revisions
+--------------------------------
+
+ASV allows to store results and generate graphs of the benchmarks over
+the project's evolution.  For this you have the latest development version of ASV:
+
+.. code::
+
+    pip install git+https://github.com/airspeed-velocity/asv
+
+Now you should be ready to run ``asv run`` or whatever other command
+suits your needs.
+
+Compatibility
+-------------
+
+We only expect the benchmarking setup to work with Python 3.6 or later,
+on a Unix-like system.
+
+.. asv:: https://asv.readthedocs.org/

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/62ef7145/docs/latest/_sources/python/csv.rst.txt
----------------------------------------------------------------------
diff --git a/docs/latest/_sources/python/csv.rst.txt b/docs/latest/_sources/python/csv.rst.txt
new file mode 100644
index 0000000..17023b1
--- /dev/null
+++ b/docs/latest/_sources/python/csv.rst.txt
@@ -0,0 +1,92 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow.csv
+.. _csv:
+
+Reading CSV files
+=================
+
+Arrow provides preliminary support for reading data from CSV files.
+The features currently offered are the following:
+
+* multi-threaded or single-threaded reading
+* automatic decompression of input files (based on the filename extension,
+  such as ``my_data.csv.gz``)
+* fetching column names from the first row in the CSV file
+* column-wise type inference and conversion to one of ``null``, ``int64``,
+  ``float64``, ``timestamp[s]``, ``string`` or ``binary`` data
+* detecting various spellings of null values such as ``NaN`` or ``#N/A``
+
+Usage
+-----
+
+CSV reading functionality is available through the :mod:`pyarrow.csv` module.
+In many cases, you will simply call the :func:`read_csv` function
+with the file path you want to read from::
+
+   >>> from pyarrow import csv
+   >>> fn = 'tips.csv.gz'
+   >>> table = csv.read_csv(fn)
+   >>> table
+   pyarrow.Table
+   total_bill: double
+   tip: double
+   sex: string
+   smoker: string
+   day: string
+   time: string
+   size: int64
+   >>> len(table)
+   244
+   >>> df = table.to_pandas()
+   >>> df.head()
+      total_bill   tip     sex smoker  day    time  size
+   0       16.99  1.01  Female     No  Sun  Dinner     2
+   1       10.34  1.66    Male     No  Sun  Dinner     3
+   2       21.01  3.50    Male     No  Sun  Dinner     3
+   3       23.68  3.31    Male     No  Sun  Dinner     2
+   4       24.59  3.61  Female     No  Sun  Dinner     4
+
+Customized parsing
+------------------
+
+To alter the default parsing settings in case of reading CSV files with an
+unusual structure, you should create a :class:`ParseOptions` instance
+and pass it to :func:`read_csv`.
+
+Customized conversion
+---------------------
+
+To alter how CSV data is converted to Arrow types and data, you should create
+a :class:`ConvertOptions` instance and pass it to :func:`read_csv`.
+
+Performance
+-----------
+
+Due to the structure of CSV files, one cannot expect the same levels of
+performance as when reading dedicated binary formats like
+:ref:`Parquet <Parquet>`.  Nevertheless, Arrow strives to reduce the
+overhead of reading CSV files.
+
+Performance options can be controlled through the :class:`ReadOptions` class.
+Multi-threaded reading is the default for highest performance, distributing
+the workload efficiently over all available cores.
+
+.. note::
+   The number of threads to use concurrently is automatically inferred by Arrow
+   and can be inspected using the :func:`~pyarrow.cpu_count()` function.

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/62ef7145/docs/latest/_sources/python/data.rst.txt
----------------------------------------------------------------------
diff --git a/docs/latest/_sources/python/data.rst.txt b/docs/latest/_sources/python/data.rst.txt
new file mode 100644
index 0000000..3260f6d
--- /dev/null
+++ b/docs/latest/_sources/python/data.rst.txt
@@ -0,0 +1,434 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. _data:
+
+Data Types and In-Memory Data Model
+===================================
+
+Apache Arrow defines columnar array data structures by composing type metadata
+with memory buffers, like the ones explained in the documentation on
+:ref:`Memory and IO <io>`. These data structures are exposed in Python through
+a series of interrelated classes:
+
+* **Type Metadata**: Instances of ``pyarrow.DataType``, which describe a logical
+  array type
+* **Schemas**: Instances of ``pyarrow.Schema``, which describe a named
+  collection of types. These can be thought of as the column types in a
+  table-like object.
+* **Arrays**: Instances of ``pyarrow.Array``, which are atomic, contiguous
+  columnar data structures composed from Arrow Buffer objects
+* **Record Batches**: Instances of ``pyarrow.RecordBatch``, which are a
+  collection of Array objects with a particular Schema
+* **Tables**: Instances of ``pyarrow.Table``, a logical table data structure in
+  which each column consists of one or more ``pyarrow.Array`` objects of the
+  same type.
+
+We will examine these in the sections below in a series of examples.
+
+.. _data.types:
+
+Type Metadata
+-------------
+
+Apache Arrow defines language agnostic column-oriented data structures for
+array data. These include:
+
+* **Fixed-length primitive types**: numbers, booleans, date and times, fixed
+  size binary, decimals, and other values that fit into a given number
+* **Variable-length primitive types**: binary, string
+* **Nested types**: list, struct, and union
+* **Dictionary type**: An encoded categorical type (more on this later)
+
+Each logical data type in Arrow has a corresponding factory function for
+creating an instance of that type object in Python:
+
+.. ipython:: python
+
+   import pyarrow as pa
+   t1 = pa.int32()
+   t2 = pa.string()
+   t3 = pa.binary()
+   t4 = pa.binary(10)
+   t5 = pa.timestamp('ms')
+
+   t1
+   print(t1)
+   print(t4)
+   print(t5)
+
+We use the name **logical type** because the **physical** storage may be the
+same for one or more types. For example, ``int64``, ``float64``, and
+``timestamp[ms]`` all occupy 64 bits per value.
+
+These objects are `metadata`; they are used for describing the data in arrays,
+schemas, and record batches. In Python, they can be used in functions where the
+input data (e.g. Python objects) may be coerced to more than one Arrow type.
+
+The :class:`~pyarrow.Field` type is a type plus a name and optional
+user-defined metadata:
+
+.. ipython:: python
+
+   f0 = pa.field('int32_field', t1)
+   f0
+   f0.name
+   f0.type
+
+Arrow supports **nested value types** like list, struct, and union. When
+creating these, you must pass types or fields to indicate the data types of the
+types' children. For example, we can define a list of int32 values with:
+
+.. ipython:: python
+
+   t6 = pa.list_(t1)
+   t6
+
+A `struct` is a collection of named fields:
+
+.. ipython:: python
+
+   fields = [
+       pa.field('s0', t1),
+       pa.field('s1', t2),
+       pa.field('s2', t4),
+       pa.field('s3', t6),
+   ]
+
+   t7 = pa.struct(fields)
+   print(t7)
+
+For convenience, you can pass ``(name, type)`` tuples directly instead of
+:class:`~pyarrow.Field` instances:
+
+.. ipython:: python
+
+   t8 = pa.struct([('s0', t1), ('s1', t2), ('s2', t4), ('s3', t6)])
+   print(t8)
+   t8 == t7
+
+
+See :ref:`Data Types API <api.types>` for a full listing of data type
+functions.
+
+.. _data.schema:
+
+Schemas
+-------
+
+The :class:`~pyarrow.Schema` type is similar to the ``struct`` array type; it
+defines the column names and types in a record batch or table data
+structure. The :func:`pyarrow.schema` factory function makes new Schema objects in
+Python:
+
+.. ipython:: python
+
+   my_schema = pa.schema([('field0', t1),
+                          ('field1', t2),
+                          ('field2', t4),
+                          ('field3', t6)])
+   my_schema
+
+In some applications, you may not create schemas directly, only using the ones
+that are embedded in :ref:`IPC messages <ipc>`.
+
+.. _data.array:
+
+Arrays
+------
+
+For each data type, there is an accompanying array data structure for holding
+memory buffers that define a single contiguous chunk of columnar array
+data. When you are using PyArrow, this data may come from IPC tools, though it
+can also be created from various types of Python sequences (lists, NumPy
+arrays, pandas data).
+
+A simple way to create arrays is with ``pyarrow.array``, which is similar to
+the ``numpy.array`` function.  By default PyArrow will infer the data type
+for you:
+
+.. ipython:: python
+
+   arr = pa.array([1, 2, None, 3])
+   arr
+
+But you may also pass a specific data type to override type inference:
+
+.. ipython:: python
+
+   pa.array([1, 2], type=pa.uint16())
+
+The array's ``type`` attribute is the corresponding piece of type metadata:
+
+.. ipython:: python
+
+   arr.type
+
+Each in-memory array has a known length and null count (which will be 0 if
+there are no null values):
+
+.. ipython:: python
+
+   len(arr)
+   arr.null_count
+
+Scalar values can be selected with normal indexing.  ``pyarrow.array`` converts
+``None`` values to Arrow nulls; we return the special ``pyarrow.NA`` value for
+nulls:
+
+.. ipython:: python
+
+   arr[0]
+   arr[2]
+
+Arrow data is immutable, so values can be selected but not assigned.
+
+Arrays can be sliced without copying:
+
+.. ipython:: python
+
+   arr[1:3]
+
+None values and NAN handling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As mentioned in the above section, the Python object ``None`` is always
+converted to an Arrow null element on the conversion to ``pyarrow.Array``. For
+the float NaN value which is either represented by the Python object
+``float('nan')`` or ``numpy.nan`` we normally convert it to a *valid* float
+value during the conversion. If an integer input is supplied to
+``pyarrow.array`` that contains ``np.nan``, ``ValueError`` is raised.
+
+To handle better compability with Pandas, we support interpreting NaN values as
+null elements. This is enabled automatically on all ``from_pandas`` function and
+can be enable on the other conversion functions by passing ``from_pandas=True``
+as a function parameter.
+
+List arrays
+~~~~~~~~~~~
+
+``pyarrow.array`` is able to infer the type of simple nested data structures
+like lists:
+
+.. ipython:: python
+
+   nested_arr = pa.array([[], None, [1, 2], [None, 1]])
+   print(nested_arr.type)
+
+Struct arrays
+~~~~~~~~~~~~~
+
+For other kinds of nested arrays, such as struct arrays, you currently need
+to pass the type explicitly.  Struct arrays can be initialized from a
+sequence of Python dicts or tuples:
+
+.. ipython:: python
+
+   ty = pa.struct([('x', pa.int8()),
+                   ('y', pa.bool_())])
+   pa.array([{'x': 1, 'y': True}, {'x': 2, 'y': False}], type=ty)
+   pa.array([(3, True), (4, False)], type=ty)
+
+When initializing a struct array, nulls are allowed both at the struct
+level and at the individual field level.  If initializing from a sequence
+of Python dicts, a missing dict key is handled as a null value:
+
+.. ipython:: python
+
+   pa.array([{'x': 1}, None, {'y': None}], type=ty)
+
+You can also construct a struct array from existing arrays for each of the
+struct's components.  In this case, data storage will be shared with the
+individual arrays, and no copy is involved:
+
+.. ipython:: python
+
+   xs = pa.array([5, 6, 7], type=pa.int16())
+   ys = pa.array([False, True, True])
+   arr = pa.StructArray.from_arrays((xs, ys), names=('x', 'y'))
+   arr.type
+   arr
+
+Union arrays
+~~~~~~~~~~~~
+
+The union type represents a nested array type where each value can be one
+(and only one) of a set of possible types.  There are two possible
+storage types for union arrays: sparse and dense.
+
+In a sparse union array, each of the child arrays has the same length
+as the resulting union array.  They are adjuncted with a ``int8`` "types"
+array that tells, for each value, from which child array it must be
+selected:
+
+.. ipython:: python
+
+   xs = pa.array([5, 6, 7])
+   ys = pa.array([False, False, True])
+   types = pa.array([0, 1, 1], type=pa.int8())
+   union_arr = pa.UnionArray.from_sparse(types, [xs, ys])
+   union_arr.type
+   union_arr
+
+In a dense union array, you also pass, in addition to the ``int8`` "types"
+array, a ``int32`` "offsets" array that tells, for each value, at
+each offset in the selected child array it can be found:
+
+.. ipython:: python
+
+   xs = pa.array([5, 6, 7])
+   ys = pa.array([False, True])
+   types = pa.array([0, 1, 1, 0, 0], type=pa.int8())
+   offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32())
+   union_arr = pa.UnionArray.from_dense(types, offsets, [xs, ys])
+   union_arr.type
+   union_arr
+
+
+Dictionary Arrays
+~~~~~~~~~~~~~~~~~
+
+The **Dictionary** type in PyArrow is a special array type that is similar to a
+factor in R or a ``pandas.Categorical``. It enables one or more record batches
+in a file or stream to transmit integer *indices* referencing a shared
+**dictionary** containing the distinct values in the logical array. This is
+particularly often used with strings to save memory and improve performance.
+
+The way that dictionaries are handled in the Apache Arrow format and the way
+they appear in C++ and Python is slightly different. We define a special
+:class:`~.DictionaryArray` type with a corresponding dictionary type. Let's
+consider an example:
+
+.. ipython:: python
+
+   indices = pa.array([0, 1, 0, 1, 2, 0, None, 2])
+   dictionary = pa.array(['foo', 'bar', 'baz'])
+
+   dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
+   dict_array
+
+Here we have:
+
+.. ipython:: python
+
+   print(dict_array.type)
+   dict_array.indices
+   dict_array.dictionary
+
+When using :class:`~.DictionaryArray` with pandas, the analogue is
+``pandas.Categorical`` (more on this later):
+
+.. ipython:: python
+
+   dict_array.to_pandas()
+
+.. _data.record_batch:
+
+Record Batches
+--------------
+
+A **Record Batch** in Apache Arrow is a collection of equal-length array
+instances. Let's consider a collection of arrays:
+
+.. ipython:: python
+
+   data = [
+       pa.array([1, 2, 3, 4]),
+       pa.array(['foo', 'bar', 'baz', None]),
+       pa.array([True, None, False, True])
+   ]
+
+A record batch can be created from this list of arrays using
+``RecordBatch.from_arrays``:
+
+.. ipython:: python
+
+   batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
+   batch.num_columns
+   batch.num_rows
+   batch.schema
+
+   batch[1]
+
+A record batch can be sliced without copying memory like an array:
+
+.. ipython:: python
+
+   batch2 = batch.slice(1, 3)
+   batch2[1]
+
+.. _data.table:
+
+Tables
+------
+
+The PyArrow :class:`~.Table` type is not part of the Apache Arrow
+specification, but is rather a tool to help with wrangling multiple record
+batches and array pieces as a single logical dataset. As a relevant example, we
+may receive multiple small record batches in a socket stream, then need to
+concatenate them into contiguous memory for use in NumPy or pandas. The Table
+object makes this efficient without requiring additional memory copying.
+
+Considering the record batch we created above, we can create a Table containing
+one or more copies of the batch using ``Table.from_batches``:
+
+.. ipython:: python
+
+   batches = [batch] * 5
+   table = pa.Table.from_batches(batches)
+   table
+   table.num_rows
+
+The table's columns are instances of :class:`~.Column`, which is a container
+for one or more arrays of the same type.
+
+.. ipython:: python
+
+   c = table[0]
+   c
+   c.data
+   c.data.num_chunks
+   c.data.chunk(0)
+
+As you'll see in the :ref:`pandas section <pandas_interop>`, we can convert
+these objects to contiguous NumPy arrays for use in pandas:
+
+.. ipython:: python
+
+   c.to_pandas()
+
+Multiple tables can also be concatenated together to form a single table using
+``pyarrow.concat_tables``, if the schemas are equal:
+
+.. ipython:: python
+
+   tables = [table] * 2
+   table_all = pa.concat_tables(tables)
+   table_all.num_rows
+   c = table_all[0]
+   c.data.num_chunks
+
+This is similar to ``Table.from_batches``, but uses tables as input instead of
+record batches. Record batches can be made into tables, but not the other way
+around, so if your data is already in table form, then use
+``pyarrow.concat_tables``.
+
+Custom Schema and Field Metadata
+--------------------------------
+
+TODO