You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2016/06/01 01:00:28 UTC
arrow git commit: [Doc] Update Layout.md
Repository: arrow
Updated Branches:
refs/heads/master cd1d770ed -> c8b807881
[Doc] Update Layout.md
For clarity, added references to official SIMD documentation, the description
of Endiandness, Parquet. Used Markdown syntax for the exponent to document the
size of the arrays.
Closes PR #82.
Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/c8b80788
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/c8b80788
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/c8b80788
Branch: refs/heads/master
Commit: c8b8078810be1d703c0261859b0862d574384600
Parents: cd1d770
Author: Edmon Begoli <eb...@gmail.com>
Authored: Sat May 28 19:11:47 2016 -0400
Committer: Wes McKinney <we...@apache.org>
Committed: Tue May 31 18:00:08 2016 -0700
----------------------------------------------------------------------
format/Layout.md | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/arrow/blob/c8b80788/format/Layout.md
----------------------------------------------------------------------
diff --git a/format/Layout.md b/format/Layout.md
index 34eade3..9de0479 100644
--- a/format/Layout.md
+++ b/format/Layout.md
@@ -41,7 +41,7 @@ Base requirements
proprietary systems that utilize the open source components.
* All array slots are accessible in constant time, with complexity growing
linearly in the nesting level
-* Capable of representing fully-materialized and decoded / decompressed Parquet
+* Capable of representing fully-materialized and decoded / decompressed [Parquet][5]
data
* All contiguous memory buffers are aligned at 64-byte boundaries and padded to a multiple of 64 bytes.
* Any relative type can have null slots
@@ -76,7 +76,7 @@ Base requirements
* Any memory management or reference counting subsystem
* To enumerate or specify types of encodings or compression support
-## Byte Order (Endianness)
+## Byte Order ([Endianness][3])
The Arrow format is little endian.
@@ -91,7 +91,7 @@ requirement follows best practices for optimized memory access:
* 64 byte alignment is recommended by the [Intel performance guide][2] for
data-structures over 64 bytes (which will be a common case for Arrow Arrays).
-Requiring padding to a multiple of 64 bytes allows for using SIMD instructions
+Requiring padding to a multiple of 64 bytes allows for using [SIMD][4] instructions
consistently in loops without additional conditional checks.
This should allow for simpler and more efficient code.
The specific padding length was chosen because it matches the largest known
@@ -105,13 +105,13 @@ Unless otherwise noted, padded bytes do not need to have a specific value.
## Array lengths
Any array has a known and fixed length, stored as a 32-bit signed integer, so a
-maximum of 2^31 - 1 elements. We choose a signed int32 for a couple reasons:
+maximum of 2<sup>31</sup> - 1 elements. We choose a signed int32 for a couple reasons:
* Enhance compatibility with Java and client languages which may have varying
quality of support for unsigned integers.
* To encourage developers to compose smaller arrays (each of which contains
contiguous memory in its leaf nodes) to create larger array structures
- possibly exceeding 2^31 - 1 elements, as opposed to allocating very large
+ possibly exceeding 2<sup>31</sup> - 1 elements, as opposed to allocating very large
contiguous memory blocks.
## Null count
@@ -238,7 +238,7 @@ A list-array is represented by the combination of the following:
* A values array, a child array of type T. T may also be a nested type.
* An offsets buffer containing 32-bit signed integers with length equal to the
length of the top-level array plus one. Note that this limits the size of the
- values array to 2^31 -1.
+ values array to 2<sup>31</sup>-1.
The offsets array encodes a start position in the values array, and the length
of the value in each slot is computed using the first difference with the next
@@ -578,7 +578,11 @@ the the types array indicates that a slot contains a different type at the index
## References
-Drill docs https://drill.apache.org/docs/value-vectors/
+Apache Drill Documentation - [Value Vectors][6]
[1]: https://en.wikipedia.org/wiki/Bit_numbering
[2]: https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
+[3]: https://en.wikipedia.org/wiki/Endianness
+[4]: https://software.intel.com/en-us/node/600110
+[5]: https://parquet.apache.org/documentation/latest/
+[6]: https://drill.apache.org/docs/value-vectors/