You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ap...@apache.org on 2018/04/09 08:41:51 UTC

arrow-site git commit: Update docs

Repository: arrow-site
Updated Branches:
  refs/heads/asf-site 4c7c2f1b2 -> 301d577e0


Update docs


Project: http://git-wip-us.apache.org/repos/asf/arrow-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow-site/commit/301d577e
Tree: http://git-wip-us.apache.org/repos/asf/arrow-site/tree/301d577e
Diff: http://git-wip-us.apache.org/repos/asf/arrow-site/diff/301d577e

Branch: refs/heads/asf-site
Commit: 301d577e09df11f95edda377da499215a7f15ec7
Parents: 4c7c2f1
Author: Antoine Pitrou <an...@python.org>
Authored: Mon Apr 9 10:41:14 2018 +0200
Committer: Antoine Pitrou <an...@python.org>
Committed: Mon Apr 9 10:41:14 2018 +0200

----------------------------------------------------------------------
 blog/index.html         |  2 +-
 committers/index.html   |  6 ++++++
 docs/ipc.html           | 47 ++++++++++++++++++++++++++++++++++++++++++--
 docs/memory_layout.html | 11 +++++------
 docs/metadata.html      |  3 ++-
 feed.xml                |  4 ++--
 6 files changed, 61 insertions(+), 12 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow-site/blob/301d577e/blog/index.html
----------------------------------------------------------------------
diff --git a/blog/index.html b/blog/index.html
index cfddf8f..ddad423 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -489,7 +489,7 @@ implementations and bindings to more languages.</p>
   <div class="container">
     <h2>
       Improvements to Java Vector API in Apache Arrow 0.8.0
-      <a href="/blog/2017/12/19/java-vector-improvements/" class="permalink" title="Permalink">∞</a>
+      <a href="/blog/2017/12/18/java-vector-improvements/" class="permalink" title="Permalink">∞</a>
     </h2>
 
     

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/301d577e/committers/index.html
----------------------------------------------------------------------
diff --git a/committers/index.html b/committers/index.html
index 10a508e..9cbf10e 100644
--- a/committers/index.html
+++ b/committers/index.html
@@ -292,6 +292,12 @@
 <td>ptaylor</td>
 <td>Graphistry</td>
 </tr>
+<tr>
+<td>Antoine Pitrou</td>
+<td>Committer</td>
+<td>apitrou</td>
+<td>Independent / Two Sigma</td>
+</tr>
 </tbody></table>
 
     </div> <!-- /container -->

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/301d577e/docs/ipc.html
----------------------------------------------------------------------
diff --git a/docs/ipc.html b/docs/ipc.html
index a825908..5022c80 100644
--- a/docs/ipc.html
+++ b/docs/ipc.html
@@ -146,7 +146,7 @@
 
 <ul>
   <li>A length prefix indicating the metadata size</li>
-  <li>The message metadata as a <a href="https://github.com/google]/flatbuffers">Flatbuffer</a></li>
+  <li>The message metadata as a <a href="https://github.com/google/flatbuffers">Flatbuffer</a></li>
   <li>Padding bytes to an 8-byte boundary</li>
   <li>The message body, which must be a multiple of 8 bytes</li>
 </ul>
@@ -191,7 +191,9 @@ flatbuffer union), and the size of the message body:</p>
 of encapsulated messages, each of which follows the format above. The schema
 comes first in the stream, and it is the same for all of the record batches
 that follow. If any fields in the schema are dictionary-encoded, one or more
-<code class="highlighter-rouge">DictionaryBatch</code> messages will follow the schema.</p>
+<code class="highlighter-rouge">DictionaryBatch</code> messages will be included. <code class="highlighter-rouge">DictionaryBatch</code> and
+<code class="highlighter-rouge">RecordBatch</code> messages may be interleaved, but before any dictionary key is used
+in a <code class="highlighter-rouge">RecordBatch</code> it should be defined in a <code class="highlighter-rouge">DictionaryBatch</code>.</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>&lt;SCHEMA&gt;
 &lt;DICTIONARY 0&gt;
@@ -199,6 +201,10 @@ that follow. If any fields in the schema are dictionary-encoded, one or more
 &lt;DICTIONARY k - 1&gt;
 &lt;RECORD BATCH 0&gt;
 ...
+&lt;DICTIONARY x DELTA&gt;
+...
+&lt;DICTIONARY y DELTA&gt;
+...
 &lt;RECORD BATCH n - 1&gt;
 &lt;EOS [optional]: int32&gt;
 </code></pre>
@@ -233,6 +239,10 @@ footer.</p>
 </code></pre>
 </div>
 
+<p>In the file format, there is no requirement that dictionary keys should be
+defined in a <code class="highlighter-rouge">DictionaryBatch</code> before they are used in a <code class="highlighter-rouge">RecordBatch</code>, as long
+as the keys are defined somewhere in the file.</p>
+
 <h3 id="recordbatch-body-structure">RecordBatch body structure</h3>
 
 <p>The <code class="highlighter-rouge">RecordBatch</code> metadata contains a depth-first (pre-order) flattened set of
@@ -306,6 +316,7 @@ the dictionaries can be properly interpreted.</p>
 <div class="highlighter-rouge"><pre class="highlight"><code>table DictionaryBatch {
   id: long;
   data: RecordBatch;
+  isDelta: boolean = false;
 }
 </code></pre>
 </div>
@@ -315,6 +326,38 @@ in the schema, so that dictionaries can even be used for multiple fields. See
 the <a href="https://github.com/apache/arrow/blob/master/format/Layout.md">Physical Layout</a> document for more about the semantics of
 dictionary-encoded data.</p>
 
+<p>The dictionary <code class="highlighter-rouge">isDelta</code> flag allows dictionary batches to be modified
+mid-stream.  A dictionary batch with <code class="highlighter-rouge">isDelta</code> set indicates that its vector
+should be concatenated with those of any previous batches with the same <code class="highlighter-rouge">id</code>. A
+stream which encodes one column, the list of strings
+<code class="highlighter-rouge">["A", "B", "C", "B", "D", "C", "E", "A"]</code>, with a delta dictionary batch could
+take the form:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>&lt;SCHEMA&gt;
+&lt;DICTIONARY 0&gt;
+(0) "A"
+(1) "B"
+(2) "C"
+
+&lt;RECORD BATCH 0&gt;
+0
+1
+2
+1
+
+&lt;DICTIONARY 0 DELTA&gt;
+(3) "D"
+(4) "E"
+
+&lt;RECORD BATCH 1&gt;
+3
+2
+4
+0
+EOS
+</code></pre>
+</div>
+
 <h3 id="tensor-multi-dimensional-array-message-format">Tensor (Multi-dimensional Array) Message Format</h3>
 
 <p>The <code class="highlighter-rouge">Tensor</code> message types provides a way to write a multidimensional array of

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/301d577e/docs/memory_layout.html
----------------------------------------------------------------------
diff --git a/docs/memory_layout.html b/docs/memory_layout.html
index 10fc82c..ff8f9e8 100644
--- a/docs/memory_layout.html
+++ b/docs/memory_layout.html
@@ -162,9 +162,8 @@ from <code class="highlighter-rouge">List&lt;V&gt;</code> iff U and V are differ
 or a fully-specified nested type. When we say slot we mean a relative type
 value, not necessarily any physical storage region.</li>
   <li>Logical type: A data type that is implemented using some relative (physical)
-type. For example, a Decimal value stored in 16 bytes could be stored in a
-primitive array with slot size 16 bytes. Similarly, strings can be stored as
-<code class="highlighter-rouge">List&lt;1-byte&gt;</code>.</li>
+type. For example, Decimal values are stored as 16 bytes in a fixed byte
+size array. Similarly, strings can be stored as <code class="highlighter-rouge">List&lt;1-byte&gt;</code>.</li>
   <li>Parent and child arrays: names to express relationships between physical
 value arrays in a nested type structure. For example, a <code class="highlighter-rouge">List&lt;T&gt;</code>-type parent
 array has a T-type array as its child (see more on lists below).</li>
@@ -753,9 +752,9 @@ the the types array indicates that a slot contains a different type at the index
 <h2 id="dictionary-encoding">Dictionary encoding</h2>
 
 <p>When a field is dictionary encoded, the values are represented by an array of Int32 representing the index of the value in the dictionary.
-The Dictionary is received as a DictionaryBatch whose id is referenced by a dictionary attribute defined in the metadata (<a href="https://github.com/apache/arrow/blob/master/format/Message.fbs">Message.fbs</a>) in the Field table.
-The dictionary has the same layout as the type of the field would dictate. Each entry in the dictionary can be accessed by its index in the DictionaryBatch.
-When a Schema references a Dictionary id, it must send a DictionaryBatch for this id before any RecordBatch.</p>
+The Dictionary is received as one or more DictionaryBatches with the id referenced by a dictionary attribute defined in the metadata (<a href="https://github.com/apache/arrow/blob/master/format/Message.fbs">Message.fbs</a>) in the Field table.
+The dictionary has the same layout as the type of the field would dictate. Each entry in the dictionary can be accessed by its index in the DictionaryBatches.
+When a Schema references a Dictionary id, it must send at least one DictionaryBatch for this id.</p>
 
 <p>As an example, you could have the following data:</p>
 <div class="highlighter-rouge"><pre class="highlight"><code>type: List&lt;String&gt;

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/301d577e/docs/metadata.html
----------------------------------------------------------------------
diff --git a/docs/metadata.html b/docs/metadata.html
index df36202..858f0c0 100644
--- a/docs/metadata.html
+++ b/docs/metadata.html
@@ -531,7 +531,8 @@ logical type, which have no children) and 3 buffers:</p>
 
 <h3 id="decimal">Decimal</h3>
 
-<p>TBD</p>
+<p>Decimals are represented as a 2’s complement 128-bit (16 byte) signed integer
+in little-endian byte order.</p>
 
 <h3 id="timestamp">Timestamp</h3>
 

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/301d577e/feed.xml
----------------------------------------------------------------------
diff --git a/feed.xml b/feed.xml
index e9e13a6..98d48d5 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,4 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2018-03-22T09:11:11-04:00</updated><id>/</id><entry><title type="html">A Native Go Library for Apache Arrow</title><link href="/blog/2018/03/22/go-code-donation/" rel="alternate" type="text/html" title="A Native Go Library for Apache Arrow" /><published>2018-03-22T00:00:00-04:00</published><updated>2018-03-22T00:00:00-04:00</updated><id>/blog/2018/03/22/go-code-donation</id><content type="html" xml:base="/blog/2018/03/22/go-code-donation/">&lt;!--
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2018-04-09T04:33:24-04:00</updated><id>/</id><entry><title type="html">A Native Go Library for Apache Arrow</title><link href="/blog/2018/03/22/go-code-donation/" rel="alternate" type="text/html" title="A Native Go Library for Apache Arrow" /><published>2018-03-22T00:00:00-04:00</published><updated>2018-03-22T00:00:00-04:00</updated><id>/blog/2018/03/22/go-code-donation</id><content type="html" xml:base="/blog/2018/03/22/go-code-donation/">&lt;!--
 
 --&gt;
 
@@ -266,7 +266,7 @@ working to improve and expand the libraries in support of downstream use cases.&
 
 &lt;p&gt;We continue to look for more JavaScript, Julia, R, Rust, and other programming
 language developers to join the project and expand the available
-implementations and bindings to more languages.&lt;/p&gt;</content><author><name>wesm</name></author></entry><entry><title type="html">Improvements to Java Vector API in Apache Arrow 0.8.0</title><link href="/blog/2017/12/19/java-vector-improvements/" rel="alternate" type="text/html" title="Improvements to Java Vector API in Apache Arrow 0.8.0" /><published>2017-12-18T19:00:00-05:00</published><updated>2017-12-18T19:00:00-05:00</updated><id>/blog/2017/12/19/java-vector-improvements</id><content type="html" xml:base="/blog/2017/12/19/java-vector-improvements/">&lt;!--
+implementations and bindings to more languages.&lt;/p&gt;</content><author><name>wesm</name></author></entry><entry><title type="html">Improvements to Java Vector API in Apache Arrow 0.8.0</title><link href="/blog/2017/12/18/java-vector-improvements/" rel="alternate" type="text/html" title="Improvements to Java Vector API in Apache Arrow 0.8.0" /><published>2017-12-18T19:00:00-05:00</published><updated>2017-12-18T19:00:00-05:00</updated><id>/blog/2017/12/18/java-vector-improvements</id><content type="html" xml:base="/blog/2017/12/18/java-vector-improvements/">&lt;!--
 
 --&gt;