You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@orc.apache.org by do...@apache.org on 2023/05/16 20:25:46 UTC

[orc] branch asf-site updated: ORC-1409: [Docs] Add stream order description in ORC spec

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/orc.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new f5f7bce50 ORC-1409: [Docs] Add stream order description in ORC spec
f5f7bce50 is described below

commit f5f7bce50c4b52b5873a7e9ede74d0740b636060
Author: Dongjoon Hyun <do...@apache.org>
AuthorDate: Tue May 16 13:25:38 2023 -0700

    ORC-1409: [Docs] Add stream order description in ORC spec
---
 specification/ORCv0/index.html | 34 +++++++++++++++++++++++++++++++++-
 specification/ORCv1/index.html | 33 ++++++++++++++++++++++++++++++++-
 specification/ORCv2/index.html | 33 ++++++++++++++++++++++++++++++++-
 3 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/specification/ORCv0/index.html b/specification/ORCv0/index.html
index 28412f735..b5124f022 100644
--- a/specification/ORCv0/index.html
+++ b/specification/ORCv0/index.html
@@ -634,6 +634,28 @@ uses three streams PRESENT, DATA, and LENGTH, which stores the length
 of each value. The details of each type will be presented in the
 following subsections.</p>
 
+<p>There is a general order for index and data streams:</p>
+<ul>
+  <li>Index streams are always placed together in the beginning of the stripe.</li>
+  <li>Data streams are placed together after index streams (if any).</li>
+  <li>Inside index streams or data streams, the unencrypted streams should be
+placed first and then followed by streams grouped by each encryption variant.</li>
+</ul>
+
+<p>There is no fixed order within each unencrypted or encryption variant in the
+index and data streams:</p>
+<ul>
+  <li>Different stream kinds of the same column can be placed in any order.</li>
+  <li>Streams from different columns can even be placed in any order.
+To get the precise information (a.k.a stream kind, column id and location) of
+a stream within a stripe, the streams field in the StripeFooter described below
+is the single source of truth.</li>
+</ul>
+
+<p>In the example of the integer column mentioned above, the order of the
+PRESENT stream and the DATA stream cannot be determined in advance.
+We need to get the precise information by <strong>StripeFooter</strong>.</p>
+
 <h2 id="stripe-footer">Stripe Footer</h2>
 
 <p>The stripe footer contains the encoding of each column and the
@@ -696,7 +718,7 @@ further refined as to whether they use RLE v1 or v2.</p>
 }
 </code></pre></div></div>
 
-<h1 id="column-encodings">Column Encodings</h1>
+<h1 id="column-encodings"><a id="column-encoding-section">Column Encodings</a></h1>
 
 <h2 id="smallint-int-and-bigint-columns">SmallInt, Int, and BigInt Columns</h2>
 
@@ -731,6 +753,10 @@ values are included in the data stream.</p>
   </tbody>
 </table>
 
+<blockquote>
+  <p>Note that the order of the Stream is not fixed. It also applies to other Column types.</p>
+</blockquote>
+
 <h2 id="float-and-double-columns">Float and Double Columns</h2>
 
 <p>Floating point types are stored using IEEE 754 floating point bit
@@ -1213,6 +1239,12 @@ indexes error-prone.</p>
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.</p>
 
+<p>Note that for columns with multiple streams, the order of stream
+positions in the RowIndex is <strong>fixed</strong>, which may be different to
+the actual data stream placement, and it is the same as
+<a href="#column-encoding-section">Column Encodings</a> section we described above.</p>
+
+
       </article>
     </div>
 
diff --git a/specification/ORCv1/index.html b/specification/ORCv1/index.html
index 6ceb89ed8..7c15f3404 100644
--- a/specification/ORCv1/index.html
+++ b/specification/ORCv1/index.html
@@ -1148,6 +1148,28 @@ following subsections.</p>
   <li>stripe footer</li>
 </ul>
 
+<p>There is a general order for index and data streams:</p>
+<ul>
+  <li>Index streams are always placed together in the beginning of the stripe.</li>
+  <li>Data streams are placed together after index streams (if any).</li>
+  <li>Inside index streams or data streams, the unencrypted streams should be
+placed first and then followed by streams grouped by each encryption variant.</li>
+</ul>
+
+<p>There is no fixed order within each unencrypted or encryption variant in the
+index and data streams:</p>
+<ul>
+  <li>Different stream kinds of the same column can be placed in any order.</li>
+  <li>Streams from different columns can even be placed in any order.
+To get the precise information (a.k.a stream kind, column id and location) of
+a stream within a stripe, the streams field in the StripeFooter described below
+is the single source of truth.</li>
+</ul>
+
+<p>In the example of the integer column mentioned above, the order of the
+PRESENT stream and the DATA stream cannot be determined in advance.
+We need to get the precise information by <strong>StripeFooter</strong>.</p>
+
 <h2 id="stripe-footer">Stripe Footer</h2>
 
 <p>The stripe footer contains the encoding of each column and the
@@ -1242,7 +1264,7 @@ further refined as to whether they use RLE v1 or v2.</p>
 }
 </code></pre></div></div>
 
-<h1 id="column-encodings">Column Encodings</h1>
+<h1 id="column-encodings"><a id="column-encoding-section">Column Encodings</a></h1>
 
 <h2 id="smallint-int-and-bigint-columns">SmallInt, Int, and BigInt Columns</h2>
 
@@ -1289,6 +1311,10 @@ values are included in the data stream.</p>
   </tbody>
 </table>
 
+<blockquote>
+  <p>Note that the order of the Stream is not fixed. It also applies to other Column types.</p>
+</blockquote>
+
 <h2 id="float-and-double-columns">Float and Double Columns</h2>
 
 <p>Floating point types are stored using IEEE 754 floating point bit
@@ -1903,6 +1929,11 @@ indexes error-prone.</p>
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.</p>
 
+<p>Note that for columns with multiple streams, the order of stream
+positions in the RowIndex is <strong>fixed</strong>, which may be different to
+the actual data stream placement, and it is the same as
+<a href="#column-encoding-section">Column Encodings</a> section we described above.</p>
+
 <h2 id="bloom-filter-index">Bloom Filter Index</h2>
 
 <p>Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
diff --git a/specification/ORCv2/index.html b/specification/ORCv2/index.html
index b9d1c982f..620e152dc 100644
--- a/specification/ORCv2/index.html
+++ b/specification/ORCv2/index.html
@@ -1172,6 +1172,28 @@ following subsections.</p>
   <li>stripe footer</li>
 </ul>
 
+<p>There is a general order for index and data streams:</p>
+<ul>
+  <li>Index streams are always placed together in the beginning of the stripe.</li>
+  <li>Data streams are placed together after index streams (if any).</li>
+  <li>Inside index streams or data streams, the unencrypted streams should be
+placed first and then followed by streams grouped by each encryption variant.</li>
+</ul>
+
+<p>There is no fixed order within each unencrypted or encryption variant in the
+index and data streams:</p>
+<ul>
+  <li>Different stream kinds of the same column can be placed in any order.</li>
+  <li>Streams from different columns can even be placed in any order.
+To get the precise information (a.k.a stream kind, column id and location) of
+a stream within a stripe, the streams field in the StripeFooter described below
+is the single source of truth.</li>
+</ul>
+
+<p>In the example of the integer column mentioned above, the order of the
+PRESENT stream and the DATA stream cannot be determined in advance.
+We need to get the precise information by <strong>StripeFooter</strong>.</p>
+
 <h2 id="stripe-footer">Stripe Footer</h2>
 
 <p>The stripe footer contains the encoding of each column and the
@@ -1266,7 +1288,7 @@ further refined as to whether they use RLE v1 or v2.</p>
 }
 </code></pre></div></div>
 
-<h1 id="column-encodings">Column Encodings</h1>
+<h1 id="column-encodings"><a id="column-encoding-section">Column Encodings</a></h1>
 
 <h2 id="smallint-int-and-bigint-columns">SmallInt, Int, and BigInt Columns</h2>
 
@@ -1313,6 +1335,10 @@ values are included in the data stream.</p>
   </tbody>
 </table>
 
+<blockquote>
+  <p>Note that the order of the Stream is not fixed. It also applies to other Column types.</p>
+</blockquote>
+
 <h2 id="float-and-double-columns">Float and Double Columns</h2>
 
 <p>Floating point types are stored using IEEE 754 floating point bit
@@ -1918,6 +1944,11 @@ indexes error-prone.</p>
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.</p>
 
+<p>Note that for columns with multiple streams, the order of stream
+positions in the RowIndex is <strong>fixed</strong>, which may be different to
+the actual data stream placement, and it is the same as
+<a href="#column-encoding-section">Column Encodings</a> section we described above.</p>
+
 <h2 id="bloom-filter-index">Bloom Filter Index</h2>
 
 <p>Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.