You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@orc.apache.org by om...@apache.org on 2018/05/17 22:12:56 UTC

[02/11] orc git commit: Deploy docs with CVE.

http://git-wip-us.apache.org/repos/asf/orc/blob/e58d3fc4/specification/ORCv2.html
----------------------------------------------------------------------
diff --git a/specification/ORCv2.html b/specification/ORCv2.html
deleted file mode 100644
index db8aecf..0000000
--- a/specification/ORCv2.html
+++ /dev/null
@@ -1,1760 +0,0 @@
-<!DOCTYPE HTML>
-<html lang="en-US">
-<head>
-  <meta charset="UTF-8">
-  <title>Evolving Draft for ORC Specification v2</title>
-  <meta name="viewport" content="width=device-width,initial-scale=1">
-  <meta name="generator" content="Jekyll v2.4.0">
-  <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
-  <link rel="stylesheet" href="/css/screen.css">
-  <link rel="icon" type="image/x-icon" href="/favicon.ico">
-  <!--[if lt IE 9]>
-  <script src="/js/html5shiv.min.js"></script>
-  <script src="/js/respond.min.js"></script>
-  <![endif]-->
-</head>
-
-
-<body class="wrap">
-  <header role="banner">
-  <nav class="mobile-nav show-on-mobiles">
-    <ul>
-  <li class="">
-    <a href="/">Home</a>
-  </li>
-  <li class="">
-    <a href="/docs/"><span class="show-on-mobiles">Docs</span>
-                     <span class="hide-on-mobiles">Documentation</span></a>
-  </li>
-  <li class="">
-    <a href="/talks/">Talks</a>
-  </li>
-  <li class="">
-    <a href="/news/">News</a>
-  </li>
-  <li class="">
-    <a href="/help/">Help</a>
-  </li>
-  <li class="">
-    <a href="/develop/">Develop</a>
-  </li>
-</ul>
-
-  </nav>
-  <div class="grid">
-    <div class="unit one-third center-on-mobiles">
-      <h1>
-        <a href="/">
-          <span class="sr-only">Apache ORC</span>
-          <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
-        </a>
-      </h1>
-    </div>
-    <nav class="main-nav unit two-thirds hide-on-mobiles">
-      <ul>
-  <li class="">
-    <a href="/">Home</a>
-  </li>
-  <li class="">
-    <a href="/docs/"><span class="show-on-mobiles">Docs</span>
-                     <span class="hide-on-mobiles">Documentation</span></a>
-  </li>
-  <li class="">
-    <a href="/talks/">Talks</a>
-  </li>
-  <li class="">
-    <a href="/news/">News</a>
-  </li>
-  <li class="">
-    <a href="/help/">Help</a>
-  </li>
-  <li class="">
-    <a href="/develop/">Develop</a>
-  </li>
-</ul>
-
-    </nav>
-  </div>
-</header>
-
-
-  <section class="standalone">
-  <div class="grid">
-
-    <div class="unit whole">
-      <article>
-        <h1>Evolving Draft for ORC Specification v2</h1>
-        <p>This specification is rapidly evolving and should only be used for
-developers on the project.</p>
-
-<h1 id="to-do-items">TO DO items</h1>
-
-<p>The list of things that we plan to change:</p>
-
-<ul>
-  <li>Move decimal encoding to RLEv3 and remove variable length encoding.</li>
-  <li>Create a better float/double encoding that splits mantissa and
-exponent.</li>
-  <li>Create a dictionary encoding for float, double, and decimal.</li>
-  <li>Create RLEv3:
-    <ul>
-      <li>64 and 128 bit variants</li>
-      <li>Zero suppression</li>
-      <li>Evaluate the rle subformats</li>
-    </ul>
-  </li>
-  <li>Group stripe data into stripelets to enable Async IO for reads.</li>
-  <li>Reorder stripe data into (stripe metadata, index, dictionary, data)</li>
-  <li>Stop sorting dictionaries and record the sort order separately in the index.</li>
-  <li>Remove use of RLEv1 and RLEv2.</li>
-  <li>Remove non-utf8 bloom filter.</li>
-  <li>Use numeric value for decimal statistics and bloom filter.</li>
-  <li>Add Zstd with dictionary.</li>
-</ul>
-
-<h1 id="motivation">Motivation</h1>
-
-<p>Hive’s RCFile was the standard format for storing tabular data in
-Hadoop for several years. However, RCFile has limitations because it
-treats each column as a binary blob without semantics. In Hive 0.11 we
-added a new file format named Optimized Row Columnar (ORC) file that
-uses and retains the type information from the table definition. ORC
-uses type specific readers and writers that provide light weight
-compression techniques such as dictionary encoding, bit packing, delta
-encoding, and run length encoding – resulting in dramatically smaller
-files. Additionally, ORC can apply generic compression using zlib, or
-Snappy on top of the lightweight compression for even smaller
-files. However, storage savings are only part of the gain. ORC
-supports projection, which selects subsets of the columns for reading,
-so that queries reading only one column read only the required
-bytes. Furthermore, ORC files include light weight indexes that
-include the minimum and maximum values for each column in each set of
-10,000 rows and the entire file. Using pushdown filters from Hive, the
-file reader can skip entire sets of rows that aren’t important for
-this query.</p>
-
-<p><img src="/img/OrcFileLayout.png" alt="ORC file structure" /></p>
-
-<h1 id="file-tail">File Tail</h1>
-
-<p>Since HDFS does not support changing the data in a file after it is
-written, ORC stores the top level index at the end of the file. The
-overall structure of the file is given in the figure above.  The
-file’s tail consists of 3 parts; the file metadata, file footer and
-postscript.</p>
-
-<p>The metadata for ORC is stored using
-<a href="https://s.apache.org/protobuf_encoding">Protocol Buffers</a>, which provides
-the ability to add new fields without breaking readers. This document
-incorporates the Protobuf definition from the
-<a href="https://s.apache.org/orc_proto">ORC source code</a> and the
-reader is encouraged to review the Protobuf encoding if they need to
-understand the byte-level encoding</p>
-
-<h2 id="postscript">Postscript</h2>
-
-<p>The Postscript section provides the necessary information to interpret
-the rest of the file including the length of the file’s Footer and
-Metadata sections, the version of the file, and the kind of general
-compression used (eg. none, zlib, or snappy). The Postscript is never
-compressed and ends one byte before the end of the file. The version
-stored in the Postscript is the lowest version of Hive that is
-guaranteed to be able to read the file and it stored as a sequence of
-the major and minor version. This file version is encoded as [0,12].</p>
-
-<p>The process of reading an ORC file works backwards through the
-file. Rather than making multiple short reads, the ORC reader reads
-the last 16k bytes of the file with the hope that it will contain both
-the Footer and Postscript sections. The final byte of the file
-contains the serialized length of the Postscript, which must be less
-than 256 bytes. Once the Postscript is parsed, the compressed
-serialized length of the Footer is known and it can be decompressed
-and parsed.</p>
-
-<p><code>message PostScript {
- // the length of the footer section in bytes
- optional uint64 footerLength = 1;
- // the kind of generic compression used
- optional CompressionKind compression = 2;
- // the maximum size of each compression chunk
- optional uint64 compressionBlockSize = 3;
- // the version of the writer
- repeated uint32 version = 4 [packed = true];
- // the length of the metadata section in bytes
- optional uint64 metadataLength = 5;
- // the fixed string "ORC"
- optional string magic = 8000;
-}
-</code></p>
-
-<p><code>enum CompressionKind {
- NONE = 0;
- ZLIB = 1;
- SNAPPY = 2;
- LZO = 3;
- LZ4 = 4;
- ZSTD = 5;
-}
-</code></p>
-
-<h2 id="footer">Footer</h2>
-
-<p>The Footer section contains the layout of the body of the file, the
-type schema information, the number of rows, and the statistics about
-each of the columns.</p>
-
-<p>The file is broken in to three parts- Header, Body, and Tail. The
-Header consists of the bytes “ORC’’ to support tools that want to
-scan the front of the file to determine the type of the file. The Body
-contains the rows and indexes, and the Tail gives the file level
-information as described in this section.</p>
-
-<p><code>message Footer {
- // the length of the file header in bytes (always 3)
- optional uint64 headerLength = 1;
- // the length of the file header and body in bytes
- optional uint64 contentLength = 2;
- // the information about the stripes
- repeated StripeInformation stripes = 3;
- // the schema information
- repeated Type types = 4;
- // the user metadata that was added
- repeated UserMetadataItem metadata = 5;
- // the total number of rows in the file
- optional uint64 numberOfRows = 6;
- // the statistics of each column across the file
- repeated ColumnStatistics statistics = 7;
- // the maximum number of rows in each index entry
- optional uint32 rowIndexStride = 8;
-}
-</code></p>
-
-<h3 id="stripe-information">Stripe Information</h3>
-
-<p>The body of the file is divided into stripes. Each stripe is self
-contained and may be read using only its own bytes combined with the
-file’s Footer and Postscript. Each stripe contains only entire rows so
-that rows never straddle stripe boundaries. Stripes have three
-sections: a set of indexes for the rows within the stripe, the data
-itself, and a stripe footer. Both the indexes and the data sections
-are divided by columns so that only the data for the required columns
-needs to be read.</p>
-
-<p><code>message StripeInformation {
- // the start of the stripe within the file
- optional uint64 offset = 1;
- // the length of the indexes in bytes
- optional uint64 indexLength = 2;
- // the length of the data in bytes
- optional uint64 dataLength = 3;
- // the length of the footer in bytes
- optional uint64 footerLength = 4;
- // the number of rows in the stripe
- optional uint64 numberOfRows = 5;
-}
-</code></p>
-
-<h3 id="type-information">Type Information</h3>
-
-<p>All of the rows in an ORC file must have the same schema. Logically
-the schema is expressed as a tree as in the figure below, where
-the compound types have subcolumns under them.</p>
-
-<p><img src="/img/TreeWriters.png" alt="ORC column structure" /></p>
-
-<p>The equivalent Hive DDL would be:</p>
-
-<p><code>create table Foobar (
- myInt int,
- myMap map&lt;string,
- struct&lt;myString : string,
- myDouble: double&gt;&gt;,
- myTime timestamp
-);
-</code></p>
-
-<p>The type tree is flattened in to a list via a pre-order traversal
-where each type is assigned the next id. Clearly the root of the type
-tree is always type id 0. Compound types have a field named subtypes
-that contains the list of their children’s type ids.</p>
-
-<p><code>message Type {
- enum Kind {
- BOOLEAN = 0;
- BYTE = 1;
- SHORT = 2;
- INT = 3;
- LONG = 4;
- FLOAT = 5;
- DOUBLE = 6;
- STRING = 7;
- BINARY = 8;
- TIMESTAMP = 9;
- LIST = 10;
- MAP = 11;
- STRUCT = 12;
- UNION = 13;
- DECIMAL = 14;
- DATE = 15;
- VARCHAR = 16;
- CHAR = 17;
- }
- // the kind of this type
- required Kind kind = 1;
- // the type ids of any subcolumns for list, map, struct, or union
- repeated uint32 subtypes = 2 [packed=true];
- // the list of field names for struct
- repeated string fieldNames = 3;
- // the maximum length of the type for varchar or char in UTF-8 characters
- optional uint32 maximumLength = 4;
- // the precision and scale for decimal
- optional uint32 precision = 5;
- optional uint32 scale = 6;
-}
-</code></p>
-
-<h3 id="column-statistics">Column Statistics</h3>
-
-<p>The goal of the column statistics is that for each column, the writer
-records the count and depending on the type other useful fields. For
-most of the primitive types, it records the minimum and maximum
-values; and for numeric types it additionally stores the sum.
-From Hive 1.1.0 onwards, the column statistics will also record if
-there are any null values within the row group by setting the hasNull flag.
-The hasNull flag is used by ORC’s predicate pushdown to better answer
-‘IS NULL’ queries.</p>
-
-<p><code>message ColumnStatistics {
- // the number of values
- optional uint64 numberOfValues = 1;
- // At most one of these has a value for any column
- optional IntegerStatistics intStatistics = 2;
- optional DoubleStatistics doubleStatistics = 3;
- optional StringStatistics stringStatistics = 4;
- optional BucketStatistics bucketStatistics = 5;
- optional DecimalStatistics decimalStatistics = 6;
- optional DateStatistics dateStatistics = 7;
- optional BinaryStatistics binaryStatistics = 8;
- optional TimestampStatistics timestampStatistics = 9;
- optional bool hasNull = 10;
-}
-</code></p>
-
-<p>For integer types (tinyint, smallint, int, bigint), the column
-statistics includes the minimum, maximum, and sum. If the sum
-overflows long at any point during the calculation, no sum is
-recorded.</p>
-
-<p><code>message IntegerStatistics {
- optional sint64 minimum = 1;
- optional sint64 maximum = 2;
- optional sint64 sum = 3;
-}
-</code></p>
-
-<p>For floating point types (float, double), the column statistics
-include the minimum, maximum, and sum. If the sum overflows a double,
-no sum is recorded.</p>
-
-<p><code>message DoubleStatistics {
- optional double minimum = 1;
- optional double maximum = 2;
- optional double sum = 3;
-}
-</code></p>
-
-<p>For strings, the minimum value, maximum value, and the sum of the
-lengths of the values are recorded.</p>
-
-<p><code>message StringStatistics {
- optional string minimum = 1;
- optional string maximum = 2;
- // sum will store the total length of all strings
- optional sint64 sum = 3;
-}
-</code></p>
-
-<p>For booleans, the statistics include the count of false and true values.</p>
-
-<p><code>message BucketStatistics {
- repeated uint64 count = 1 [packed=true];
-}
-</code></p>
-
-<p>For decimals, the minimum, maximum, and sum are stored.</p>
-
-<p><code>message DecimalStatistics {
- optional string minimum = 1;
- optional string maximum = 2;
- optional string sum = 3;
-}
-</code></p>
-
-<p>Date columns record the minimum and maximum values as the number of
-days since the epoch (1/1/2015).</p>
-
-<p><code>message DateStatistics {
- // min,max values saved as days since epoch
- optional sint32 minimum = 1;
- optional sint32 maximum = 2;
-}
-</code></p>
-
-<p>Timestamp columns record the minimum and maximum values as the number of
-milliseconds since the epoch (1/1/2015).</p>
-
-<p><code>message TimestampStatistics {
- // min,max values saved as milliseconds since epoch
- optional sint64 minimum = 1;
- optional sint64 maximum = 2;
-}
-</code></p>
-
-<p>Binary columns store the aggregate number of bytes across all of the values.</p>
-
-<p><code>message BinaryStatistics {
- // sum will store the total binary blob length
- optional sint64 sum = 1;
-}
-</code></p>
-
-<h3 id="user-metadata">User Metadata</h3>
-
-<p>The user can add arbitrary key/value pairs to an ORC file as it is
-written. The contents of the keys and values are completely
-application defined, but the key is a string and the value is
-binary. Care should be taken by applications to make sure that their
-keys are unique and in general should be prefixed with an organization
-code.</p>
-
-<p><code>message UserMetadataItem {
- // the user defined key
- required string name = 1;
- // the user defined binary value
- required bytes value = 2;
-}
-</code></p>
-
-<h3 id="file-metadata">File Metadata</h3>
-
-<p>The file Metadata section contains column statistics at the stripe
-level granularity. These statistics enable input split elimination
-based on the predicate push-down evaluated per a stripe.</p>
-
-<p><code>message StripeStatistics {
- repeated ColumnStatistics colStats = 1;
-}
-</code></p>
-
-<p><code>message Metadata {
- repeated StripeStatistics stripeStats = 1;
-}
-</code></p>
-
-<h1 id="compression">Compression</h1>
-
-<p>If the ORC file writer selects a generic compression codec (zlib or
-snappy), every part of the ORC file except for the Postscript is
-compressed with that codec. However, one of the requirements for ORC
-is that the reader be able to skip over compressed bytes without
-decompressing the entire stream. To manage this, ORC writes compressed
-streams in chunks with headers as in the figure below.
-To handle uncompressable data, if the compressed data is larger than
-the original, the original is stored and the isOriginal flag is
-set. Each header is 3 bytes long with (compressedLength * 2 +
-isOriginal) stored as a little endian value. For example, the header
-for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
-0x03]. The header for 5 bytes that did not compress would be [0x0b,
-0x00, 0x00]. Each compression chunk is compressed independently so
-that as long as a decompressor starts at the top of a header, it can
-start decompressing without the previous bytes.</p>
-
-<p><img src="/img/CompressionStream.png" alt="compression streams" /></p>
-
-<p>The default compression chunk size is 256K, but writers can choose
-their own value. Larger chunks lead to better compression, but require
-more memory. The chunk size is recorded in the Postscript so that
-readers can allocate appropriately sized buffers. Readers are
-guaranteed that no chunk will expand to more than the compression chunk
-size.</p>
-
-<p>ORC files without generic compression write each stream directly
-with no headers.</p>
-
-<h1 id="run-length-encoding">Run Length Encoding</h1>
-
-<h2 id="base-128-varint">Base 128 Varint</h2>
-
-<p>Variable width integer encodings take advantage of the fact that most
-numbers are small and that having smaller encodings for small numbers
-shrinks the overall size of the data. ORC uses the varint format from
-Protocol Buffers, which writes data in little endian format using the
-low 7 bits of each byte. The high bit in each byte is set if the
-number continues into the next byte.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Unsigned Original</th>
-      <th style="text-align: left">Serialized</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">0</td>
-      <td style="text-align: left">0x00</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">1</td>
-      <td style="text-align: left">0x01</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">127</td>
-      <td style="text-align: left">0x7f</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">128</td>
-      <td style="text-align: left">0x80, 0x01</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">129</td>
-      <td style="text-align: left">0x81, 0x01</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">16,383</td>
-      <td style="text-align: left">0xff, 0x7f</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">16,384</td>
-      <td style="text-align: left">0x80, 0x80, 0x01</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">16,385</td>
-      <td style="text-align: left">0x81, 0x80, 0x01</td>
-    </tr>
-  </tbody>
-</table>
-
-<p>For signed integer types, the number is converted into an unsigned
-number using a zigzag encoding. Zigzag encoding moves the sign bit to
-the least significant bit using the expression (val « 1) ^ (val »
-63) and derives its name from the fact that positive and negative
-numbers alternate once encoded. The unsigned number is then serialized
-as above.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Signed Original</th>
-      <th style="text-align: left">Unsigned</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">0</td>
-      <td style="text-align: left">0</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">-1</td>
-      <td style="text-align: left">1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">1</td>
-      <td style="text-align: left">2</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">-2</td>
-      <td style="text-align: left">3</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">2</td>
-      <td style="text-align: left">4</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="byte-run-length-encoding">Byte Run Length Encoding</h2>
-
-<p>For byte streams, ORC uses a very light weight encoding of identical
-values.</p>
-
-<ul>
-  <li>Run - a sequence of at least 3 identical values</li>
-  <li>Literals - a sequence of non-identical values</li>
-</ul>
-
-<p>The first byte of each group of values is a header than determines
-whether it is a run (value between 0 to 127) or literal list (value
-between -128 to -1). For runs, the control byte is the length of the
-run minus the length of the minimal run (3) and the control byte for
-literal lists is the negative length of the list. For example, a
-hundred 0’s is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
-would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
-either of the encodings.</p>
-
-<h2 id="boolean-run-length-encoding">Boolean Run Length Encoding</h2>
-
-<p>For encoding boolean types, the bits are put in the bytes from most
-significant to least significant. The bytes are encoded using byte run
-length encoding as described in the previous section. For example,
-the byte sequence [0xff, 0x80] would be one true followed by
-seven false values.</p>
-
-<h2 id="integer-run-length-encoding-version-1">Integer Run Length Encoding, version 1</h2>
-
-<p>In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1),
-which provides a lightweight compression of signed or unsigned integer
-sequences. RLEv1 has two sub-encodings:</p>
-
-<ul>
-  <li>Run - a sequence of values that differ by a small fixed delta</li>
-  <li>Literals - a sequence of varint encoded values</li>
-</ul>
-
-<p>Runs start with an initial byte of 0x00 to 0x7f, which encodes the
-length of the run - 3. A second byte provides the fixed delta in the
-range of -128 to 127. Finally, the first value of the run is encoded
-as a base 128 varint.</p>
-
-<p>For example, if the sequence is 100 instances of 7 the encoding would
-start with 100 - 3, followed by a delta of 0, and a varint of 7 for
-an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
-running from 100 to 1, the first byte is 100 - 3, the delta is -1,
-and the varint is 100 for an encoding of [0x61, 0xff, 0x64].</p>
-
-<p>Literals start with an initial byte of 0x80 to 0xff, which corresponds
-to the negative of number of literals in the sequence. Following the
-header byte, the list of N varints is encoded. Thus, if there are
-no runs, the overhead is 1 byte for each 128 integers. The first 5
-prime numbers [2, 3, 4, 7, 11] would encoded as [0xfb, 0x02, 0x03,
-0x04, 0x07, 0xb].</p>
-
-<h2 id="integer-run-length-encoding-version-2">Integer Run Length Encoding, version 2</h2>
-
-<p>In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2),
-which has improved compression and fixed bit width encodings for
-faster expansion. RLEv2 uses four sub-encodings based on the data:</p>
-
-<ul>
-  <li>Short Repeat - used for short sequences with repeated values</li>
-  <li>Direct - used for random sequences with a fixed bit width</li>
-  <li>Patched Base - used for random sequences with a variable bit width</li>
-  <li>Delta - used for monotonically increasing or decreasing sequences</li>
-</ul>
-
-<h3 id="short-repeat">Short Repeat</h3>
-
-<p>The short repeat encoding is used for short repeating integer
-sequences with the goal of minimizing the overhead of the header. All
-of the bits listed in the header are from the first byte to the last
-and from most significant bit to least significant bit. If the type is
-signed, the value is zigzag encoded.</p>
-
-<ul>
-  <li>1 byte header
-    <ul>
-      <li>2 bits for encoding type (0)</li>
-      <li>3 bits for width (W) of repeating value (1 to 8 bytes)</li>
-      <li>3 bits for repeat count (3 to 10 values)</li>
-    </ul>
-  </li>
-  <li>W bytes in big endian format, which is zigzag encoded if they type
-is signed</li>
-</ul>
-
-<p>The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be
-serialized with short repeat encoding (0), a width of 2 bytes (1), and
-repeat count of 5 (2) as [0x0a, 0x27, 0x10].</p>
-
-<h3 id="direct">Direct</h3>
-
-<p>The direct encoding is used for integer sequences whose values have a
-relatively constant bit width. It encodes the values directly using a
-fixed width big endian encoding. The width of the values is encoded
-using the table below.</p>
-
-<p>The 5 bit width encoding table for RLEv2:</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Width in Bits</th>
-      <th style="text-align: left">Encoded Value</th>
-      <th style="text-align: left">Notes</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">0</td>
-      <td style="text-align: left">0</td>
-      <td style="text-align: left">for delta encoding</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">1</td>
-      <td style="text-align: left">0</td>
-      <td style="text-align: left">for non-delta encoding</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">2</td>
-      <td style="text-align: left">1</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">4</td>
-      <td style="text-align: left">3</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">8</td>
-      <td style="text-align: left">7</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">16</td>
-      <td style="text-align: left">15</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">24</td>
-      <td style="text-align: left">23</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">32</td>
-      <td style="text-align: left">27</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">40</td>
-      <td style="text-align: left">28</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">48</td>
-      <td style="text-align: left">29</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">56</td>
-      <td style="text-align: left">30</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">64</td>
-      <td style="text-align: left">31</td>
-      <td style="text-align: left"> </td>
-    </tr>
-    <tr>
-      <td style="text-align: left">3</td>
-      <td style="text-align: left">2</td>
-      <td style="text-align: left">deprecated</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">5 &lt;= x &lt;= 7</td>
-      <td style="text-align: left">x - 1</td>
-      <td style="text-align: left">deprecated</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">9 &lt;= x &lt;= 15</td>
-      <td style="text-align: left">x - 1</td>
-      <td style="text-align: left">deprecated</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">17 &lt;= x &lt;= 21</td>
-      <td style="text-align: left">x - 1</td>
-      <td style="text-align: left">deprecated</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">26</td>
-      <td style="text-align: left">24</td>
-      <td style="text-align: left">deprecated</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">28</td>
-      <td style="text-align: left">25</td>
-      <td style="text-align: left">deprecated</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">30</td>
-      <td style="text-align: left">26</td>
-      <td style="text-align: left">deprecated</td>
-    </tr>
-  </tbody>
-</table>
-
-<ul>
-  <li>2 bytes header
-    <ul>
-      <li>2 bits for encoding type (1)</li>
-      <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
-width encoding table</li>
-      <li>9 bits for length (L) (1 to 512 values)</li>
-    </ul>
-  </li>
-  <li>W * L bits (padded to the next byte) encoded in big endian format, which is
-zigzag encoding if the type is signed</li>
-</ul>
-
-<p>The unsigned sequence of [23713, 43806, 57005, 48879] would be
-serialized with direct encoding (1), a width of 16 bits (15), and
-length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
-0xbe, 0xef].</p>
-
-<h3 id="patched-base">Patched Base</h3>
-
-<p>The patched base encoding is used for integer sequences whose bit
-widths varies a lot. The minimum signed value of the sequence is found
-and subtracted from the other values. The bit width of those adjusted
-values is analyzed and the 90 percentile of the bit width is chosen
-as W. The 10\% of values larger than W use patches from a patch list
-to set the additional bits. Patches are encoded as a list of gaps in
-the index values and the additional value bits.</p>
-
-<ul>
-  <li>4 bytes header
-    <ul>
-      <li>2 bits for encoding type (2)</li>
-      <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
-  width encoding table</li>
-      <li>9 bits for length (L) (1 to 512 values)</li>
-      <li>3 bits for base value width (BW) (1 to 8 bytes)</li>
-      <li>5 bits for patch width (PW) (1 to 64 bits) using  the 5 bit width
-encoding table</li>
-      <li>3 bits for patch gap width (PGW) (1 to 8 bits)</li>
-      <li>5 bits for patch list length (PLL) (0 to 31 patches)</li>
-    </ul>
-  </li>
-  <li>Base value (BW bytes) - The base value is stored as a big endian value
-with negative values marked by the most significant bit set. If it that
-bit is set, the entire value is negated.</li>
-  <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive
-values that are added to the base value.</li>
-  <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive
-values that are added to the base value.</li>
-  <li>Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
-that didn’t fit within W bits. Each entry in the list consists of a
-gap, which is the number of elements skipped from the previous
-patch, and a patch value. Patches are applied by logically or’ing
-the data values with the relevant patch shifted W bits left. If a
-patch is 0, it was introduced to skip over more than 255 items. The
-combined length of each patch (PGW + PW) must be less or equal to
-64.</li>
-</ul>
-
-<p>The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070,
-2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
-has a minimum of 2000, which makes the adjusted
-sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
-150, 160, 170, 180, 190]. It has an
-encoding of patched base (2), a bit width of 8 (7), a length of 20
-(19), a base value width of 2 bytes (1), a patch width of 12 bits (11),
-patch gap width of 2 bits (1), and a patch list length of 1 (1). The
-base value is 2000 and the combined result is [0x8e, 0x13, 0x2b, 0x21, 0x07,
-0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e,
-0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc, 0xe8]</p>
-
-<h3 id="delta">Delta</h3>
-
-<p>The Delta encoding is used for monotonically increasing or decreasing
-sequences. The first two numbers in the sequence can not be identical,
-because the encoding is using the sign of the first delta to determine
-if the series is increasing or decreasing.</p>
-
-<ul>
-  <li>2 bytes header
-    <ul>
-      <li>2 bits for encoding type (3)</li>
-      <li>5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit
-width encoding table</li>
-      <li>9 bits for run length (L) (1 to 512 values)</li>
-    </ul>
-  </li>
-  <li>Base value - encoded as (signed or unsigned) varint</li>
-  <li>Delta base - encoded as signed varint</li>
-  <li>Delta values $W * (L - 2)$ bytes - encode each delta after the first
-one. If the delta base is positive, the sequence is increasing and if it is
-negative the sequence is decreasing.</li>
-</ul>
-
-<p>The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be
-serialized with delta encoding (3), a width of 4 bits (3), length of
-10 (9), a base of 2 (2), and first delta of 1 (2). The resulting
-sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].</p>
-
-<h1 id="stripes">Stripes</h1>
-
-<p>The body of ORC files consists of a series of stripes. Stripes are
-large (typically ~200MB) and independent of each other and are often
-processed by different tasks. The defining characteristic for columnar
-storage formats is that the data for each column is stored separately
-and that reading data out of the file should be proportional to the
-number of columns read.</p>
-
-<p>In ORC files, each column is stored in several streams that are stored
-next to each other in the file. For example, an integer column is
-represented as two streams PRESENT, which uses one with a bit per
-value recording if the value is non-null, and DATA, which records the
-non-null values. If all of a column’s values in a stripe are non-null,
-the PRESENT stream is omitted from the stripe. For binary data, ORC
-uses three streams PRESENT, DATA, and LENGTH, which stores the length
-of each value. The details of each type will be presented in the
-following subsections.</p>
-
-<h2 id="stripe-footer">Stripe Footer</h2>
-
-<p>The stripe footer contains the encoding of each column and the
-directory of the streams including their location.</p>
-
-<p><code>message StripeFooter {
- // the location of each stream
- repeated Stream streams = 1;
- // the encoding of each column
- repeated ColumnEncoding columns = 2;
-}
-</code></p>
-
-<p>To describe each stream, ORC stores the kind of stream, the column id,
-and the stream’s size in bytes. The details of what is stored in each stream
-depends on the type and encoding of the column.</p>
-
-<p><code>message Stream {
- enum Kind {
- // boolean stream of whether the next value is non-null
- PRESENT = 0;
- // the primary data stream
- DATA = 1;
- // the length of each value for variable length data
- LENGTH = 2;
- // the dictionary blob
- DICTIONARY\_DATA = 3;
- // deprecated prior to Hive 0.11
- // It was used to store the number of instances of each value in the
- // dictionary
- DICTIONARY_COUNT = 4;
- // a secondary data stream
- SECONDARY = 5;
- // the index for seeking to particular row groups
- ROW_INDEX = 6;
- // original bloom filters used before ORC-101
- BLOOM_FILTER = 7;
- // bloom filters that consistently use utf8
- BLOOM_FILTER_UTF8 = 8;
- }
- required Kind kind = 1;
- // the column id
- optional uint32 column = 2;
- // the number of bytes in the file
- optional uint64 length = 3;
-}
-</code></p>
-
-<p>Depending on their type several options for encoding are possible. The
-encodings are divided into direct or dictionary-based categories and
-further refined as to whether they use RLE v1 or v2.</p>
-
-<p><code>message ColumnEncoding {
- enum Kind {
- // the encoding is mapped directly to the stream using RLE v1
- DIRECT = 0;
- // the encoding uses a dictionary of unique values using RLE v1
- DICTIONARY = 1;
- // the encoding is direct using RLE v2
- DIRECT\_V2 = 2;
- // the encoding is dictionary-based using RLE v2
- DICTIONARY\_V2 = 3;
- }
- required Kind kind = 1;
- // for dictionary encodings, record the size of the dictionary
- optional uint32 dictionarySize = 2;
-}
-</code></p>
-
-<h1 id="column-encodings">Column Encodings</h1>
-
-<h2 id="smallint-int-and-bigint-columns">SmallInt, Int, and BigInt Columns</h2>
-
-<p>All of the 16, 32, and 64 bit integer column types use the same set of
-potential encodings, which is basically whether they use RLE v1 or
-v2. If the PRESENT stream is not included, all of the values are
-present. For values that have false bits in the present stream, no
-values are included in the data stream.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Signed Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DIRECT_V2</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Signed Integer RLE v2</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="float-and-double-columns">Float and Double Columns</h2>
-
-<p>Floating point types are stored using IEEE 754 floating point bit
-layout. Float columns use 4 bytes per value and double columns use 8
-bytes.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">IEEE 754 floating point representation</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="string-char-and-varchar-columns">String, Char, and VarChar Columns</h2>
-
-<p>String, char, and varchar columns may be encoded either using a
-dictionary encoding or a direct encoding. A direct encoding should be
-preferred when there are many distinct values. In all of the
-encodings, the PRESENT stream encodes whether the value is null. The
-Java ORC writer automatically picks the encoding after the first row
-group (10,000 rows).</p>
-
-<p>For direct encoding the UTF-8 bytes are saved in the DATA stream and
-the length of each value is written into the LENGTH stream. In direct
-encoding, if the values were [“Nevada”, “California”]; the DATA
-would be “NevadaCalifornia” and the LENGTH would be [6, 10].</p>
-
-<p>For dictionary encodings the dictionary is sorted and UTF-8 bytes of
-each unique value are placed into DICTIONARY_DATA. The length of each
-item in the dictionary is put into the LENGTH stream. The DATA stream
-consists of the sequence of references to the dictionary elements.</p>
-
-<p>In dictionary encoding, if the values were [“Nevada”,
-“California”, “Nevada”, “California”, and “Florida”]; the
-DICTIONARY_DATA would be “CaliforniaFloridaNevada” and LENGTH would
-be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">String contents</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DICTIONARY</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DICTIONARY_DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">String contents</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DIRECT_V2</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">String contents</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v2</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DICTIONARY_V2</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v2</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DICTIONARY_DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">String contents</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v2</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="boolean-columns">Boolean Columns</h2>
-
-<p>Boolean columns are rare, but have a simple encoding.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="tinyint-columns">TinyInt Columns</h2>
-
-<p>TinyInt (byte) columns use byte run length encoding.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Byte RLE</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="binary-columns">Binary Columns</h2>
-
-<p>Binary data is encoded with a PRESENT stream, a DATA stream that records
-the contents, and a LENGTH stream that records the number of bytes per a
-value.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">String contents</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DIRECT_V2</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">String contents</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v2</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="decimal-columns">Decimal Columns</h2>
-
-<p>Since Hive 0.13, all decimals have had fixed precision and scale.
-The goal is to use RLEv3 for the value and use the fixed scale from
-the type. As an interim solution, we are using RLE v2 for short decimals
-(precision &lt;= 18) and the old encoding for long decimals.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Signed Integer RLE v2</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DIRECT_V2</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unbounded base 128 varints</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">SECONDARY</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v2</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="date-columns">Date Columns</h2>
-
-<p>Date data is encoded with a PRESENT stream, a DATA stream that records
-the number of days after January 1, 1970 in UTC.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Signed Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DIRECT_V2</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Signed Integer RLE v2</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="timestamp-columns">Timestamp Columns</h2>
-
-<p>Timestamp records times down to nanoseconds as a PRESENT stream that
-records non-null values, a DATA stream that records the number of
-seconds after 1 January 2015, and a SECONDARY stream that records the
-number of nanoseconds.</p>
-
-<p>Because the number of nanoseconds often has a large number of trailing
-zeros, the number has trailing decimal zero digits removed and the
-last three bits are used to record how many zeros were removed. Thus
-1000 nanoseconds would be serialized as 0x0b and 100000 would be
-serialized as 0x0d.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Signed Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">SECONDARY</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DIRECT_V2</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DATA</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Signed Integer RLE v2</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">SECONDARY</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v2</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="struct-columns">Struct Columns</h2>
-
-<p>Structs have no data themselves and delegate everything to their child
-columns except for their PRESENT stream. They have a child column
-for each of the fields.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="list-columns">List Columns</h2>
-
-<p>Lists are encoded as the PRESENT stream and a length stream with
-number of items in each list. They have a single child column for the
-element values.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DIRECT_V2</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v2</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="map-columns">Map Columns</h2>
-
-<p>Maps are encoded as the PRESENT stream and a length stream with number
-of items in each list. They have a child column for the key and
-another child column for the value.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v1</td>
-    </tr>
-    <tr>
-      <td style="text-align: left">DIRECT_V2</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">LENGTH</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Unsigned Integer RLE v2</td>
-    </tr>
-  </tbody>
-</table>
-
-<h2 id="union-columns">Union Columns</h2>
-
-<p>Unions are encoded as the PRESENT stream and a tag stream that controls which
-potential variant is used. They have a child column for each variant of the
-union. Currently ORC union types are limited to 256 variants, which matches
-the Hive type model.</p>
-
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: left">Encoding</th>
-      <th style="text-align: left">Stream Kind</th>
-      <th style="text-align: left">Optional</th>
-      <th style="text-align: left">Contents</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">PRESENT</td>
-      <td style="text-align: left">Yes</td>
-      <td style="text-align: left">Boolean RLE</td>
-    </tr>
-    <tr>
-      <td style="text-align: left"> </td>
-      <td style="text-align: left">DIRECT</td>
-      <td style="text-align: left">No</td>
-      <td style="text-align: left">Byte RLE</td>
-    </tr>
-  </tbody>
-</table>
-
-<h1 id="indexes">Indexes</h1>
-
-<h2 id="row-group-index">Row Group Index</h2>
-
-<p>The row group indexes consist of a ROW_INDEX stream for each primitive
-column that has an entry for each row group. Row groups are controlled
-by the writer and default to 10,000 rows. Each RowIndexEntry gives the
-position of each stream for the column and the statistics for that row
-group.</p>
-
-<p>The index streams are placed at the front of the stripe, because in
-the default case of streaming they do not need to be read. They are
-only loaded when either predicate push down is being used or the
-reader seeks to a particular row.</p>
-
-<p><code>message RowIndexEntry {
- repeated uint64 positions = 1 [packed=true];
- optional ColumnStatistics statistics = 2;
-}
-</code></p>
-
-<p><code>message RowIndex {
- repeated RowIndexEntry entry = 1;
-}
-</code></p>
-
-<p>To record positions, each stream needs a sequence of numbers. For
-uncompressed streams, the position is the byte offset of the RLE run’s
-start location followed by the number of values that need to be
-consumed from the run. In compressed streams, the first number is the
-start of the compression chunk in the stream, followed by the number
-of decompressed bytes that need to be consumed, and finally the number
-of values consumed in the RLE.</p>
-
-<p>For columns with multiple streams, the sequences of positions in each
-stream are concatenated. That was an unfortunate decision on my part
-that we should fix at some point, because it makes code that uses the
-indexes error-prone.</p>
-
-<p>Because dictionaries are accessed randomly, there is not a position to
-record for the dictionary and the entire dictionary must be read even
-if only part of a stripe is being read.</p>
-
-<h2 id="bloom-filter-index">Bloom Filter Index</h2>
-
-<p>Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
-Predicate pushdown can make use of bloom filters to better prune
-the row groups that do not satisfy the filter condition.
-The bloom filter indexes consist of a BLOOM_FILTER stream for each
-column specified through ‘orc.bloom.filter.columns’ table properties.
-A BLOOM_FILTER stream records a bloom filter entry for each row
-group (default to 10,000 rows) in a column. Only the row groups that
-satisfy min/max row index evaluation will be evaluated against the
-bloom filter index.</p>
-
-<p>Each BloomFilterEntry stores the number of hash functions (‘k’) used
-and the bitset backing the bloom filter. The original encoding (pre
-ORC-101) of bloom filters used the bitset field encoded as a repeating
-sequence of longs in the bitset field with a little endian encoding
-(0x1 is bit 0 and 0x2 is bit 1.) After ORC-101, the encoding is a
-sequence of bytes with a little endian encoding in the utf8bitset field.</p>
-
-<p><code>message BloomFilter {
- optional uint32 numHashFunctions = 1;
- repeated fixed64 bitset = 2;
- optional bytes utf8bitset = 3;
-}
-</code></p>
-
-<p><code>message BloomFilterIndex {
- repeated BloomFilter bloomFilter = 1;
-}
-</code></p>
-
-<p>Bloom filter internally uses two different hash functions to map a key
-to a position in the bit set. For tinyint, smallint, int, bigint, float
-and double types, Thomas Wang’s 64-bit integer hash function is used.
-Floats are converted to IEEE-754 32 bit representation
-(using Java’s Float.floatToIntBits(float)). Similary, Doubles are
-converted to IEEE-754 64 bit representation (using Java’s
-Double.doubleToLongBits(double)). All these primitive types
-are cast to long base type before being passed on to the hash function.
-For strings and binary types, Murmur3 64 bit hash algorithm is used.
-The 64 bit variant of Murmur3 considers only the most significant
-8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
-from the above algorithms is used as a base to derive ‘k’ different
-hash functions. We use the idea mentioned in the paper “Less Hashing,
-Same Performance: Building a Better Bloom Filter” by Kirsch et. al. to
-quickly compute the k hashcodes.</p>
-
-<p>The algorithm for computing k hashcodes and setting the bit position
-in a bloom filter is as follows:</p>
-
-<ol>
-  <li>Get 64 bit base hash code from Murmur3 or Thomas Wang’s hash algorithm.</li>
-  <li>Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).</li>
-  <li>k’th hashcode is obtained by (where k &gt; 0):
-    <ul>
-      <li>combinedHash = hash1 + (k * hash2)</li>
-    </ul>
-  </li>
-  <li>If combinedHash is negative flip all the bits:
-    <ul>
-      <li>combinedHash = ~combinedHash</li>
-    </ul>
-  </li>
-  <li>Bit set position is obtained by performing modulo with m:
-    <ul>
-      <li>position = combinedHash % m</li>
-    </ul>
-  </li>
-  <li>Set the position in bit set. The LSB 6 bits identifies the long index
-within bitset and bit position within the long uses little endian order.
-    <ul>
-      <li>bitset[position »&gt; 6] |= (1L « position);</li>
-    </ul>
-  </li>
-</ol>
-
-<p>Bloom filter streams are interlaced with row group indexes. This placement
-makes it convenient to read the bloom filter stream and row index stream
-together in single read operation.</p>
-
-<p><img src="/img/BloomFilter.png" alt="bloom filter" /></p>
-
-      </article>
-    </div>
-
-    <div class="clear"></div>
-
-  </div>
-</section>
-
-
-  <footer role="contentinfo">
-  <p>The contents of this website are &copy;&nbsp;2018
-     <a href="https://www.apache.org/">Apache Software Foundation</a>
-     under the terms of the <a
-      href="https://www.apache.org/licenses/LICENSE-2.0.html">
-      Apache&nbsp;License&nbsp;v2</a>. Apache ORC and its logo are trademarks
-      of the Apache Software Foundation.</p>
-</footer>
-
-  <script>
-  var anchorForId = function (id) {
-    var anchor = document.createElement("a");
-    anchor.className = "header-link";
-    anchor.href      = "#" + id;
-    anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
-    anchor.title = "Permalink";
-    return anchor;
-  };
-
-  var linkifyAnchors = function (level, containingElement) {
-    var headers = containingElement.getElementsByTagName("h" + level);
-    for (var h = 0; h < headers.length; h++) {
-      var header = headers[h];
-
-      if (typeof header.id !== "undefined" && header.id !== "") {
-        header.appendChild(anchorForId(header.id));
-      }
-    }
-  };
-
-  document.onreadystatechange = function () {
-    if (this.readyState === "complete") {
-      var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
-      if (!contentBlock) {
-        return;
-      }
-      for (var level = 1; level <= 6; level++) {
-        linkifyAnchors(level, contentBlock);
-      }
-    }
-  };
-</script>
-
-
-</body>
-</html>