You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@orc.apache.org by om...@apache.org on 2018/04/17 17:49:48 UTC
[1/9] orc git commit: Pushing ORC-339 reorganize the ORC file format
spec.
Repository: orc
Updated Branches:
refs/heads/asf-site c63412b1b -> c6e290902
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/specification/ORCv2.html
----------------------------------------------------------------------
diff --git a/specification/ORCv2.html b/specification/ORCv2.html
new file mode 100644
index 0000000..b78fc0a
--- /dev/null
+++ b/specification/ORCv2.html
@@ -0,0 +1,1769 @@
+<!DOCTYPE HTML>
+<html lang="en-US">
+<head>
+ <meta charset="UTF-8">
+ <title>Evolving Draft for ORC Specification v2</title>
+ <meta name="viewport" content="width=device-width,initial-scale=1">
+ <meta name="generator" content="Jekyll v2.4.0">
+ <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+ <link rel="stylesheet" href="/css/screen.css">
+ <link rel="icon" type="image/x-icon" href="/favicon.ico">
+ <!--[if lt IE 9]>
+ <script src="/js/html5shiv.min.js"></script>
+ <script src="/js/respond.min.js"></script>
+ <![endif]-->
+</head>
+
+
+<body class="wrap">
+ <header role="banner">
+ <nav class="mobile-nav show-on-mobiles">
+ <ul>
+ <li class="">
+ <a href="/">Home</a>
+ </li>
+ <li class="">
+ <a href="/docs/"><span class="show-on-mobiles">Docs</span>
+ <span class="hide-on-mobiles">Documentation</span></a>
+ </li>
+ <li class="">
+ <a href="/talks/">Talks</a>
+ </li>
+ <li class="">
+ <a href="/news/">News</a>
+ </li>
+ <li class="">
+ <a href="/help/">Help</a>
+ </li>
+ <li class="">
+ <a href="/develop/">Develop</a>
+ </li>
+</ul>
+
+ </nav>
+ <div class="grid">
+ <div class="unit one-third center-on-mobiles">
+ <h1>
+ <a href="/">
+ <span class="sr-only">Apache ORC</span>
+ <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
+ </a>
+ </h1>
+ </div>
+ <nav class="main-nav unit two-thirds hide-on-mobiles">
+ <ul>
+ <li class="">
+ <a href="/">Home</a>
+ </li>
+ <li class="">
+ <a href="/docs/"><span class="show-on-mobiles">Docs</span>
+ <span class="hide-on-mobiles">Documentation</span></a>
+ </li>
+ <li class="">
+ <a href="/talks/">Talks</a>
+ </li>
+ <li class="">
+ <a href="/news/">News</a>
+ </li>
+ <li class="">
+ <a href="/help/">Help</a>
+ </li>
+ <li class="">
+ <a href="/develop/">Develop</a>
+ </li>
+</ul>
+
+ </nav>
+ </div>
+</header>
+
+
+ <section class="standalone">
+ <div class="grid">
+
+ <div class="unit whole">
+ <article>
+ <h1>Evolving Draft for ORC Specification v2</h1>
+ <p>This specification is rapidly evolving and should only be used for
+developers on the project.</p>
+
+<h1 id="to-do-items">TO DO items</h1>
+
+<p>The list of things that we plan to change:</p>
+
+<ul>
+ <li>Create a decimal representation with fixed scale using rle.</li>
+ <li>Create a better float/double encoding that splits mantissa and
+exponent.</li>
+ <li>Create a dictionary encoding for float, double, and decimal.</li>
+ <li>Create RLEv3:
+ <ul>
+ <li>64 and 128 bit variants</li>
+ <li>Zero suppression</li>
+ <li>Evaluate the rle subformats</li>
+ </ul>
+ </li>
+ <li>Group stripe data into stripelets to enable Async IO for reads.</li>
+ <li>Reorder stripe data into (stripe metadata, index, dictionary, data)</li>
+ <li>Stop sorting dictionaries and record the sort order separately in the index.</li>
+ <li>Remove use of RLEv1 and RLEv2.</li>
+ <li>Remove non-utf8 bloom filter.</li>
+ <li>Use numeric value for decimal statistics and bloom filter.</li>
+ <li>Add Zstd with dictionary.</li>
+</ul>
+
+<h1 id="motivation">Motivation</h1>
+
+<p>Hive’s RCFile was the standard format for storing tabular data in
+Hadoop for several years. However, RCFile has limitations because it
+treats each column as a binary blob without semantics. In Hive 0.11 we
+added a new file format named Optimized Row Columnar (ORC) file that
+uses and retains the type information from the table definition. ORC
+uses type specific readers and writers that provide light weight
+compression techniques such as dictionary encoding, bit packing, delta
+encoding, and run length encoding – resulting in dramatically smaller
+files. Additionally, ORC can apply generic compression using zlib, or
+Snappy on top of the lightweight compression for even smaller
+files. However, storage savings are only part of the gain. ORC
+supports projection, which selects subsets of the columns for reading,
+so that queries reading only one column read only the required
+bytes. Furthermore, ORC files include light weight indexes that
+include the minimum and maximum values for each column in each set of
+10,000 rows and the entire file. Using pushdown filters from Hive, the
+file reader can skip entire sets of rows that aren’t important for
+this query.</p>
+
+<p><img src="/img/OrcFileLayout.png" alt="ORC file structure" /></p>
+
+<h1 id="file-tail">File Tail</h1>
+
+<p>Since HDFS does not support changing the data in a file after it is
+written, ORC stores the top level index at the end of the file. The
+overall structure of the file is given in the figure above. The
+file’s tail consists of 3 parts; the file metadata, file footer and
+postscript.</p>
+
+<p>The metadata for ORC is stored using
+<a href="https://s.apache.org/protobuf_encoding">Protocol Buffers</a>, which provides
+the ability to add new fields without breaking readers. This document
+incorporates the Protobuf definition from the
+<a href="https://s.apache.org/orc_proto">ORC source code</a> and the
+reader is encouraged to review the Protobuf encoding if they need to
+understand the byte-level encoding</p>
+
+<h2 id="postscript">Postscript</h2>
+
+<p>The Postscript section provides the necessary information to interpret
+the rest of the file including the length of the file’s Footer and
+Metadata sections, the version of the file, and the kind of general
+compression used (eg. none, zlib, or snappy). The Postscript is never
+compressed and ends one byte before the end of the file. The version
+stored in the Postscript is the lowest version of Hive that is
+guaranteed to be able to read the file and it stored as a sequence of
+the major and minor version. This file version is encoded as [0,12].</p>
+
+<p>The process of reading an ORC file works backwards through the
+file. Rather than making multiple short reads, the ORC reader reads
+the last 16k bytes of the file with the hope that it will contain both
+the Footer and Postscript sections. The final byte of the file
+contains the serialized length of the Postscript, which must be less
+than 256 bytes. Once the Postscript is parsed, the compressed
+serialized length of the Footer is known and it can be decompressed
+and parsed.</p>
+
+<p><code>message PostScript {
+ // the length of the footer section in bytes
+ optional uint64 footerLength = 1;
+ // the kind of generic compression used
+ optional CompressionKind compression = 2;
+ // the maximum size of each compression chunk
+ optional uint64 compressionBlockSize = 3;
+ // the version of the writer
+ repeated uint32 version = 4 [packed = true];
+ // the length of the metadata section in bytes
+ optional uint64 metadataLength = 5;
+ // the fixed string "ORC"
+ optional string magic = 8000;
+}
+</code></p>
+
+<p><code>enum CompressionKind {
+ NONE = 0;
+ ZLIB = 1;
+ SNAPPY = 2;
+ LZO = 3;
+ LZ4 = 4;
+ ZSTD = 5;
+}
+</code></p>
+
+<h2 id="footer">Footer</h2>
+
+<p>The Footer section contains the layout of the body of the file, the
+type schema information, the number of rows, and the statistics about
+each of the columns.</p>
+
+<p>The file is broken in to three parts- Header, Body, and Tail. The
+Header consists of the bytes “ORC’’ to support tools that want to
+scan the front of the file to determine the type of the file. The Body
+contains the rows and indexes, and the Tail gives the file level
+information as described in this section.</p>
+
+<p><code>message Footer {
+ // the length of the file header in bytes (always 3)
+ optional uint64 headerLength = 1;
+ // the length of the file header and body in bytes
+ optional uint64 contentLength = 2;
+ // the information about the stripes
+ repeated StripeInformation stripes = 3;
+ // the schema information
+ repeated Type types = 4;
+ // the user metadata that was added
+ repeated UserMetadataItem metadata = 5;
+ // the total number of rows in the file
+ optional uint64 numberOfRows = 6;
+ // the statistics of each column across the file
+ repeated ColumnStatistics statistics = 7;
+ // the maximum number of rows in each index entry
+ optional uint32 rowIndexStride = 8;
+}
+</code></p>
+
+<h3 id="stripe-information">Stripe Information</h3>
+
+<p>The body of the file is divided into stripes. Each stripe is self
+contained and may be read using only its own bytes combined with the
+file’s Footer and Postscript. Each stripe contains only entire rows so
+that rows never straddle stripe boundaries. Stripes have three
+sections: a set of indexes for the rows within the stripe, the data
+itself, and a stripe footer. Both the indexes and the data sections
+are divided by columns so that only the data for the required columns
+needs to be read.</p>
+
+<p><code>message StripeInformation {
+ // the start of the stripe within the file
+ optional uint64 offset = 1;
+ // the length of the indexes in bytes
+ optional uint64 indexLength = 2;
+ // the length of the data in bytes
+ optional uint64 dataLength = 3;
+ // the length of the footer in bytes
+ optional uint64 footerLength = 4;
+ // the number of rows in the stripe
+ optional uint64 numberOfRows = 5;
+}
+</code></p>
+
+<h3 id="type-information">Type Information</h3>
+
+<p>All of the rows in an ORC file must have the same schema. Logically
+the schema is expressed as a tree as in the figure below, where
+the compound types have subcolumns under them.</p>
+
+<p><img src="/img/TreeWriters.png" alt="ORC column structure" /></p>
+
+<p>The equivalent Hive DDL would be:</p>
+
+<p><code>create table Foobar (
+ myInt int,
+ myMap map<string,
+ struct<myString : string,
+ myDouble: double>>,
+ myTime timestamp
+);
+</code></p>
+
+<p>The type tree is flattened in to a list via a pre-order traversal
+where each type is assigned the next id. Clearly the root of the type
+tree is always type id 0. Compound types have a field named subtypes
+that contains the list of their children’s type ids.</p>
+
+<p><code>message Type {
+ enum Kind {
+ BOOLEAN = 0;
+ BYTE = 1;
+ SHORT = 2;
+ INT = 3;
+ LONG = 4;
+ FLOAT = 5;
+ DOUBLE = 6;
+ STRING = 7;
+ BINARY = 8;
+ TIMESTAMP = 9;
+ LIST = 10;
+ MAP = 11;
+ STRUCT = 12;
+ UNION = 13;
+ DECIMAL = 14;
+ DATE = 15;
+ VARCHAR = 16;
+ CHAR = 17;
+ }
+ // the kind of this type
+ required Kind kind = 1;
+ // the type ids of any subcolumns for list, map, struct, or union
+ repeated uint32 subtypes = 2 [packed=true];
+ // the list of field names for struct
+ repeated string fieldNames = 3;
+ // the maximum length of the type for varchar or char in UTF-8 characters
+ optional uint32 maximumLength = 4;
+ // the precision and scale for decimal
+ optional uint32 precision = 5;
+ optional uint32 scale = 6;
+}
+</code></p>
+
+<h3 id="column-statistics">Column Statistics</h3>
+
+<p>The goal of the column statistics is that for each column, the writer
+records the count and depending on the type other useful fields. For
+most of the primitive types, it records the minimum and maximum
+values; and for numeric types it additionally stores the sum.
+From Hive 1.1.0 onwards, the column statistics will also record if
+there are any null values within the row group by setting the hasNull flag.
+The hasNull flag is used by ORC’s predicate pushdown to better answer
+‘IS NULL’ queries.</p>
+
+<p><code>message ColumnStatistics {
+ // the number of values
+ optional uint64 numberOfValues = 1;
+ // At most one of these has a value for any column
+ optional IntegerStatistics intStatistics = 2;
+ optional DoubleStatistics doubleStatistics = 3;
+ optional StringStatistics stringStatistics = 4;
+ optional BucketStatistics bucketStatistics = 5;
+ optional DecimalStatistics decimalStatistics = 6;
+ optional DateStatistics dateStatistics = 7;
+ optional BinaryStatistics binaryStatistics = 8;
+ optional TimestampStatistics timestampStatistics = 9;
+ optional bool hasNull = 10;
+}
+</code></p>
+
+<p>For integer types (tinyint, smallint, int, bigint), the column
+statistics includes the minimum, maximum, and sum. If the sum
+overflows long at any point during the calculation, no sum is
+recorded.</p>
+
+<p><code>message IntegerStatistics {
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+ optional sint64 sum = 3;
+}
+</code></p>
+
+<p>For floating point types (float, double), the column statistics
+include the minimum, maximum, and sum. If the sum overflows a double,
+no sum is recorded.</p>
+
+<p><code>message DoubleStatistics {
+ optional double minimum = 1;
+ optional double maximum = 2;
+ optional double sum = 3;
+}
+</code></p>
+
+<p>For strings, the minimum value, maximum value, and the sum of the
+lengths of the values are recorded.</p>
+
+<p><code>message StringStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ // sum will store the total length of all strings
+ optional sint64 sum = 3;
+}
+</code></p>
+
+<p>For booleans, the statistics include the count of false and true values.</p>
+
+<p><code>message BucketStatistics {
+ repeated uint64 count = 1 [packed=true];
+}
+</code></p>
+
+<p>For decimals, the minimum, maximum, and sum are stored.</p>
+
+<p><code>message DecimalStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ optional string sum = 3;
+}
+</code></p>
+
+<p>Date columns record the minimum and maximum values as the number of
+days since the epoch (1/1/2015).</p>
+
+<p><code>message DateStatistics {
+ // min,max values saved as days since epoch
+ optional sint32 minimum = 1;
+ optional sint32 maximum = 2;
+}
+</code></p>
+
+<p>Timestamp columns record the minimum and maximum values as the number of
+milliseconds since the epoch (1/1/2015).</p>
+
+<p><code>message TimestampStatistics {
+ // min,max values saved as milliseconds since epoch
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+}
+</code></p>
+
+<p>Binary columns store the aggregate number of bytes across all of the values.</p>
+
+<p><code>message BinaryStatistics {
+ // sum will store the total binary blob length
+ optional sint64 sum = 1;
+}
+</code></p>
+
+<h3 id="user-metadata">User Metadata</h3>
+
+<p>The user can add arbitrary key/value pairs to an ORC file as it is
+written. The contents of the keys and values are completely
+application defined, but the key is a string and the value is
+binary. Care should be taken by applications to make sure that their
+keys are unique and in general should be prefixed with an organization
+code.</p>
+
+<p><code>message UserMetadataItem {
+ // the user defined key
+ required string name = 1;
+ // the user defined binary value
+ required bytes value = 2;
+}
+</code></p>
+
+<h3 id="file-metadata">File Metadata</h3>
+
+<p>The file Metadata section contains column statistics at the stripe
+level granularity. These statistics enable input split elimination
+based on the predicate push-down evaluated per a stripe.</p>
+
+<p><code>message StripeStatistics {
+ repeated ColumnStatistics colStats = 1;
+}
+</code></p>
+
+<p><code>message Metadata {
+ repeated StripeStatistics stripeStats = 1;
+}
+</code></p>
+
+<h1 id="compression">Compression</h1>
+
+<p>If the ORC file writer selects a generic compression codec (zlib or
+snappy), every part of the ORC file except for the Postscript is
+compressed with that codec. However, one of the requirements for ORC
+is that the reader be able to skip over compressed bytes without
+decompressing the entire stream. To manage this, ORC writes compressed
+streams in chunks with headers as in the figure below.
+To handle uncompressable data, if the compressed data is larger than
+the original, the original is stored and the isOriginal flag is
+set. Each header is 3 bytes long with (compressedLength * 2 +
+isOriginal) stored as a little endian value. For example, the header
+for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
+0x03]. The header for 5 bytes that did not compress would be [0x0b,
+0x00, 0x00]. Each compression chunk is compressed independently so
+that as long as a decompressor starts at the top of a header, it can
+start decompressing without the previous bytes.</p>
+
+<p><img src="/img/CompressionStream.png" alt="compression streams" /></p>
+
+<p>The default compression chunk size is 256K, but writers can choose
+their own value. Larger chunks lead to better compression, but require
+more memory. The chunk size is recorded in the Postscript so that
+readers can allocate appropriately sized buffers. Readers are
+guaranteed that no chunk will expand to more than the compression chunk
+size.</p>
+
+<p>ORC files without generic compression write each stream directly
+with no headers.</p>
+
+<h1 id="run-length-encoding">Run Length Encoding</h1>
+
+<h2 id="base-128-varint">Base 128 Varint</h2>
+
+<p>Variable width integer encodings take advantage of the fact that most
+numbers are small and that having smaller encodings for small numbers
+shrinks the overall size of the data. ORC uses the varint format from
+Protocol Buffers, which writes data in little endian format using the
+low 7 bits of each byte. The high bit in each byte is set if the
+number continues into the next byte.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Unsigned Original</th>
+ <th style="text-align: left">Serialized</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">0x00</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left">0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">127</td>
+ <td style="text-align: left">0x7f</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">128</td>
+ <td style="text-align: left">0x80, 0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">129</td>
+ <td style="text-align: left">0x81, 0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16,383</td>
+ <td style="text-align: left">0xff, 0x7f</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16,384</td>
+ <td style="text-align: left">0x80, 0x80, 0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16,385</td>
+ <td style="text-align: left">0x81, 0x80, 0x01</td>
+ </tr>
+ </tbody>
+</table>
+
+<p>For signed integer types, the number is converted into an unsigned
+number using a zigzag encoding. Zigzag encoding moves the sign bit to
+the least significant bit using the expression (val « 1) ^ (val »
+63) and derives its name from the fact that positive and negative
+numbers alternate once encoded. The unsigned number is then serialized
+as above.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Signed Original</th>
+ <th style="text-align: left">Unsigned</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">0</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">-1</td>
+ <td style="text-align: left">1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left">2</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">-2</td>
+ <td style="text-align: left">3</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">2</td>
+ <td style="text-align: left">4</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="byte-run-length-encoding">Byte Run Length Encoding</h2>
+
+<p>For byte streams, ORC uses a very light weight encoding of identical
+values.</p>
+
+<ul>
+ <li>Run - a sequence of at least 3 identical values</li>
+ <li>Literals - a sequence of non-identical values</li>
+</ul>
+
+<p>The first byte of each group of values is a header than determines
+whether it is a run (value between 0 to 127) or literal list (value
+between -128 to -1). For runs, the control byte is the length of the
+run minus the length of the minimal run (3) and the control byte for
+literal lists is the negative length of the list. For example, a
+hundred 0’s is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
+would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
+either of the encodings.</p>
+
+<h2 id="boolean-run-length-encoding">Boolean Run Length Encoding</h2>
+
+<p>For encoding boolean types, the bits are put in the bytes from most
+significant to least significant. The bytes are encoded using byte run
+length encoding as described in the previous section. For example,
+the byte sequence [0xff, 0x80] would be one true followed by
+seven false values.</p>
+
+<h2 id="integer-run-length-encoding-version-1">Integer Run Length Encoding, version 1</h2>
+
+<p>In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1),
+which provides a lightweight compression of signed or unsigned integer
+sequences. RLEv1 has two sub-encodings:</p>
+
+<ul>
+ <li>Run - a sequence of values that differ by a small fixed delta</li>
+ <li>Literals - a sequence of varint encoded values</li>
+</ul>
+
+<p>Runs start with an initial byte of 0x00 to 0x7f, which encodes the
+length of the run - 3. A second byte provides the fixed delta in the
+range of -128 to 127. Finally, the first value of the run is encoded
+as a base 128 varint.</p>
+
+<p>For example, if the sequence is 100 instances of 7 the encoding would
+start with 100 - 3, followed by a delta of 0, and a varint of 7 for
+an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
+running from 100 to 1, the first byte is 100 - 3, the delta is -1,
+and the varint is 100 for an encoding of [0x61, 0xff, 0x64].</p>
+
+<p>Literals start with an initial byte of 0x80 to 0xff, which corresponds
+to the negative of number of literals in the sequence. Following the
+header byte, the list of N varints is encoded. Thus, if there are
+no runs, the overhead is 1 byte for each 128 integers. The first 5
+prime numbers [2, 3, 4, 7, 11] would encoded as [0xfb, 0x02, 0x03,
+0x04, 0x07, 0xb].</p>
+
+<h2 id="integer-run-length-encoding-version-2">Integer Run Length Encoding, version 2</h2>
+
+<p>In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2),
+which has improved compression and fixed bit width encodings for
+faster expansion. RLEv2 uses four sub-encodings based on the data:</p>
+
+<ul>
+ <li>Short Repeat - used for short sequences with repeated values</li>
+ <li>Direct - used for random sequences with a fixed bit width</li>
+ <li>Patched Base - used for random sequences with a variable bit width</li>
+ <li>Delta - used for monotonically increasing or decreasing sequences</li>
+</ul>
+
+<h3 id="short-repeat">Short Repeat</h3>
+
+<p>The short repeat encoding is used for short repeating integer
+sequences with the goal of minimizing the overhead of the header. All
+of the bits listed in the header are from the first byte to the last
+and from most significant bit to least significant bit. If the type is
+signed, the value is zigzag encoded.</p>
+
+<ul>
+ <li>1 byte header
+ <ul>
+ <li>2 bits for encoding type (0)</li>
+ <li>3 bits for width (W) of repeating value (1 to 8 bytes)</li>
+ <li>3 bits for repeat count (3 to 10 values)</li>
+ </ul>
+ </li>
+ <li>W bytes in big endian format, which is zigzag encoded if they type
+is signed</li>
+</ul>
+
+<p>The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be
+serialized with short repeat encoding (0), a width of 2 bytes (1), and
+repeat count of 5 (2) as [0x0a, 0x27, 0x10].</p>
+
+<h3 id="direct">Direct</h3>
+
+<p>The direct encoding is used for integer sequences whose values have a
+relatively constant bit width. It encodes the values directly using a
+fixed width big endian encoding. The width of the values is encoded
+using the table below.</p>
+
+<p>The 5 bit width encoding table for RLEv2:</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Width in Bits</th>
+ <th style="text-align: left">Encoded Value</th>
+ <th style="text-align: left">Notes</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">for delta encoding</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">for non-delta encoding</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">2</td>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">4</td>
+ <td style="text-align: left">3</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">8</td>
+ <td style="text-align: left">7</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16</td>
+ <td style="text-align: left">15</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">24</td>
+ <td style="text-align: left">23</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">32</td>
+ <td style="text-align: left">27</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">40</td>
+ <td style="text-align: left">28</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">48</td>
+ <td style="text-align: left">29</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">56</td>
+ <td style="text-align: left">30</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">64</td>
+ <td style="text-align: left">31</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">3</td>
+ <td style="text-align: left">2</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">5 <= x <= 7</td>
+ <td style="text-align: left">x - 1</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">9 <= x <= 15</td>
+ <td style="text-align: left">x - 1</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">17 <= x <= 21</td>
+ <td style="text-align: left">x - 1</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">26</td>
+ <td style="text-align: left">24</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">28</td>
+ <td style="text-align: left">25</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">30</td>
+ <td style="text-align: left">26</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ </tbody>
+</table>
+
+<ul>
+ <li>2 bytes header
+ <ul>
+ <li>2 bits for encoding type (1)</li>
+ <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
+width encoding table</li>
+ <li>9 bits for length (L) (1 to 512 values)</li>
+ </ul>
+ </li>
+ <li>W * L bits (padded to the next byte) encoded in big endian format, which is
+zigzag encoding if the type is signed</li>
+</ul>
+
+<p>The unsigned sequence of [23713, 43806, 57005, 48879] would be
+serialized with direct encoding (1), a width of 16 bits (15), and
+length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
+0xbe, 0xef].</p>
+
+<h3 id="patched-base">Patched Base</h3>
+
+<p>The patched base encoding is used for integer sequences whose bit
+widths varies a lot. The minimum signed value of the sequence is found
+and subtracted from the other values. The bit width of those adjusted
+values is analyzed and the 90 percentile of the bit width is chosen
+as W. The 10\% of values larger than W use patches from a patch list
+to set the additional bits. Patches are encoded as a list of gaps in
+the index values and the additional value bits.</p>
+
+<ul>
+ <li>4 bytes header
+ <ul>
+ <li>2 bits for encoding type (2)</li>
+ <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
+ width encoding table</li>
+ <li>9 bits for length (L) (1 to 512 values)</li>
+ <li>3 bits for base value width (BW) (1 to 8 bytes)</li>
+ <li>5 bits for patch width (PW) (1 to 64 bits) using the 5 bit width
+encoding table</li>
+ <li>3 bits for patch gap width (PGW) (1 to 8 bits)</li>
+ <li>5 bits for patch list length (PLL) (0 to 31 patches)</li>
+ </ul>
+ </li>
+ <li>Base value (BW bytes) - The base value is stored as a big endian value
+with negative values marked by the most significant bit set. If it that
+bit is set, the entire value is negated.</li>
+ <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive
+values that are added to the base value.</li>
+ <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive
+values that are added to the base value.</li>
+ <li>Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
+that didn’t fit within W bits. Each entry in the list consists of a
+gap, which is the number of elements skipped from the previous
+patch, and a patch value. Patches are applied by logically or’ing
+the data values with the relevant patch shifted W bits left. If a
+patch is 0, it was introduced to skip over more than 255 items. The
+combined length of each patch (PGW + PW) must be less or equal to
+64.</li>
+</ul>
+
+<p>The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070,
+2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
+has a minimum of 2000, which makes the adjusted
+sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
+150, 160, 170, 180, 190]. It has an
+encoding of patched base (2), a bit width of 8 (7), a length of 20
+(19), a base value width of 2 bytes (1), a patch width of 12 bits (11),
+patch gap width of 2 bits (1), and a patch list length of 1 (1). The
+base value is 2000 and the combined result is [0x8e, 0x13, 0x2b, 0x21, 0x07,
+0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e,
+0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc, 0xe8]</p>
+
+<h3 id="delta">Delta</h3>
+
+<p>The Delta encoding is used for monotonically increasing or decreasing
+sequences. The first two numbers in the sequence can not be identical,
+because the encoding is using the sign of the first delta to determine
+if the series is increasing or decreasing.</p>
+
+<ul>
+ <li>2 bytes header
+ <ul>
+ <li>2 bits for encoding type (3)</li>
+ <li>5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit
+width encoding table</li>
+ <li>9 bits for run length (L) (1 to 512 values)</li>
+ </ul>
+ </li>
+ <li>Base value - encoded as (signed or unsigned) varint</li>
+ <li>Delta base - encoded as signed varint</li>
+ <li>Delta values $W * (L - 2)$ bytes - encode each delta after the first
+one. If the delta base is positive, the sequence is increasing and if it is
+negative the sequence is decreasing.</li>
+</ul>
+
+<p>The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be
+serialized with delta encoding (3), a width of 4 bits (3), length of
+10 (9), a base of 2 (2), and first delta of 1 (2). The resulting
+sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].</p>
+
+<h1 id="stripes">Stripes</h1>
+
+<p>The body of ORC files consists of a series of stripes. Stripes are
+large (typically ~200MB) and independent of each other and are often
+processed by different tasks. The defining characteristic for columnar
+storage formats is that the data for each column is stored separately
+and that reading data out of the file should be proportional to the
+number of columns read.</p>
+
+<p>In ORC files, each column is stored in several streams that are stored
+next to each other in the file. For example, an integer column is
+represented as two streams PRESENT, which uses one with a bit per
+value recording if the value is non-null, and DATA, which records the
+non-null values. If all of a column’s values in a stripe are non-null,
+the PRESENT stream is omitted from the stripe. For binary data, ORC
+uses three streams PRESENT, DATA, and LENGTH, which stores the length
+of each value. The details of each type will be presented in the
+following subsections.</p>
+
+<h2 id="stripe-footer">Stripe Footer</h2>
+
+<p>The stripe footer contains the encoding of each column and the
+directory of the streams including their location.</p>
+
+<p><code>message StripeFooter {
+ // the location of each stream
+ repeated Stream streams = 1;
+ // the encoding of each column
+ repeated ColumnEncoding columns = 2;
+}
+</code></p>
+
+<p>To describe each stream, ORC stores the kind of stream, the column id,
+and the stream’s size in bytes. The details of what is stored in each stream
+depends on the type and encoding of the column.</p>
+
+<p><code>message Stream {
+ enum Kind {
+ // boolean stream of whether the next value is non-null
+ PRESENT = 0;
+ // the primary data stream
+ DATA = 1;
+ // the length of each value for variable length data
+ LENGTH = 2;
+ // the dictionary blob
+ DICTIONARY\_DATA = 3;
+ // deprecated prior to Hive 0.11
+ // It was used to store the number of instances of each value in the
+ // dictionary
+ DICTIONARY_COUNT = 4;
+ // a secondary data stream
+ SECONDARY = 5;
+ // the index for seeking to particular row groups
+ ROW_INDEX = 6;
+ // original bloom filters used before ORC-101
+ BLOOM_FILTER = 7;
+ // bloom filters that consistently use utf8
+ BLOOM_FILTER_UTF8 = 8;
+ }
+ required Kind kind = 1;
+ // the column id
+ optional uint32 column = 2;
+ // the number of bytes in the file
+ optional uint64 length = 3;
+}
+</code></p>
+
+<p>Depending on their type several options for encoding are possible. The
+encodings are divided into direct or dictionary-based categories and
+further refined as to whether they use RLE v1 or v2.</p>
+
+<p><code>message ColumnEncoding {
+ enum Kind {
+ // the encoding is mapped directly to the stream using RLE v1
+ DIRECT = 0;
+ // the encoding uses a dictionary of unique values using RLE v1
+ DICTIONARY = 1;
+ // the encoding is direct using RLE v2
+ DIRECT\_V2 = 2;
+ // the encoding is dictionary-based using RLE v2
+ DICTIONARY\_V2 = 3;
+ }
+ required Kind kind = 1;
+ // for dictionary encodings, record the size of the dictionary
+ optional uint32 dictionarySize = 2;
+}
+</code></p>
+
+<h1 id="column-encodings">Column Encodings</h1>
+
+<h2 id="smallint-int-and-bigint-columns">SmallInt, Int, and BigInt Columns</h2>
+
+<p>All of the 16, 32, and 64 bit integer column types use the same set of
+potential encodings, which is basically whether they use RLE v1 or
+v2. If the PRESENT stream is not included, all of the values are
+present. For values that have false bits in the present stream, no
+values are included in the data stream.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="float-and-double-columns">Float and Double Columns</h2>
+
+<p>Floating point types are stored using IEEE 754 floating point bit
+layout. Float columns use 4 bytes per value and double columns use 8
+bytes.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">IEEE 754 floating point representation</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="string-char-and-varchar-columns">String, Char, and VarChar Columns</h2>
+
+<p>String, char, and varchar columns may be encoded either using a
+dictionary encoding or a direct encoding. A direct encoding should be
+preferred when there are many distinct values. In all of the
+encodings, the PRESENT stream encodes whether the value is null. The
+Java ORC writer automatically picks the encoding after the first row
+group (10,000 rows).</p>
+
+<p>For direct encoding the UTF-8 bytes are saved in the DATA stream and
+the length of each value is written into the LENGTH stream. In direct
+encoding, if the values were [“Nevada”, “California”]; the DATA
+would be “NevadaCalifornia” and the LENGTH would be [6, 10].</p>
+
+<p>For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+each unique value are placed into DICTIONARY_DATA. The length of each
+item in the dictionary is put into the LENGTH stream. The DATA stream
+consists of the sequence of references to the dictionary elements.</p>
+
+<p>In dictionary encoding, if the values were [“Nevada”,
+“California”, “Nevada”, “California”, and “Florida”]; the
+DICTIONARY_DATA would be “CaliforniaFloridaNevada” and LENGTH would
+be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DICTIONARY</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DICTIONARY_DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DICTIONARY_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DICTIONARY_DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="boolean-columns">Boolean Columns</h2>
+
+<p>Boolean columns are rare, but have a simple encoding.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="tinyint-columns">TinyInt Columns</h2>
+
+<p>TinyInt (byte) columns use byte run length encoding.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Byte RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="binary-columns">Binary Columns</h2>
+
+<p>Binary data is encoded with a PRESENT stream, a DATA stream that records
+the contents, and a LENGTH stream that records the number of bytes per a
+value.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="decimal-columns">Decimal Columns</h2>
+
+<p>Decimal was introduced in Hive 0.11 with infinite precision (the total
+number of digits). In Hive 0.13, the definition was change to limit
+the precision to a maximum of 38 digits, which conveniently uses 127
+bits plus a sign bit. The current encoding of decimal columns stores
+the integer representation of the value as an unbounded length zigzag
+encoded base 128 varint. The scale is stored in the SECONDARY stream
+as an signed integer.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unbounded base 128 varints</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unbounded base 128 varints</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="date-columns">Date Columns</h2>
+
+<p>Date data is encoded with a PRESENT stream, a DATA stream that records
+the number of days after January 1, 1970 in UTC.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="timestamp-columns">Timestamp Columns</h2>
+
+<p>Timestamp records times down to nanoseconds as a PRESENT stream that
+records non-null values, a DATA stream that records the number of
+seconds after 1 January 2015, and a SECONDARY stream that records the
+number of nanoseconds.</p>
+
+<p>Because the number of nanoseconds often has a large number of trailing
+zeros, the number has trailing decimal zero digits removed and the
+last three bits are used to record how many zeros were removed. Thus
+1000 nanoseconds would be serialized as 0x0b and 100000 would be
+serialized as 0x0d.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v2</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="struct-columns">Struct Columns</h2>
+
+<p>Structs have no data themselves and delegate everything to their child
+columns except for their PRESENT stream. They have a child column
+for each of the fields.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="list-columns">List Columns</h2>
+
+<p>Lists are encoded as the PRESENT stream and a length stream with
+number of items in each list. They have a single child column for the
+element values.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="map-columns">Map Columns</h2>
+
+<p>Maps are encoded as the PRESENT stream and a length stream with number
+of items in each list. They have a child column for the key and
+another child column for the value.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="union-columns">Union Columns</h2>
+
+<p>Unions are encoded as the PRESENT stream and a tag stream that controls which
+potential variant is used. They have a child column for each variant of the
+union. Currently ORC union types are limited to 256 variants, which matches
+the Hive type model.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Byte RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h1 id="indexes">Indexes</h1>
+
+<h2 id="row-group-index">Row Group Index</h2>
+
+<p>The row group indexes consist of a ROW_INDEX stream for each primitive
+column that has an entry for each row group. Row groups are controlled
+by the writer and default to 10,000 rows. Each RowIndexEntry gives the
+position of each stream for the column and the statistics for that row
+group.</p>
+
+<p>The index streams are placed at the front of the stripe, because in
+the default case of streaming they do not need to be read. They are
+only loaded when either predicate push down is being used or the
+reader seeks to a particular row.</p>
+
+<p><code>message RowIndexEntry {
+ repeated uint64 positions = 1 [packed=true];
+ optional ColumnStatistics statistics = 2;
+}
+</code></p>
+
+<p><code>message RowIndex {
+ repeated RowIndexEntry entry = 1;
+}
+</code></p>
+
+<p>To record positions, each stream needs a sequence of numbers. For
+uncompressed streams, the position is the byte offset of the RLE run’s
+start location followed by the number of values that need to be
+consumed from the run. In compressed streams, the first number is the
+start of the compression chunk in the stream, followed by the number
+of decompressed bytes that need to be consumed, and finally the number
+of values consumed in the RLE.</p>
+
+<p>For columns with multiple streams, the sequences of positions in each
+stream are concatenated. That was an unfortunate decision on my part
+that we should fix at some point, because it makes code that uses the
+indexes error-prone.</p>
+
+<p>Because dictionaries are accessed randomly, there is not a position to
+record for the dictionary and the entire dictionary must be read even
+if only part of a stripe is being read.</p>
+
+<h2 id="bloom-filter-index">Bloom Filter Index</h2>
+
+<p>Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
+Predicate pushdown can make use of bloom filters to better prune
+the row groups that do not satisfy the filter condition.
+The bloom filter indexes consist of a BLOOM_FILTER stream for each
+column specified through ‘orc.bloom.filter.columns’ table properties.
+A BLOOM_FILTER stream records a bloom filter entry for each row
+group (default to 10,000 rows) in a column. Only the row groups that
+satisfy min/max row index evaluation will be evaluated against the
+bloom filter index.</p>
+
+<p>Each BloomFilterEntry stores the number of hash functions (‘k’) used
+and the bitset backing the bloom filter. The original encoding (pre
+ORC-101) of bloom filters used the bitset field encoded as a repeating
+sequence of longs in the bitset field with a little endian encoding
+(0x1 is bit 0 and 0x2 is bit 1.) After ORC-101, the encoding is a
+sequence of bytes with a little endian encoding in the utf8bitset field.</p>
+
+<p><code>message BloomFilter {
+ optional uint32 numHashFunctions = 1;
+ repeated fixed64 bitset = 2;
+ optional bytes utf8bitset = 3;
+}
+</code></p>
+
+<p><code>message BloomFilterIndex {
+ repeated BloomFilter bloomFilter = 1;
+}
+</code></p>
+
+<p>Bloom filter internally uses two different hash functions to map a key
+to a position in the bit set. For tinyint, smallint, int, bigint, float
+and double types, Thomas Wang’s 64-bit integer hash function is used.
+Floats are converted to IEEE-754 32 bit representation
+(using Java’s Float.floatToIntBits(float)). Similary, Doubles are
+converted to IEEE-754 64 bit representation (using Java’s
+Double.doubleToLongBits(double)). All these primitive types
+are cast to long base type before being passed on to the hash function.
+For strings and binary types, Murmur3 64 bit hash algorithm is used.
+The 64 bit variant of Murmur3 considers only the most significant
+8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
+from the above algorithms is used as a base to derive ‘k’ different
+hash functions. We use the idea mentioned in the paper “Less Hashing,
+Same Performance: Building a Better Bloom Filter” by Kirsch et. al. to
+quickly compute the k hashcodes.</p>
+
+<p>The algorithm for computing k hashcodes and setting the bit position
+in a bloom filter is as follows:</p>
+
+<ol>
+ <li>Get 64 bit base hash code from Murmur3 or Thomas Wang’s hash algorithm.</li>
+ <li>Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).</li>
+ <li>k’th hashcode is obtained by (where k > 0):
+ <ul>
+ <li>combinedHash = hash1 + (k * hash2)</li>
+ </ul>
+ </li>
+ <li>If combinedHash is negative flip all the bits:
+ <ul>
+ <li>combinedHash = ~combinedHash</li>
+ </ul>
+ </li>
+ <li>Bit set position is obtained by performing modulo with m:
+ <ul>
+ <li>position = combinedHash % m</li>
+ </ul>
+ </li>
+ <li>Set the position in bit set. The LSB 6 bits identifies the long index
+within bitset and bit position within the long uses little endian order.
+ <ul>
+ <li>bitset[position »> 6] |= (1L « position);</li>
+ </ul>
+ </li>
+</ol>
+
+<p>Bloom filter streams are interlaced with row group indexes. This placement
+makes it convenient to read the bloom filter stream and row index stream
+together in single read operation.</p>
+
+<p><img src="/img/BloomFilter.png" alt="bloom filter" /></p>
+
+ </article>
+ </div>
+
+ <div class="clear"></div>
+
+ </div>
+</section>
+
+
+ <footer role="contentinfo">
+ <p>The contents of this website are © 2018
+ <a href="https://www.apache.org/">Apache Software Foundation</a>
+ under the terms of the <a
+ href="https://www.apache.org/licenses/LICENSE-2.0.html">
+ Apache License v2</a>. Apache ORC and its logo are trademarks
+ of the Apache Software Foundation.</p>
+</footer>
+
+ <script>
+ var anchorForId = function (id) {
+ var anchor = document.createElement("a");
+ anchor.className = "header-link";
+ anchor.href = "#" + id;
+ anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
+ anchor.title = "Permalink";
+ return anchor;
+ };
+
+ var linkifyAnchors = function (level, containingElement) {
+ var headers = containingElement.getElementsByTagName("h" + level);
+ for (var h = 0; h < headers.length; h++) {
+ var header = headers[h];
+
+ if (typeof header.id !== "undefined" && header.id !== "") {
+ header.appendChild(anchorForId(header.id));
+ }
+ }
+ };
+
+ document.onreadystatechange = function () {
+ if (this.readyState === "complete") {
+ var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
+ if (!contentBlock) {
+ return;
+ }
+ for (var level = 1; level <= 6; level++) {
+ linkifyAnchors(level, contentBlock);
+ }
+ }
+ };
+</script>
+
+
+</body>
+</html>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/specification/index.html
----------------------------------------------------------------------
diff --git a/specification/index.html b/specification/index.html
new file mode 100644
index 0000000..3c3a5fe
--- /dev/null
+++ b/specification/index.html
@@ -0,0 +1,159 @@
+<!DOCTYPE HTML>
+<html lang="en-US">
+<head>
+ <meta charset="UTF-8">
+ <title>ORC Specification</title>
+ <meta name="viewport" content="width=device-width,initial-scale=1">
+ <meta name="generator" content="Jekyll v2.4.0">
+ <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+ <link rel="stylesheet" href="/css/screen.css">
+ <link rel="icon" type="image/x-icon" href="/favicon.ico">
+ <!--[if lt IE 9]>
+ <script src="/js/html5shiv.min.js"></script>
+ <script src="/js/respond.min.js"></script>
+ <![endif]-->
+</head>
+
+
+<body class="wrap">
+ <header role="banner">
+ <nav class="mobile-nav show-on-mobiles">
+ <ul>
+ <li class="">
+ <a href="/">Home</a>
+ </li>
+ <li class="">
+ <a href="/docs/"><span class="show-on-mobiles">Docs</span>
+ <span class="hide-on-mobiles">Documentation</span></a>
+ </li>
+ <li class="">
+ <a href="/talks/">Talks</a>
+ </li>
+ <li class="">
+ <a href="/news/">News</a>
+ </li>
+ <li class="">
+ <a href="/help/">Help</a>
+ </li>
+ <li class="">
+ <a href="/develop/">Develop</a>
+ </li>
+</ul>
+
+ </nav>
+ <div class="grid">
+ <div class="unit one-third center-on-mobiles">
+ <h1>
+ <a href="/">
+ <span class="sr-only">Apache ORC</span>
+ <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
+ </a>
+ </h1>
+ </div>
+ <nav class="main-nav unit two-thirds hide-on-mobiles">
+ <ul>
+ <li class="">
+ <a href="/">Home</a>
+ </li>
+ <li class="">
+ <a href="/docs/"><span class="show-on-mobiles">Docs</span>
+ <span class="hide-on-mobiles">Documentation</span></a>
+ </li>
+ <li class="">
+ <a href="/talks/">Talks</a>
+ </li>
+ <li class="">
+ <a href="/news/">News</a>
+ </li>
+ <li class="">
+ <a href="/help/">Help</a>
+ </li>
+ <li class="">
+ <a href="/develop/">Develop</a>
+ </li>
+</ul>
+
+ </nav>
+ </div>
+</header>
+
+
+ <section class="standalone">
+ <div class="grid">
+
+ <div class="unit whole">
+ <article>
+ <h1>ORC Specification</h1>
+ <p>There have been two released ORC file versions:</p>
+
+<ul>
+ <li><a href="ORCv0.html">ORC v0</a> was released in Hive 0.11.</li>
+ <li><a href="ORCv1.html">ORC v1</a> was released in Hive 0.12 and ORC 1.x.</li>
+</ul>
+
+<p>Each version of the library will detect the format version and use
+the appropriate reader. The library can also write the older versions
+of the file format to ensure that users can write files that all of their
+clusters can read correctly.</p>
+
+<p>We are working on a new version of the file format:</p>
+
+<ul>
+ <li><a href="ORCv2.html">ORC v2</a> is a work in progress and is rapidly evolving.</li>
+</ul>
+
+ </article>
+ </div>
+
+ <div class="clear"></div>
+
+ </div>
+</section>
+
+
+ <footer role="contentinfo">
+ <p>The contents of this website are © 2018
+ <a href="https://www.apache.org/">Apache Software Foundation</a>
+ under the terms of the <a
+ href="https://www.apache.org/licenses/LICENSE-2.0.html">
+ Apache License v2</a>. Apache ORC and its logo are trademarks
+ of the Apache Software Foundation.</p>
+</footer>
+
+ <script>
+ var anchorForId = function (id) {
+ var anchor = document.createElement("a");
+ anchor.className = "header-link";
+ anchor.href = "#" + id;
+ anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
+ anchor.title = "Permalink";
+ return anchor;
+ };
+
+ var linkifyAnchors = function (level, containingElement) {
+ var headers = containingElement.getElementsByTagName("h" + level);
+ for (var h = 0; h < headers.length; h++) {
+ var header = headers[h];
+
+ if (typeof header.id !== "undefined" && header.id !== "") {
+ header.appendChild(anchorForId(header.id));
+ }
+ }
+ };
+
+ document.onreadystatechange = function () {
+ if (this.readyState === "complete") {
+ var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
+ if (!contentBlock) {
+ return;
+ }
+ for (var level = 1; level <= 6; level++) {
+ linkifyAnchors(level, contentBlock);
+ }
+ }
+ };
+</script>
+
+
+</body>
+</html>
[6/9] orc git commit: Pushing ORC-339 reorganize the ORC file format
spec.
Posted by om...@apache.org.
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/hive-ddl.html
----------------------------------------------------------------------
diff --git a/docs/hive-ddl.html b/docs/hive-ddl.html
index 0da9356..8c360d3 100644
--- a/docs/hive-ddl.html
+++ b/docs/hive-ddl.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,695 +663,104 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
+
+
+ <div class="unit four-fifths">
+ <article>
+ <h1>Hive DDL</h1>
+ <p>ORC is well integrated into Hive, so storing your istari table as ORC
+is done by adding “STORED AS ORC”.</p>
+
+<p><code>CREATE TABLE istari (
+ name STRING,
+ color STRING
+) STORED AS ORC;
+</code></p>
+
+<p>To modify a table so that new partitions of the istari table are
+stored as ORC files:</p>
+
+<p><code>ALTER TABLE istari SET FILEFORMAT ORC;
+</code></p>
+
+<p>As of Hive 0.14, users can request an efficient merge of small ORC files
+together by issuing a CONCATENATE command on their table or partition. The
+files will be merged at the stripe level without reserialization.</p>
+
+<p><code>ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
+</code></p>
+
+<p>To get information about an ORC file, use the orcfiledump command.</p>
+
+<p><code>% hive --orcfiledump <path_to_file>
+</code></p>
+
+<p>As of Hive 1.1, to display the data in the ORC file, use:</p>
+
+<p><code>% hive --orcfiledump -d <path_to_file>
+</code></p>
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Hive DDL</h1>
- <p>ORC is well integrated into Hive, so storing your istari table as ORC
-is done by adding “STORED AS ORC”.</p>
-
-<p><code>CREATE TABLE istari (
- name STRING,
- color STRING
-) STORED AS ORC;
-</code></p>
-
-<p>To modify a table so that new partitions of the istari table are
-stored as ORC files:</p>
-
-<p><code>ALTER TABLE istari SET FILEFORMAT ORC;
-</code></p>
-
-<p>As of Hive 0.14, users can request an efficient merge of small ORC files
-together by issuing a CONCATENATE command on their table or partition. The
-files will be merged at the stripe level without reserialization.</p>
-
-<p><code>ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
-</code></p>
-
-<p>To get information about an ORC file, use the orcfiledump command.</p>
-
-<p><code>% hive --orcfiledump <path_to_file>
-</code></p>
-
-<p>As of Hive 1.1, to display the data in the ORC file, use:</p>
-
-<p><code>% hive --orcfiledump -d <path_to_file>
-</code></p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/releases.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/hive-config.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/releases.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/hive-config.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1582,11 +789,7 @@ files will be merged at the stripe level without reserialization.</p>
-
-
-
-
- <li class="current"><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1600,34 +803,10 @@ files will be merged at the stripe level without reserialization.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1664,7 +843,7 @@ files will be merged at the stripe level without reserialization.</p>
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -1694,49 +873,7 @@ files will be merged at the stripe level without reserialization.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -1748,22 +885,14 @@ files will be merged at the stripe level without reserialization.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -1780,15 +909,7 @@ files will be merged at the stripe level without reserialization.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -1826,14 +947,14 @@ files will be merged at the stripe level without reserialization.</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -1860,31 +981,7 @@ files will be merged at the stripe level without reserialization.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class="current"><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -1908,31 +1005,17 @@ files will be merged at the stripe level without reserialization.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -1964,19 +1047,7 @@ files will be merged at the stripe level without reserialization.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2012,13 +1083,25 @@ files will be merged at the stripe level without reserialization.</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2028,7 +1111,7 @@ files will be merged at the stripe level without reserialization.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2046,17 +1129,17 @@ files will be merged at the stripe level without reserialization.</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2076,11 +1159,17 @@ files will be merged at the stripe level without reserialization.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2102,7 +1191,7 @@ files will be merged at the stripe level without reserialization.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/index.html
----------------------------------------------------------------------
diff --git a/docs/index.html b/docs/index.html
index 6014e66..0d344dc 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,9 +160,9 @@
-
+
-
+
@@ -188,9 +174,9 @@
-
+
-
+
@@ -207,6 +193,12 @@
+ <option value="/docs/types.html">Types</option>
+
+
+
+
+
@@ -227,6 +219,8 @@
+ <option value="/docs/indexes.html">Indexes</option>
+
@@ -235,7 +229,7 @@
- <option value="/docs/types.html">Types</option>
+
@@ -243,7 +237,7 @@
-
+ <option value="/docs/acid.html">ACID support</option>
@@ -267,8 +261,6 @@
- <option value="/docs/indexes.html">Indexes</option>
-
@@ -276,24 +268,35 @@
+
+
+ </optgroup>
+ <optgroup label="Installing">
+
+
+
+
+
+ <option value="/docs/building.html">Building ORC</option>
+
-
+
-
+
- <option value="/docs/acid.html">ACID support</option>
+
@@ -308,6 +311,10 @@
+
+
+
+
@@ -335,6 +342,8 @@
+ <option value="/docs/releases.html">Releases</option>
+
@@ -342,7 +351,7 @@
</optgroup>
- <optgroup label="Installing">
+ <optgroup label="Using in Hive">
@@ -354,8 +363,6 @@
- <option value="/docs/building.html">Building ORC</option>
-
@@ -366,7 +373,7 @@
-
+ <option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -383,7 +390,9 @@
-
+
+
+
@@ -395,9 +404,11 @@
-
+
-
+
+ <option value="/docs/hive-config.html">Hive Configuration</option>
+
@@ -415,7 +426,16 @@
+
+
+ </optgroup>
+ <optgroup label="Using in MapReduce">
+
+
+
+
+
@@ -432,7 +452,7 @@
- <option value="/docs/releases.html">Releases</option>
+
@@ -440,18 +460,15 @@
+ <option value="/docs/mapred.html">Using in MapRed</option>
+
-
-
- </optgroup>
- <optgroup label="Using in Hive">
-
-
+
@@ -477,20 +494,27 @@
- <option value="/docs/hive-ddl.html">Hive DDL</option>
-
-
+ <option value="/docs/mapreduce.html">Using in MapReduce</option>
+
+
+ </optgroup>
+ <optgroup label="Using ORC Core">
+
+
+
+
+
@@ -501,10 +525,12 @@
+ <option value="/docs/core-java.html">Using Core Java</option>
+
-
+
-
+
@@ -522,13 +548,19 @@
+
+
+
+
- <option value="/docs/hive-config.html">Hive Configuration</option>
+
+ <option value="/docs/core-cpp.html">Using Core C++</option>
+
@@ -556,7 +588,7 @@
</optgroup>
- <optgroup label="Using in MapReduce">
+ <optgroup label="Tools">
@@ -574,7 +606,7 @@
-
+ <option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -592,14 +624,12 @@
- <option value="/docs/mapred.html">Using in MapRed</option>
-
-
+
-
+
@@ -609,10 +639,6 @@
-
-
-
-
@@ -626,7 +652,7 @@
-
+ <option value="/docs/java-tools.html">Java Tools</option>
@@ -637,14 +663,94 @@
+
+
+ </optgroup>
+ </select>
+</div>
+
+
+ <div class="unit four-fifths">
+ <article>
+ <h1>Background</h1>
+ <p>Back in January 2013, we created ORC files as part of the initiative
+to massively speed up Apache Hive and improve the storage efficiency
+of data stored in Apache Hadoop. The focus was on enabling high speed
+processing and reducing file sizes.</p>
+
+<p>ORC is a self-describing type-aware columnar file format designed for
+Hadoop workloads. It is optimized for large streaming reads, but with
+integrated support for finding required rows quickly. Storing data in
+a columnar format lets the reader read, decompress, and process only
+the values that are required for the current query. Because ORC files
+are type-aware, the writer chooses the most appropriate encoding for
+the type and builds an internal index as the file is written.</p>
+
+<p>Predicate pushdown uses those indexes to determine which stripes in a
+file need to be read for a particular query and the row indexes can
+narrow the search to a particular set of 10,000 rows. ORC supports the
+complete set of types in Hive, including the complex types: structs,
+lists, maps, and unions.</p>
+
+<p>Many large Hadoop users have adopted ORC. For instance, Facebook uses
+ORC to <a href="https://s.apache.org/fb-scaling-300-pb">save tens of petabytes</a>
+in their data warehouse and demonstrated that ORC is <a href="https://s.apache.org/presto-orc">significantly
+faster</a> than RC File or Parquet. Yahoo
+uses ORC to store their production data and has released some of their
+<a href="https://s.apache.org/yahoo-orc">benchmark results</a>.</p>
+
+<p>ORC files are divided in to <em>stripes</em> that are roughly 64MB by
+default. The stripes in a file are independent of each other and form
+the natural unit of distributed work. Within each stripe, the columns
+are separated from each other so the reader can read just the columns
+that are required.</p>
+
+
+
+
+
+
+
+
+ <div class="section-nav">
+ <div class="left align-right">
+
+ <span class="prev disabled">Back</span>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/adopters.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
+
+
+ </article>
+ </div>
+
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
+
+ <h4>Overview</h4>
+
+<ul>
+
+
+
+
+
- <option value="/docs/mapreduce.html">Using in MapReduce</option>
+
@@ -659,11 +765,8 @@
-
-
- </optgroup>
- <optgroup label="Using ORC Core">
+ <li class="current"><a href="/docs/index.html">Background</a></li>
@@ -672,19 +775,21 @@
-
+
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
+
+
+
-
+
- <option value="/docs/core-java.html">Using Core Java</option>
-
-
+
@@ -715,22 +820,20 @@
-
+ <li class=""><a href="/docs/types.html">Types</a></li>
+
-
-
+
-
+
- <option value="/docs/core-cpp.html">Using Core C++</option>
-
@@ -747,860 +850,26 @@
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
+
+
+
-
+
-
+
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Tools">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/cpp-tools.html">C++ Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/java-tools.html">Java Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Format Specification">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Background</h1>
- <p>Back in January 2013, we created ORC files as part of the initiative
-to massively speed up Apache Hive and improve the storage efficiency
-of data stored in Apache Hadoop. The focus was on enabling high speed
-processing and reducing file sizes.</p>
-
-<p>ORC is a self-describing type-aware columnar file format designed for
-Hadoop workloads. It is optimized for large streaming reads, but with
-integrated support for finding required rows quickly. Storing data in
-a columnar format lets the reader read, decompress, and process only
-the values that are required for the current query. Because ORC files
-are type-aware, the writer chooses the most appropriate encoding for
-the type and builds an internal index as the file is written.</p>
-
-<p>Predicate pushdown uses those indexes to determine which stripes in a
-file need to be read for a particular query and the row indexes can
-narrow the search to a particular set of 10,000 rows. ORC supports the
-complete set of types in Hive, including the complex types: structs,
-lists, maps, and unions.</p>
-
-<p>Many large Hadoop users have adopted ORC. For instance, Facebook uses
-ORC to <a href="https://s.apache.org/fb-scaling-300-pb">save tens of petabytes</a>
-in their data warehouse and demonstrated that ORC is <a href="https://s.apache.org/presto-orc">significantly
-faster</a> than RC File or Parquet. Yahoo
-uses ORC to store their production data and has released some of their
-<a href="https://s.apache.org/yahoo-orc">benchmark results</a>.</p>
-
-<p>ORC files are divided in to <em>stripes</em> that are roughly 64MB by
-default. The stripes in a file are independent of each other and form
-the natural unit of distributed work. Within each stripe, the columns
-are separated from each other so the reader can read just the columns
-that are required.</p>
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
- <span class="prev disabled">Back</span>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/adopters.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in Hive</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Using in MapReduce</h4>
+ <h4>Installing</h4>
<ul>
@@ -1617,31 +886,7 @@ that are required.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -1679,18 +924,14 @@ that are required.</p>
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Using ORC Core</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -1713,59 +954,11 @@ that are required.</p>
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
-
-
-
-</ul>
-
-
- <h4>Tools</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -1789,28 +982,14 @@ that are required.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in MapReduce</h4>
<ul>
@@ -1845,23 +1024,7 @@ that are required.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -1889,16 +1052,6 @@ that are required.</p>
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
-
-
-
-
-
-
-
-
-
@@ -1907,42 +1060,24 @@ that are required.</p>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
-
+
-
+
@@ -1953,7 +1088,7 @@ that are required.</p>
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -1971,54 +1106,24 @@ that are required.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
-
+</ul>
-
+ <h4>Tools</h4>
+
+
+<ul>
+
-
+
@@ -2031,7 +1136,7 @@ that are required.</p>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
@@ -2063,23 +1168,7 @@ that are required.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/indexes.html
----------------------------------------------------------------------
diff --git a/docs/indexes.html b/docs/indexes.html
index 5654a47..0a81f43 100644
--- a/docs/indexes.html
+++ b/docs/indexes.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,680 +663,89 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
+
+
+ <div class="unit four-fifths">
+ <article>
+ <h1>Indexes</h1>
+ <p>ORC provides three level of indexes within each file:</p>
+
+<ul>
+ <li>file level - statistics about the values in each column across the entire
+file</li>
+ <li>stripe level - statistics about the values in each column for each stripe</li>
+ <li>row level - statistics about the values in each column for each set of
+10,000 rows within a stripe</li>
+</ul>
+
+<p>The file and stripe level column statistics are in the file footer so
+that they are easy to access to determine if the rest of the file
+needs to be read at all. Row level indexes include both the column
+statistics for each row group and the position for seeking to the
+start of the row group.</p>
+
+<p>Column statistics always contain the count of values and whether there
+are null values present. Most other primitive types include the
+minimum and maximum values and for numeric types the sum. As of Hive
+1.2, the indexes can include bloom filters, which provide a much more
+selective filter.</p>
+
+<p>The indexes at all levels are used by the reader using Search
+ARGuments or SARGs, which are simplified expressions that restrict the
+rows that are of interest. For example, if a query was looking for
+people older than 100 years old, the SARG would be “age > 100” and
+only files, stripes, or row groups that had people over 100 years old
+would be read.</p>
+
+
+
+
+
+
-
-
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Indexes</h1>
- <p>ORC provides three level of indexes within each file:</p>
-
-<ul>
- <li>file level - statistics about the values in each column across the entire
-file</li>
- <li>stripe level - statistics about the values in each column for each stripe</li>
- <li>row level - statistics about the values in each column for each set of
-10,000 rows within a stripe</li>
-</ul>
-
-<p>The file and stripe level column statistics are in the file footer so
-that they are easy to access to determine if the rest of the file
-needs to be read at all. Row level indexes include both the column
-statistics for each row group and the position for seeking to the
-start of the row group.</p>
-
-<p>Column statistics always contain the count of values and whether there
-are null values present. Most other primitive types include the
-minimum and maximum values and for numeric types the sum. As of Hive
-1.2, the indexes can include bloom filters, which provide a much more
-selective filter.</p>
-
-<p>The indexes at all levels are used by the reader using Search
-ARGuments or SARGs, which are simplified expressions that restrict the
-rows that are of interest. For example, if a query was looking for
-people older than 100 years old, the SARG would be “age > 100” and
-only files, stripes, or row groups that had people over 100 years old
-would be read.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/types.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/acid.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/types.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/acid.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1567,11 +774,7 @@ would be read.</p>
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1585,34 +788,10 @@ would be read.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1649,7 +828,7 @@ would be read.</p>
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -1679,49 +858,7 @@ would be read.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class="current"><a href="/docs/indexes.html">Indexes</a></li>
@@ -1733,22 +870,14 @@ would be read.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -1765,15 +894,7 @@ would be read.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -1811,14 +932,14 @@ would be read.</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -1845,31 +966,7 @@ would be read.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -1893,31 +990,17 @@ would be read.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -1949,19 +1032,7 @@ would be read.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -1997,13 +1068,25 @@ would be read.</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2013,7 +1096,7 @@ would be read.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2031,17 +1114,17 @@ would be read.</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2061,11 +1144,17 @@ would be read.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2087,7 +1176,7 @@ would be read.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/java-tools.html
----------------------------------------------------------------------
diff --git a/docs/java-tools.html b/docs/java-tools.html
index 7d38769..25efb43 100644
--- a/docs/java-tools.html
+++ b/docs/java-tools.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,992 +663,329 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
+ <div class="unit four-fifths">
+ <article>
+ <h1>Java Tools</h1>
+ <p>In addition to the C++ tools, there is an ORC tools jar that packages
+several useful utilities and the necessary Java dependencies
+(including Hadoop) into a single package. The Java ORC tool jar
+supports both the local file system and HDFS.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Java Tools</h1>
- <p>In addition to the C++ tools, there is an ORC tools jar that packages
-several useful utilities and the necessary Java dependencies
-(including Hadoop) into a single package. The Java ORC tool jar
-supports both the local file system and HDFS.</p>
-
-<p>The subcommands for the tools are:</p>
-
-<ul>
- <li>meta - print the metadata of an ORC file</li>
- <li>data - print the data of an ORC file</li>
- <li>scan (since ORC 1.3) - scan the data for benchmarking</li>
- <li>convert (since ORC 1.4) - convert JSON files to ORC</li>
- <li>json-schema (since ORC 1.4) - determine the schema of JSON documents</li>
-</ul>
-
-<p>The command line looks like:</p>
-
-<pre><code class="language-shell">% java -jar orc-tools-X.Y.Z-uber.jar <sub-command> <args>
-</code></pre>
-
-<h2 id="java-meta">Java Meta</h2>
-
-<p>The meta command prints the metadata about the given ORC file and is
-equivalent to the Hive ORC File Dump command.</p>
-
-<dl>
- <dt>-j</dt>
- <dd>format the output in JSON</dd>
- <dt>-p</dt>
- <dd>pretty print the output</dd>
- <dt>-t</dt>
- <dd>print the timezone of the writer</dd>
- <dt>–rowindex</dt>
- <dd>print the row indexes for the comma separated list of column ids</dd>
- <dt>–recover</dt>
- <dd>skip over corrupted values in the ORC file</dd>
- <dt>–skip-dump</dt>
- <dd>skip dumping the metadata</dd>
- <dt>–backup-path</dt>
- <dd>when used with –recover specifies the path where the recovered file is written</dd>
-</dl>
-
-<p>An example of the output is given below:</p>
-
-<pre><code class="language-shell">% java -jar orc-tools-X.Y.Z-uber.jar meta examples/TestOrcFile.test1.orc
-Processing data file examples/TestOrcFile.test1.orc [length: 1711]
-Structure for examples/TestOrcFile.test1.orc
-File Version: 0.12 with HIVE_8732
-Rows: 2
-Compression: ZLIB
-Compression size: 10000
-Type: struct<boolean1:boolean,byte1:tinyint,short1:smallint,int1:int,
-long1:bigint,float1:float,double1:double,bytes1:binary,string1:string,
-middle:struct<list:array<struct<int1:int,string1:string>>>,list:array<
-struct<int1:int,string1:string>>,map:map<string,struct<int1:int,string1:
-string>>>
-
-Stripe Statistics:
- Stripe 1:
- Column 0: count: 2 hasNull: false
- Column 1: count: 2 hasNull: false true: 1
- Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
- Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
- Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
- Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 9223372036854775807
- Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
- Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
- Column 8: count: 2 hasNull: false sum: 5
- Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
- Column 10: count: 2 hasNull: false
- Column 11: count: 2 hasNull: false
- Column 12: count: 4 hasNull: false
- Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
- Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
- Column 15: count: 2 hasNull: false
- Column 16: count: 5 hasNull: false
- Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 99901241
- Column 18: count: 5 hasNull: false min: bad max: in sum: 15
- Column 19: count: 2 hasNull: false
- Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
- Column 21: count: 2 hasNull: false
- Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
- Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12
-
-File Statistics:
- Column 0: count: 2 hasNull: false
- Column 1: count: 2 hasNull: false true: 1
- Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
- Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
- Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
- Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 9223372036854775807
- Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
- Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
- Column 8: count: 2 hasNull: false sum: 5
- Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
- Column 10: count: 2 hasNull: false
- Column 11: count: 2 hasNull: false
- Column 12: count: 4 hasNull: false
- Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
- Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
- Column 15: count: 2 hasNull: false
- Column 16: count: 5 hasNull: false
- Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 99901241
- Column 18: count: 5 hasNull: false min: bad max: in sum: 15
- Column 19: count: 2 hasNull: false
- Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
- Column 21: count: 2 hasNull: false
- Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
- Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12
-
-Stripes:
- Stripe: offset: 3 data: 243 rows: 2 tail: 199 index: 570
- Stream: column 0 section ROW_INDEX start: 3 length 11
- Stream: column 1 section ROW_INDEX start: 14 length 22
- Stream: column 2 section ROW_INDEX start: 36 length 26
- Stream: column 3 section ROW_INDEX start: 62 length 27
- Stream: column 4 section ROW_INDEX start: 89 length 30
- Stream: column 5 section ROW_INDEX start: 119 length 28
- Stream: column 6 section ROW_INDEX start: 147 length 34
- Stream: column 7 section ROW_INDEX start: 181 length 34
- Stream: column 8 section ROW_INDEX start: 215 length 21
- Stream: column 9 section ROW_INDEX start: 236 length 30
- Stream: column 10 section ROW_INDEX start: 266 length 11
- Stream: column 11 section ROW_INDEX start: 277 length 16
- Stream: column 12 section ROW_INDEX start: 293 length 11
- Stream: column 13 section ROW_INDEX start: 304 length 24
- Stream: column 14 section ROW_INDEX start: 328 length 31
- Stream: column 15 section ROW_INDEX start: 359 length 16
- Stream: column 16 section ROW_INDEX start: 375 length 11
- Stream: column 17 section ROW_INDEX start: 386 length 32
- Stream: column 18 section ROW_INDEX start: 418 length 30
- Stream: column 19 section ROW_INDEX start: 448 length 16
- Stream: column 20 section ROW_INDEX start: 464 length 37
- Stream: column 21 section ROW_INDEX start: 501 length 11
- Stream: column 22 section ROW_INDEX start: 512 length 24
- Stream: column 23 section ROW_INDEX start: 536 length 37
- Stream: column 1 section DATA start: 573 length 5
- Stream: column 2 section DATA start: 578 length 6
- Stream: column 3 section DATA start: 584 length 9
- Stream: column 4 section DATA start: 593 length 11
- Stream: column 5 section DATA start: 604 length 12
- Stream: column 6 section DATA start: 616 length 11
- Stream: column 7 section DATA start: 627 length 15
- Stream: column 8 section DATA start: 642 length 8
- Stream: column 8 section LENGTH start: 650 length 6
- Stream: column 9 section DATA start: 656 length 8
- Stream: column 9 section LENGTH start: 664 length 6
- Stream: column 11 section LENGTH start: 670 length 6
- Stream: column 13 section DATA start: 676 length 7
- Stream: column 14 section DATA start: 683 length 6
- Stream: column 14 section LENGTH start: 689 length 6
- Stream: column 14 section DICTIONARY_DATA start: 695 length 10
- Stream: column 15 section LENGTH start: 705 length 6
- Stream: column 17 section DATA start: 711 length 25
- Stream: column 18 section DATA start: 736 length 18
- Stream: column 18 section LENGTH start: 754 length 8
- Stream: column 19 section LENGTH start: 762 length 6
- Stream: column 20 section DATA start: 768 length 15
- Stream: column 20 section LENGTH start: 783 length 6
- Stream: column 22 section DATA start: 789 length 6
- Stream: column 23 section DATA start: 795 length 15
- Stream: column 23 section LENGTH start: 810 length 6
- Encoding column 0: DIRECT
- Encoding column 1: DIRECT
- Encoding column 2: DIRECT
- Encoding column 3: DIRECT_V2
- Encoding column 4: DIRECT_V2
- Encoding column 5: DIRECT_V2
- Encoding column 6: DIRECT
- Encoding column 7: DIRECT
- Encoding column 8: DIRECT_V2
- Encoding column 9: DIRECT_V2
- Encoding column 10: DIRECT
- Encoding column 11: DIRECT_V2
- Encoding column 12: DIRECT
- Encoding column 13: DIRECT_V2
- Encoding column 14: DICTIONARY_V2[2]
- Encoding column 15: DIRECT_V2
- Encoding column 16: DIRECT
- Encoding column 17: DIRECT_V2
- Encoding column 18: DIRECT_V2
- Encoding column 19: DIRECT_V2
- Encoding column 20: DIRECT_V2
- Encoding column 21: DIRECT
- Encoding column 22: DIRECT_V2
- Encoding column 23: DIRECT_V2
-
-File length: 1711 bytes
-Padding length: 0 bytes
-Padding ratio: 0%
-______________________________________________________________________
-</code></pre>
-
-<h2 id="java-data">Java Data</h2>
-
-<p>The data command prints the data in an ORC file as a JSON document. Each
-record is printed as a JSON object on a line. Each record is annotated with
-the fieldnames and a JSON representation that depends on the field’s type.</p>
-
-<h2 id="java-scan">Java Scan</h2>
-
-<p>The scan command reads the contents of the file without printing anything. It
-is primarily intendend for benchmarking the Java reader without including the
-cost of printing the data out.</p>
-
-<h2 id="java-convert">Java Convert</h2>
-
-<p>The convert command reads several JSON files and converts them into a
-single ORC file.</p>
-
-<dl>
- <dt>-o <filename></filename></dt>
- <dd>Sets the output ORC filename, which defaults to output.orc</dd>
- <dt>-s <schema></schema></dt>
- <dd>Sets the schema for the ORC file. By default, the schema is automatically discovered.</dd>
- <dt>-h</dt>
- <dd>Print help</dd>
-</dl>
-
-<p>The automatic JSON schema discovery is equivalent to the json-schema tool
-below.</p>
-
-<h2 id="java-json-schema">Java JSON Schema</h2>
-
-<p>The JSON Schema discovery tool processes a set of JSON documents and
-produces a schema that encompasses all of the records in all of the
-documents. It works by computing the enclosing type and promoting it
-to include all of the observed values.</p>
-
-<dl>
- <dt>-f</dt>
- <dd>Print the schema as a list of flat types for each subfield</dd>
- <dt>-t</dt>
- <dd>Print the schema as a Hive table declaration</dd>
- <dt>-h</dt>
- <dd>Print help</dd>
-</dl>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/cpp-tools.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/spec-intro.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
+<p>The subcommands for the tools are:</p>
+<ul>
+ <li>meta - print the metadata of an ORC file</li>
+ <li>data - print the data of an ORC file</li>
+ <li>scan (since ORC 1.3) - scan the data for benchmarking</li>
+ <li>convert (since ORC 1.4) - convert JSON files to ORC</li>
+ <li>json-schema (since ORC 1.4) - determine the schema of JSON documents</li>
+</ul>
-
+<p>The command line looks like:</p>
-
-
-
+<pre><code class="language-shell">% java -jar orc-tools-X.Y.Z-uber.jar <sub-command> <args>
+</code></pre>
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
+<h2 id="java-meta">Java Meta</h2>
+<p>The meta command prints the metadata about the given ORC file and is
+equivalent to the Hive ORC File Dump command.</p>
-</ul>
+<dl>
+ <dt>-j</dt>
+ <dd>format the output in JSON</dd>
+ <dt>-p</dt>
+ <dd>pretty print the output</dd>
+ <dt>-t</dt>
+ <dd>print the timezone of the writer</dd>
+ <dt>–rowindex</dt>
+ <dd>print the row indexes for the comma separated list of column ids</dd>
+ <dt>–recover</dt>
+ <dd>skip over corrupted values in the ORC file</dd>
+ <dt>–skip-dump</dt>
+ <dd>skip dumping the metadata</dd>
+ <dt>–backup-path</dt>
+ <dd>when used with –recover specifies the path where the recovered file is written</dd>
+</dl>
-
- <h4>Installing</h4>
-
+<p>An example of the output is given below:</p>
-<ul>
+<pre><code class="language-shell">% java -jar orc-tools-X.Y.Z-uber.jar meta examples/TestOrcFile.test1.orc
+Processing data file examples/TestOrcFile.test1.orc [length: 1711]
+Structure for examples/TestOrcFile.test1.orc
+File Version: 0.12 with HIVE_8732
+Rows: 2
+Compression: ZLIB
+Compression size: 10000
+Type: struct<boolean1:boolean,byte1:tinyint,short1:smallint,int1:int,
+long1:bigint,float1:float,double1:double,bytes1:binary,string1:string,
+middle:struct<list:array<struct<int1:int,string1:string>>>,list:array<
+struct<int1:int,string1:string>>,map:map<string,struct<int1:int,string1:
+string>>>
-
+Stripe Statistics:
+ Stripe 1:
+ Column 0: count: 2 hasNull: false
+ Column 1: count: 2 hasNull: false true: 1
+ Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
+ Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
+ Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
+ Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 9223372036854775807
+ Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
+ Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
+ Column 8: count: 2 hasNull: false sum: 5
+ Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
+ Column 10: count: 2 hasNull: false
+ Column 11: count: 2 hasNull: false
+ Column 12: count: 4 hasNull: false
+ Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
+ Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
+ Column 15: count: 2 hasNull: false
+ Column 16: count: 5 hasNull: false
+ Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 99901241
+ Column 18: count: 5 hasNull: false min: bad max: in sum: 15
+ Column 19: count: 2 hasNull: false
+ Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
+ Column 21: count: 2 hasNull: false
+ Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
+ Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12
-
-
-
+File Statistics:
+ Column 0: count: 2 hasNull: false
+ Column 1: count: 2 hasNull: false true: 1
+ Column 2: count: 2 hasNull: false min: 1 max: 100 sum: 101
+ Column 3: count: 2 hasNull: false min: 1024 max: 2048 sum: 3072
+ Column 4: count: 2 hasNull: false min: 65536 max: 65536 sum: 131072
+ Column 5: count: 2 hasNull: false min: 9223372036854775807 max: 9223372036854775807
+ Column 6: count: 2 hasNull: false min: 1.0 max: 2.0 sum: 3.0
+ Column 7: count: 2 hasNull: false min: -15.0 max: -5.0 sum: -20.0
+ Column 8: count: 2 hasNull: false sum: 5
+ Column 9: count: 2 hasNull: false min: bye max: hi sum: 5
+ Column 10: count: 2 hasNull: false
+ Column 11: count: 2 hasNull: false
+ Column 12: count: 4 hasNull: false
+ Column 13: count: 4 hasNull: false min: 1 max: 2 sum: 6
+ Column 14: count: 4 hasNull: false min: bye max: sigh sum: 14
+ Column 15: count: 2 hasNull: false
+ Column 16: count: 5 hasNull: false
+ Column 17: count: 5 hasNull: false min: -100000 max: 100000000 sum: 99901241
+ Column 18: count: 5 hasNull: false min: bad max: in sum: 15
+ Column 19: count: 2 hasNull: false
+ Column 20: count: 2 hasNull: false min: chani max: mauddib sum: 12
+ Column 21: count: 2 hasNull: false
+ Column 22: count: 2 hasNull: false min: 1 max: 5 sum: 6
+ Column 23: count: 2 hasNull: false min: chani max: mauddib sum: 12
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
+Stripes:
+ Stripe: offset: 3 data: 243 rows: 2 tail: 199 index: 570
+ Stream: column 0 section ROW_INDEX start: 3 length 11
+ Stream: column 1 section ROW_INDEX start: 14 length 22
+ Stream: column 2 section ROW_INDEX start: 36 length 26
+ Stream: column 3 section ROW_INDEX start: 62 length 27
+ Stream: column 4 section ROW_INDEX start: 89 length 30
+ Stream: column 5 section ROW_INDEX start: 119 length 28
+ Stream: column 6 section ROW_INDEX start: 147 length 34
+ Stream: column 7 section ROW_INDEX start: 181 length 34
+ Stream: column 8 section ROW_INDEX start: 215 length 21
+ Stream: column 9 section ROW_INDEX start: 236 length 30
+ Stream: column 10 section ROW_INDEX start: 266 length 11
+ Stream: column 11 section ROW_INDEX start: 277 length 16
+ Stream: column 12 section ROW_INDEX start: 293 length 11
+ Stream: column 13 section ROW_INDEX start: 304 length 24
+ Stream: column 14 section ROW_INDEX start: 328 length 31
+ Stream: column 15 section ROW_INDEX start: 359 length 16
+ Stream: column 16 section ROW_INDEX start: 375 length 11
+ Stream: column 17 section ROW_INDEX start: 386 length 32
+ Stream: column 18 section ROW_INDEX start: 418 length 30
+ Stream: column 19 section ROW_INDEX start: 448 length 16
+ Stream: column 20 section ROW_INDEX start: 464 length 37
+ Stream: column 21 section ROW_INDEX start: 501 length 11
+ Stream: column 22 section ROW_INDEX start: 512 length 24
+ Stream: column 23 section ROW_INDEX start: 536 length 37
+ Stream: column 1 section DATA start: 573 length 5
+ Stream: column 2 section DATA start: 578 length 6
+ Stream: column 3 section DATA start: 584 length 9
+ Stream: column 4 section DATA start: 593 length 11
+ Stream: column 5 section DATA start: 604 length 12
+ Stream: column 6 section DATA start: 616 length 11
+ Stream: column 7 section DATA start: 627 length 15
+ Stream: column 8 section DATA start: 642 length 8
+ Stream: column 8 section LENGTH start: 650 length 6
+ Stream: column 9 section DATA start: 656 length 8
+ Stream: column 9 section LENGTH start: 664 length 6
+ Stream: column 11 section LENGTH start: 670 length 6
+ Stream: column 13 section DATA start: 676 length 7
+ Stream: column 14 section DATA start: 683 length 6
+ Stream: column 14 section LENGTH start: 689 length 6
+ Stream: column 14 section DICTIONARY_DATA start: 695 length 10
+ Stream: column 15 section LENGTH start: 705 length 6
+ Stream: column 17 section DATA start: 711 length 25
+ Stream: column 18 section DATA start: 736 length 18
+ Stream: column 18 section LENGTH start: 754 length 8
+ Stream: column 19 section LENGTH start: 762 length 6
+ Stream: column 20 section DATA start: 768 length 15
+ Stream: column 20 section LENGTH start: 783 length 6
+ Stream: column 22 section DATA start: 789 length 6
+ Stream: column 23 section DATA start: 795 length 15
+ Stream: column 23 section LENGTH start: 810 length 6
+ Encoding column 0: DIRECT
+ Encoding column 1: DIRECT
+ Encoding column 2: DIRECT
+ Encoding column 3: DIRECT_V2
+ Encoding column 4: DIRECT_V2
+ Encoding column 5: DIRECT_V2
+ Encoding column 6: DIRECT
+ Encoding column 7: DIRECT
+ Encoding column 8: DIRECT_V2
+ Encoding column 9: DIRECT_V2
+ Encoding column 10: DIRECT
+ Encoding column 11: DIRECT_V2
+ Encoding column 12: DIRECT
+ Encoding column 13: DIRECT_V2
+ Encoding column 14: DICTIONARY_V2[2]
+ Encoding column 15: DIRECT_V2
+ Encoding column 16: DIRECT
+ Encoding column 17: DIRECT_V2
+ Encoding column 18: DIRECT_V2
+ Encoding column 19: DIRECT_V2
+ Encoding column 20: DIRECT_V2
+ Encoding column 21: DIRECT
+ Encoding column 22: DIRECT_V2
+ Encoding column 23: DIRECT_V2
+File length: 1711 bytes
+Padding length: 0 bytes
+Padding ratio: 0%
+______________________________________________________________________
+</code></pre>
-
+<h2 id="java-data">Java Data</h2>
-
-
-
+<p>The data command prints the data in an ORC file as a JSON document. Each
+record is printed as a JSON object on a line. Each record is annotated with
+the fieldnames and a JSON representation that depends on the field’s type.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+<h2 id="java-scan">Java Scan</h2>
+
+<p>The scan command reads the contents of the file without printing anything. It
+is primarily intendend for benchmarking the Java reader without including the
+cost of printing the data out.</p>
+
+<h2 id="java-convert">Java Convert</h2>
+<p>The convert command reads several JSON files and converts them into a
+single ORC file.</p>
+
+<dl>
+ <dt>-o <filename></filename></dt>
+ <dd>Sets the output ORC filename, which defaults to output.orc</dd>
+ <dt>-s <schema></schema></dt>
+ <dd>Sets the schema for the ORC file. By default, the schema is automatically discovered.</dd>
+ <dt>-h</dt>
+ <dd>Print help</dd>
+</dl>
+
+<p>The automatic JSON schema discovery is equivalent to the json-schema tool
+below.</p>
+
+<h2 id="java-json-schema">Java JSON Schema</h2>
+
+<p>The JSON Schema discovery tool processes a set of JSON documents and
+produces a schema that encompasses all of the records in all of the
+documents. It works by computing the enclosing type and promoting it
+to include all of the observed values.</p>
+
+<dl>
+ <dt>-f</dt>
+ <dd>Print the schema as a list of flat types for each subfield</dd>
+ <dt>-t</dt>
+ <dd>Print the schema as a Hive table declaration</dd>
+ <dt>-h</dt>
+ <dd>Print help</dd>
+</dl>
+
+
-</ul>
-
- <h4>Using in Hive</h4>
-
-<ul>
-
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
+
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
+
+
+
-
-
-
+
-
-
+
-
-
+
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/cpp-tools.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+ <span class="next disabled">Next</span>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in MapReduce</h4>
+ <h4>Overview</h4>
<ul>
@@ -1879,19 +1014,21 @@ to include all of the observed values.</p>
+ <li class=""><a href="/docs/index.html">Background</a></li>
+
+
+
-
-
-
+
-
+
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
@@ -1931,20 +1068,10 @@ to include all of the observed values.</p>
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
@@ -1963,34 +1090,34 @@ to include all of the observed values.</p>
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
+
+
+
+
+
+
+
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -2007,15 +1134,7 @@ to include all of the observed values.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -2053,14 +1172,14 @@ to include all of the observed values.</p>
- <li class="current"><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -2087,31 +1206,7 @@ to include all of the observed values.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -2135,31 +1230,17 @@ to include all of the observed values.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2191,19 +1272,7 @@ to include all of the observed values.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2239,13 +1308,25 @@ to include all of the observed values.</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2255,7 +1336,7 @@ to include all of the observed values.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2273,17 +1354,17 @@ to include all of the observed values.</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2303,11 +1384,17 @@ to include all of the observed values.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2329,7 +1416,7 @@ to include all of the observed values.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class="current"><a href="/docs/java-tools.html">Java Tools</a></li>
[2/9] orc git commit: Pushing ORC-339 reorganize the ORC file format
spec.
Posted by om...@apache.org.
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/specification/ORCv1.html
----------------------------------------------------------------------
diff --git a/specification/ORCv1.html b/specification/ORCv1.html
new file mode 100644
index 0000000..e3cad2e
--- /dev/null
+++ b/specification/ORCv1.html
@@ -0,0 +1,1744 @@
+<!DOCTYPE HTML>
+<html lang="en-US">
+<head>
+ <meta charset="UTF-8">
+ <title>ORC Specification v1</title>
+ <meta name="viewport" content="width=device-width,initial-scale=1">
+ <meta name="generator" content="Jekyll v2.4.0">
+ <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+ <link rel="stylesheet" href="/css/screen.css">
+ <link rel="icon" type="image/x-icon" href="/favicon.ico">
+ <!--[if lt IE 9]>
+ <script src="/js/html5shiv.min.js"></script>
+ <script src="/js/respond.min.js"></script>
+ <![endif]-->
+</head>
+
+
+<body class="wrap">
+ <header role="banner">
+ <nav class="mobile-nav show-on-mobiles">
+ <ul>
+ <li class="">
+ <a href="/">Home</a>
+ </li>
+ <li class="">
+ <a href="/docs/"><span class="show-on-mobiles">Docs</span>
+ <span class="hide-on-mobiles">Documentation</span></a>
+ </li>
+ <li class="">
+ <a href="/talks/">Talks</a>
+ </li>
+ <li class="">
+ <a href="/news/">News</a>
+ </li>
+ <li class="">
+ <a href="/help/">Help</a>
+ </li>
+ <li class="">
+ <a href="/develop/">Develop</a>
+ </li>
+</ul>
+
+ </nav>
+ <div class="grid">
+ <div class="unit one-third center-on-mobiles">
+ <h1>
+ <a href="/">
+ <span class="sr-only">Apache ORC</span>
+ <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
+ </a>
+ </h1>
+ </div>
+ <nav class="main-nav unit two-thirds hide-on-mobiles">
+ <ul>
+ <li class="">
+ <a href="/">Home</a>
+ </li>
+ <li class="">
+ <a href="/docs/"><span class="show-on-mobiles">Docs</span>
+ <span class="hide-on-mobiles">Documentation</span></a>
+ </li>
+ <li class="">
+ <a href="/talks/">Talks</a>
+ </li>
+ <li class="">
+ <a href="/news/">News</a>
+ </li>
+ <li class="">
+ <a href="/help/">Help</a>
+ </li>
+ <li class="">
+ <a href="/develop/">Develop</a>
+ </li>
+</ul>
+
+ </nav>
+ </div>
+</header>
+
+
+ <section class="standalone">
+ <div class="grid">
+
+ <div class="unit whole">
+ <article>
+ <h1>ORC Specification v1</h1>
+ <p>This version of the file format was originally released as part of
+Hive 0.12.</p>
+
+<h1 id="motivation">Motivation</h1>
+
+<p>Hive’s RCFile was the standard format for storing tabular data in
+Hadoop for several years. However, RCFile has limitations because it
+treats each column as a binary blob without semantics. In Hive 0.11 we
+added a new file format named Optimized Row Columnar (ORC) file that
+uses and retains the type information from the table definition. ORC
+uses type specific readers and writers that provide light weight
+compression techniques such as dictionary encoding, bit packing, delta
+encoding, and run length encoding – resulting in dramatically smaller
+files. Additionally, ORC can apply generic compression using zlib, or
+Snappy on top of the lightweight compression for even smaller
+files. However, storage savings are only part of the gain. ORC
+supports projection, which selects subsets of the columns for reading,
+so that queries reading only one column read only the required
+bytes. Furthermore, ORC files include light weight indexes that
+include the minimum and maximum values for each column in each set of
+10,000 rows and the entire file. Using pushdown filters from Hive, the
+file reader can skip entire sets of rows that aren’t important for
+this query.</p>
+
+<p><img src="/img/OrcFileLayout.png" alt="ORC file structure" /></p>
+
+<h1 id="file-tail">File Tail</h1>
+
+<p>Since HDFS does not support changing the data in a file after it is
+written, ORC stores the top level index at the end of the file. The
+overall structure of the file is given in the figure above. The
+file’s tail consists of 3 parts; the file metadata, file footer and
+postscript.</p>
+
+<p>The metadata for ORC is stored using
+<a href="https://s.apache.org/protobuf_encoding">Protocol Buffers</a>, which provides
+the ability to add new fields without breaking readers. This document
+incorporates the Protobuf definition from the
+<a href="https://s.apache.org/orc_proto">ORC source code</a> and the
+reader is encouraged to review the Protobuf encoding if they need to
+understand the byte-level encoding</p>
+
+<h2 id="postscript">Postscript</h2>
+
+<p>The Postscript section provides the necessary information to interpret
+the rest of the file including the length of the file’s Footer and
+Metadata sections, the version of the file, and the kind of general
+compression used (eg. none, zlib, or snappy). The Postscript is never
+compressed and ends one byte before the end of the file. The version
+stored in the Postscript is the lowest version of Hive that is
+guaranteed to be able to read the file and it stored as a sequence of
+the major and minor version. This file version is encoded as [0,12].</p>
+
+<p>The process of reading an ORC file works backwards through the
+file. Rather than making multiple short reads, the ORC reader reads
+the last 16k bytes of the file with the hope that it will contain both
+the Footer and Postscript sections. The final byte of the file
+contains the serialized length of the Postscript, which must be less
+than 256 bytes. Once the Postscript is parsed, the compressed
+serialized length of the Footer is known and it can be decompressed
+and parsed.</p>
+
+<p><code>message PostScript {
+ // the length of the footer section in bytes
+ optional uint64 footerLength = 1;
+ // the kind of generic compression used
+ optional CompressionKind compression = 2;
+ // the maximum size of each compression chunk
+ optional uint64 compressionBlockSize = 3;
+ // the version of the writer
+ repeated uint32 version = 4 [packed = true];
+ // the length of the metadata section in bytes
+ optional uint64 metadataLength = 5;
+ // the fixed string "ORC"
+ optional string magic = 8000;
+}
+</code></p>
+
+<p><code>enum CompressionKind {
+ NONE = 0;
+ ZLIB = 1;
+ SNAPPY = 2;
+ LZO = 3;
+ LZ4 = 4;
+ ZSTD = 5;
+}
+</code></p>
+
+<h2 id="footer">Footer</h2>
+
+<p>The Footer section contains the layout of the body of the file, the
+type schema information, the number of rows, and the statistics about
+each of the columns.</p>
+
+<p>The file is broken in to three parts- Header, Body, and Tail. The
+Header consists of the bytes “ORC’’ to support tools that want to
+scan the front of the file to determine the type of the file. The Body
+contains the rows and indexes, and the Tail gives the file level
+information as described in this section.</p>
+
+<p><code>message Footer {
+ // the length of the file header in bytes (always 3)
+ optional uint64 headerLength = 1;
+ // the length of the file header and body in bytes
+ optional uint64 contentLength = 2;
+ // the information about the stripes
+ repeated StripeInformation stripes = 3;
+ // the schema information
+ repeated Type types = 4;
+ // the user metadata that was added
+ repeated UserMetadataItem metadata = 5;
+ // the total number of rows in the file
+ optional uint64 numberOfRows = 6;
+ // the statistics of each column across the file
+ repeated ColumnStatistics statistics = 7;
+ // the maximum number of rows in each index entry
+ optional uint32 rowIndexStride = 8;
+}
+</code></p>
+
+<h3 id="stripe-information">Stripe Information</h3>
+
+<p>The body of the file is divided into stripes. Each stripe is self
+contained and may be read using only its own bytes combined with the
+file’s Footer and Postscript. Each stripe contains only entire rows so
+that rows never straddle stripe boundaries. Stripes have three
+sections: a set of indexes for the rows within the stripe, the data
+itself, and a stripe footer. Both the indexes and the data sections
+are divided by columns so that only the data for the required columns
+needs to be read.</p>
+
+<p><code>message StripeInformation {
+ // the start of the stripe within the file
+ optional uint64 offset = 1;
+ // the length of the indexes in bytes
+ optional uint64 indexLength = 2;
+ // the length of the data in bytes
+ optional uint64 dataLength = 3;
+ // the length of the footer in bytes
+ optional uint64 footerLength = 4;
+ // the number of rows in the stripe
+ optional uint64 numberOfRows = 5;
+}
+</code></p>
+
+<h3 id="type-information">Type Information</h3>
+
+<p>All of the rows in an ORC file must have the same schema. Logically
+the schema is expressed as a tree as in the figure below, where
+the compound types have subcolumns under them.</p>
+
+<p><img src="/img/TreeWriters.png" alt="ORC column structure" /></p>
+
+<p>The equivalent Hive DDL would be:</p>
+
+<p><code>create table Foobar (
+ myInt int,
+ myMap map<string,
+ struct<myString : string,
+ myDouble: double>>,
+ myTime timestamp
+);
+</code></p>
+
+<p>The type tree is flattened in to a list via a pre-order traversal
+where each type is assigned the next id. Clearly the root of the type
+tree is always type id 0. Compound types have a field named subtypes
+that contains the list of their children’s type ids.</p>
+
+<p><code>message Type {
+ enum Kind {
+ BOOLEAN = 0;
+ BYTE = 1;
+ SHORT = 2;
+ INT = 3;
+ LONG = 4;
+ FLOAT = 5;
+ DOUBLE = 6;
+ STRING = 7;
+ BINARY = 8;
+ TIMESTAMP = 9;
+ LIST = 10;
+ MAP = 11;
+ STRUCT = 12;
+ UNION = 13;
+ DECIMAL = 14;
+ DATE = 15;
+ VARCHAR = 16;
+ CHAR = 17;
+ }
+ // the kind of this type
+ required Kind kind = 1;
+ // the type ids of any subcolumns for list, map, struct, or union
+ repeated uint32 subtypes = 2 [packed=true];
+ // the list of field names for struct
+ repeated string fieldNames = 3;
+ // the maximum length of the type for varchar or char in UTF-8 characters
+ optional uint32 maximumLength = 4;
+ // the precision and scale for decimal
+ optional uint32 precision = 5;
+ optional uint32 scale = 6;
+}
+</code></p>
+
+<h3 id="column-statistics">Column Statistics</h3>
+
+<p>The goal of the column statistics is that for each column, the writer
+records the count and depending on the type other useful fields. For
+most of the primitive types, it records the minimum and maximum
+values; and for numeric types it additionally stores the sum.
+From Hive 1.1.0 onwards, the column statistics will also record if
+there are any null values within the row group by setting the hasNull flag.
+The hasNull flag is used by ORC’s predicate pushdown to better answer
+‘IS NULL’ queries.</p>
+
+<p><code>message ColumnStatistics {
+ // the number of values
+ optional uint64 numberOfValues = 1;
+ // At most one of these has a value for any column
+ optional IntegerStatistics intStatistics = 2;
+ optional DoubleStatistics doubleStatistics = 3;
+ optional StringStatistics stringStatistics = 4;
+ optional BucketStatistics bucketStatistics = 5;
+ optional DecimalStatistics decimalStatistics = 6;
+ optional DateStatistics dateStatistics = 7;
+ optional BinaryStatistics binaryStatistics = 8;
+ optional TimestampStatistics timestampStatistics = 9;
+ optional bool hasNull = 10;
+}
+</code></p>
+
+<p>For integer types (tinyint, smallint, int, bigint), the column
+statistics includes the minimum, maximum, and sum. If the sum
+overflows long at any point during the calculation, no sum is
+recorded.</p>
+
+<p><code>message IntegerStatistics {
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+ optional sint64 sum = 3;
+}
+</code></p>
+
+<p>For floating point types (float, double), the column statistics
+include the minimum, maximum, and sum. If the sum overflows a double,
+no sum is recorded.</p>
+
+<p><code>message DoubleStatistics {
+ optional double minimum = 1;
+ optional double maximum = 2;
+ optional double sum = 3;
+}
+</code></p>
+
+<p>For strings, the minimum value, maximum value, and the sum of the
+lengths of the values are recorded.</p>
+
+<p><code>message StringStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ // sum will store the total length of all strings
+ optional sint64 sum = 3;
+}
+</code></p>
+
+<p>For booleans, the statistics include the count of false and true values.</p>
+
+<p><code>message BucketStatistics {
+ repeated uint64 count = 1 [packed=true];
+}
+</code></p>
+
+<p>For decimals, the minimum, maximum, and sum are stored.</p>
+
+<p><code>message DecimalStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ optional string sum = 3;
+}
+</code></p>
+
+<p>Date columns record the minimum and maximum values as the number of
+days since the epoch (1/1/2015).</p>
+
+<p><code>message DateStatistics {
+ // min,max values saved as days since epoch
+ optional sint32 minimum = 1;
+ optional sint32 maximum = 2;
+}
+</code></p>
+
+<p>Timestamp columns record the minimum and maximum values as the number of
+milliseconds since the epoch (1/1/2015).</p>
+
+<p><code>message TimestampStatistics {
+ // min,max values saved as milliseconds since epoch
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+}
+</code></p>
+
+<p>Binary columns store the aggregate number of bytes across all of the values.</p>
+
+<p><code>message BinaryStatistics {
+ // sum will store the total binary blob length
+ optional sint64 sum = 1;
+}
+</code></p>
+
+<h3 id="user-metadata">User Metadata</h3>
+
+<p>The user can add arbitrary key/value pairs to an ORC file as it is
+written. The contents of the keys and values are completely
+application defined, but the key is a string and the value is
+binary. Care should be taken by applications to make sure that their
+keys are unique and in general should be prefixed with an organization
+code.</p>
+
+<p><code>message UserMetadataItem {
+ // the user defined key
+ required string name = 1;
+ // the user defined binary value
+ required bytes value = 2;
+}
+</code></p>
+
+<h3 id="file-metadata">File Metadata</h3>
+
+<p>The file Metadata section contains column statistics at the stripe
+level granularity. These statistics enable input split elimination
+based on the predicate push-down evaluated per a stripe.</p>
+
+<p><code>message StripeStatistics {
+ repeated ColumnStatistics colStats = 1;
+}
+</code></p>
+
+<p><code>message Metadata {
+ repeated StripeStatistics stripeStats = 1;
+}
+</code></p>
+
+<h1 id="compression">Compression</h1>
+
+<p>If the ORC file writer selects a generic compression codec (zlib or
+snappy), every part of the ORC file except for the Postscript is
+compressed with that codec. However, one of the requirements for ORC
+is that the reader be able to skip over compressed bytes without
+decompressing the entire stream. To manage this, ORC writes compressed
+streams in chunks with headers as in the figure below.
+To handle uncompressable data, if the compressed data is larger than
+the original, the original is stored and the isOriginal flag is
+set. Each header is 3 bytes long with (compressedLength * 2 +
+isOriginal) stored as a little endian value. For example, the header
+for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
+0x03]. The header for 5 bytes that did not compress would be [0x0b,
+0x00, 0x00]. Each compression chunk is compressed independently so
+that as long as a decompressor starts at the top of a header, it can
+start decompressing without the previous bytes.</p>
+
+<p><img src="/img/CompressionStream.png" alt="compression streams" /></p>
+
+<p>The default compression chunk size is 256K, but writers can choose
+their own value. Larger chunks lead to better compression, but require
+more memory. The chunk size is recorded in the Postscript so that
+readers can allocate appropriately sized buffers. Readers are
+guaranteed that no chunk will expand to more than the compression chunk
+size.</p>
+
+<p>ORC files without generic compression write each stream directly
+with no headers.</p>
+
+<h1 id="run-length-encoding">Run Length Encoding</h1>
+
+<h2 id="base-128-varint">Base 128 Varint</h2>
+
+<p>Variable width integer encodings take advantage of the fact that most
+numbers are small and that having smaller encodings for small numbers
+shrinks the overall size of the data. ORC uses the varint format from
+Protocol Buffers, which writes data in little endian format using the
+low 7 bits of each byte. The high bit in each byte is set if the
+number continues into the next byte.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Unsigned Original</th>
+ <th style="text-align: left">Serialized</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">0x00</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left">0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">127</td>
+ <td style="text-align: left">0x7f</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">128</td>
+ <td style="text-align: left">0x80, 0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">129</td>
+ <td style="text-align: left">0x81, 0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16,383</td>
+ <td style="text-align: left">0xff, 0x7f</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16,384</td>
+ <td style="text-align: left">0x80, 0x80, 0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16,385</td>
+ <td style="text-align: left">0x81, 0x80, 0x01</td>
+ </tr>
+ </tbody>
+</table>
+
+<p>For signed integer types, the number is converted into an unsigned
+number using a zigzag encoding. Zigzag encoding moves the sign bit to
+the least significant bit using the expression (val « 1) ^ (val »
+63) and derives its name from the fact that positive and negative
+numbers alternate once encoded. The unsigned number is then serialized
+as above.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Signed Original</th>
+ <th style="text-align: left">Unsigned</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">0</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">-1</td>
+ <td style="text-align: left">1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left">2</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">-2</td>
+ <td style="text-align: left">3</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">2</td>
+ <td style="text-align: left">4</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="byte-run-length-encoding">Byte Run Length Encoding</h2>
+
+<p>For byte streams, ORC uses a very light weight encoding of identical
+values.</p>
+
+<ul>
+ <li>Run - a sequence of at least 3 identical values</li>
+ <li>Literals - a sequence of non-identical values</li>
+</ul>
+
+<p>The first byte of each group of values is a header than determines
+whether it is a run (value between 0 to 127) or literal list (value
+between -128 to -1). For runs, the control byte is the length of the
+run minus the length of the minimal run (3) and the control byte for
+literal lists is the negative length of the list. For example, a
+hundred 0’s is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
+would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
+either of the encodings.</p>
+
+<h2 id="boolean-run-length-encoding">Boolean Run Length Encoding</h2>
+
+<p>For encoding boolean types, the bits are put in the bytes from most
+significant to least significant. The bytes are encoded using byte run
+length encoding as described in the previous section. For example,
+the byte sequence [0xff, 0x80] would be one true followed by
+seven false values.</p>
+
+<h2 id="integer-run-length-encoding-version-1">Integer Run Length Encoding, version 1</h2>
+
+<p>In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1),
+which provides a lightweight compression of signed or unsigned integer
+sequences. RLEv1 has two sub-encodings:</p>
+
+<ul>
+ <li>Run - a sequence of values that differ by a small fixed delta</li>
+ <li>Literals - a sequence of varint encoded values</li>
+</ul>
+
+<p>Runs start with an initial byte of 0x00 to 0x7f, which encodes the
+length of the run - 3. A second byte provides the fixed delta in the
+range of -128 to 127. Finally, the first value of the run is encoded
+as a base 128 varint.</p>
+
+<p>For example, if the sequence is 100 instances of 7 the encoding would
+start with 100 - 3, followed by a delta of 0, and a varint of 7 for
+an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
+running from 100 to 1, the first byte is 100 - 3, the delta is -1,
+and the varint is 100 for an encoding of [0x61, 0xff, 0x64].</p>
+
+<p>Literals start with an initial byte of 0x80 to 0xff, which corresponds
+to the negative of number of literals in the sequence. Following the
+header byte, the list of N varints is encoded. Thus, if there are
+no runs, the overhead is 1 byte for each 128 integers. The first 5
+prime numbers [2, 3, 4, 7, 11] would encoded as [0xfb, 0x02, 0x03,
+0x04, 0x07, 0xb].</p>
+
+<h2 id="integer-run-length-encoding-version-2">Integer Run Length Encoding, version 2</h2>
+
+<p>In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2),
+which has improved compression and fixed bit width encodings for
+faster expansion. RLEv2 uses four sub-encodings based on the data:</p>
+
+<ul>
+ <li>Short Repeat - used for short sequences with repeated values</li>
+ <li>Direct - used for random sequences with a fixed bit width</li>
+ <li>Patched Base - used for random sequences with a variable bit width</li>
+ <li>Delta - used for monotonically increasing or decreasing sequences</li>
+</ul>
+
+<h3 id="short-repeat">Short Repeat</h3>
+
+<p>The short repeat encoding is used for short repeating integer
+sequences with the goal of minimizing the overhead of the header. All
+of the bits listed in the header are from the first byte to the last
+and from most significant bit to least significant bit. If the type is
+signed, the value is zigzag encoded.</p>
+
+<ul>
+ <li>1 byte header
+ <ul>
+ <li>2 bits for encoding type (0)</li>
+ <li>3 bits for width (W) of repeating value (1 to 8 bytes)</li>
+ <li>3 bits for repeat count (3 to 10 values)</li>
+ </ul>
+ </li>
+ <li>W bytes in big endian format, which is zigzag encoded if they type
+is signed</li>
+</ul>
+
+<p>The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be
+serialized with short repeat encoding (0), a width of 2 bytes (1), and
+repeat count of 5 (2) as [0x0a, 0x27, 0x10].</p>
+
+<h3 id="direct">Direct</h3>
+
+<p>The direct encoding is used for integer sequences whose values have a
+relatively constant bit width. It encodes the values directly using a
+fixed width big endian encoding. The width of the values is encoded
+using the table below.</p>
+
+<p>The 5 bit width encoding table for RLEv2:</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Width in Bits</th>
+ <th style="text-align: left">Encoded Value</th>
+ <th style="text-align: left">Notes</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">for delta encoding</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">for non-delta encoding</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">2</td>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">4</td>
+ <td style="text-align: left">3</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">8</td>
+ <td style="text-align: left">7</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16</td>
+ <td style="text-align: left">15</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">24</td>
+ <td style="text-align: left">23</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">32</td>
+ <td style="text-align: left">27</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">40</td>
+ <td style="text-align: left">28</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">48</td>
+ <td style="text-align: left">29</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">56</td>
+ <td style="text-align: left">30</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">64</td>
+ <td style="text-align: left">31</td>
+ <td style="text-align: left"> </td>
+ </tr>
+ <tr>
+ <td style="text-align: left">3</td>
+ <td style="text-align: left">2</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">5 <= x <= 7</td>
+ <td style="text-align: left">x - 1</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">9 <= x <= 15</td>
+ <td style="text-align: left">x - 1</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">17 <= x <= 21</td>
+ <td style="text-align: left">x - 1</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">26</td>
+ <td style="text-align: left">24</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">28</td>
+ <td style="text-align: left">25</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">30</td>
+ <td style="text-align: left">26</td>
+ <td style="text-align: left">deprecated</td>
+ </tr>
+ </tbody>
+</table>
+
+<ul>
+ <li>2 bytes header
+ <ul>
+ <li>2 bits for encoding type (1)</li>
+ <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
+width encoding table</li>
+ <li>9 bits for length (L) (1 to 512 values)</li>
+ </ul>
+ </li>
+ <li>W * L bits (padded to the next byte) encoded in big endian format, which is
+zigzag encoding if the type is signed</li>
+</ul>
+
+<p>The unsigned sequence of [23713, 43806, 57005, 48879] would be
+serialized with direct encoding (1), a width of 16 bits (15), and
+length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
+0xbe, 0xef].</p>
+
+<h3 id="patched-base">Patched Base</h3>
+
+<p>The patched base encoding is used for integer sequences whose bit
+widths varies a lot. The minimum signed value of the sequence is found
+and subtracted from the other values. The bit width of those adjusted
+values is analyzed and the 90 percentile of the bit width is chosen
+as W. The 10\% of values larger than W use patches from a patch list
+to set the additional bits. Patches are encoded as a list of gaps in
+the index values and the additional value bits.</p>
+
+<ul>
+ <li>4 bytes header
+ <ul>
+ <li>2 bits for encoding type (2)</li>
+ <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
+ width encoding table</li>
+ <li>9 bits for length (L) (1 to 512 values)</li>
+ <li>3 bits for base value width (BW) (1 to 8 bytes)</li>
+ <li>5 bits for patch width (PW) (1 to 64 bits) using the 5 bit width
+encoding table</li>
+ <li>3 bits for patch gap width (PGW) (1 to 8 bits)</li>
+ <li>5 bits for patch list length (PLL) (0 to 31 patches)</li>
+ </ul>
+ </li>
+ <li>Base value (BW bytes) - The base value is stored as a big endian value
+with negative values marked by the most significant bit set. If it that
+bit is set, the entire value is negated.</li>
+ <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive
+values that are added to the base value.</li>
+ <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive
+values that are added to the base value.</li>
+ <li>Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
+that didn’t fit within W bits. Each entry in the list consists of a
+gap, which is the number of elements skipped from the previous
+patch, and a patch value. Patches are applied by logically or’ing
+the data values with the relevant patch shifted W bits left. If a
+patch is 0, it was introduced to skip over more than 255 items. The
+combined length of each patch (PGW + PW) must be less or equal to
+64.</li>
+</ul>
+
+<p>The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070,
+2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
+has a minimum of 2000, which makes the adjusted
+sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
+150, 160, 170, 180, 190]. It has an
+encoding of patched base (2), a bit width of 8 (7), a length of 20
+(19), a base value width of 2 bytes (1), a patch width of 12 bits (11),
+patch gap width of 2 bits (1), and a patch list length of 1 (1). The
+base value is 2000 and the combined result is [0x8e, 0x13, 0x2b, 0x21, 0x07,
+0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e,
+0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc, 0xe8]</p>
+
+<h3 id="delta">Delta</h3>
+
+<p>The Delta encoding is used for monotonically increasing or decreasing
+sequences. The first two numbers in the sequence can not be identical,
+because the encoding is using the sign of the first delta to determine
+if the series is increasing or decreasing.</p>
+
+<ul>
+ <li>2 bytes header
+ <ul>
+ <li>2 bits for encoding type (3)</li>
+ <li>5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit
+width encoding table</li>
+ <li>9 bits for run length (L) (1 to 512 values)</li>
+ </ul>
+ </li>
+ <li>Base value - encoded as (signed or unsigned) varint</li>
+ <li>Delta base - encoded as signed varint</li>
+ <li>Delta values $W * (L - 2)$ bytes - encode each delta after the first
+one. If the delta base is positive, the sequence is increasing and if it is
+negative the sequence is decreasing.</li>
+</ul>
+
+<p>The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be
+serialized with delta encoding (3), a width of 4 bits (3), length of
+10 (9), a base of 2 (2), and first delta of 1 (2). The resulting
+sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].</p>
+
+<h1 id="stripes">Stripes</h1>
+
+<p>The body of ORC files consists of a series of stripes. Stripes are
+large (typically ~200MB) and independent of each other and are often
+processed by different tasks. The defining characteristic for columnar
+storage formats is that the data for each column is stored separately
+and that reading data out of the file should be proportional to the
+number of columns read.</p>
+
+<p>In ORC files, each column is stored in several streams that are stored
+next to each other in the file. For example, an integer column is
+represented as two streams PRESENT, which uses one with a bit per
+value recording if the value is non-null, and DATA, which records the
+non-null values. If all of a column’s values in a stripe are non-null,
+the PRESENT stream is omitted from the stripe. For binary data, ORC
+uses three streams PRESENT, DATA, and LENGTH, which stores the length
+of each value. The details of each type will be presented in the
+following subsections.</p>
+
+<h2 id="stripe-footer">Stripe Footer</h2>
+
+<p>The stripe footer contains the encoding of each column and the
+directory of the streams including their location.</p>
+
+<p><code>message StripeFooter {
+ // the location of each stream
+ repeated Stream streams = 1;
+ // the encoding of each column
+ repeated ColumnEncoding columns = 2;
+}
+</code></p>
+
+<p>To describe each stream, ORC stores the kind of stream, the column id,
+and the stream’s size in bytes. The details of what is stored in each stream
+depends on the type and encoding of the column.</p>
+
+<p><code>message Stream {
+ enum Kind {
+ // boolean stream of whether the next value is non-null
+ PRESENT = 0;
+ // the primary data stream
+ DATA = 1;
+ // the length of each value for variable length data
+ LENGTH = 2;
+ // the dictionary blob
+ DICTIONARY\_DATA = 3;
+ // deprecated prior to Hive 0.11
+ // It was used to store the number of instances of each value in the
+ // dictionary
+ DICTIONARY_COUNT = 4;
+ // a secondary data stream
+ SECONDARY = 5;
+ // the index for seeking to particular row groups
+ ROW_INDEX = 6;
+ // original bloom filters used before ORC-101
+ BLOOM_FILTER = 7;
+ // bloom filters that consistently use utf8
+ BLOOM_FILTER_UTF8 = 8;
+ }
+ required Kind kind = 1;
+ // the column id
+ optional uint32 column = 2;
+ // the number of bytes in the file
+ optional uint64 length = 3;
+}
+</code></p>
+
+<p>Depending on their type several options for encoding are possible. The
+encodings are divided into direct or dictionary-based categories and
+further refined as to whether they use RLE v1 or v2.</p>
+
+<p><code>message ColumnEncoding {
+ enum Kind {
+ // the encoding is mapped directly to the stream using RLE v1
+ DIRECT = 0;
+ // the encoding uses a dictionary of unique values using RLE v1
+ DICTIONARY = 1;
+ // the encoding is direct using RLE v2
+ DIRECT\_V2 = 2;
+ // the encoding is dictionary-based using RLE v2
+ DICTIONARY\_V2 = 3;
+ }
+ required Kind kind = 1;
+ // for dictionary encodings, record the size of the dictionary
+ optional uint32 dictionarySize = 2;
+}
+</code></p>
+
+<h1 id="column-encodings">Column Encodings</h1>
+
+<h2 id="smallint-int-and-bigint-columns">SmallInt, Int, and BigInt Columns</h2>
+
+<p>All of the 16, 32, and 64 bit integer column types use the same set of
+potential encodings, which is basically whether they use RLE v1 or
+v2. If the PRESENT stream is not included, all of the values are
+present. For values that have false bits in the present stream, no
+values are included in the data stream.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="float-and-double-columns">Float and Double Columns</h2>
+
+<p>Floating point types are stored using IEEE 754 floating point bit
+layout. Float columns use 4 bytes per value and double columns use 8
+bytes.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">IEEE 754 floating point representation</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="string-char-and-varchar-columns">String, Char, and VarChar Columns</h2>
+
+<p>String, char, and varchar columns may be encoded either using a
+dictionary encoding or a direct encoding. A direct encoding should be
+preferred when there are many distinct values. In all of the
+encodings, the PRESENT stream encodes whether the value is null. The
+Java ORC writer automatically picks the encoding after the first row
+group (10,000 rows).</p>
+
+<p>For direct encoding the UTF-8 bytes are saved in the DATA stream and
+the length of each value is written into the LENGTH stream. In direct
+encoding, if the values were [“Nevada”, “California”]; the DATA
+would be “NevadaCalifornia” and the LENGTH would be [6, 10].</p>
+
+<p>For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+each unique value are placed into DICTIONARY_DATA. The length of each
+item in the dictionary is put into the LENGTH stream. The DATA stream
+consists of the sequence of references to the dictionary elements.</p>
+
+<p>In dictionary encoding, if the values were [“Nevada”,
+“California”, “Nevada”, “California”, and “Florida”]; the
+DICTIONARY_DATA would be “CaliforniaFloridaNevada” and LENGTH would
+be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DICTIONARY</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DICTIONARY_DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DICTIONARY_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DICTIONARY_DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="boolean-columns">Boolean Columns</h2>
+
+<p>Boolean columns are rare, but have a simple encoding.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="tinyint-columns">TinyInt Columns</h2>
+
+<p>TinyInt (byte) columns use byte run length encoding.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Byte RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="binary-columns">Binary Columns</h2>
+
+<p>Binary data is encoded with a PRESENT stream, a DATA stream that records
+the contents, and a LENGTH stream that records the number of bytes per a
+value.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="decimal-columns">Decimal Columns</h2>
+
+<p>Decimal was introduced in Hive 0.11 with infinite precision (the total
+number of digits). In Hive 0.13, the definition was change to limit
+the precision to a maximum of 38 digits, which conveniently uses 127
+bits plus a sign bit. The current encoding of decimal columns stores
+the integer representation of the value as an unbounded length zigzag
+encoded base 128 varint. The scale is stored in the SECONDARY stream
+as an signed integer.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unbounded base 128 varints</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unbounded base 128 varints</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="date-columns">Date Columns</h2>
+
+<p>Date data is encoded with a PRESENT stream, a DATA stream that records
+the number of days after January 1, 1970 in UTC.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="timestamp-columns">Timestamp Columns</h2>
+
+<p>Timestamp records times down to nanoseconds as a PRESENT stream that
+records non-null values, a DATA stream that records the number of
+seconds after 1 January 2015, and a SECONDARY stream that records the
+number of nanoseconds.</p>
+
+<p>Because the number of nanoseconds often has a large number of trailing
+zeros, the number has trailing decimal zero digits removed and the
+last three bits are used to record how many zeros were removed. Thus
+1000 nanoseconds would be serialized as 0x0b and 100000 would be
+serialized as 0x0d.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v2</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="struct-columns">Struct Columns</h2>
+
+<p>Structs have no data themselves and delegate everything to their child
+columns except for their PRESENT stream. They have a child column
+for each of the fields.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="list-columns">List Columns</h2>
+
+<p>Lists are encoded as the PRESENT stream and a length stream with
+number of items in each list. They have a single child column for the
+element values.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="map-columns">Map Columns</h2>
+
+<p>Maps are encoded as the PRESENT stream and a length stream with number
+of items in each list. They have a child column for the key and
+another child column for the value.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DIRECT_V2</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v2</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="union-columns">Union Columns</h2>
+
+<p>Unions are encoded as the PRESENT stream and a tag stream that controls which
+potential variant is used. They have a child column for each variant of the
+union. Currently ORC union types are limited to 256 variants, which matches
+the Hive type model.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Byte RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h1 id="indexes">Indexes</h1>
+
+<h2 id="row-group-index">Row Group Index</h2>
+
+<p>The row group indexes consist of a ROW_INDEX stream for each primitive
+column that has an entry for each row group. Row groups are controlled
+by the writer and default to 10,000 rows. Each RowIndexEntry gives the
+position of each stream for the column and the statistics for that row
+group.</p>
+
+<p>The index streams are placed at the front of the stripe, because in
+the default case of streaming they do not need to be read. They are
+only loaded when either predicate push down is being used or the
+reader seeks to a particular row.</p>
+
+<p><code>message RowIndexEntry {
+ repeated uint64 positions = 1 [packed=true];
+ optional ColumnStatistics statistics = 2;
+}
+</code></p>
+
+<p><code>message RowIndex {
+ repeated RowIndexEntry entry = 1;
+}
+</code></p>
+
+<p>To record positions, each stream needs a sequence of numbers. For
+uncompressed streams, the position is the byte offset of the RLE run’s
+start location followed by the number of values that need to be
+consumed from the run. In compressed streams, the first number is the
+start of the compression chunk in the stream, followed by the number
+of decompressed bytes that need to be consumed, and finally the number
+of values consumed in the RLE.</p>
+
+<p>For columns with multiple streams, the sequences of positions in each
+stream are concatenated. That was an unfortunate decision on my part
+that we should fix at some point, because it makes code that uses the
+indexes error-prone.</p>
+
+<p>Because dictionaries are accessed randomly, there is not a position to
+record for the dictionary and the entire dictionary must be read even
+if only part of a stripe is being read.</p>
+
+<h2 id="bloom-filter-index">Bloom Filter Index</h2>
+
+<p>Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
+Predicate pushdown can make use of bloom filters to better prune
+the row groups that do not satisfy the filter condition.
+The bloom filter indexes consist of a BLOOM_FILTER stream for each
+column specified through ‘orc.bloom.filter.columns’ table properties.
+A BLOOM_FILTER stream records a bloom filter entry for each row
+group (default to 10,000 rows) in a column. Only the row groups that
+satisfy min/max row index evaluation will be evaluated against the
+bloom filter index.</p>
+
+<p>Each BloomFilterEntry stores the number of hash functions (‘k’) used
+and the bitset backing the bloom filter. The original encoding (pre
+ORC-101) of bloom filters used the bitset field encoded as a repeating
+sequence of longs in the bitset field with a little endian encoding
+(0x1 is bit 0 and 0x2 is bit 1.) After ORC-101, the encoding is a
+sequence of bytes with a little endian encoding in the utf8bitset field.</p>
+
+<p><code>message BloomFilter {
+ optional uint32 numHashFunctions = 1;
+ repeated fixed64 bitset = 2;
+ optional bytes utf8bitset = 3;
+}
+</code></p>
+
+<p><code>message BloomFilterIndex {
+ repeated BloomFilter bloomFilter = 1;
+}
+</code></p>
+
+<p>Bloom filter internally uses two different hash functions to map a key
+to a position in the bit set. For tinyint, smallint, int, bigint, float
+and double types, Thomas Wang’s 64-bit integer hash function is used.
+Floats are converted to IEEE-754 32 bit representation
+(using Java’s Float.floatToIntBits(float)). Similary, Doubles are
+converted to IEEE-754 64 bit representation (using Java’s
+Double.doubleToLongBits(double)). All these primitive types
+are cast to long base type before being passed on to the hash function.
+For strings and binary types, Murmur3 64 bit hash algorithm is used.
+The 64 bit variant of Murmur3 considers only the most significant
+8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
+from the above algorithms is used as a base to derive ‘k’ different
+hash functions. We use the idea mentioned in the paper “Less Hashing,
+Same Performance: Building a Better Bloom Filter” by Kirsch et. al. to
+quickly compute the k hashcodes.</p>
+
+<p>The algorithm for computing k hashcodes and setting the bit position
+in a bloom filter is as follows:</p>
+
+<ol>
+ <li>Get 64 bit base hash code from Murmur3 or Thomas Wang’s hash algorithm.</li>
+ <li>Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).</li>
+ <li>k’th hashcode is obtained by (where k > 0):
+ <ul>
+ <li>combinedHash = hash1 + (k * hash2)</li>
+ </ul>
+ </li>
+ <li>If combinedHash is negative flip all the bits:
+ <ul>
+ <li>combinedHash = ~combinedHash</li>
+ </ul>
+ </li>
+ <li>Bit set position is obtained by performing modulo with m:
+ <ul>
+ <li>position = combinedHash % m</li>
+ </ul>
+ </li>
+ <li>Set the position in bit set. The LSB 6 bits identifies the long index
+within bitset and bit position within the long uses little endian order.
+ <ul>
+ <li>bitset[position »> 6] |= (1L « position);</li>
+ </ul>
+ </li>
+</ol>
+
+<p>Bloom filter streams are interlaced with row group indexes. This placement
+makes it convenient to read the bloom filter stream and row index stream
+together in single read operation.</p>
+
+<p><img src="/img/BloomFilter.png" alt="bloom filter" /></p>
+
+ </article>
+ </div>
+
+ <div class="clear"></div>
+
+ </div>
+</section>
+
+
+ <footer role="contentinfo">
+ <p>The contents of this website are © 2018
+ <a href="https://www.apache.org/">Apache Software Foundation</a>
+ under the terms of the <a
+ href="https://www.apache.org/licenses/LICENSE-2.0.html">
+ Apache License v2</a>. Apache ORC and its logo are trademarks
+ of the Apache Software Foundation.</p>
+</footer>
+
+ <script>
+ var anchorForId = function (id) {
+ var anchor = document.createElement("a");
+ anchor.className = "header-link";
+ anchor.href = "#" + id;
+ anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
+ anchor.title = "Permalink";
+ return anchor;
+ };
+
+ var linkifyAnchors = function (level, containingElement) {
+ var headers = containingElement.getElementsByTagName("h" + level);
+ for (var h = 0; h < headers.length; h++) {
+ var header = headers[h];
+
+ if (typeof header.id !== "undefined" && header.id !== "") {
+ header.appendChild(anchorForId(header.id));
+ }
+ }
+ };
+
+ document.onreadystatechange = function () {
+ if (this.readyState === "complete") {
+ var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
+ if (!contentBlock) {
+ return;
+ }
+ for (var level = 1; level <= 6; level++) {
+ linkifyAnchors(level, contentBlock);
+ }
+ }
+ };
+</script>
+
+
+</body>
+</html>
[4/9] orc git commit: Pushing ORC-339 reorganize the ORC file format
spec.
Posted by om...@apache.org.
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/spec-index.html
----------------------------------------------------------------------
diff --git a/docs/spec-index.html b/docs/spec-index.html
deleted file mode 100644
index 25ba64d..0000000
--- a/docs/spec-index.html
+++ /dev/null
@@ -1,2298 +0,0 @@
-<!DOCTYPE HTML>
-<html lang="en-US">
-<head>
- <meta charset="UTF-8">
- <title>Indexes</title>
- <meta name="viewport" content="width=device-width,initial-scale=1">
- <meta name="generator" content="Jekyll v2.4.0">
- <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
- <link rel="stylesheet" href="/css/screen.css">
- <link rel="icon" type="image/x-icon" href="/favicon.ico">
- <!--[if lt IE 9]>
- <script src="/js/html5shiv.min.js"></script>
- <script src="/js/respond.min.js"></script>
- <![endif]-->
-</head>
-
-
-<body class="wrap">
- <header role="banner">
- <nav class="mobile-nav show-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- <div class="grid">
- <div class="unit one-third center-on-mobiles">
- <h1>
- <a href="/">
- <span class="sr-only">Apache ORC</span>
- <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
- </a>
- </h1>
- </div>
- <nav class="main-nav unit two-thirds hide-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- </div>
-</header>
-
-
- <section class="docs">
- <div class="grid">
-
- <div class="docs-nav-mobile unit whole show-on-mobiles">
- <select onchange="if (this.value) window.location.href=this.value">
- <option value="">Navigate the docs…</option>
-
- <optgroup label="Overview">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/index.html">Background</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/adopters.html">ORC Adopters</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/types.html">Types</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/indexes.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/acid.html">ACID support</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Installing">
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/building.html">Building ORC</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in Hive">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-ddl.html">Hive DDL</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-config.html">Hive Configuration</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in MapReduce">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapred.html">Using in MapRed</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapreduce.html">Using in MapReduce</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using ORC Core">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-java.html">Using Core Java</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-cpp.html">Using Core C++</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Tools">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/cpp-tools.html">C++ Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/java-tools.html">Java Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Format Specification">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Indexes</h1>
- <h1 id="row-group-index">Row Group Index</h1>
-
-<p>The row group indexes consist of a ROW_INDEX stream for each primitive
-column that has an entry for each row group. Row groups are controlled
-by the writer and default to 10,000 rows. Each RowIndexEntry gives the
-position of each stream for the column and the statistics for that row
-group.</p>
-
-<p>The index streams are placed at the front of the stripe, because in
-the default case of streaming they do not need to be read. They are
-only loaded when either predicate push down is being used or the
-reader seeks to a particular row.</p>
-
-<p><code>message RowIndexEntry {
- repeated uint64 positions = 1 [packed=true];
- optional ColumnStatistics statistics = 2;
-}
-</code></p>
-
-<p><code>message RowIndex {
- repeated RowIndexEntry entry = 1;
-}
-</code></p>
-
-<p>To record positions, each stream needs a sequence of numbers. For
-uncompressed streams, the position is the byte offset of the RLE run’s
-start location followed by the number of values that need to be
-consumed from the run. In compressed streams, the first number is the
-start of the compression chunk in the stream, followed by the number
-of decompressed bytes that need to be consumed, and finally the number
-of values consumed in the RLE.</p>
-
-<p>For columns with multiple streams, the sequences of positions in each
-stream are concatenated. That was an unfortunate decision on my part
-that we should fix at some point, because it makes code that uses the
-indexes error-prone.</p>
-
-<p>Because dictionaries are accessed randomly, there is not a position to
-record for the dictionary and the entire dictionary must be read even
-if only part of a stripe is being read.</p>
-
-<h1 id="bloom-filter-index">Bloom Filter Index</h1>
-
-<p>Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
-Predicate pushdown can make use of bloom filters to better prune
-the row groups that do not satisfy the filter condition.
-The bloom filter indexes consist of a BLOOM_FILTER stream for each
-column specified through ‘orc.bloom.filter.columns’ table properties.
-A BLOOM_FILTER stream records a bloom filter entry for each row
-group (default to 10,000 rows) in a column. Only the row groups that
-satisfy min/max row index evaluation will be evaluated against the
-bloom filter index.</p>
-
-<p>Each BloomFilterEntry stores the number of hash functions (‘k’) used
-and the bitset backing the bloom filter. The original encoding (pre
-ORC-101) of bloom filters used the bitset field encoded as a repeating
-sequence of longs in the bitset field with a little endian encoding
-(0x1 is bit 0 and 0x2 is bit 1.) After ORC-101, the encoding is a
-sequence of bytes with a little endian encoding in the utf8bitset field.</p>
-
-<p><code>message BloomFilter {
- optional uint32 numHashFunctions = 1;
- repeated fixed64 bitset = 2;
- optional bytes utf8bitset = 3;
-}
-</code></p>
-
-<p><code>message BloomFilterIndex {
- repeated BloomFilter bloomFilter = 1;
-}
-</code></p>
-
-<p>Bloom filter internally uses two different hash functions to map a key
-to a position in the bit set. For tinyint, smallint, int, bigint, float
-and double types, Thomas Wang’s 64-bit integer hash function is used.
-Floats are converted to IEEE-754 32 bit representation
-(using Java’s Float.floatToIntBits(float)). Similary, Doubles are
-converted to IEEE-754 64 bit representation (using Java’s
-Double.doubleToLongBits(double)). All these primitive types
-are cast to long base type before being passed on to the hash function.
-For strings and binary types, Murmur3 64 bit hash algorithm is used.
-The 64 bit variant of Murmur3 considers only the most significant
-8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
-from the above algorithms is used as a base to derive ‘k’ different
-hash functions. We use the idea mentioned in the paper “Less Hashing,
-Same Performance: Building a Better Bloom Filter” by Kirsch et. al. to
-quickly compute the k hashcodes.</p>
-
-<p>The algorithm for computing k hashcodes and setting the bit position
-in a bloom filter is as follows:</p>
-
-<ol>
- <li>Get 64 bit base hash code from Murmur3 or Thomas Wang’s hash algorithm.</li>
- <li>Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).</li>
- <li>k’th hashcode is obtained by (where k > 0):
- <ul>
- <li>combinedHash = hash1 + (k * hash2)</li>
- </ul>
- </li>
- <li>If combinedHash is negative flip all the bits:
- <ul>
- <li>combinedHash = ~combinedHash</li>
- </ul>
- </li>
- <li>Bit set position is obtained by performing modulo with m:
- <ul>
- <li>position = combinedHash % m</li>
- </ul>
- </li>
- <li>Set the position in bit set. The LSB 6 bits identifies the long index
-within bitset and bit position within the long uses little endian order.
- <ul>
- <li>bitset[position »> 6] |= (1L « position);</li>
- </ul>
- </li>
-</ol>
-
-<p>Bloom filter streams are interlaced with row group indexes. This placement
-makes it convenient to read the bloom filter stream and row index stream
-together in single read operation.</p>
-
-<p><img src="/img/BloomFilter.png" alt="bloom filter" /></p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/encodings.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
- <span class="next disabled">Next</span>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in Hive</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
-
-
-
-</ul>
-
-
- <h4>Tools</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
-
-
-
-</ul>
-
-
- <h4>Format Specification</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/spec-index.html">Indexes</a></li>
-
-
-
-</ul>
-
-
- </aside>
-</div>
-
-
- <div class="clear"></div>
-
- </div>
- </section>
-
-
- <footer role="contentinfo">
- <p>The contents of this website are © 2018
- <a href="https://www.apache.org/">Apache Software Foundation</a>
- under the terms of the <a
- href="https://www.apache.org/licenses/LICENSE-2.0.html">
- Apache License v2</a>. Apache ORC and its logo are trademarks
- of the Apache Software Foundation.</p>
-</footer>
-
- <script>
- var anchorForId = function (id) {
- var anchor = document.createElement("a");
- anchor.className = "header-link";
- anchor.href = "#" + id;
- anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
- anchor.title = "Permalink";
- return anchor;
- };
-
- var linkifyAnchors = function (level, containingElement) {
- var headers = containingElement.getElementsByTagName("h" + level);
- for (var h = 0; h < headers.length; h++) {
- var header = headers[h];
-
- if (typeof header.id !== "undefined" && header.id !== "") {
- header.appendChild(anchorForId(header.id));
- }
- }
- };
-
- document.onreadystatechange = function () {
- if (this.readyState === "complete") {
- var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
- if (!contentBlock) {
- return;
- }
- for (var level = 1; level <= 6; level++) {
- linkifyAnchors(level, contentBlock);
- }
- }
- };
-</script>
-
-
-</body>
-</html>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/spec-intro.html
----------------------------------------------------------------------
diff --git a/docs/spec-intro.html b/docs/spec-intro.html
deleted file mode 100644
index 3468dd0..0000000
--- a/docs/spec-intro.html
+++ /dev/null
@@ -1,2180 +0,0 @@
-<!DOCTYPE HTML>
-<html lang="en-US">
-<head>
- <meta charset="UTF-8">
- <title>Introduction</title>
- <meta name="viewport" content="width=device-width,initial-scale=1">
- <meta name="generator" content="Jekyll v2.4.0">
- <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
- <link rel="stylesheet" href="/css/screen.css">
- <link rel="icon" type="image/x-icon" href="/favicon.ico">
- <!--[if lt IE 9]>
- <script src="/js/html5shiv.min.js"></script>
- <script src="/js/respond.min.js"></script>
- <![endif]-->
-</head>
-
-
-<body class="wrap">
- <header role="banner">
- <nav class="mobile-nav show-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- <div class="grid">
- <div class="unit one-third center-on-mobiles">
- <h1>
- <a href="/">
- <span class="sr-only">Apache ORC</span>
- <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
- </a>
- </h1>
- </div>
- <nav class="main-nav unit two-thirds hide-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- </div>
-</header>
-
-
- <section class="docs">
- <div class="grid">
-
- <div class="docs-nav-mobile unit whole show-on-mobiles">
- <select onchange="if (this.value) window.location.href=this.value">
- <option value="">Navigate the docs…</option>
-
- <optgroup label="Overview">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/index.html">Background</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/adopters.html">ORC Adopters</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/types.html">Types</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/indexes.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/acid.html">ACID support</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Installing">
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/building.html">Building ORC</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in Hive">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-ddl.html">Hive DDL</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-config.html">Hive Configuration</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in MapReduce">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapred.html">Using in MapRed</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapreduce.html">Using in MapReduce</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using ORC Core">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-java.html">Using Core Java</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-cpp.html">Using Core C++</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Tools">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/cpp-tools.html">C++ Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/java-tools.html">Java Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Format Specification">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Introduction</h1>
- <p>Hive’s RCFile was the standard format for storing tabular data in
-Hadoop for several years. However, RCFile has limitations because it
-treats each column as a binary blob without semantics. In Hive 0.11 we
-added a new file format named Optimized Row Columnar (ORC) file that
-uses and retains the type information from the table definition. ORC
-uses type specific readers and writers that provide light weight
-compression techniques such as dictionary encoding, bit packing, delta
-encoding, and run length encoding – resulting in dramatically smaller
-files. Additionally, ORC can apply generic compression using zlib, or
-Snappy on top of the lightweight compression for even smaller
-files. However, storage savings are only part of the gain. ORC
-supports projection, which selects subsets of the columns for reading,
-so that queries reading only one column read only the required
-bytes. Furthermore, ORC files include light weight indexes that
-include the minimum and maximum values for each column in each set of
-10,000 rows and the entire file. Using pushdown filters from Hive, the
-file reader can skip entire sets of rows that aren’t important for
-this query.</p>
-
-<p><img src="/img/OrcFileLayout.png" alt="ORC file structure" /></p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/java-tools.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/file-tail.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in Hive</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
-
-
-
-</ul>
-
-
- <h4>Tools</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
-
-
-
-</ul>
-
-
- <h4>Format Specification</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/spec-intro.html">Introduction</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
-
-
-
-</ul>
-
-
- </aside>
-</div>
-
-
- <div class="clear"></div>
-
- </div>
- </section>
-
-
- <footer role="contentinfo">
- <p>The contents of this website are © 2018
- <a href="https://www.apache.org/">Apache Software Foundation</a>
- under the terms of the <a
- href="https://www.apache.org/licenses/LICENSE-2.0.html">
- Apache License v2</a>. Apache ORC and its logo are trademarks
- of the Apache Software Foundation.</p>
-</footer>
-
- <script>
- var anchorForId = function (id) {
- var anchor = document.createElement("a");
- anchor.className = "header-link";
- anchor.href = "#" + id;
- anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
- anchor.title = "Permalink";
- return anchor;
- };
-
- var linkifyAnchors = function (level, containingElement) {
- var headers = containingElement.getElementsByTagName("h" + level);
- for (var h = 0; h < headers.length; h++) {
- var header = headers[h];
-
- if (typeof header.id !== "undefined" && header.id !== "") {
- header.appendChild(anchorForId(header.id));
- }
- }
- };
-
- document.onreadystatechange = function () {
- if (this.readyState === "complete") {
- var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
- if (!contentBlock) {
- return;
- }
- for (var level = 1; level <= 6; level++) {
- linkifyAnchors(level, contentBlock);
- }
- }
- };
-</script>
-
-
-</body>
-</html>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/stripes.html
----------------------------------------------------------------------
diff --git a/docs/stripes.html b/docs/stripes.html
deleted file mode 100644
index 401c0d9..0000000
--- a/docs/stripes.html
+++ /dev/null
@@ -1,2257 +0,0 @@
-<!DOCTYPE HTML>
-<html lang="en-US">
-<head>
- <meta charset="UTF-8">
- <title>Stripes</title>
- <meta name="viewport" content="width=device-width,initial-scale=1">
- <meta name="generator" content="Jekyll v2.4.0">
- <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
- <link rel="stylesheet" href="/css/screen.css">
- <link rel="icon" type="image/x-icon" href="/favicon.ico">
- <!--[if lt IE 9]>
- <script src="/js/html5shiv.min.js"></script>
- <script src="/js/respond.min.js"></script>
- <![endif]-->
-</head>
-
-
-<body class="wrap">
- <header role="banner">
- <nav class="mobile-nav show-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- <div class="grid">
- <div class="unit one-third center-on-mobiles">
- <h1>
- <a href="/">
- <span class="sr-only">Apache ORC</span>
- <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
- </a>
- </h1>
- </div>
- <nav class="main-nav unit two-thirds hide-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- </div>
-</header>
-
-
- <section class="docs">
- <div class="grid">
-
- <div class="docs-nav-mobile unit whole show-on-mobiles">
- <select onchange="if (this.value) window.location.href=this.value">
- <option value="">Navigate the docs…</option>
-
- <optgroup label="Overview">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/index.html">Background</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/adopters.html">ORC Adopters</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/types.html">Types</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/indexes.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/acid.html">ACID support</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Installing">
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/building.html">Building ORC</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in Hive">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-ddl.html">Hive DDL</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-config.html">Hive Configuration</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in MapReduce">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapred.html">Using in MapRed</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapreduce.html">Using in MapReduce</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using ORC Core">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-java.html">Using Core Java</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-cpp.html">Using Core C++</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Tools">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/cpp-tools.html">C++ Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/java-tools.html">Java Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Format Specification">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Stripes</h1>
- <p>The body of ORC files consists of a series of stripes. Stripes are
-large (typically ~200MB) and independent of each other and are often
-processed by different tasks. The defining characteristic for columnar
-storage formats is that the data for each column is stored separately
-and that reading data out of the file should be proportional to the
-number of columns read.</p>
-
-<p>In ORC files, each column is stored in several streams that are stored
-next to each other in the file. For example, an integer column is
-represented as two streams PRESENT, which uses one with a bit per
-value recording if the value is non-null, and DATA, which records the
-non-null values. If all of a column’s values in a stripe are non-null,
-the PRESENT stream is omitted from the stripe. For binary data, ORC
-uses three streams PRESENT, DATA, and LENGTH, which stores the length
-of each value. The details of each type will be presented in the
-following subsections.</p>
-
-<h1 id="stripe-footer">Stripe Footer</h1>
-
-<p>The stripe footer contains the encoding of each column and the
-directory of the streams including their location.</p>
-
-<p><code>message StripeFooter {
- // the location of each stream
- repeated Stream streams = 1;
- // the encoding of each column
- repeated ColumnEncoding columns = 2;
-}
-</code></p>
-
-<p>To describe each stream, ORC stores the kind of stream, the column id,
-and the stream’s size in bytes. The details of what is stored in each stream
-depends on the type and encoding of the column.</p>
-
-<p><code>message Stream {
- enum Kind {
- // boolean stream of whether the next value is non-null
- PRESENT = 0;
- // the primary data stream
- DATA = 1;
- // the length of each value for variable length data
- LENGTH = 2;
- // the dictionary blob
- DICTIONARY\_DATA = 3;
- // deprecated prior to Hive 0.11
- // It was used to store the number of instances of each value in the
- // dictionary
- DICTIONARY_COUNT = 4;
- // a secondary data stream
- SECONDARY = 5;
- // the index for seeking to particular row groups
- ROW_INDEX = 6;
- // original bloom filters used before ORC-101
- BLOOM_FILTER = 7;
- // bloom filters that consistently use utf8
- BLOOM_FILTER_UTF8 = 8;
- }
- required Kind kind = 1;
- // the column id
- optional uint32 column = 2;
- // the number of bytes in the file
- optional uint64 length = 3;
-}
-</code></p>
-
-<p>Depending on their type several options for encoding are possible. The
-encodings are divided into direct or dictionary-based categories and
-further refined as to whether they use RLE v1 or v2.</p>
-
-<p><code>message ColumnEncoding {
- enum Kind {
- // the encoding is mapped directly to the stream using RLE v1
- DIRECT = 0;
- // the encoding uses a dictionary of unique values using RLE v1
- DICTIONARY = 1;
- // the encoding is direct using RLE v2
- DIRECT\_V2 = 2;
- // the encoding is dictionary-based using RLE v2
- DICTIONARY\_V2 = 3;
- }
- required Kind kind = 1;
- // for dictionary encodings, record the size of the dictionary
- optional uint32 dictionarySize = 2;
-}
-</code></p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/run-length.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/encodings.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in Hive</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
-
-
-
-</ul>
-
-
- <h4>Tools</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
-
-
-
-</ul>
-
-
- <h4>Format Specification</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/stripes.html">Stripes</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
-
-
-
-</ul>
-
-
- </aside>
-</div>
-
-
- <div class="clear"></div>
-
- </div>
- </section>
-
-
- <footer role="contentinfo">
- <p>The contents of this website are © 2018
- <a href="https://www.apache.org/">Apache Software Foundation</a>
- under the terms of the <a
- href="https://www.apache.org/licenses/LICENSE-2.0.html">
- Apache License v2</a>. Apache ORC and its logo are trademarks
- of the Apache Software Foundation.</p>
-</footer>
-
- <script>
- var anchorForId = function (id) {
- var anchor = document.createElement("a");
- anchor.className = "header-link";
- anchor.href = "#" + id;
- anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
- anchor.title = "Permalink";
- return anchor;
- };
-
- var linkifyAnchors = function (level, containingElement) {
- var headers = containingElement.getElementsByTagName("h" + level);
- for (var h = 0; h < headers.length; h++) {
- var header = headers[h];
-
- if (typeof header.id !== "undefined" && header.id !== "") {
- header.appendChild(anchorForId(header.id));
- }
- }
- };
-
- document.onreadystatechange = function () {
- if (this.readyState === "complete") {
- var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
- if (!contentBlock) {
- return;
- }
- for (var level = 1; level <= 6; level++) {
- linkifyAnchors(level, contentBlock);
- }
- }
- };
-</script>
-
-
-</body>
-</html>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/types.html
----------------------------------------------------------------------
diff --git a/docs/types.html b/docs/types.html
index dda60a4..149fa88 100644
--- a/docs/types.html
+++ b/docs/types.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,726 +663,135 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
+ <div class="unit four-fifths">
+ <article>
+ <h1>Types</h1>
+ <p>ORC files are completely self-describing and do not depend on the Hive
+Metastore or any other external metadata. The file includes all of the
+type and encoding information for the objects stored in the file. Because the
+file is self-contained, it does not depend on the user’s environment to
+correctly interpret the file’s contents.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Types</h1>
- <p>ORC files are completely self-describing and do not depend on the Hive
-Metastore or any other external metadata. The file includes all of the
-type and encoding information for the objects stored in the file. Because the
-file is self-contained, it does not depend on the user’s environment to
-correctly interpret the file’s contents.</p>
-
-<p>ORC provides a rich set of scalar and compound types:</p>
-
-<ul>
- <li>Integer
- <ul>
- <li>boolean (1 bit)</li>
- <li>tinyint (8 bit)</li>
- <li>smallint (16 bit)</li>
- <li>int (32 bit)</li>
- <li>bigint (64 bit)</li>
- </ul>
- </li>
- <li>Floating point
- <ul>
- <li>float</li>
- <li>double</li>
- </ul>
- </li>
- <li>String types
- <ul>
- <li>string</li>
- <li>char</li>
- <li>varchar</li>
- </ul>
- </li>
- <li>Binary blobs
- <ul>
- <li>binary</li>
- </ul>
- </li>
- <li>Date/time
- <ul>
- <li>timestamp</li>
- <li>date</li>
- </ul>
- </li>
- <li>Compound types
- <ul>
- <li>struct</li>
- <li>list</li>
- <li>map</li>
- <li>union</li>
- </ul>
- </li>
-</ul>
-
-<p>All ORC file are logically sequences of identically typed objects. Hive
-always uses a struct with a field for each of the top-level columns as
-the root object type, but that is not required. All types in ORC can take
-null values including the compound types.</p>
-
-<p>Compound types have children columns that hold the values for their
-sub-elements. For example, a struct column has one child column for
-each field of the struct. Lists always have a single child column for
-the element values and maps always have two child columns. Union
-columns have one child column for each of the variants.</p>
-
-<p>Given the following definition of the table Foobar, the columns in the
-file would form the given tree.</p>
-
-<p><code>create table Foobar (
- myInt int,
- myMap map<string,
- struct<myString : string,
- myDouble: double>>,
- myTime timestamp
-);
-</code></p>
-
-<p><img src="/img/TreeWriters.png" alt="ORC column structure" /></p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/adopters.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/indexes.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
+<p>ORC provides a rich set of scalar and compound types:</p>
<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
+ <li>Integer
+ <ul>
+ <li>boolean (1 bit)</li>
+ <li>tinyint (8 bit)</li>
+ <li>smallint (16 bit)</li>
+ <li>int (32 bit)</li>
+ <li>bigint (64 bit)</li>
+ </ul>
+ </li>
+ <li>Floating point
+ <ul>
+ <li>float</li>
+ <li>double</li>
+ </ul>
+ </li>
+ <li>String types
+ <ul>
+ <li>string</li>
+ <li>char</li>
+ <li>varchar</li>
+ </ul>
+ </li>
+ <li>Binary blobs
+ <ul>
+ <li>binary</li>
+ </ul>
+ </li>
+ <li>Date/time
+ <ul>
+ <li>timestamp</li>
+ <li>date</li>
+ </ul>
+ </li>
+ <li>Compound types
+ <ul>
+ <li>struct</li>
+ <li>list</li>
+ <li>map</li>
+ <li>union</li>
+ </ul>
+ </li>
</ul>
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
+<p>All ORC file are logically sequences of identically typed objects. Hive
+always uses a struct with a field for each of the top-level columns as
+the root object type, but that is not required. All types in ORC can take
+null values including the compound types.</p>
+
+<p>Compound types have children columns that hold the values for their
+sub-elements. For example, a struct column has one child column for
+each field of the struct. Lists always have a single child column for
+the element values and maps always have two child columns. Union
+columns have one child column for each of the variants.</p>
+
+<p>Given the following definition of the table Foobar, the columns in the
+file would form the given tree.</p>
+
+<p><code>create table Foobar (
+ myInt int,
+ myMap map<string,
+ struct<myString : string,
+ myDouble: double>>,
+ myTime timestamp
+);
+</code></p>
+
+<p><img src="/img/TreeWriters.png" alt="ORC column structure" /></p>
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/adopters.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/indexes.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1613,11 +820,7 @@ file would form the given tree.</p>
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1631,34 +834,10 @@ file would form the given tree.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1695,7 +874,7 @@ file would form the given tree.</p>
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class="current"><a href="/docs/types.html">Types</a></li>
@@ -1725,49 +904,7 @@ file would form the given tree.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -1779,22 +916,14 @@ file would form the given tree.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -1811,15 +940,7 @@ file would form the given tree.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -1857,14 +978,14 @@ file would form the given tree.</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -1891,31 +1012,7 @@ file would form the given tree.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -1939,31 +1036,17 @@ file would form the given tree.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -1995,19 +1078,7 @@ file would form the given tree.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2043,13 +1114,25 @@ file would form the given tree.</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2059,7 +1142,7 @@ file would form the given tree.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2077,17 +1160,17 @@ file would form the given tree.</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2107,11 +1190,17 @@ file would form the given tree.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2133,7 +1222,7 @@ file would form the given tree.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
[5/9] orc git commit: Pushing ORC-339 reorganize the ORC file format
spec.
Posted by om...@apache.org.
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/mapred.html
----------------------------------------------------------------------
diff --git a/docs/mapred.html b/docs/mapred.html
index f0ab622..ab932db 100644
--- a/docs/mapred.html
+++ b/docs/mapred.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,386 +663,21 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Using in MapRed</h1>
- <p>This page describes how to read and write ORC files from Hadoop’s
-older org.apache.hadoop.mapred MapReduce APIs. If you want to use the
-new org.apache.hadoop.mapreduce API, please look at the <a href="/docs/mapreduce.html">next
-page</a>.</p>
+ <div class="unit four-fifths">
+ <article>
+ <h1>Using in MapRed</h1>
+ <p>This page describes how to read and write ORC files from Hadoop’s
+older org.apache.hadoop.mapred MapReduce APIs. If you want to use the
+new org.apache.hadoop.mapreduce API, please look at the <a href="/docs/mapreduce.html">next
+page</a>.</p>
<h2 id="reading-orc-files">Reading ORC files</h2>
@@ -1506,282 +939,56 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/hive-config.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/mapreduce.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
+
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/hive-config.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/mapreduce.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1810,11 +1017,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1828,34 +1031,10 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1892,7 +1071,7 @@ OrcKey.key and OrcValue.value fields.</p>
- <li class="current"><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -1922,49 +1101,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -1976,22 +1113,14 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -2008,15 +1137,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -2054,14 +1175,14 @@ OrcKey.key and OrcValue.value fields.</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -2088,31 +1209,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -2136,31 +1233,17 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2192,19 +1275,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class="current"><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2240,13 +1311,25 @@ OrcKey.key and OrcValue.value fields.</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2256,7 +1339,7 @@ OrcKey.key and OrcValue.value fields.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2274,17 +1357,17 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2304,11 +1387,17 @@ OrcKey.key and OrcValue.value fields.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2330,7 +1419,7 @@ OrcKey.key and OrcValue.value fields.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/mapreduce.html
----------------------------------------------------------------------
diff --git a/docs/mapreduce.html b/docs/mapreduce.html
index 2423f01..63fcd9c 100644
--- a/docs/mapreduce.html
+++ b/docs/mapreduce.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,386 +663,21 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Using in MapReduce</h1>
- <p>This page describes how to read and write ORC files from Hadoop’s
-newer org.apache.hadoop.mapreduce MapReduce APIs. If you want to use the
-older org.apache.hadoop.mapred API, please look at the <a href="/docs/mapred.html">previous
-page</a>.</p>
+ <div class="unit four-fifths">
+ <article>
+ <h1>Using in MapReduce</h1>
+ <p>This page describes how to read and write ORC files from Hadoop’s
+newer org.apache.hadoop.mapreduce MapReduce APIs. If you want to use the
+older org.apache.hadoop.mapred API, please look at the <a href="/docs/mapred.html">previous
+page</a>.</p>
<h2 id="reading-orc-files">Reading ORC files</h2>
@@ -1483,289 +916,63 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/mapred.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/core-java.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
+
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
+
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/mapred.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/core-java.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1794,11 +1001,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1812,34 +1015,10 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1876,7 +1055,7 @@ OrcKey.key and OrcValue.value fields.</p>
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -1906,49 +1085,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -1960,22 +1097,14 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -1992,15 +1121,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -2038,14 +1159,14 @@ OrcKey.key and OrcValue.value fields.</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -2072,31 +1193,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -2120,31 +1217,17 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2176,19 +1259,7 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2224,13 +1295,25 @@ OrcKey.key and OrcValue.value fields.</p>
-
+ <li class="current"><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2240,7 +1323,7 @@ OrcKey.key and OrcValue.value fields.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2258,17 +1341,17 @@ OrcKey.key and OrcValue.value fields.</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2288,11 +1371,17 @@ OrcKey.key and OrcValue.value fields.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2314,7 +1403,7 @@ OrcKey.key and OrcValue.value fields.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/releases.html
----------------------------------------------------------------------
diff --git a/docs/releases.html b/docs/releases.html
index 8a2406f..3b96cec 100644
--- a/docs/releases.html
+++ b/docs/releases.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,384 +663,19 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Releases</h1>
-
-<h2 id="current-release---143">Current Release - 1.4.3:</h2>
+ <div class="unit four-fifths">
+ <article>
+ <h1>Releases</h1>
+
+<h2 id="current-release---143">Current Release - 1.4.3:</h2>
<p>ORC 1.4.3 contains both the Java reader and writer and the C++
reader for ORC files. It also contains tools for working with ORC
@@ -1483,273 +916,47 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/building.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/hive-ddl.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/building.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/hive-ddl.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class="current"><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1778,11 +985,7 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1796,34 +999,10 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1860,7 +1039,7 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -1890,49 +1069,7 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -1944,22 +1081,14 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -1976,15 +1105,7 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -2022,14 +1143,14 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class="current"><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -2056,31 +1177,7 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -2104,31 +1201,17 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2160,19 +1243,7 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2208,13 +1279,25 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2224,7 +1307,7 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2242,17 +1325,17 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2272,11 +1355,17 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2298,7 +1387,7 @@ committers’ <a href="https://dist.apache.org/repos/dist/release/orc/KEYS">key
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/run-length.html
----------------------------------------------------------------------
diff --git a/docs/run-length.html b/docs/run-length.html
deleted file mode 100644
index 5ca06d6..0000000
--- a/docs/run-length.html
+++ /dev/null
@@ -1,2566 +0,0 @@
-<!DOCTYPE HTML>
-<html lang="en-US">
-<head>
- <meta charset="UTF-8">
- <title>Run Length Encoding</title>
- <meta name="viewport" content="width=device-width,initial-scale=1">
- <meta name="generator" content="Jekyll v2.4.0">
- <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
- <link rel="stylesheet" href="/css/screen.css">
- <link rel="icon" type="image/x-icon" href="/favicon.ico">
- <!--[if lt IE 9]>
- <script src="/js/html5shiv.min.js"></script>
- <script src="/js/respond.min.js"></script>
- <![endif]-->
-</head>
-
-
-<body class="wrap">
- <header role="banner">
- <nav class="mobile-nav show-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- <div class="grid">
- <div class="unit one-third center-on-mobiles">
- <h1>
- <a href="/">
- <span class="sr-only">Apache ORC</span>
- <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
- </a>
- </h1>
- </div>
- <nav class="main-nav unit two-thirds hide-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- </div>
-</header>
-
-
- <section class="docs">
- <div class="grid">
-
- <div class="docs-nav-mobile unit whole show-on-mobiles">
- <select onchange="if (this.value) window.location.href=this.value">
- <option value="">Navigate the docs…</option>
-
- <optgroup label="Overview">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/index.html">Background</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/adopters.html">ORC Adopters</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/types.html">Types</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/indexes.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/acid.html">ACID support</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Installing">
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/building.html">Building ORC</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in Hive">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-ddl.html">Hive DDL</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-config.html">Hive Configuration</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in MapReduce">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapred.html">Using in MapRed</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapreduce.html">Using in MapReduce</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using ORC Core">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-java.html">Using Core Java</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-cpp.html">Using Core C++</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Tools">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/cpp-tools.html">C++ Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/java-tools.html">Java Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Format Specification">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Run Length Encoding</h1>
- <h1 id="base-128-varint">Base 128 Varint</h1>
-
-<p>Variable width integer encodings take advantage of the fact that most
-numbers are small and that having smaller encodings for small numbers
-shrinks the overall size of the data. ORC uses the varint format from
-Protocol Buffers, which writes data in little endian format using the
-low 7 bits of each byte. The high bit in each byte is set if the
-number continues into the next byte.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Unsigned Original</th>
- <th style="text-align: left">Serialized</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">0</td>
- <td style="text-align: left">0x00</td>
- </tr>
- <tr>
- <td style="text-align: left">1</td>
- <td style="text-align: left">0x01</td>
- </tr>
- <tr>
- <td style="text-align: left">127</td>
- <td style="text-align: left">0x7f</td>
- </tr>
- <tr>
- <td style="text-align: left">128</td>
- <td style="text-align: left">0x80, 0x01</td>
- </tr>
- <tr>
- <td style="text-align: left">129</td>
- <td style="text-align: left">0x81, 0x01</td>
- </tr>
- <tr>
- <td style="text-align: left">16,383</td>
- <td style="text-align: left">0xff, 0x7f</td>
- </tr>
- <tr>
- <td style="text-align: left">16,384</td>
- <td style="text-align: left">0x80, 0x80, 0x01</td>
- </tr>
- <tr>
- <td style="text-align: left">16,385</td>
- <td style="text-align: left">0x81, 0x80, 0x01</td>
- </tr>
- </tbody>
-</table>
-
-<p>For signed integer types, the number is converted into an unsigned
-number using a zigzag encoding. Zigzag encoding moves the sign bit to
-the least significant bit using the expression (val « 1) ^ (val »
-63) and derives its name from the fact that positive and negative
-numbers alternate once encoded. The unsigned number is then serialized
-as above.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Signed Original</th>
- <th style="text-align: left">Unsigned</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">0</td>
- <td style="text-align: left">0</td>
- </tr>
- <tr>
- <td style="text-align: left">-1</td>
- <td style="text-align: left">1</td>
- </tr>
- <tr>
- <td style="text-align: left">1</td>
- <td style="text-align: left">2</td>
- </tr>
- <tr>
- <td style="text-align: left">-2</td>
- <td style="text-align: left">3</td>
- </tr>
- <tr>
- <td style="text-align: left">2</td>
- <td style="text-align: left">4</td>
- </tr>
- </tbody>
-</table>
-
-<h1 id="byte-run-length-encoding">Byte Run Length Encoding</h1>
-
-<p>For byte streams, ORC uses a very light weight encoding of identical
-values.</p>
-
-<ul>
- <li>Run - a sequence of at least 3 identical values</li>
- <li>Literals - a sequence of non-identical values</li>
-</ul>
-
-<p>The first byte of each group of values is a header than determines
-whether it is a run (value between 0 to 127) or literal list (value
-between -128 to -1). For runs, the control byte is the length of the
-run minus the length of the minimal run (3) and the control byte for
-literal lists is the negative length of the list. For example, a
-hundred 0’s is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
-would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
-either of the encodings.</p>
-
-<h1 id="boolean-run-length-encoding">Boolean Run Length Encoding</h1>
-
-<p>For encoding boolean types, the bits are put in the bytes from most
-significant to least significant. The bytes are encoded using byte run
-length encoding as described in the previous section. For example,
-the byte sequence [0xff, 0x80] would be one true followed by
-seven false values.</p>
-
-<h1 id="integer-run-length-encoding-version-1">Integer Run Length Encoding, version 1</h1>
-
-<p>In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1),
-which provides a lightweight compression of signed or unsigned integer
-sequences. RLEv1 has two sub-encodings:</p>
-
-<ul>
- <li>Run - a sequence of values that differ by a small fixed delta</li>
- <li>Literals - a sequence of varint encoded values</li>
-</ul>
-
-<p>Runs start with an initial byte of 0x00 to 0x7f, which encodes the
-length of the run - 3. A second byte provides the fixed delta in the
-range of -128 to 127. Finally, the first value of the run is encoded
-as a base 128 varint.</p>
-
-<p>For example, if the sequence is 100 instances of 7 the encoding would
-start with 100 - 3, followed by a delta of 0, and a varint of 7 for
-an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
-running from 100 to 1, the first byte is 100 - 3, the delta is -1,
-and the varint is 100 for an encoding of [0x61, 0xff, 0x64].</p>
-
-<p>Literals start with an initial byte of 0x80 to 0xff, which corresponds
-to the negative of number of literals in the sequence. Following the
-header byte, the list of N varints is encoded. Thus, if there are
-no runs, the overhead is 1 byte for each 128 integers. The first 5
-prime numbers [2, 3, 4, 7, 11] would encoded as [0xfb, 0x02, 0x03,
-0x04, 0x07, 0xb].</p>
-
-<h1 id="integer-run-length-encoding-version-2">Integer Run Length Encoding, version 2</h1>
-
-<p>In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2),
-which has improved compression and fixed bit width encodings for
-faster expansion. RLEv2 uses four sub-encodings based on the data:</p>
-
-<ul>
- <li>Short Repeat - used for short sequences with repeated values</li>
- <li>Direct - used for random sequences with a fixed bit width</li>
- <li>Patched Base - used for random sequences with a variable bit width</li>
- <li>Delta - used for monotonically increasing or decreasing sequences</li>
-</ul>
-
-<h2 id="short-repeat">Short Repeat</h2>
-
-<p>The short repeat encoding is used for short repeating integer
-sequences with the goal of minimizing the overhead of the header. All
-of the bits listed in the header are from the first byte to the last
-and from most significant bit to least significant bit. If the type is
-signed, the value is zigzag encoded.</p>
-
-<ul>
- <li>1 byte header
- <ul>
- <li>2 bits for encoding type (0)</li>
- <li>3 bits for width (W) of repeating value (1 to 8 bytes)</li>
- <li>3 bits for repeat count (3 to 10 values)</li>
- </ul>
- </li>
- <li>W bytes in big endian format, which is zigzag encoded if they type
-is signed</li>
-</ul>
-
-<p>The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be
-serialized with short repeat encoding (0), a width of 2 bytes (1), and
-repeat count of 5 (2) as [0x0a, 0x27, 0x10].</p>
-
-<h2 id="direct">Direct</h2>
-
-<p>The direct encoding is used for integer sequences whose values have a
-relatively constant bit width. It encodes the values directly using a
-fixed width big endian encoding. The width of the values is encoded
-using the table below.</p>
-
-<p>The 5 bit width encoding table for RLEv2:</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Width in Bits</th>
- <th style="text-align: left">Encoded Value</th>
- <th style="text-align: left">Notes</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">0</td>
- <td style="text-align: left">0</td>
- <td style="text-align: left">for delta encoding</td>
- </tr>
- <tr>
- <td style="text-align: left">1</td>
- <td style="text-align: left">0</td>
- <td style="text-align: left">for non-delta encoding</td>
- </tr>
- <tr>
- <td style="text-align: left">2</td>
- <td style="text-align: left">1</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">4</td>
- <td style="text-align: left">3</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">8</td>
- <td style="text-align: left">7</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">16</td>
- <td style="text-align: left">15</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">24</td>
- <td style="text-align: left">23</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">32</td>
- <td style="text-align: left">27</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">40</td>
- <td style="text-align: left">28</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">48</td>
- <td style="text-align: left">29</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">56</td>
- <td style="text-align: left">30</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">64</td>
- <td style="text-align: left">31</td>
- <td style="text-align: left"> </td>
- </tr>
- <tr>
- <td style="text-align: left">3</td>
- <td style="text-align: left">2</td>
- <td style="text-align: left">deprecated</td>
- </tr>
- <tr>
- <td style="text-align: left">5 <= x <= 7</td>
- <td style="text-align: left">x - 1</td>
- <td style="text-align: left">deprecated</td>
- </tr>
- <tr>
- <td style="text-align: left">9 <= x <= 15</td>
- <td style="text-align: left">x - 1</td>
- <td style="text-align: left">deprecated</td>
- </tr>
- <tr>
- <td style="text-align: left">17 <= x <= 21</td>
- <td style="text-align: left">x - 1</td>
- <td style="text-align: left">deprecated</td>
- </tr>
- <tr>
- <td style="text-align: left">26</td>
- <td style="text-align: left">24</td>
- <td style="text-align: left">deprecated</td>
- </tr>
- <tr>
- <td style="text-align: left">28</td>
- <td style="text-align: left">25</td>
- <td style="text-align: left">deprecated</td>
- </tr>
- <tr>
- <td style="text-align: left">30</td>
- <td style="text-align: left">26</td>
- <td style="text-align: left">deprecated</td>
- </tr>
- </tbody>
-</table>
-
-<ul>
- <li>2 bytes header
- <ul>
- <li>2 bits for encoding type (1)</li>
- <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
-width encoding table</li>
- <li>9 bits for length (L) (1 to 512 values)</li>
- </ul>
- </li>
- <li>W * L bits (padded to the next byte) encoded in big endian format, which is
-zigzag encoding if the type is signed</li>
-</ul>
-
-<p>The unsigned sequence of [23713, 43806, 57005, 48879] would be
-serialized with direct encoding (1), a width of 16 bits (15), and
-length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
-0xbe, 0xef].</p>
-
-<h2 id="patched-base">Patched Base</h2>
-
-<p>The patched base encoding is used for integer sequences whose bit
-widths varies a lot. The minimum signed value of the sequence is found
-and subtracted from the other values. The bit width of those adjusted
-values is analyzed and the 90 percentile of the bit width is chosen
-as W. The 10\% of values larger than W use patches from a patch list
-to set the additional bits. Patches are encoded as a list of gaps in
-the index values and the additional value bits.</p>
-
-<ul>
- <li>4 bytes header
- <ul>
- <li>2 bits for encoding type (2)</li>
- <li>5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
- width encoding table</li>
- <li>9 bits for length (L) (1 to 512 values)</li>
- <li>3 bits for base value width (BW) (1 to 8 bytes)</li>
- <li>5 bits for patch width (PW) (1 to 64 bits) using the 5 bit width
-encoding table</li>
- <li>3 bits for patch gap width (PGW) (1 to 8 bits)</li>
- <li>5 bits for patch list length (PLL) (0 to 31 patches)</li>
- </ul>
- </li>
- <li>Base value (BW bytes) - The base value is stored as a big endian value
-with negative values marked by the most significant bit set. If it that
-bit is set, the entire value is negated.</li>
- <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive
-values that are added to the base value.</li>
- <li>Data values (W * L bits padded to the byte) - A sequence of W bit positive
-values that are added to the base value.</li>
- <li>Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
-that didn’t fit within W bits. Each entry in the list consists of a
-gap, which is the number of elements skipped from the previous
-patch, and a patch value. Patches are applied by logically or’ing
-the data values with the relevant patch shifted W bits left. If a
-patch is 0, it was introduced to skip over more than 255 items. The
-combined length of each patch (PGW + PW) must be less or equal to
-64.</li>
-</ul>
-
-<p>The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070,
-2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
-has a minimum of 2000, which makes the adjusted
-sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
-150, 160, 170, 180, 190]. It has an
-encoding of patched base (2), a bit width of 8 (7), a length of 20
-(19), a base value width of 2 bytes (1), a patch width of 12 bits (11),
-patch gap width of 2 bits (1), and a patch list length of 1 (1). The
-base value is 2000 and the combined result is [0x8e, 0x13, 0x2b, 0x21, 0x07,
-0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e,
-0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc, 0xe8]</p>
-
-<h2 id="delta">Delta</h2>
-
-<p>The Delta encoding is used for monotonically increasing or decreasing
-sequences. The first two numbers in the sequence can not be identical,
-because the encoding is using the sign of the first delta to determine
-if the series is increasing or decreasing.</p>
-
-<ul>
- <li>2 bytes header
- <ul>
- <li>2 bits for encoding type (3)</li>
- <li>5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit
-width encoding table</li>
- <li>9 bits for run length (L) (1 to 512 values)</li>
- </ul>
- </li>
- <li>Base value - encoded as (signed or unsigned) varint</li>
- <li>Delta base - encoded as signed varint</li>
- <li>Delta values $W * (L - 2)$ bytes - encode each delta after the first
-one. If the delta base is positive, the sequence is increasing and if it is
-negative the sequence is decreasing.</li>
-</ul>
-
-<p>The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be
-serialized with delta encoding (3), a width of 4 bits (3), length of
-10 (9), a base of 2 (2), and first delta of 1 (2). The resulting
-sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/compression.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/stripes.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in Hive</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
-
-
-
-</ul>
-
-
- <h4>Tools</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
-
-
-
-</ul>
-
-
- <h4>Format Specification</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/run-length.html">Run Length Encoding</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
-
-
-
-</ul>
-
-
- </aside>
-</div>
-
-
- <div class="clear"></div>
-
- </div>
- </section>
-
-
- <footer role="contentinfo">
- <p>The contents of this website are © 2018
- <a href="https://www.apache.org/">Apache Software Foundation</a>
- under the terms of the <a
- href="https://www.apache.org/licenses/LICENSE-2.0.html">
- Apache License v2</a>. Apache ORC and its logo are trademarks
- of the Apache Software Foundation.</p>
-</footer>
-
- <script>
- var anchorForId = function (id) {
- var anchor = document.createElement("a");
- anchor.className = "header-link";
- anchor.href = "#" + id;
- anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
- anchor.title = "Permalink";
- return anchor;
- };
-
- var linkifyAnchors = function (level, containingElement) {
- var headers = containingElement.getElementsByTagName("h" + level);
- for (var h = 0; h < headers.length; h++) {
- var header = headers[h];
-
- if (typeof header.id !== "undefined" && header.id !== "") {
- header.appendChild(anchorForId(header.id));
- }
- }
- };
-
- document.onreadystatechange = function () {
- if (this.readyState === "complete") {
- var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
- if (!contentBlock) {
- return;
- }
- for (var level = 1; level <= 6; level++) {
- linkifyAnchors(level, contentBlock);
- }
- }
- };
-</script>
-
-
-</body>
-</html>
[9/9] orc git commit: Pushing ORC-339 reorganize the ORC file format
spec.
Posted by om...@apache.org.
Pushing ORC-339 reorganize the ORC file format spec.
Signed-off-by: Owen O'Malley <om...@apache.org>
Project: http://git-wip-us.apache.org/repos/asf/orc/repo
Commit: http://git-wip-us.apache.org/repos/asf/orc/commit/c6e29090
Tree: http://git-wip-us.apache.org/repos/asf/orc/tree/c6e29090
Diff: http://git-wip-us.apache.org/repos/asf/orc/diff/c6e29090
Branch: refs/heads/asf-site
Commit: c6e2909025381446398961f4ac1da61550cd13b5
Parents: c63412b
Author: Owen O'Malley <om...@apache.org>
Authored: Tue Apr 17 10:49:12 2018 -0700
Committer: Owen O'Malley <om...@apache.org>
Committed: Tue Apr 17 10:49:12 2018 -0700
----------------------------------------------------------------------
develop/index.html | 3 +
docs/acid.html | 1073 ++--------------
docs/adopters.html | 1185 ++---------------
docs/building.html | 1071 ++--------------
docs/compression.html | 2193 --------------------------------
docs/core-cpp.html | 1429 ++++-----------------
docs/core-java.html | 1083 ++--------------
docs/cpp-tools.html | 1523 +++++-----------------
docs/encodings.html | 2790 -----------------------------------------
docs/file-tail.html | 2477 ------------------------------------
docs/hive-config.html | 1075 ++--------------
docs/hive-ddl.html | 1145 ++---------------
docs/index.html | 1329 +++-----------------
docs/indexes.html | 1133 ++---------------
docs/java-tools.html | 1549 +++++------------------
docs/mapred.html | 1083 ++--------------
docs/mapreduce.html | 1085 ++--------------
docs/releases.html | 1071 ++--------------
docs/run-length.html | 2566 -------------------------------------
docs/spec-index.html | 2298 ---------------------------------
docs/spec-intro.html | 2180 --------------------------------
docs/stripes.html | 2257 ---------------------------------
docs/types.html | 1215 +++---------------
specification/ORCv0.html | 1260 +++++++++++++++++++
specification/ORCv1.html | 1744 ++++++++++++++++++++++++++
specification/ORCv2.html | 1769 ++++++++++++++++++++++++++
specification/index.html | 159 +++
27 files changed, 7126 insertions(+), 32619 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/develop/index.html
----------------------------------------------------------------------
diff --git a/develop/index.html b/develop/index.html
index e920320..d9224d0 100644
--- a/develop/index.html
+++ b/develop/index.html
@@ -87,6 +87,9 @@
<p>Information about the ORC project that is most important for
developers working on the project.</p>
+<p>The <a href="/specification">ORC format specification</a> defines the format
+to promote compatibility between implementations.</p>
+
<h2 id="development-community">Development community</h2>
<p>We have committers from many different companies. The full
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/acid.html
----------------------------------------------------------------------
diff --git a/docs/acid.html b/docs/acid.html
index c460d41..71c980c 100644
--- a/docs/acid.html
+++ b/docs/acid.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,386 +663,21 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>ACID support</h1>
- <p>Historically, the only way to atomically add data to a table in Hive
-was to add a new partition. Updating or deleting data in partition
-required removing the old partition and adding it back with the new
-data and it wasn’t possible to do atomically.</p>
+ <div class="unit four-fifths">
+ <article>
+ <h1>ACID support</h1>
+ <p>Historically, the only way to atomically add data to a table in Hive
+was to add a new partition. Updating or deleting data in partition
+required removing the old partition and adding it back with the new
+data and it wasn’t possible to do atomically.</p>
<p>However, user’s data is continually changing and as Hive matured,
users required reliability guarantees despite the churning data
@@ -1465,270 +898,44 @@ file that don’t need to be read in this task.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/indexes.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/building.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
-
-
+
-
-
+
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/indexes.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/building.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1757,11 +964,7 @@ file that don’t need to be read in this task.</p>
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1775,34 +978,10 @@ file that don’t need to be read in this task.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1839,7 +1018,7 @@ file that don’t need to be read in this task.</p>
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -1869,49 +1048,7 @@ file that don’t need to be read in this task.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -1923,22 +1060,14 @@ file that don’t need to be read in this task.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class="current"><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -1955,15 +1084,7 @@ file that don’t need to be read in this task.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -2001,14 +1122,14 @@ file that don’t need to be read in this task.</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -2035,31 +1156,7 @@ file that don’t need to be read in this task.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -2083,31 +1180,17 @@ file that don’t need to be read in this task.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2139,19 +1222,7 @@ file that don’t need to be read in this task.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2187,13 +1258,25 @@ file that don’t need to be read in this task.</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2203,7 +1286,7 @@ file that don’t need to be read in this task.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2221,17 +1304,17 @@ file that don’t need to be read in this task.</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2251,11 +1334,17 @@ file that don’t need to be read in this task.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2277,7 +1366,7 @@ file that don’t need to be read in this task.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/adopters.html
----------------------------------------------------------------------
diff --git a/docs/adopters.html b/docs/adopters.html
index b30ef6e..7ee402b 100644
--- a/docs/adopters.html
+++ b/docs/adopters.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,717 +663,126 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>ORC Adopters</h1>
- <p>If your company or tool uses ORC, please let us know so that we can update
-this page.</p>
-
-<h3 id="apache-hadoophttpshadoopapacheorg"><a href="https://hadoop.apache.org/">Apache Hadoop</a></h3>
-
-<p>ORC files have always supporting reading and writing from Hadoop’s MapReduce,
-but with the ORC 1.1.0 release it is now easier than ever without pulling in
-Hive’s exec jar and all of its dependencies. OrcStruct now also implements
-WritableComparable and can be serialized through the MapReduce shuffle.</p>
-
-<h3 id="apache-hivehttpshiveapacheorg"><a href="https://hive.apache.org/">Apache Hive</a></h3>
-
-<p>Apache Hive was the original use case and home for ORC. ORC’s strong
-type system, advanced compression, column projection, predicate push
-down, and vectorization support make Hive <a href="https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/">perform
-better</a>
-than any other format for your data.</p>
-
-<h3 id="apache-nifihttpsnifiapacheorg"><a href="https://nifi.apache.org/">Apache Nifi</a></h3>
-
-<p>Apache Nifi is <a href="https://issues.apache.org/jira/browse/NIFI-1663">adding
-support</a> for writing
-ORC files.</p>
-
-<h3 id="apache-pighttpspigapacheorg"><a href="https://pig.apache.org/">Apache Pig</a></h3>
-
-<p>Apache Pig added support for reading and writing ORC files in <a href="https://hortonworks.com/blog/announcing-apache-pig-0-14-0/">Pig
-14.0</a>.</p>
-
-<h3 id="apache-sparkhttpssparkapacheorg"><a href="https://spark.apache.org/">Apache Spark</a></h3>
-
-<p>Apache Spark has <a href="https://hortonworks.com/blog/bringing-orc-support-into-apache-spark/">added
-support</a>
-for reading and writing ORC files with support for column project and
-predicate push down.</p>
-
-<h3 id="eelhttpsgithubcom51zeroeel-sdk"><a href="https://github.com/51zero/eel-sdk">EEL</a></h3>
-
-<p>EEL is a Scala BigData API that supports reading and writing data for
-various file formats and storage systems including to and from ORC. It
-is designed as a in-process low level API for manipulating data. Data
-is lazily streamed from source to sink and using standard Scala
-operations such as map, flatMap and filter, it is especially suited
-for ETL style applications. EEL supports ORC predicate and projection
-pushdowns and correct handles conversions from other formats including
-complex types such as maps, lists or nested structs. A typical use
-case would be to extract data from JDBC to ORC files housed in HDFS,
-or directly into Hive tables backed by an ORC file format.</p>
-
-<h3 id="facebookhttpsfacebookcom"><a href="https://facebook.com">Facebook</a></h3>
-
-<p>With more than 300 PB of data, Facebook was an <a href="https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/">early adopter of
-ORC</a> and quickly put it into production.</p>
-
-<h3 id="prestohttpsprestodbio"><a href="https://prestodb.io/">Presto</a></h3>
-
-<p>The Presto team has done a lot of work <a href="https://code.facebook.com/posts/370832626374903/even-faster-data-at-the-speed-of-presto-orc/">integrating
-ORC</a> into their SQL engine.</p>
-
-<h3 id="timberhttpstimberio"><a href="https://timber.io/">Timber</a></h3>
-
-<p>Timber adopted ORC for it’s S3 based logging platform that stores
-petabytes of log data. ORC has been key in ensuring a fast,
-cost-effective strategy for persisting and querying that data.</p>
-
-<h3 id="verticahttpwww8hpcomusensoftware-solutionsadvanced-sql-big-data-analytics"><a href="http://www8.hp.com/us/en/software-solutions/advanced-sql-big-data-analytics/">Vertica</a></h3>
-
-<p>HPE Vertica has contributed significantly to the ORC C++ library. ORC
-is a significant part of Vertica SQL-on-Hadoop (VSQLoH) which brings
-the performance, reliability and standards compliance of the Vertica
-Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/index.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/types.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
+ <div class="unit four-fifths">
+ <article>
+ <h1>ORC Adopters</h1>
+ <p>If your company or tool uses ORC, please let us know so that we can update
+this page.</p>
+<h3 id="apache-hadoophttpshadoopapacheorg"><a href="https://hadoop.apache.org/">Apache Hadoop</a></h3>
-
+<p>ORC files have always supporting reading and writing from Hadoop’s MapReduce,
+but with the ORC 1.1.0 release it is now easier than ever without pulling in
+Hive’s exec jar and all of its dependencies. OrcStruct now also implements
+WritableComparable and can be serialized through the MapReduce shuffle.</p>
-
-
-
+<h3 id="apache-hivehttpshiveapacheorg"><a href="https://hive.apache.org/">Apache Hive</a></h3>
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
+<p>Apache Hive was the original use case and home for ORC. ORC’s strong
+type system, advanced compression, column projection, predicate push
+down, and vectorization support make Hive <a href="https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/">perform
+better</a>
+than any other format for your data.</p>
+<h3 id="apache-nifihttpsnifiapacheorg"><a href="https://nifi.apache.org/">Apache Nifi</a></h3>
-</ul>
+<p>Apache Nifi is <a href="https://issues.apache.org/jira/browse/NIFI-1663">adding
+support</a> for writing
+ORC files.</p>
-
- <h4>Installing</h4>
-
+<h3 id="apache-pighttpspigapacheorg"><a href="https://pig.apache.org/">Apache Pig</a></h3>
-<ul>
+<p>Apache Pig added support for reading and writing ORC files in <a href="https://hortonworks.com/blog/announcing-apache-pig-0-14-0/">Pig
+14.0</a>.</p>
-
+<h3 id="apache-sparkhttpssparkapacheorg"><a href="https://spark.apache.org/">Apache Spark</a></h3>
-
-
-
+<p>Apache Spark has <a href="https://hortonworks.com/blog/bringing-orc-support-into-apache-spark/">added
+support</a>
+for reading and writing ORC files with support for column project and
+predicate push down.</p>
+
+<h3 id="eelhttpsgithubcom51zeroeel-sdk"><a href="https://github.com/51zero/eel-sdk">EEL</a></h3>
+
+<p>EEL is a Scala BigData API that supports reading and writing data for
+various file formats and storage systems including to and from ORC. It
+is designed as a in-process low level API for manipulating data. Data
+is lazily streamed from source to sink and using standard Scala
+operations such as map, flatMap and filter, it is especially suited
+for ETL style applications. EEL supports ORC predicate and projection
+pushdowns and correct handles conversions from other formats including
+complex types such as maps, lists or nested structs. A typical use
+case would be to extract data from JDBC to ORC files housed in HDFS,
+or directly into Hive tables backed by an ORC file format.</p>
+
+<h3 id="facebookhttpsfacebookcom"><a href="https://facebook.com">Facebook</a></h3>
+
+<p>With more than 300 PB of data, Facebook was an <a href="https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/">early adopter of
+ORC</a> and quickly put it into production.</p>
+
+<h3 id="prestohttpsprestodbio"><a href="https://prestodb.io/">Presto</a></h3>
+
+<p>The Presto team has done a lot of work <a href="https://code.facebook.com/posts/370832626374903/even-faster-data-at-the-speed-of-presto-orc/">integrating
+ORC</a> into their SQL engine.</p>
+
+<h3 id="timberhttpstimberio"><a href="https://timber.io/">Timber</a></h3>
+
+<p>Timber adopted ORC for it’s S3 based logging platform that stores
+petabytes of log data. ORC has been key in ensuring a fast,
+cost-effective strategy for persisting and querying that data.</p>
+
+<h3 id="verticahttpwww8hpcomusensoftware-solutionsadvanced-sql-big-data-analytics"><a href="http://www8.hp.com/us/en/software-solutions/advanced-sql-big-data-analytics/">Vertica</a></h3>
+
+<p>HPE Vertica has contributed significantly to the ORC C++ library. ORC
+is a significant part of Vertica SQL-on-Hadoop (VSQLoH) which brings
+the performance, reliability and standards compliance of the Vertica
+Analytic Database to the Hadoop ecosystem.</p>
+
+
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/index.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/types.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1604,11 +811,7 @@ Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1622,34 +825,10 @@ Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class="current"><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1686,7 +865,7 @@ Analytic Database to the Hadoop ecosystem.</p>
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -1716,49 +895,7 @@ Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -1770,22 +907,14 @@ Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -1802,15 +931,7 @@ Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -1848,14 +969,14 @@ Analytic Database to the Hadoop ecosystem.</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -1882,31 +1003,7 @@ Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -1930,31 +1027,17 @@ Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -1986,19 +1069,7 @@ Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2034,13 +1105,25 @@ Analytic Database to the Hadoop ecosystem.</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2050,7 +1133,7 @@ Analytic Database to the Hadoop ecosystem.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2068,17 +1151,17 @@ Analytic Database to the Hadoop ecosystem.</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2098,11 +1181,17 @@ Analytic Database to the Hadoop ecosystem.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2124,7 +1213,7 @@ Analytic Database to the Hadoop ecosystem.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/building.html
----------------------------------------------------------------------
diff --git a/docs/building.html b/docs/building.html
index bbe1ec4..378f541 100644
--- a/docs/building.html
+++ b/docs/building.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,383 +663,18 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Building ORC</h1>
- <h2 id="building-both-c-and-java">Building both C++ and Java</h2>
+ <div class="unit four-fifths">
+ <article>
+ <h1>Building ORC</h1>
+ <h2 id="building-both-c-and-java">Building both C++ and Java</h2>
<p>The C++ library is supported on the following operating systems:</p>
@@ -1338,276 +771,50 @@ is invoking:</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/acid.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/releases.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/acid.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/releases.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1636,11 +843,7 @@ is invoking:</p>
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1654,34 +857,10 @@ is invoking:</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1718,7 +897,7 @@ is invoking:</p>
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -1748,49 +927,7 @@ is invoking:</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -1802,22 +939,14 @@ is invoking:</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -1834,15 +963,7 @@ is invoking:</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class="current"><a href="/docs/building.html">Building ORC</a></li>
@@ -1880,14 +1001,14 @@ is invoking:</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -1914,31 +1035,7 @@ is invoking:</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -1962,31 +1059,17 @@ is invoking:</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2018,19 +1101,7 @@ is invoking:</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2066,13 +1137,25 @@ is invoking:</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2082,7 +1165,7 @@ is invoking:</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2100,17 +1183,17 @@ is invoking:</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2130,11 +1213,17 @@ is invoking:</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2156,7 +1245,7 @@ is invoking:</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/compression.html
----------------------------------------------------------------------
diff --git a/docs/compression.html b/docs/compression.html
deleted file mode 100644
index 2c70cb8..0000000
--- a/docs/compression.html
+++ /dev/null
@@ -1,2193 +0,0 @@
-<!DOCTYPE HTML>
-<html lang="en-US">
-<head>
- <meta charset="UTF-8">
- <title>Compression</title>
- <meta name="viewport" content="width=device-width,initial-scale=1">
- <meta name="generator" content="Jekyll v2.4.0">
- <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
- <link rel="stylesheet" href="/css/screen.css">
- <link rel="icon" type="image/x-icon" href="/favicon.ico">
- <!--[if lt IE 9]>
- <script src="/js/html5shiv.min.js"></script>
- <script src="/js/respond.min.js"></script>
- <![endif]-->
-</head>
-
-
-<body class="wrap">
- <header role="banner">
- <nav class="mobile-nav show-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- <div class="grid">
- <div class="unit one-third center-on-mobiles">
- <h1>
- <a href="/">
- <span class="sr-only">Apache ORC</span>
- <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
- </a>
- </h1>
- </div>
- <nav class="main-nav unit two-thirds hide-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- </div>
-</header>
-
-
- <section class="docs">
- <div class="grid">
-
- <div class="docs-nav-mobile unit whole show-on-mobiles">
- <select onchange="if (this.value) window.location.href=this.value">
- <option value="">Navigate the docs…</option>
-
- <optgroup label="Overview">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/index.html">Background</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/adopters.html">ORC Adopters</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/types.html">Types</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/indexes.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/acid.html">ACID support</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Installing">
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/building.html">Building ORC</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in Hive">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-ddl.html">Hive DDL</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-config.html">Hive Configuration</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in MapReduce">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapred.html">Using in MapRed</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapreduce.html">Using in MapReduce</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using ORC Core">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-java.html">Using Core Java</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-cpp.html">Using Core C++</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Tools">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/cpp-tools.html">C++ Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/java-tools.html">Java Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Format Specification">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Compression</h1>
- <p>If the ORC file writer selects a generic compression codec (zlib or
-snappy), every part of the ORC file except for the Postscript is
-compressed with that codec. However, one of the requirements for ORC
-is that the reader be able to skip over compressed bytes without
-decompressing the entire stream. To manage this, ORC writes compressed
-streams in chunks with headers as in the figure below.
-To handle uncompressable data, if the compressed data is larger than
-the original, the original is stored and the isOriginal flag is
-set. Each header is 3 bytes long with (compressedLength * 2 +
-isOriginal) stored as a little endian value. For example, the header
-for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
-0x03]. The header for 5 bytes that did not compress would be [0x0b,
-0x00, 0x00]. Each compression chunk is compressed independently so
-that as long as a decompressor starts at the top of a header, it can
-start decompressing without the previous bytes.</p>
-
-<p><img src="/img/CompressionStream.png" alt="compression streams" /></p>
-
-<p>The default compression chunk size is 256K, but writers can choose
-their own value. Larger chunks lead to better compression, but require
-more memory. The chunk size is recorded in the Postscript so that
-readers can allocate appropriately sized buffers. Readers are
-guaranteed that no chunk will expand to more than the compression chunk
-size.</p>
-
-<p>ORC files without generic compression write each stream directly
-with no headers.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/file-tail.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/run-length.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in Hive</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
-
-
-
-</ul>
-
-
- <h4>Tools</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
-
-
-
-</ul>
-
-
- <h4>Format Specification</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/compression.html">Compression</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
-
-
-
-</ul>
-
-
- </aside>
-</div>
-
-
- <div class="clear"></div>
-
- </div>
- </section>
-
-
- <footer role="contentinfo">
- <p>The contents of this website are © 2018
- <a href="https://www.apache.org/">Apache Software Foundation</a>
- under the terms of the <a
- href="https://www.apache.org/licenses/LICENSE-2.0.html">
- Apache License v2</a>. Apache ORC and its logo are trademarks
- of the Apache Software Foundation.</p>
-</footer>
-
- <script>
- var anchorForId = function (id) {
- var anchor = document.createElement("a");
- anchor.className = "header-link";
- anchor.href = "#" + id;
- anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
- anchor.title = "Permalink";
- return anchor;
- };
-
- var linkifyAnchors = function (level, containingElement) {
- var headers = containingElement.getElementsByTagName("h" + level);
- for (var h = 0; h < headers.length; h++) {
- var header = headers[h];
-
- if (typeof header.id !== "undefined" && header.id !== "") {
- header.appendChild(anchorForId(header.id));
- }
- }
- };
-
- document.onreadystatechange = function () {
- if (this.readyState === "complete") {
- var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
- if (!contentBlock) {
- return;
- }
- for (var level = 1; level <= 6; level++) {
- linkifyAnchors(level, contentBlock);
- }
- }
- };
-</script>
-
-
-</body>
-</html>
[7/9] orc git commit: Pushing ORC-339 reorganize the ORC file format
spec.
Posted by om...@apache.org.
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/encodings.html
----------------------------------------------------------------------
diff --git a/docs/encodings.html b/docs/encodings.html
deleted file mode 100644
index 0a2a3f7..0000000
--- a/docs/encodings.html
+++ /dev/null
@@ -1,2790 +0,0 @@
-<!DOCTYPE HTML>
-<html lang="en-US">
-<head>
- <meta charset="UTF-8">
- <title>Column Encodings</title>
- <meta name="viewport" content="width=device-width,initial-scale=1">
- <meta name="generator" content="Jekyll v2.4.0">
- <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
- <link rel="stylesheet" href="/css/screen.css">
- <link rel="icon" type="image/x-icon" href="/favicon.ico">
- <!--[if lt IE 9]>
- <script src="/js/html5shiv.min.js"></script>
- <script src="/js/respond.min.js"></script>
- <![endif]-->
-</head>
-
-
-<body class="wrap">
- <header role="banner">
- <nav class="mobile-nav show-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- <div class="grid">
- <div class="unit one-third center-on-mobiles">
- <h1>
- <a href="/">
- <span class="sr-only">Apache ORC</span>
- <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
- </a>
- </h1>
- </div>
- <nav class="main-nav unit two-thirds hide-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- </div>
-</header>
-
-
- <section class="docs">
- <div class="grid">
-
- <div class="docs-nav-mobile unit whole show-on-mobiles">
- <select onchange="if (this.value) window.location.href=this.value">
- <option value="">Navigate the docs…</option>
-
- <optgroup label="Overview">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/index.html">Background</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/adopters.html">ORC Adopters</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/types.html">Types</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/indexes.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/acid.html">ACID support</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Installing">
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/building.html">Building ORC</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in Hive">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-ddl.html">Hive DDL</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-config.html">Hive Configuration</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in MapReduce">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapred.html">Using in MapRed</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapreduce.html">Using in MapReduce</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using ORC Core">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-java.html">Using Core Java</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-cpp.html">Using Core C++</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Tools">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/cpp-tools.html">C++ Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/java-tools.html">Java Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Format Specification">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Column Encodings</h1>
- <h2 id="smallint-int-and-bigint-columns">SmallInt, Int, and BigInt Columns</h2>
-
-<p>All of the 16, 32, and 64 bit integer column types use the same set of
-potential encodings, which is basically whether they use RLE v1 or
-v2. If the PRESENT stream is not included, all of the values are
-present. For values that have false bits in the present stream, no
-values are included in the data stream.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Signed Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left">DIRECT_V2</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Signed Integer RLE v2</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="float-and-double-columns">Float and Double Columns</h2>
-
-<p>Floating point types are stored using IEEE 754 floating point bit
-layout. Float columns use 4 bytes per value and double columns use 8
-bytes.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">IEEE 754 floating point representation</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="string-char-and-varchar-columns">String, Char, and VarChar Columns</h2>
-
-<p>String, char, and varchar columns may be encoded either using a
-dictionary encoding or a direct encoding. A direct encoding should be
-preferred when there are many distinct values. In all of the
-encodings, the PRESENT stream encodes whether the value is null. The
-Java ORC writer automatically picks the encoding after the first row
-group (10,000 rows).</p>
-
-<p>For direct encoding the UTF-8 bytes are saved in the DATA stream and
-the length of each value is written into the LENGTH stream. In direct
-encoding, if the values were [“Nevada”, “California”]; the DATA
-would be “NevadaCalifornia” and the LENGTH would be [6, 10].</p>
-
-<p>For dictionary encodings the dictionary is sorted and UTF-8 bytes of
-each unique value are placed into DICTIONARY_DATA. The length of each
-item in the dictionary is put into the LENGTH stream. The DATA stream
-consists of the sequence of references to the dictionary elements.</p>
-
-<p>In dictionary encoding, if the values were [“Nevada”,
-“California”, “Nevada”, “California”, and “Florida”]; the
-DICTIONARY_DATA would be “CaliforniaFloridaNevada” and LENGTH would
-be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">String contents</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left">DICTIONARY</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DICTIONARY_DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">String contents</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left">DIRECT_V2</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">String contents</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v2</td>
- </tr>
- <tr>
- <td style="text-align: left">DICTIONARY_V2</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v2</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DICTIONARY_DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">String contents</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v2</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="boolean-columns">Boolean Columns</h2>
-
-<p>Boolean columns are rare, but have a simple encoding.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="tinyint-columns">TinyInt Columns</h2>
-
-<p>TinyInt (byte) columns use byte run length encoding.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Byte RLE</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="binary-columns">Binary Columns</h2>
-
-<p>Binary data is encoded with a PRESENT stream, a DATA stream that records
-the contents, and a LENGTH stream that records the number of bytes per a
-value.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">String contents</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left">DIRECT_V2</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">String contents</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v2</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="decimal-columns">Decimal Columns</h2>
-
-<p>Decimal was introduced in Hive 0.11 with infinite precision (the total
-number of digits). In Hive 0.13, the definition was change to limit
-the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unbounded base 128 varints</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">SECONDARY</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left">DIRECT_V2</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unbounded base 128 varints</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">SECONDARY</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v2</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="date-columns">Date Columns</h2>
-
-<p>Date data is encoded with a PRESENT stream, a DATA stream that records
-the number of days after January 1, 1970 in UTC.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Signed Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left">DIRECT_V2</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Signed Integer RLE v2</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="timestamp-columns">Timestamp Columns</h2>
-
-<p>Timestamp records times down to nanoseconds as a PRESENT stream that
-records non-null values, a DATA stream that records the number of
-seconds after 1 January 2015, and a SECONDARY stream that records the
-number of nanoseconds.</p>
-
-<p>Because the number of nanoseconds often has a large number of trailing
-zeros, the number has trailing decimal zero digits removed and the
-last three bits are used to record how many zeros were removed. Thus
-1000 nanoseconds would be serialized as 0x0b and 100000 would be
-serialized as 0x0d.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Signed Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">SECONDARY</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left">DIRECT_V2</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DATA</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Signed Integer RLE v2</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">SECONDARY</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v2</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="struct-columns">Struct Columns</h2>
-
-<p>Structs have no data themselves and delegate everything to their child
-columns except for their PRESENT stream. They have a child column
-for each of the fields.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="list-columns">List Columns</h2>
-
-<p>Lists are encoded as the PRESENT stream and a length stream with
-number of items in each list. They have a single child column for the
-element values.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left">DIRECT_V2</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v2</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="map-columns">Map Columns</h2>
-
-<p>Maps are encoded as the PRESENT stream and a length stream with number
-of items in each list. They have a child column for the key and
-another child column for the value.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v1</td>
- </tr>
- <tr>
- <td style="text-align: left">DIRECT_V2</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">LENGTH</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Unsigned Integer RLE v2</td>
- </tr>
- </tbody>
-</table>
-
-<h2 id="union-columns">Union Columns</h2>
-
-<p>Unions are encoded as the PRESENT stream and a tag stream that controls which
-potential variant is used. They have a child column for each variant of the
-union. Currently ORC union types are limited to 256 variants, which matches
-the Hive type model.</p>
-
-<table>
- <thead>
- <tr>
- <th style="text-align: left">Encoding</th>
- <th style="text-align: left">Stream Kind</th>
- <th style="text-align: left">Optional</th>
- <th style="text-align: left">Contents</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">PRESENT</td>
- <td style="text-align: left">Yes</td>
- <td style="text-align: left">Boolean RLE</td>
- </tr>
- <tr>
- <td style="text-align: left"> </td>
- <td style="text-align: left">DIRECT</td>
- <td style="text-align: left">No</td>
- <td style="text-align: left">Byte RLE</td>
- </tr>
- </tbody>
-</table>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/stripes.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/spec-index.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in Hive</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
-
-
-
-</ul>
-
-
- <h4>Tools</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
-
-
-
-</ul>
-
-
- <h4>Format Specification</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/encodings.html">Column Encodings</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
-
-
-
-</ul>
-
-
- </aside>
-</div>
-
-
- <div class="clear"></div>
-
- </div>
- </section>
-
-
- <footer role="contentinfo">
- <p>The contents of this website are © 2018
- <a href="https://www.apache.org/">Apache Software Foundation</a>
- under the terms of the <a
- href="https://www.apache.org/licenses/LICENSE-2.0.html">
- Apache License v2</a>. Apache ORC and its logo are trademarks
- of the Apache Software Foundation.</p>
-</footer>
-
- <script>
- var anchorForId = function (id) {
- var anchor = document.createElement("a");
- anchor.className = "header-link";
- anchor.href = "#" + id;
- anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
- anchor.title = "Permalink";
- return anchor;
- };
-
- var linkifyAnchors = function (level, containingElement) {
- var headers = containingElement.getElementsByTagName("h" + level);
- for (var h = 0; h < headers.length; h++) {
- var header = headers[h];
-
- if (typeof header.id !== "undefined" && header.id !== "") {
- header.appendChild(anchorForId(header.id));
- }
- }
- };
-
- document.onreadystatechange = function () {
- if (this.readyState === "complete") {
- var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
- if (!contentBlock) {
- return;
- }
- for (var level = 1; level <= 6; level++) {
- linkifyAnchors(level, contentBlock);
- }
- }
- };
-</script>
-
-
-</body>
-</html>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/file-tail.html
----------------------------------------------------------------------
diff --git a/docs/file-tail.html b/docs/file-tail.html
deleted file mode 100644
index 3e4c9a4..0000000
--- a/docs/file-tail.html
+++ /dev/null
@@ -1,2477 +0,0 @@
-<!DOCTYPE HTML>
-<html lang="en-US">
-<head>
- <meta charset="UTF-8">
- <title>File Tail</title>
- <meta name="viewport" content="width=device-width,initial-scale=1">
- <meta name="generator" content="Jekyll v2.4.0">
- <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
- <link rel="stylesheet" href="/css/screen.css">
- <link rel="icon" type="image/x-icon" href="/favicon.ico">
- <!--[if lt IE 9]>
- <script src="/js/html5shiv.min.js"></script>
- <script src="/js/respond.min.js"></script>
- <![endif]-->
-</head>
-
-
-<body class="wrap">
- <header role="banner">
- <nav class="mobile-nav show-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- <div class="grid">
- <div class="unit one-third center-on-mobiles">
- <h1>
- <a href="/">
- <span class="sr-only">Apache ORC</span>
- <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
- </a>
- </h1>
- </div>
- <nav class="main-nav unit two-thirds hide-on-mobiles">
- <ul>
- <li class="">
- <a href="/">Home</a>
- </li>
- <li class="current">
- <a href="/docs/"><span class="show-on-mobiles">Docs</span>
- <span class="hide-on-mobiles">Documentation</span></a>
- </li>
- <li class="">
- <a href="/talks/">Talks</a>
- </li>
- <li class="">
- <a href="/news/">News</a>
- </li>
- <li class="">
- <a href="/help/">Help</a>
- </li>
- <li class="">
- <a href="/develop/">Develop</a>
- </li>
-</ul>
-
- </nav>
- </div>
-</header>
-
-
- <section class="docs">
- <div class="grid">
-
- <div class="docs-nav-mobile unit whole show-on-mobiles">
- <select onchange="if (this.value) window.location.href=this.value">
- <option value="">Navigate the docs…</option>
-
- <optgroup label="Overview">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/index.html">Background</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/adopters.html">ORC Adopters</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/types.html">Types</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/indexes.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/acid.html">ACID support</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Installing">
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/building.html">Building ORC</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in Hive">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-ddl.html">Hive DDL</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/hive-config.html">Hive Configuration</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using in MapReduce">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapred.html">Using in MapRed</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/mapreduce.html">Using in MapReduce</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Using ORC Core">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-java.html">Using Core Java</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/core-cpp.html">Using Core C++</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Tools">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/cpp-tools.html">C++ Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/java-tools.html">Java Tools</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- <optgroup label="Format Specification">
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>File Tail</h1>
- <p>Since HDFS does not support changing the data in a file after it is
-written, ORC stores the top level index at the end of the file. The
-overall structure of the file is given in the figure above. The
-file’s tail consists of 3 parts; the file metadata, file footer and
-postscript.</p>
-
-<p>The metadata for ORC is stored using
-<a href="https://s.apache.org/protobuf_encoding">Protocol Buffers</a>, which provides
-the ability to add new fields without breaking readers. This document
-incorporates the Protobuf definition from the
-<a href="https://s.apache.org/orc_proto">ORC source code</a> and the
-reader is encouraged to review the Protobuf encoding if they need to
-understand the byte-level encoding</p>
-
-<h1 id="postscript">Postscript</h1>
-
-<p>The Postscript section provides the necessary information to interpret
-the rest of the file including the length of the file’s Footer and
-Metadata sections, the version of the file, and the kind of general
-compression used (eg. none, zlib, or snappy). The Postscript is never
-compressed and ends one byte before the end of the file. The version
-stored in the Postscript is the lowest version of Hive that is
-guaranteed to be able to read the file and it stored as a sequence of
-the major and minor version. There are currently two versions that are
-used: [0,11] for Hive 0.11, and [0,12] for Hive 0.12 or later.</p>
-
-<p>The process of reading an ORC file works backwards through the
-file. Rather than making multiple short reads, the ORC reader reads
-the last 16k bytes of the file with the hope that it will contain both
-the Footer and Postscript sections. The final byte of the file
-contains the serialized length of the Postscript, which must be less
-than 256 bytes. Once the Postscript is parsed, the compressed
-serialized length of the Footer is known and it can be decompressed
-and parsed.</p>
-
-<p><code>message PostScript {
- // the length of the footer section in bytes
- optional uint64 footerLength = 1;
- // the kind of generic compression used
- optional CompressionKind compression = 2;
- // the maximum size of each compression chunk
- optional uint64 compressionBlockSize = 3;
- // the version of the writer
- repeated uint32 version = 4 [packed = true];
- // the length of the metadata section in bytes
- optional uint64 metadataLength = 5;
- // the fixed string "ORC"
- optional string magic = 8000;
-}
-</code></p>
-
-<p><code>enum CompressionKind {
- NONE = 0;
- ZLIB = 1;
- SNAPPY = 2;
- LZO = 3;
- LZ4 = 4;
- ZSTD = 5;
-}
-</code></p>
-
-<h1 id="footer">Footer</h1>
-
-<p>The Footer section contains the layout of the body of the file, the
-type schema information, the number of rows, and the statistics about
-each of the columns.</p>
-
-<p>The file is broken in to three parts- Header, Body, and Tail. The
-Header consists of the bytes “ORC’’ to support tools that want to
-scan the front of the file to determine the type of the file. The Body
-contains the rows and indexes, and the Tail gives the file level
-information as described in this section.</p>
-
-<p><code>message Footer {
- // the length of the file header in bytes (always 3)
- optional uint64 headerLength = 1;
- // the length of the file header and body in bytes
- optional uint64 contentLength = 2;
- // the information about the stripes
- repeated StripeInformation stripes = 3;
- // the schema information
- repeated Type types = 4;
- // the user metadata that was added
- repeated UserMetadataItem metadata = 5;
- // the total number of rows in the file
- optional uint64 numberOfRows = 6;
- // the statistics of each column across the file
- repeated ColumnStatistics statistics = 7;
- // the maximum number of rows in each index entry
- optional uint32 rowIndexStride = 8;
-}
-</code></p>
-
-<h2 id="stripe-information">Stripe Information</h2>
-
-<p>The body of the file is divided into stripes. Each stripe is self
-contained and may be read using only its own bytes combined with the
-file’s Footer and Postscript. Each stripe contains only entire rows so
-that rows never straddle stripe boundaries. Stripes have three
-sections: a set of indexes for the rows within the stripe, the data
-itself, and a stripe footer. Both the indexes and the data sections
-are divided by columns so that only the data for the required columns
-needs to be read.</p>
-
-<p><code>message StripeInformation {
- // the start of the stripe within the file
- optional uint64 offset = 1;
- // the length of the indexes in bytes
- optional uint64 indexLength = 2;
- // the length of the data in bytes
- optional uint64 dataLength = 3;
- // the length of the footer in bytes
- optional uint64 footerLength = 4;
- // the number of rows in the stripe
- optional uint64 numberOfRows = 5;
-}
-</code></p>
-
-<h2 id="type-information">Type Information</h2>
-
-<p>All of the rows in an ORC file must have the same schema. Logically
-the schema is expressed as a tree as in the figure below, where
-the compound types have subcolumns under them.</p>
-
-<p><img src="/img/TreeWriters.png" alt="ORC column structure" /></p>
-
-<p>The equivalent Hive DDL would be:</p>
-
-<p><code>create table Foobar (
- myInt int,
- myMap map<string,
- struct<myString : string,
- myDouble: double>>,
- myTime timestamp
-);
-</code></p>
-
-<p>The type tree is flattened in to a list via a pre-order traversal
-where each type is assigned the next id. Clearly the root of the type
-tree is always type id 0. Compound types have a field named subtypes
-that contains the list of their children’s type ids.</p>
-
-<p><code>message Type {
- enum Kind {
- BOOLEAN = 0;
- BYTE = 1;
- SHORT = 2;
- INT = 3;
- LONG = 4;
- FLOAT = 5;
- DOUBLE = 6;
- STRING = 7;
- BINARY = 8;
- TIMESTAMP = 9;
- LIST = 10;
- MAP = 11;
- STRUCT = 12;
- UNION = 13;
- DECIMAL = 14;
- DATE = 15;
- VARCHAR = 16;
- CHAR = 17;
- }
- // the kind of this type
- required Kind kind = 1;
- // the type ids of any subcolumns for list, map, struct, or union
- repeated uint32 subtypes = 2 [packed=true];
- // the list of field names for struct
- repeated string fieldNames = 3;
- // the maximum length of the type for varchar or char in UTF-8 characters
- optional uint32 maximumLength = 4;
- // the precision and scale for decimal
- optional uint32 precision = 5;
- optional uint32 scale = 6;
-}
-</code></p>
-
-<h2 id="column-statistics">Column Statistics</h2>
-
-<p>The goal of the column statistics is that for each column, the writer
-records the count and depending on the type other useful fields. For
-most of the primitive types, it records the minimum and maximum
-values; and for numeric types it additionally stores the sum.
-From Hive 1.1.0 onwards, the column statistics will also record if
-there are any null values within the row group by setting the hasNull flag.
-The hasNull flag is used by ORC’s predicate pushdown to better answer
-‘IS NULL’ queries.</p>
-
-<p><code>message ColumnStatistics {
- // the number of values
- optional uint64 numberOfValues = 1;
- // At most one of these has a value for any column
- optional IntegerStatistics intStatistics = 2;
- optional DoubleStatistics doubleStatistics = 3;
- optional StringStatistics stringStatistics = 4;
- optional BucketStatistics bucketStatistics = 5;
- optional DecimalStatistics decimalStatistics = 6;
- optional DateStatistics dateStatistics = 7;
- optional BinaryStatistics binaryStatistics = 8;
- optional TimestampStatistics timestampStatistics = 9;
- optional bool hasNull = 10;
-}
-</code></p>
-
-<p>For integer types (tinyint, smallint, int, bigint), the column
-statistics includes the minimum, maximum, and sum. If the sum
-overflows long at any point during the calculation, no sum is
-recorded.</p>
-
-<p><code>message IntegerStatistics {
- optional sint64 minimum = 1;
- optional sint64 maximum = 2;
- optional sint64 sum = 3;
-}
-</code></p>
-
-<p>For floating point types (float, double), the column statistics
-include the minimum, maximum, and sum. If the sum overflows a double,
-no sum is recorded.</p>
-
-<p><code>message DoubleStatistics {
- optional double minimum = 1;
- optional double maximum = 2;
- optional double sum = 3;
-}
-</code></p>
-
-<p>For strings, the minimum value, maximum value, and the sum of the
-lengths of the values are recorded.</p>
-
-<p><code>message StringStatistics {
- optional string minimum = 1;
- optional string maximum = 2;
- // sum will store the total length of all strings
- optional sint64 sum = 3;
-}
-</code></p>
-
-<p>For booleans, the statistics include the count of false and true values.</p>
-
-<p><code>message BucketStatistics {
- repeated uint64 count = 1 [packed=true];
-}
-</code></p>
-
-<p>For decimals, the minimum, maximum, and sum are stored.</p>
-
-<p><code>message DecimalStatistics {
- optional string minimum = 1;
- optional string maximum = 2;
- optional string sum = 3;
-}
-</code></p>
-
-<p>Date columns record the minimum and maximum values as the number of
-days since the epoch (1/1/2015).</p>
-
-<p><code>message DateStatistics {
- // min,max values saved as days since epoch
- optional sint32 minimum = 1;
- optional sint32 maximum = 2;
-}
-</code></p>
-
-<p>Timestamp columns record the minimum and maximum values as the number of
-milliseconds since the epoch (1/1/2015).</p>
-
-<p><code>message TimestampStatistics {
- // min,max values saved as milliseconds since epoch
- optional sint64 minimum = 1;
- optional sint64 maximum = 2;
-}
-</code></p>
-
-<p>Binary columns store the aggregate number of bytes across all of the values.</p>
-
-<p><code>message BinaryStatistics {
- // sum will store the total binary blob length
- optional sint64 sum = 1;
-}
-</code></p>
-
-<h2 id="user-metadata">User Metadata</h2>
-
-<p>The user can add arbitrary key/value pairs to an ORC file as it is
-written. The contents of the keys and values are completely
-application defined, but the key is a string and the value is
-binary. Care should be taken by applications to make sure that their
-keys are unique and in general should be prefixed with an organization
-code.</p>
-
-<p><code>message UserMetadataItem {
- // the user defined key
- required string name = 1;
- // the user defined binary value
- required bytes value = 2;
-}
-</code></p>
-
-<h2 id="file-metadata">File Metadata</h2>
-
-<p>The file Metadata section contains column statistics at the stripe
-level granularity. These statistics enable input split elimination
-based on the predicate push-down evaluated per a stripe.</p>
-
-<p><code>message StripeStatistics {
- repeated ColumnStatistics colStats = 1;
-}
-</code></p>
-
-<p><code>message Metadata {
- repeated StripeStatistics stripeStats = 1;
-}
-</code></p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/spec-intro.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/compression.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in Hive</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
-
-
-
-</ul>
-
-
- <h4>Tools</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
-
-
-
-</ul>
-
-
- <h4>Format Specification</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/file-tail.html">File Tail</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
-
-
-
-</ul>
-
-
- </aside>
-</div>
-
-
- <div class="clear"></div>
-
- </div>
- </section>
-
-
- <footer role="contentinfo">
- <p>The contents of this website are © 2018
- <a href="https://www.apache.org/">Apache Software Foundation</a>
- under the terms of the <a
- href="https://www.apache.org/licenses/LICENSE-2.0.html">
- Apache License v2</a>. Apache ORC and its logo are trademarks
- of the Apache Software Foundation.</p>
-</footer>
-
- <script>
- var anchorForId = function (id) {
- var anchor = document.createElement("a");
- anchor.className = "header-link";
- anchor.href = "#" + id;
- anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
- anchor.title = "Permalink";
- return anchor;
- };
-
- var linkifyAnchors = function (level, containingElement) {
- var headers = containingElement.getElementsByTagName("h" + level);
- for (var h = 0; h < headers.length; h++) {
- var header = headers[h];
-
- if (typeof header.id !== "undefined" && header.id !== "") {
- header.appendChild(anchorForId(header.id));
- }
- }
- };
-
- document.onreadystatechange = function () {
- if (this.readyState === "complete") {
- var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
- if (!contentBlock) {
- return;
- }
- for (var level = 1; level <= 6; level++) {
- linkifyAnchors(level, contentBlock);
- }
- }
- };
-</script>
-
-
-</body>
-</html>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/hive-config.html
----------------------------------------------------------------------
diff --git a/docs/hive-config.html b/docs/hive-config.html
index 6fe958c..bc2f68c 100644
--- a/docs/hive-config.html
+++ b/docs/hive-config.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,383 +663,18 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Hive Configuration</h1>
- <h2 id="table-properties">Table properties</h2>
+ <div class="unit four-fifths">
+ <article>
+ <h1>Hive Configuration</h1>
+ <h2 id="table-properties">Table properties</h2>
<p>Tables stored as ORC files use table properties to control their behavior. By
using table properties, the table owner ensures that all clients store data
@@ -1460,286 +893,60 @@ with the same options.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/hive-ddl.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/mapred.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
+
-
-
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/hive-ddl.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/mapred.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1768,11 +975,7 @@ with the same options.</p>
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1786,34 +989,10 @@ with the same options.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -1850,7 +1029,7 @@ with the same options.</p>
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -1880,49 +1059,7 @@ with the same options.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -1934,22 +1071,14 @@ with the same options.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -1966,15 +1095,7 @@ with the same options.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -2012,14 +1133,14 @@ with the same options.</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -2046,31 +1167,7 @@ with the same options.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -2094,31 +1191,17 @@ with the same options.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class="current"><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2150,19 +1233,7 @@ with the same options.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2198,13 +1269,25 @@ with the same options.</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2214,7 +1297,7 @@ with the same options.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2232,17 +1315,17 @@ with the same options.</p>
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2262,11 +1345,17 @@ with the same options.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2288,7 +1377,7 @@ with the same options.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
[3/9] orc git commit: Pushing ORC-339 reorganize the ORC file format
spec.
Posted by om...@apache.org.
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/specification/ORCv0.html
----------------------------------------------------------------------
diff --git a/specification/ORCv0.html b/specification/ORCv0.html
new file mode 100644
index 0000000..ecf335a
--- /dev/null
+++ b/specification/ORCv0.html
@@ -0,0 +1,1260 @@
+<!DOCTYPE HTML>
+<html lang="en-US">
+<head>
+ <meta charset="UTF-8">
+ <title>ORC Specification v0</title>
+ <meta name="viewport" content="width=device-width,initial-scale=1">
+ <meta name="generator" content="Jekyll v2.4.0">
+ <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+ <link rel="stylesheet" href="/css/screen.css">
+ <link rel="icon" type="image/x-icon" href="/favicon.ico">
+ <!--[if lt IE 9]>
+ <script src="/js/html5shiv.min.js"></script>
+ <script src="/js/respond.min.js"></script>
+ <![endif]-->
+</head>
+
+
+<body class="wrap">
+ <header role="banner">
+ <nav class="mobile-nav show-on-mobiles">
+ <ul>
+ <li class="">
+ <a href="/">Home</a>
+ </li>
+ <li class="">
+ <a href="/docs/"><span class="show-on-mobiles">Docs</span>
+ <span class="hide-on-mobiles">Documentation</span></a>
+ </li>
+ <li class="">
+ <a href="/talks/">Talks</a>
+ </li>
+ <li class="">
+ <a href="/news/">News</a>
+ </li>
+ <li class="">
+ <a href="/help/">Help</a>
+ </li>
+ <li class="">
+ <a href="/develop/">Develop</a>
+ </li>
+</ul>
+
+ </nav>
+ <div class="grid">
+ <div class="unit one-third center-on-mobiles">
+ <h1>
+ <a href="/">
+ <span class="sr-only">Apache ORC</span>
+ <img src="/img/logo.png" width="249" height="101" alt="ORC Logo">
+ </a>
+ </h1>
+ </div>
+ <nav class="main-nav unit two-thirds hide-on-mobiles">
+ <ul>
+ <li class="">
+ <a href="/">Home</a>
+ </li>
+ <li class="">
+ <a href="/docs/"><span class="show-on-mobiles">Docs</span>
+ <span class="hide-on-mobiles">Documentation</span></a>
+ </li>
+ <li class="">
+ <a href="/talks/">Talks</a>
+ </li>
+ <li class="">
+ <a href="/news/">News</a>
+ </li>
+ <li class="">
+ <a href="/help/">Help</a>
+ </li>
+ <li class="">
+ <a href="/develop/">Develop</a>
+ </li>
+</ul>
+
+ </nav>
+ </div>
+</header>
+
+
+ <section class="standalone">
+ <div class="grid">
+
+ <div class="unit whole">
+ <article>
+ <h1>ORC Specification v0</h1>
+ <p>This version of the file format was originally released as part of
+Hive 0.11.</p>
+
+<h1 id="motivation">Motivation</h1>
+
+<p>Hive’s RCFile was the standard format for storing tabular data in
+Hadoop for several years. However, RCFile has limitations because it
+treats each column as a binary blob without semantics. In Hive 0.11 we
+added a new file format named Optimized Row Columnar (ORC) file that
+uses and retains the type information from the table definition. ORC
+uses type specific readers and writers that provide light weight
+compression techniques such as dictionary encoding, bit packing, delta
+encoding, and run length encoding – resulting in dramatically smaller
+files. Additionally, ORC can apply generic compression using zlib, or
+Snappy on top of the lightweight compression for even smaller
+files. However, storage savings are only part of the gain. ORC
+supports projection, which selects subsets of the columns for reading,
+so that queries reading only one column read only the required
+bytes. Furthermore, ORC files include light weight indexes that
+include the minimum and maximum values for each column in each set of
+10,000 rows and the entire file. Using pushdown filters from Hive, the
+file reader can skip entire sets of rows that aren’t important for
+this query.</p>
+
+<p><img src="/img/OrcFileLayout.png" alt="ORC file structure" /></p>
+
+<h1 id="file-tail">File Tail</h1>
+
+<p>Since HDFS does not support changing the data in a file after it is
+written, ORC stores the top level index at the end of the file. The
+overall structure of the file is given in the figure above. The
+file’s tail consists of 3 parts; the file metadata, file footer and
+postscript.</p>
+
+<p>The metadata for ORC is stored using
+<a href="https://s.apache.org/protobuf_encoding">Protocol Buffers</a>, which provides
+the ability to add new fields without breaking readers. This document
+incorporates the Protobuf definition from the
+<a href="https://s.apache.org/orc_proto">ORC source code</a> and the
+reader is encouraged to review the Protobuf encoding if they need to
+understand the byte-level encoding</p>
+
+<h2 id="postscript">Postscript</h2>
+
+<p>The Postscript section provides the necessary information to interpret
+the rest of the file including the length of the file’s Footer and
+Metadata sections, the version of the file, and the kind of general
+compression used (eg. none, zlib, or snappy). The Postscript is never
+compressed and ends one byte before the end of the file. The version
+stored in the Postscript is the lowest version of Hive that is
+guaranteed to be able to read the file and it stored as a sequence of
+the major and minor version. This version is stored as [0, 11].</p>
+
+<p>The process of reading an ORC file works backwards through the
+file. Rather than making multiple short reads, the ORC reader reads
+the last 16k bytes of the file with the hope that it will contain both
+the Footer and Postscript sections. The final byte of the file
+contains the serialized length of the Postscript, which must be less
+than 256 bytes. Once the Postscript is parsed, the compressed
+serialized length of the Footer is known and it can be decompressed
+and parsed.</p>
+
+<p><code>message PostScript {
+ // the length of the footer section in bytes
+ optional uint64 footerLength = 1;
+ // the kind of generic compression used
+ optional CompressionKind compression = 2;
+ // the maximum size of each compression chunk
+ optional uint64 compressionBlockSize = 3;
+ // the version of the writer
+ repeated uint32 version = 4 [packed = true];
+ // the length of the metadata section in bytes
+ optional uint64 metadataLength = 5;
+ // the fixed string "ORC"
+ optional string magic = 8000;
+}
+</code></p>
+
+<p><code>enum CompressionKind {
+ NONE = 0;
+ ZLIB = 1;
+ SNAPPY = 2;
+ LZO = 3;
+ LZ4 = 4;
+ ZSTD = 5;
+}
+</code></p>
+
+<h2 id="footer">Footer</h2>
+
+<p>The Footer section contains the layout of the body of the file, the
+type schema information, the number of rows, and the statistics about
+each of the columns.</p>
+
+<p>The file is broken in to three parts- Header, Body, and Tail. The
+Header consists of the bytes “ORC’’ to support tools that want to
+scan the front of the file to determine the type of the file. The Body
+contains the rows and indexes, and the Tail gives the file level
+information as described in this section.</p>
+
+<p><code>message Footer {
+ // the length of the file header in bytes (always 3)
+ optional uint64 headerLength = 1;
+ // the length of the file header and body in bytes
+ optional uint64 contentLength = 2;
+ // the information about the stripes
+ repeated StripeInformation stripes = 3;
+ // the schema information
+ repeated Type types = 4;
+ // the user metadata that was added
+ repeated UserMetadataItem metadata = 5;
+ // the total number of rows in the file
+ optional uint64 numberOfRows = 6;
+ // the statistics of each column across the file
+ repeated ColumnStatistics statistics = 7;
+ // the maximum number of rows in each index entry
+ optional uint32 rowIndexStride = 8;
+}
+</code></p>
+
+<h3 id="stripe-information">Stripe Information</h3>
+
+<p>The body of the file is divided into stripes. Each stripe is self
+contained and may be read using only its own bytes combined with the
+file’s Footer and Postscript. Each stripe contains only entire rows so
+that rows never straddle stripe boundaries. Stripes have three
+sections: a set of indexes for the rows within the stripe, the data
+itself, and a stripe footer. Both the indexes and the data sections
+are divided by columns so that only the data for the required columns
+needs to be read.</p>
+
+<p><code>message StripeInformation {
+ // the start of the stripe within the file
+ optional uint64 offset = 1;
+ // the length of the indexes in bytes
+ optional uint64 indexLength = 2;
+ // the length of the data in bytes
+ optional uint64 dataLength = 3;
+ // the length of the footer in bytes
+ optional uint64 footerLength = 4;
+ // the number of rows in the stripe
+ optional uint64 numberOfRows = 5;
+}
+</code></p>
+
+<h3 id="type-information">Type Information</h3>
+
+<p>All of the rows in an ORC file must have the same schema. Logically
+the schema is expressed as a tree as in the figure below, where
+the compound types have subcolumns under them.</p>
+
+<p><img src="/img/TreeWriters.png" alt="ORC column structure" /></p>
+
+<p>The equivalent Hive DDL would be:</p>
+
+<p><code>create table Foobar (
+ myInt int,
+ myMap map<string,
+ struct<myString : string,
+ myDouble: double>>,
+ myTime timestamp
+);
+</code></p>
+
+<p>The type tree is flattened in to a list via a pre-order traversal
+where each type is assigned the next id. Clearly the root of the type
+tree is always type id 0. Compound types have a field named subtypes
+that contains the list of their children’s type ids.</p>
+
+<p><code>message Type {
+ enum Kind {
+ BOOLEAN = 0;
+ BYTE = 1;
+ SHORT = 2;
+ INT = 3;
+ LONG = 4;
+ FLOAT = 5;
+ DOUBLE = 6;
+ STRING = 7;
+ BINARY = 8;
+ TIMESTAMP = 9;
+ LIST = 10;
+ MAP = 11;
+ STRUCT = 12;
+ UNION = 13;
+ DECIMAL = 14;
+ DATE = 15;
+ VARCHAR = 16;
+ CHAR = 17;
+ }
+ // the kind of this type
+ required Kind kind = 1;
+ // the type ids of any subcolumns for list, map, struct, or union
+ repeated uint32 subtypes = 2 [packed=true];
+ // the list of field names for struct
+ repeated string fieldNames = 3;
+ // the maximum length of the type for varchar or char in UTF-8 characters
+ optional uint32 maximumLength = 4;
+ // the precision and scale for decimal
+ optional uint32 precision = 5;
+ optional uint32 scale = 6;
+}
+</code></p>
+
+<h3 id="column-statistics">Column Statistics</h3>
+
+<p>The goal of the column statistics is that for each column, the writer
+records the count and depending on the type other useful fields. For
+most of the primitive types, it records the minimum and maximum
+values; and for numeric types it additionally stores the sum.
+From Hive 1.1.0 onwards, the column statistics will also record if
+there are any null values within the row group by setting the hasNull flag.
+The hasNull flag is used by ORC’s predicate pushdown to better answer
+‘IS NULL’ queries.</p>
+
+<p><code>message ColumnStatistics {
+ // the number of values
+ optional uint64 numberOfValues = 1;
+ // At most one of these has a value for any column
+ optional IntegerStatistics intStatistics = 2;
+ optional DoubleStatistics doubleStatistics = 3;
+ optional StringStatistics stringStatistics = 4;
+ optional BucketStatistics bucketStatistics = 5;
+ optional DecimalStatistics decimalStatistics = 6;
+ optional DateStatistics dateStatistics = 7;
+ optional BinaryStatistics binaryStatistics = 8;
+ optional TimestampStatistics timestampStatistics = 9;
+ optional bool hasNull = 10;
+}
+</code></p>
+
+<p>For integer types (tinyint, smallint, int, bigint), the column
+statistics includes the minimum, maximum, and sum. If the sum
+overflows long at any point during the calculation, no sum is
+recorded.</p>
+
+<p><code>message IntegerStatistics {
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+ optional sint64 sum = 3;
+}
+</code></p>
+
+<p>For floating point types (float, double), the column statistics
+include the minimum, maximum, and sum. If the sum overflows a double,
+no sum is recorded.</p>
+
+<p><code>message DoubleStatistics {
+ optional double minimum = 1;
+ optional double maximum = 2;
+ optional double sum = 3;
+}
+</code></p>
+
+<p>For strings, the minimum value, maximum value, and the sum of the
+lengths of the values are recorded.</p>
+
+<p><code>message StringStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ // sum will store the total length of all strings
+ optional sint64 sum = 3;
+}
+</code></p>
+
+<p>For booleans, the statistics include the count of false and true values.</p>
+
+<p><code>message BucketStatistics {
+ repeated uint64 count = 1 [packed=true];
+}
+</code></p>
+
+<p>For decimals, the minimum, maximum, and sum are stored.</p>
+
+<p><code>message DecimalStatistics {
+ optional string minimum = 1;
+ optional string maximum = 2;
+ optional string sum = 3;
+}
+</code></p>
+
+<p>Date columns record the minimum and maximum values as the number of
+days since the epoch (1/1/2015).</p>
+
+<p><code>message DateStatistics {
+ // min,max values saved as days since epoch
+ optional sint32 minimum = 1;
+ optional sint32 maximum = 2;
+}
+</code></p>
+
+<p>Timestamp columns record the minimum and maximum values as the number of
+milliseconds since the epoch (1/1/2015).</p>
+
+<p><code>message TimestampStatistics {
+ // min,max values saved as milliseconds since epoch
+ optional sint64 minimum = 1;
+ optional sint64 maximum = 2;
+}
+</code></p>
+
+<p>Binary columns store the aggregate number of bytes across all of the values.</p>
+
+<p><code>message BinaryStatistics {
+ // sum will store the total binary blob length
+ optional sint64 sum = 1;
+}
+</code></p>
+
+<h3 id="user-metadata">User Metadata</h3>
+
+<p>The user can add arbitrary key/value pairs to an ORC file as it is
+written. The contents of the keys and values are completely
+application defined, but the key is a string and the value is
+binary. Care should be taken by applications to make sure that their
+keys are unique and in general should be prefixed with an organization
+code.</p>
+
+<p><code>message UserMetadataItem {
+ // the user defined key
+ required string name = 1;
+ // the user defined binary value
+ required bytes value = 2;
+}
+</code></p>
+
+<h3 id="file-metadata">File Metadata</h3>
+
+<p>The file Metadata section contains column statistics at the stripe
+level granularity. These statistics enable input split elimination
+based on the predicate push-down evaluated per a stripe.</p>
+
+<p><code>message StripeStatistics {
+ repeated ColumnStatistics colStats = 1;
+}
+</code></p>
+
+<p><code>message Metadata {
+ repeated StripeStatistics stripeStats = 1;
+}
+</code></p>
+
+<h1 id="compression">Compression</h1>
+
+<p>If the ORC file writer selects a generic compression codec (zlib or
+snappy), every part of the ORC file except for the Postscript is
+compressed with that codec. However, one of the requirements for ORC
+is that the reader be able to skip over compressed bytes without
+decompressing the entire stream. To manage this, ORC writes compressed
+streams in chunks with headers as in the figure below.
+To handle uncompressable data, if the compressed data is larger than
+the original, the original is stored and the isOriginal flag is
+set. Each header is 3 bytes long with (compressedLength * 2 +
+isOriginal) stored as a little endian value. For example, the header
+for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
+0x03]. The header for 5 bytes that did not compress would be [0x0b,
+0x00, 0x00]. Each compression chunk is compressed independently so
+that as long as a decompressor starts at the top of a header, it can
+start decompressing without the previous bytes.</p>
+
+<p><img src="/img/CompressionStream.png" alt="compression streams" /></p>
+
+<p>The default compression chunk size is 256K, but writers can choose
+their own value. Larger chunks lead to better compression, but require
+more memory. The chunk size is recorded in the Postscript so that
+readers can allocate appropriately sized buffers. Readers are
+guaranteed that no chunk will expand to more than the compression chunk
+size.</p>
+
+<p>ORC files without generic compression write each stream directly
+with no headers.</p>
+
+<h1 id="run-length-encoding">Run Length Encoding</h1>
+
+<h2 id="base-128-varint">Base 128 Varint</h2>
+
+<p>Variable width integer encodings take advantage of the fact that most
+numbers are small and that having smaller encodings for small numbers
+shrinks the overall size of the data. ORC uses the varint format from
+Protocol Buffers, which writes data in little endian format using the
+low 7 bits of each byte. The high bit in each byte is set if the
+number continues into the next byte.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Unsigned Original</th>
+ <th style="text-align: left">Serialized</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">0x00</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left">0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">127</td>
+ <td style="text-align: left">0x7f</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">128</td>
+ <td style="text-align: left">0x80, 0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">129</td>
+ <td style="text-align: left">0x81, 0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16,383</td>
+ <td style="text-align: left">0xff, 0x7f</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16,384</td>
+ <td style="text-align: left">0x80, 0x80, 0x01</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">16,385</td>
+ <td style="text-align: left">0x81, 0x80, 0x01</td>
+ </tr>
+ </tbody>
+</table>
+
+<p>For signed integer types, the number is converted into an unsigned
+number using a zigzag encoding. Zigzag encoding moves the sign bit to
+the least significant bit using the expression (val « 1) ^ (val »
+63) and derives its name from the fact that positive and negative
+numbers alternate once encoded. The unsigned number is then serialized
+as above.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Signed Original</th>
+ <th style="text-align: left">Unsigned</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">0</td>
+ <td style="text-align: left">0</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">-1</td>
+ <td style="text-align: left">1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">1</td>
+ <td style="text-align: left">2</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">-2</td>
+ <td style="text-align: left">3</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">2</td>
+ <td style="text-align: left">4</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="byte-run-length-encoding">Byte Run Length Encoding</h2>
+
+<p>For byte streams, ORC uses a very light weight encoding of identical
+values.</p>
+
+<ul>
+ <li>Run - a sequence of at least 3 identical values</li>
+ <li>Literals - a sequence of non-identical values</li>
+</ul>
+
+<p>The first byte of each group of values is a header than determines
+whether it is a run (value between 0 to 127) or literal list (value
+between -128 to -1). For runs, the control byte is the length of the
+run minus the length of the minimal run (3) and the control byte for
+literal lists is the negative length of the list. For example, a
+hundred 0’s is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
+would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
+either of the encodings.</p>
+
+<h2 id="boolean-run-length-encoding">Boolean Run Length Encoding</h2>
+
+<p>For encoding boolean types, the bits are put in the bytes from most
+significant to least significant. The bytes are encoded using byte run
+length encoding as described in the previous section. For example,
+the byte sequence [0xff, 0x80] would be one true followed by
+seven false values.</p>
+
+<h2 id="integer-run-length-encoding-version-1">Integer Run Length Encoding, version 1</h2>
+
+<p>ORC v0 files use Run Length Encoding version 1 (RLEv1),
+which provides a lightweight compression of signed or unsigned integer
+sequences. RLEv1 has two sub-encodings:</p>
+
+<ul>
+ <li>Run - a sequence of values that differ by a small fixed delta</li>
+ <li>Literals - a sequence of varint encoded values</li>
+</ul>
+
+<p>Runs start with an initial byte of 0x00 to 0x7f, which encodes the
+length of the run - 3. A second byte provides the fixed delta in the
+range of -128 to 127. Finally, the first value of the run is encoded
+as a base 128 varint.</p>
+
+<p>For example, if the sequence is 100 instances of 7 the encoding would
+start with 100 - 3, followed by a delta of 0, and a varint of 7 for
+an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
+running from 100 to 1, the first byte is 100 - 3, the delta is -1,
+and the varint is 100 for an encoding of [0x61, 0xff, 0x64].</p>
+
+<p>Literals start with an initial byte of 0x80 to 0xff, which corresponds
+to the negative of number of literals in the sequence. Following the
+header byte, the list of N varints is encoded. Thus, if there are
+no runs, the overhead is 1 byte for each 128 integers. The first 5
+prime numbers [2, 3, 4, 7, 11] would encoded as [0xfb, 0x02, 0x03,
+0x04, 0x07, 0xb].</p>
+
+<h1 id="stripes">Stripes</h1>
+
+<p>The body of ORC files consists of a series of stripes. Stripes are
+large (typically ~200MB) and independent of each other and are often
+processed by different tasks. The defining characteristic for columnar
+storage formats is that the data for each column is stored separately
+and that reading data out of the file should be proportional to the
+number of columns read.</p>
+
+<p>In ORC files, each column is stored in several streams that are stored
+next to each other in the file. For example, an integer column is
+represented as two streams PRESENT, which uses one with a bit per
+value recording if the value is non-null, and DATA, which records the
+non-null values. If all of a column’s values in a stripe are non-null,
+the PRESENT stream is omitted from the stripe. For binary data, ORC
+uses three streams PRESENT, DATA, and LENGTH, which stores the length
+of each value. The details of each type will be presented in the
+following subsections.</p>
+
+<h2 id="stripe-footer">Stripe Footer</h2>
+
+<p>The stripe footer contains the encoding of each column and the
+directory of the streams including their location.</p>
+
+<p><code>message StripeFooter {
+ // the location of each stream
+ repeated Stream streams = 1;
+ // the encoding of each column
+ repeated ColumnEncoding columns = 2;
+}
+</code></p>
+
+<p>To describe each stream, ORC stores the kind of stream, the column id,
+and the stream’s size in bytes. The details of what is stored in each stream
+depends on the type and encoding of the column.</p>
+
+<p><code>message Stream {
+ enum Kind {
+ // boolean stream of whether the next value is non-null
+ PRESENT = 0;
+ // the primary data stream
+ DATA = 1;
+ // the length of each value for variable length data
+ LENGTH = 2;
+ // the dictionary blob
+ DICTIONARY\_DATA = 3;
+ // deprecated prior to Hive 0.11
+ // It was used to store the number of instances of each value in the
+ // dictionary
+ DICTIONARY_COUNT = 4;
+ // a secondary data stream
+ SECONDARY = 5;
+ // the index for seeking to particular row groups
+ ROW_INDEX = 6;
+ }
+ required Kind kind = 1;
+ // the column id
+ optional uint32 column = 2;
+ // the number of bytes in the file
+ optional uint64 length = 3;
+}
+</code></p>
+
+<p>Depending on their type several options for encoding are possible. The
+encodings are divided into direct or dictionary-based categories and
+further refined as to whether they use RLE v1 or v2.</p>
+
+<p><code>message ColumnEncoding {
+ enum Kind {
+ // the encoding is mapped directly to the stream using RLE v1
+ DIRECT = 0;
+ // the encoding uses a dictionary of unique values using RLE v1
+ DICTIONARY = 1;
+ // the encoding is direct using RLE v2
+ }
+ required Kind kind = 1;
+ // for dictionary encodings, record the size of the dictionary
+ optional uint32 dictionarySize = 2;
+}
+</code></p>
+
+<h1 id="column-encodings">Column Encodings</h1>
+
+<h2 id="smallint-int-and-bigint-columns">SmallInt, Int, and BigInt Columns</h2>
+
+<p>All of the 16, 32, and 64 bit integer column types use the same set of
+potential encodings, which is basically whether they use RLE v1 or
+v2. If the PRESENT stream is not included, all of the values are
+present. For values that have false bits in the present stream, no
+values are included in the data stream.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v1</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="float-and-double-columns">Float and Double Columns</h2>
+
+<p>Floating point types are stored using IEEE 754 floating point bit
+layout. Float columns use 4 bytes per value and double columns use 8
+bytes.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">IEEE 754 floating point representation</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="string-char-and-varchar-columns">String, Char, and VarChar Columns</h2>
+
+<p>String, char, and varchar columns may be encoded either using a
+dictionary encoding or a direct encoding. A direct encoding should be
+preferred when there are many distinct values. In all of the
+encodings, the PRESENT stream encodes whether the value is null. The
+Java ORC writer automatically picks the encoding after the first row
+group (10,000 rows).</p>
+
+<p>For direct encoding the UTF-8 bytes are saved in the DATA stream and
+the length of each value is written into the LENGTH stream. In direct
+encoding, if the values were [“Nevada”, “California”]; the DATA
+would be “NevadaCalifornia” and the LENGTH would be [6, 10].</p>
+
+<p>For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+each unique value are placed into DICTIONARY_DATA. The length of each
+item in the dictionary is put into the LENGTH stream. The DATA stream
+consists of the sequence of references to the dictionary elements.</p>
+
+<p>In dictionary encoding, if the values were [“Nevada”,
+“California”, “Nevada”, “California”, and “Florida”]; the
+DICTIONARY_DATA would be “CaliforniaFloridaNevada” and LENGTH would
+be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left">DICTIONARY</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DICTIONARY_DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="boolean-columns">Boolean Columns</h2>
+
+<p>Boolean columns are rare, but have a simple encoding.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="tinyint-columns">TinyInt Columns</h2>
+
+<p>TinyInt (byte) columns use byte run length encoding.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Byte RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="binary-columns">Binary Columns</h2>
+
+<p>Binary data is encoded with a PRESENT stream, a DATA stream that records
+the contents, and a LENGTH stream that records the number of bytes per a
+value.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">String contents</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="decimal-columns">Decimal Columns</h2>
+
+<p>Decimal was introduced in Hive 0.11 with infinite precision (the total
+number of digits). In Hive 0.13, the definition was change to limit
+the precision to a maximum of 38 digits, which conveniently uses 127
+bits plus a sign bit. The current encoding of decimal columns stores
+the integer representation of the value as an unbounded length zigzag
+encoded base 128 varint. The scale is stored in the SECONDARY stream
+as an signed integer.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unbounded base 128 varints</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="date-columns">Date Columns</h2>
+
+<p>Date data is encoded with a PRESENT stream, a DATA stream that records
+the number of days after January 1, 1970 in UTC.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v1</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="timestamp-columns">Timestamp Columns</h2>
+
+<p>Timestamp records times down to nanoseconds as a PRESENT stream that
+records non-null values, a DATA stream that records the number of
+seconds after 1 January 2015, and a SECONDARY stream that records the
+number of nanoseconds.</p>
+
+<p>Because the number of nanoseconds often has a large number of trailing
+zeros, the number has trailing decimal zero digits removed and the
+last three bits are used to record how many zeros were removed. Thus
+1000 nanoseconds would be serialized as 0x0b and 100000 would be
+serialized as 0x0d.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DATA</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Signed Integer RLE v1</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">SECONDARY</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="struct-columns">Struct Columns</h2>
+
+<p>Structs have no data themselves and delegate everything to their child
+columns except for their PRESENT stream. They have a child column
+for each of the fields.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="list-columns">List Columns</h2>
+
+<p>Lists are encoded as the PRESENT stream and a length stream with
+number of items in each list. They have a single child column for the
+element values.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="map-columns">Map Columns</h2>
+
+<p>Maps are encoded as the PRESENT stream and a length stream with number
+of items in each list. They have a child column for the key and
+another child column for the value.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">LENGTH</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Unsigned Integer RLE v1</td>
+ </tr>
+ </tbody>
+</table>
+
+<h2 id="union-columns">Union Columns</h2>
+
+<p>Unions are encoded as the PRESENT stream and a tag stream that controls which
+potential variant is used. They have a child column for each variant of the
+union. Currently ORC union types are limited to 256 variants, which matches
+the Hive type model.</p>
+
+<table>
+ <thead>
+ <tr>
+ <th style="text-align: left">Encoding</th>
+ <th style="text-align: left">Stream Kind</th>
+ <th style="text-align: left">Optional</th>
+ <th style="text-align: left">Contents</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">PRESENT</td>
+ <td style="text-align: left">Yes</td>
+ <td style="text-align: left">Boolean RLE</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"> </td>
+ <td style="text-align: left">DIRECT</td>
+ <td style="text-align: left">No</td>
+ <td style="text-align: left">Byte RLE</td>
+ </tr>
+ </tbody>
+</table>
+
+<h1 id="indexes">Indexes</h1>
+
+<h2 id="row-group-index">Row Group Index</h2>
+
+<p>The row group indexes consist of a ROW_INDEX stream for each primitive
+column that has an entry for each row group. Row groups are controlled
+by the writer and default to 10,000 rows. Each RowIndexEntry gives the
+position of each stream for the column and the statistics for that row
+group.</p>
+
+<p>The index streams are placed at the front of the stripe, because in
+the default case of streaming they do not need to be read. They are
+only loaded when either predicate push down is being used or the
+reader seeks to a particular row.</p>
+
+<p><code>message RowIndexEntry {
+ repeated uint64 positions = 1 [packed=true];
+ optional ColumnStatistics statistics = 2;
+}
+</code></p>
+
+<p><code>message RowIndex {
+ repeated RowIndexEntry entry = 1;
+}
+</code></p>
+
+<p>To record positions, each stream needs a sequence of numbers. For
+uncompressed streams, the position is the byte offset of the RLE run’s
+start location followed by the number of values that need to be
+consumed from the run. In compressed streams, the first number is the
+start of the compression chunk in the stream, followed by the number
+of decompressed bytes that need to be consumed, and finally the number
+of values consumed in the RLE.</p>
+
+<p>For columns with multiple streams, the sequences of positions in each
+stream are concatenated. That was an unfortunate decision on my part
+that we should fix at some point, because it makes code that uses the
+indexes error-prone.</p>
+
+<p>Because dictionaries are accessed randomly, there is not a position to
+record for the dictionary and the entire dictionary must be read even
+if only part of a stripe is being read.</p>
+
+
+ </article>
+ </div>
+
+ <div class="clear"></div>
+
+ </div>
+</section>
+
+
+ <footer role="contentinfo">
+ <p>The contents of this website are © 2018
+ <a href="https://www.apache.org/">Apache Software Foundation</a>
+ under the terms of the <a
+ href="https://www.apache.org/licenses/LICENSE-2.0.html">
+ Apache License v2</a>. Apache ORC and its logo are trademarks
+ of the Apache Software Foundation.</p>
+</footer>
+
+ <script>
+ var anchorForId = function (id) {
+ var anchor = document.createElement("a");
+ anchor.className = "header-link";
+ anchor.href = "#" + id;
+ anchor.innerHTML = "<span class=\"sr-only\">Permalink</span><i class=\"fa fa-link\"></i>";
+ anchor.title = "Permalink";
+ return anchor;
+ };
+
+ var linkifyAnchors = function (level, containingElement) {
+ var headers = containingElement.getElementsByTagName("h" + level);
+ for (var h = 0; h < headers.length; h++) {
+ var header = headers[h];
+
+ if (typeof header.id !== "undefined" && header.id !== "") {
+ header.appendChild(anchorForId(header.id));
+ }
+ }
+ };
+
+ document.onreadystatechange = function () {
+ if (this.readyState === "complete") {
+ var contentBlock = document.getElementsByClassName("docs")[0] || document.getElementsByClassName("news")[0];
+ if (!contentBlock) {
+ return;
+ }
+ for (var level = 1; level <= 6; level++) {
+ linkifyAnchors(level, contentBlock);
+ }
+ }
+ };
+</script>
+
+
+</body>
+</html>
[8/9] orc git commit: Pushing ORC-339 reorganize the ORC file format
spec.
Posted by om...@apache.org.
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/core-cpp.html
----------------------------------------------------------------------
diff --git a/docs/core-cpp.html b/docs/core-cpp.html
index 130d019..ec31d6f 100644
--- a/docs/core-cpp.html
+++ b/docs/core-cpp.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,384 +663,19 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Using Core C++</h1>
- <p>The C++ Core ORC API reads and writes ORC files into its own
-orc::ColumnVectorBatch vectorized classes.</p>
+ <div class="unit four-fifths">
+ <article>
+ <h1>Using Core C++</h1>
+ <p>The C++ Core ORC API reads and writes ORC files into its own
+orc::ColumnVectorBatch vectorized classes.</p>
<h2 id="vectorized-row-batch">Vectorized Row Batch</h2>
@@ -1345,576 +778,280 @@ value is null.</p>
<td>UnionVectorBatch</td>
</tr>
<tr>
- <td>varchar</td>
- <td>StringVectorBatch</td>
- </tr>
- </tbody>
-</table>
-
-<p>LongVectorBatch handles all of the integer types (boolean, bigint,
-date, int, smallint, and tinyint). The data is represented as a
-buffer of int64_t where each value is sign-extended as necessary.</p>
-
-<pre><code class="language-cpp"> struct LongVectorBatch: public ColumnVectorBatch {
- DataBuffer<int64_t> data;
- ...
- };
-</code></pre>
-
-<p>TimestampVectorBatch handles timestamp values. The data is
-represented as two buffers of int64_t for seconds and nanoseconds
-respectively. Note that we always assume data is in GMT timezone;
-therefore it is user’s responsibility to convert wall clock time
-from local timezone to GMT.</p>
-
-<pre><code class="language-cpp"> struct TimestampVectorBatch: public ColumnVectorBatch {
- DataBuffer<int64_t> data;
- DataBuffer<int64_t> nanoseconds;
- ...
- };
-</code></pre>
-
-<p>DoubleVectorBatch handles all of the floating point types
-(double, and float). The data is represented as a buffer of doubles.</p>
-
-<pre><code class="language-cpp"> struct DoubleVectorBatch: public ColumnVectorBatch {
- DataBuffer<double> data;
- ...
- };
-</code></pre>
-
-<p>Decimal64VectorBatch handles decimal columns with precision no
-greater than 18. Decimal128VectorBatch handles the others. The data
-is represented as a buffer of int64_t and orc::Int128 respectively.</p>
-
-<pre><code class="language-cpp"> struct Decimal64VectorBatch: public ColumnVectorBatch {
- DataBuffer<int64_t> values;
- ...
- };
-
- struct Decimal128VectorBatch: public ColumnVectorBatch {
- DataBuffer<Int128> values;
- ...
- };
-</code></pre>
-
-<p>StringVectorBatch handles all of the binary types (binary,
-char, string, and varchar). The data is represented as a char* buffer,
-and a length buffer.</p>
-
-<pre><code class="language-cpp"> struct StringVectorBatch: public ColumnVectorBatch {
- DataBuffer<char*> data;
- DataBuffer<int64_t> length;
- ...
- };
-</code></pre>
-
-<p>StructVectorBatch handles the struct columns and represents
-the data as a buffer of <code>ColumnVectorBatch</code>.</p>
-
-<pre><code class="language-cpp"> struct StructVectorBatch: public ColumnVectorBatch {
- std::vector<ColumnVectorBatch*> fields;
- ...
- };
-</code></pre>
+ <td>varchar</td>
+ <td>StringVectorBatch</td>
+ </tr>
+ </tbody>
+</table>
-<p>UnionVectorBatch handles the union columns. It uses <code>tags</code>
-to indicate which subtype has the value and <code>offsets</code> indicates
-the offset in child batch of that subtype. A individual
-<code>ColumnVectorBatch</code> is used for each subtype.</p>
+<p>LongVectorBatch handles all of the integer types (boolean, bigint,
+date, int, smallint, and tinyint). The data is represented as a
+buffer of int64_t where each value is sign-extended as necessary.</p>
-<pre><code class="language-cpp"> struct UnionVectorBatch: public ColumnVectorBatch {
- DataBuffer<unsigned char> tags;
- DataBuffer<uint64_t> offsets;
- std::vector<ColumnVectorBatch*> children;
+<pre><code class="language-cpp"> struct LongVectorBatch: public ColumnVectorBatch {
+ DataBuffer<int64_t> data;
...
};
</code></pre>
-<p>ListVectorBatch handles the array columns and represents
-the data as a buffer of integers for the offsets and a
-<code>ColumnVectorBatch</code> for the children values.</p>
+<p>TimestampVectorBatch handles timestamp values. The data is
+represented as two buffers of int64_t for seconds and nanoseconds
+respectively. Note that we always assume data is in GMT timezone;
+therefore it is user’s responsibility to convert wall clock time
+from local timezone to GMT.</p>
-<pre><code class="language-cpp"> struct ListVectorBatch: public ColumnVectorBatch {
- DataBuffer<int64_t> offsets;
- ORC_UNIQUE_PTR<ColumnVectorBatch> elements;
+<pre><code class="language-cpp"> struct TimestampVectorBatch: public ColumnVectorBatch {
+ DataBuffer<int64_t> data;
+ DataBuffer<int64_t> nanoseconds;
...
};
</code></pre>
-<p>MapVectorBatch handles the map columns and represents the data
-as two arrays of integers for the offsets and two <code>ColumnVectorBatch</code>s
-for the keys and values.</p>
+<p>DoubleVectorBatch handles all of the floating point types
+(double, and float). The data is represented as a buffer of doubles.</p>
-<pre><code class="language-cpp"> struct MapVectorBatch: public ColumnVectorBatch {
- DataBuffer<int64_t> offsets;
- ORC_UNIQUE_PTR<ColumnVectorBatch> keys;
- ORC_UNIQUE_PTR<ColumnVectorBatch> elements;
+<pre><code class="language-cpp"> struct DoubleVectorBatch: public ColumnVectorBatch {
+ DataBuffer<double> data;
...
};
</code></pre>
-<h2 id="writing-orc-files">Writing ORC Files</h2>
-
-<p>To write an ORC file, you need to include <code>OrcFile.hh</code> and define
-the schema; then use <code>orc::OutputStream</code> and <code>orc::WriterOptions</code>
-to create a <code>orc::Writer</code> with the desired filename. This example
-sets the required schema parameter, but there are many other
-options to control the ORC writer.</p>
-
-<pre><code class="language-cpp">ORC_UNIQUE_PTR<OutputStream> outStream =
- writeLocalFile("my-file.orc");
-ORC_UNIQUE_PTR<Type> schema(
- Type::buildTypeFromString("struct<x:int,y:int>"));
-WriterOptions options;
-ORC_UNIQUE_PTR<Writer> writer =
- createWriter(*schema, outStream.get(), options);
-</code></pre>
-
-<p>Now you need to create a row batch, set the data, and write it to the file
-as the batch fills up. When the file is done, close the <code>Writer</code>.</p>
-
-<pre><code class="language-cpp">uint64_t batchSize = 1024, rowCount = 10000;
-ORC_UNIQUE_PTR<ColumnVectorBatch> batch =
- writer->createRowBatch(batchSize);
-StructVectorBatch *root =
- dynamic_cast<StructVectorBatch *>(batch.get());
-LongVectorBatch *x =
- dynamic_cast<LongVectorBatch *>(root->fields[0]);
-LongVectorBatch *y =
- dynamic_cast<LongVectorBatch *>(root->fields[1]);
-
-uint64_t rows = 0;
-for (uint64_t i = 0; i < rowCount; ++i) {
- x->data[rows] = i;
- y->data[rows] = i * 3;
- rows++;
-
- if (rows == batchSize) {
- root->numElements = rows;
- x->numElements = rows;
- y->numElements = rows;
-
- writer->add(*batch);
- rows = 0;
- }
-}
-
-if (rows != 0) {
- root->numElements = rows;
- x->numElements = rows;
- y->numElements = rows;
-
- writer->add(*batch);
- rows = 0;
-}
-
-writer->close();
-</code></pre>
-
-<h2 id="reading-orc-files">Reading ORC Files</h2>
-
-<p>To read ORC files, include <code>OrcFile.hh</code> file to create a <code>orc::Reader</code>
-that contains the metadata about the file. There are a few options to
-the <code>orc::Reader</code>, but far fewer than the writer and none of them are
-required. The reader has methods for getting the number of rows,
-schema, compression, etc. from the file.</p>
-
-<pre><code class="language-cpp">ORC_UNIQUE_PTR<InputStream> inStream =
- readLocalFile("my-file.orc");
-ReaderOptions options;
-ORC_UNIQUE_PTR<Reader> reader =
- createReader(inStream, options);
-</code></pre>
-
-<p>To get the data, create a <code>orc::RowReader</code> object. By default,
-the RowReader reads all rows and all columns, but there are
-options to control the data that is read.</p>
-
-<pre><code class="language-cpp">RowReaderOptions rowReaderOptions;
-ORC_UNIQUE_PTR<RowReader> rowReader =
- reader->createRowReader(rowReaderOptions);
-ORC_UNIQUE_PTR<ColumnVectorBatch> batch =
- rowReader->createRowBatch(1024);
-</code></pre>
-
-<p>With a <code>orc::RowReader</code> the user can ask for the next batch until there
-are no more left. The reader will stop the batch at certain boundaries,
-so the returned batch may not be full, but it will always contain some rows.</p>
-
-<pre><code class="language-cpp">while (rowReader->next(*batch)) {
- for (uint64_t r = 0; r < batch->numElements; ++r) {
- ... process row r from batch
- }
-}
-</code></pre>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/core-java.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/cpp-tools.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
+<p>Decimal64VectorBatch handles decimal columns with precision no
+greater than 18. Decimal128VectorBatch handles the others. The data
+is represented as a buffer of int64_t and orc::Int128 respectively.</p>
-
-
-
+<pre><code class="language-cpp"> struct Decimal64VectorBatch: public ColumnVectorBatch {
+ DataBuffer<int64_t> values;
+ ...
+ };
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
+ struct Decimal128VectorBatch: public ColumnVectorBatch {
+ DataBuffer<Int128> values;
+ ...
+ };
+</code></pre>
+<p>StringVectorBatch handles all of the binary types (binary,
+char, string, and varchar). The data is represented as a char* buffer,
+and a length buffer.</p>
-
+<pre><code class="language-cpp"> struct StringVectorBatch: public ColumnVectorBatch {
+ DataBuffer<char*> data;
+ DataBuffer<int64_t> length;
+ ...
+ };
+</code></pre>
-
-
-
+<p>StructVectorBatch handles the struct columns and represents
+the data as a buffer of <code>ColumnVectorBatch</code>.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
+<pre><code class="language-cpp"> struct StructVectorBatch: public ColumnVectorBatch {
+ std::vector<ColumnVectorBatch*> fields;
+ ...
+ };
+</code></pre>
+<p>UnionVectorBatch handles the union columns. It uses <code>tags</code>
+to indicate which subtype has the value and <code>offsets</code> indicates
+the offset in child batch of that subtype. A individual
+<code>ColumnVectorBatch</code> is used for each subtype.</p>
-
+<pre><code class="language-cpp"> struct UnionVectorBatch: public ColumnVectorBatch {
+ DataBuffer<unsigned char> tags;
+ DataBuffer<uint64_t> offsets;
+ std::vector<ColumnVectorBatch*> children;
+ ...
+ };
+</code></pre>
-
-
-
+<p>ListVectorBatch handles the array columns and represents
+the data as a buffer of integers for the offsets and a
+<code>ColumnVectorBatch</code> for the children values.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
+<pre><code class="language-cpp"> struct ListVectorBatch: public ColumnVectorBatch {
+ DataBuffer<int64_t> offsets;
+ ORC_UNIQUE_PTR<ColumnVectorBatch> elements;
+ ...
+ };
+</code></pre>
+<p>MapVectorBatch handles the map columns and represents the data
+as two arrays of integers for the offsets and two <code>ColumnVectorBatch</code>s
+for the keys and values.</p>
-
+<pre><code class="language-cpp"> struct MapVectorBatch: public ColumnVectorBatch {
+ DataBuffer<int64_t> offsets;
+ ORC_UNIQUE_PTR<ColumnVectorBatch> keys;
+ ORC_UNIQUE_PTR<ColumnVectorBatch> elements;
+ ...
+ };
+</code></pre>
-
-
-
+<h2 id="writing-orc-files">Writing ORC Files</h2>
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
+<p>To write an ORC file, you need to include <code>OrcFile.hh</code> and define
+the schema; then use <code>orc::OutputStream</code> and <code>orc::WriterOptions</code>
+to create a <code>orc::Writer</code> with the desired filename. This example
+sets the required schema parameter, but there are many other
+options to control the ORC writer.</p>
+<pre><code class="language-cpp">ORC_UNIQUE_PTR<OutputStream> outStream =
+ writeLocalFile("my-file.orc");
+ORC_UNIQUE_PTR<Type> schema(
+ Type::buildTypeFromString("struct<x:int,y:int>"));
+WriterOptions options;
+ORC_UNIQUE_PTR<Writer> writer =
+ createWriter(*schema, outStream.get(), options);
+</code></pre>
-</ul>
+<p>Now you need to create a row batch, set the data, and write it to the file
+as the batch fills up. When the file is done, close the <code>Writer</code>.</p>
-
- <h4>Installing</h4>
-
+<pre><code class="language-cpp">uint64_t batchSize = 1024, rowCount = 10000;
+ORC_UNIQUE_PTR<ColumnVectorBatch> batch =
+ writer->createRowBatch(batchSize);
+StructVectorBatch *root =
+ dynamic_cast<StructVectorBatch *>(batch.get());
+LongVectorBatch *x =
+ dynamic_cast<LongVectorBatch *>(root->fields[0]);
+LongVectorBatch *y =
+ dynamic_cast<LongVectorBatch *>(root->fields[1]);
-<ul>
+uint64_t rows = 0;
+for (uint64_t i = 0; i < rowCount; ++i) {
+ x->data[rows] = i;
+ y->data[rows] = i * 3;
+ rows++;
-
+ if (rows == batchSize) {
+ root->numElements = rows;
+ x->numElements = rows;
+ y->numElements = rows;
-
-
-
+ writer->add(*batch);
+ rows = 0;
+ }
+}
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
+if (rows != 0) {
+ root->numElements = rows;
+ x->numElements = rows;
+ y->numElements = rows;
+ writer->add(*batch);
+ rows = 0;
+}
-
+writer->close();
+</code></pre>
-
-
-
+<h2 id="reading-orc-files">Reading ORC Files</h2>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+<p>To read ORC files, include <code>OrcFile.hh</code> file to create a <code>orc::Reader</code>
+that contains the metadata about the file. There are a few options to
+the <code>orc::Reader</code>, but far fewer than the writer and none of them are
+required. The reader has methods for getting the number of rows,
+schema, compression, etc. from the file.</p>
+<pre><code class="language-cpp">ORC_UNIQUE_PTR<InputStream> inStream =
+ readLocalFile("my-file.orc");
+ReaderOptions options;
+ORC_UNIQUE_PTR<Reader> reader =
+ createReader(inStream, options);
+</code></pre>
+
+<p>To get the data, create a <code>orc::RowReader</code> object. By default,
+the RowReader reads all rows and all columns, but there are
+options to control the data that is read.</p>
+
+<pre><code class="language-cpp">RowReaderOptions rowReaderOptions;
+ORC_UNIQUE_PTR<RowReader> rowReader =
+ reader->createRowReader(rowReaderOptions);
+ORC_UNIQUE_PTR<ColumnVectorBatch> batch =
+ rowReader->createRowBatch(1024);
+</code></pre>
+
+<p>With a <code>orc::RowReader</code> the user can ask for the next batch until there
+are no more left. The reader will stop the batch at certain boundaries,
+so the returned batch may not be full, but it will always contain some rows.</p>
+
+<pre><code class="language-cpp">while (rowReader->next(*batch)) {
+ for (uint64_t r = 0; r < batch->numElements; ++r) {
+ ... process row r from batch
+ }
+}
+</code></pre>
+
+
-</ul>
-
- <h4>Using in Hive</h4>
-
-<ul>
-
-
-
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
-
+
-
-
-
+
-
-
+
-
-
+
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/core-java.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/cpp-tools.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in MapReduce</h4>
+ <h4>Overview</h4>
<ul>
@@ -1943,19 +1080,21 @@ so the returned batch may not be full, but it will always contain some rows.</p>
+ <li class=""><a href="/docs/index.html">Background</a></li>
+
+
+
-
-
-
+
-
+
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
@@ -1995,20 +1134,10 @@ so the returned batch may not be full, but it will always contain some rows.</p>
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
@@ -2027,34 +1156,34 @@ so the returned batch may not be full, but it will always contain some rows.</p>
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
+
+
+
+
+
+
+
- <li class="current"><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -2071,15 +1200,7 @@ so the returned batch may not be full, but it will always contain some rows.</p>
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -2117,14 +1238,14 @@ so the returned batch may not be full, but it will always contain some rows.</p>
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -2151,31 +1272,7 @@ so the returned batch may not be full, but it will always contain some rows.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -2199,31 +1296,17 @@ so the returned batch may not be full, but it will always contain some rows.</p>
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2255,19 +1338,7 @@ so the returned batch may not be full, but it will always contain some rows.</p>
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2303,13 +1374,25 @@ so the returned batch may not be full, but it will always contain some rows.</p>
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2319,7 +1402,7 @@ so the returned batch may not be full, but it will always contain some rows.</p>
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2337,17 +1420,17 @@ so the returned batch may not be full, but it will always contain some rows.</p>
-
-
-
-
-
+ <li class="current"><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2367,11 +1450,17 @@ so the returned batch may not be full, but it will always contain some rows.</p>
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2393,7 +1482,7 @@ so the returned batch may not be full, but it will always contain some rows.</p>
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/core-java.html
----------------------------------------------------------------------
diff --git a/docs/core-java.html b/docs/core-java.html
index 196bf0d..ca4e99b 100644
--- a/docs/core-java.html
+++ b/docs/core-java.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,385 +663,20 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>Using Core Java</h1>
- <p>The Core ORC API reads and writes ORC files into Hive’s storage-api
-vectorized classes. Both Hive and MapReduce use the Core API to actually
-read and write the data.</p>
+ <div class="unit four-fifths">
+ <article>
+ <h1>Using Core Java</h1>
+ <p>The Core ORC API reads and writes ORC files into Hive’s storage-api
+vectorized classes. Both Hive and MapReduce use the Core API to actually
+read and write the data.</p>
<h2 id="vectorized-row-batch">Vectorized Row Batch</h2>
@@ -1646,289 +1079,63 @@ rows.close();
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/mapreduce.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/core-cpp.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
-
- </article>
- </div>
-
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
-
-
-</ul>
-
-
- <h4>Installing</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
-
-
-
+
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
+
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/mapreduce.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/core-cpp.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in Hive</h4>
+ <h4>Overview</h4>
<ul>
@@ -1957,11 +1164,7 @@ rows.close();
-
-
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
+ <li class=""><a href="/docs/index.html">Background</a></li>
@@ -1975,34 +1178,10 @@ rows.close();
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-</ul>
-
-
- <h4>Using in MapReduce</h4>
-
-
-<ul>
-
@@ -2039,7 +1218,7 @@ rows.close();
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
@@ -2069,49 +1248,7 @@ rows.close();
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
-
-
-
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/core-java.html">Using Core Java</a></li>
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
@@ -2123,22 +1260,14 @@ rows.close();
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -2155,15 +1284,7 @@ rows.close();
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -2201,14 +1322,14 @@ rows.close();
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -2235,31 +1356,7 @@ rows.close();
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -2283,31 +1380,17 @@ rows.close();
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2339,19 +1422,7 @@ rows.close();
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2387,13 +1458,25 @@ rows.close();
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2403,7 +1486,7 @@ rows.close();
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class="current"><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2421,17 +1504,17 @@ rows.close();
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2451,11 +1534,17 @@ rows.close();
+ <li class=""><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2477,7 +1566,7 @@ rows.close();
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
http://git-wip-us.apache.org/repos/asf/orc/blob/c6e29090/docs/cpp-tools.html
----------------------------------------------------------------------
diff --git a/docs/cpp-tools.html b/docs/cpp-tools.html
index 171dc0d..abe6e2e 100644
--- a/docs/cpp-tools.html
+++ b/docs/cpp-tools.html
@@ -109,12 +109,6 @@
-
-
-
-
-
-
<option value="/docs/index.html">Background</option>
@@ -130,14 +124,6 @@
-
-
-
-
-
-
-
-
@@ -174,20 +160,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -221,20 +193,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
<option value="/docs/types.html">Types</option>
@@ -261,12 +219,6 @@
-
-
-
-
-
-
<option value="/docs/indexes.html">Indexes</option>
@@ -280,14 +232,6 @@
-
-
-
-
-
-
-
-
@@ -324,20 +268,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -381,20 +311,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
@@ -426,25 +342,11 @@
-
-
-
-
-
-
<option value="/docs/releases.html">Releases</option>
-
-
-
-
-
-
-
-
</optgroup>
@@ -471,12 +373,6 @@
-
-
-
-
-
-
<option value="/docs/hive-ddl.html">Hive DDL</option>
@@ -494,14 +390,6 @@
-
-
-
-
-
-
-
-
@@ -519,12 +407,6 @@
-
-
-
-
-
-
<option value="/docs/hive-config.html">Hive Configuration</option>
@@ -544,14 +426,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -586,12 +460,6 @@
-
-
-
-
-
-
<option value="/docs/mapred.html">Using in MapRed</option>
@@ -601,14 +469,6 @@
-
-
-
-
-
-
-
-
@@ -638,12 +498,6 @@
-
-
-
-
-
-
<option value="/docs/mapreduce.html">Using in MapReduce</option>
@@ -651,14 +505,6 @@
-
-
-
-
-
-
-
-
</optgroup>
@@ -679,8 +525,6 @@
-
-
<option value="/docs/core-java.html">Using Core Java</option>
@@ -704,18 +548,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -727,8 +559,6 @@
-
-
<option value="/docs/core-cpp.html">Using Core C++</option>
@@ -754,18 +584,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
</optgroup>
@@ -788,8 +606,6 @@
-
-
<option value="/docs/cpp-tools.html">C++ Tools</option>
@@ -811,18 +627,6 @@
-
-
-
-
-
-
-
-
-
-
-
-
@@ -848,12 +652,6 @@
-
-
-
-
-
-
<option value="/docs/java-tools.html">Java Tools</option>
@@ -865,1004 +663,343 @@
-
-
-
-
-
-
-
-
</optgroup>
- <optgroup label="Format Specification">
-
+ </select>
+</div>
-
+ <div class="unit four-fifths">
+ <article>
+ <h1>C++ Tools</h1>
+ <h2 id="orc-contents">orc-contents</h2>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-intro.html">Introduction</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/file-tail.html">File Tail</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/compression.html">Compression</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/run-length.html">Run Length Encoding</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/stripes.html">Stripes</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/encodings.html">Column Encodings</option>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <option value="/docs/spec-index.html">Indexes</option>
-
-
-
-
-
-
-
-
-
-
- </optgroup>
-
- </select>
-</div>
-
-
- <div class="unit four-fifths">
- <article>
- <h1>C++ Tools</h1>
- <h2 id="orc-contents">orc-contents</h2>
-
-<p>Displays the contents of the ORC file as a JSON document. With the
-<code>columns</code> argument only the selected columns are printed.</p>
-
-<pre><code class="language-shell">% orc-contents [--columns=1,2,...] <filename>
-</code></pre>
-
-<p>If you run it on the example file TestOrcFile.test1.orc, you’ll see (without
-the line breaks within each record):</p>
-
-<pre><code class="language-shell">% orc-contents examples/TestOrcFile.test1.orc
-{"boolean1": false, "byte1": 1, "short1": 1024, "int1": 65536, \\
- "long1": 9223372036854775807, "float1": 1, "double1": -15, \\
- "bytes1": [0, 1, 2, 3, 4], "string1": "hi", "middle": \\
- {"list": [{"int1": 1, "string1": "bye"}, \\
- {"int1": 2, "string1": "sigh"}]}, \\
- "list": [{"int1": 3, "string1": "good"}, \\
- {"int1": 4, "string1": "bad"}], \\
- "map": []}
-{"boolean1": true, "byte1": 100, "short1": 2048, "int1": 65536,
- "long1": 9223372036854775807, "float1": 2, "double1": -5, \\
- "bytes1": [], "string1": "bye", \\
- "middle": {"list": [{"int1": 1, "string1": "bye"}, \\
- {"int1": 2, "string1": "sigh"}]}, \\
- "list": [{"int1": 100000000, "string1": "cat"}, \\
- {"int1": -100000, "string1": "in"}, \\
- {"int1": 1234, "string1": "hat"}], \\
- "map": [{"key": "chani", "value": {"int1": 5, "string1": "chani"}}, \\
- {"key": "mauddib", \\
- "value": {"int1": 1, "string1": "mauddib"}}]}
-</code></pre>
-
-<h2 id="orc-metadata">orc-metadata</h2>
-
-<p>Displays the metadata of the ORC file as a JSON document. With the
-<code>verbose</code> option additional information about the layout of the file
-is also printed.</p>
-
-<p>For diagnosing problems, it is useful to use the ‘–raw’ option that
-prints the protocol buffers from the ORC file directly rather than
-interpreting them.</p>
-
-<pre><code class="language-shell">% orc-metadata [-v] [--raw] <filename>
-</code></pre>
-
-<p>If you run it on the example file TestOrcFile.test1.orc, you’ll see:</p>
-
-<pre><code class="language-shell">% orc-metadata examples/TestOrcFile.test1.orc
-{ "name": "../examples/TestOrcFile.test1.orc",
- "type": "struct<boolean1:boolean,byte1:tinyint,short1:smallint,
-int1:int,long1:bigint,float1:float,double1:double,bytes1:binary,
-string1:string,middle:struct<list:array<struct<int1:int,string1:
-string>>>,list:array<struct<int1:int,string1:string>>,map:map<
-string,struct<int1:int,string1:string>>>",
- "rows": 2,
- "stripe count": 1,
- "format": "0.12", "writer version": "HIVE-8732",
- "compression": "zlib", "compression block": 10000,
- "file length": 1711,
- "content": 1015, "stripe stats": 250, "footer": 421, "postscript": 24,
- "row index stride": 10000,
- "user metadata": {
- },
- "stripes": [
- { "stripe": 0, "rows": 2,
- "offset": 3, "length": 1012,
- "index": 570, "data": 243, "footer": 199
- }
- ]
-}
-</code></pre>
-
-<h2 id="csv-import">csv-import</h2>
-
-<p>Imports CSV file into an Orc file using the specified schema.
-Compound types are not yet supported. <code>delimiter</code> option indicates
-the delimiter in the input CSV file and by default is <code>,</code>. <code>stripe</code>
-option means the stripe size and set to 128MB by default. <code>block</code>
-option is compression block size which is 64KB by default. <code>batch</code>
-option is by default 1024 rows for one batch.</p>
-
-<pre><code class="language-shell">% csv-import [--delimiter=<character>] [--stripe=<size>]
- [--block=<size>] [--batch=<size>]
- <schema> <inputCSVFile> <outputORCFile>
-</code></pre>
-
-<p>If you run it on the example file TestCSVFileImport.test10rows.csv,
-you’ll see:</p>
-
-<pre><code class="language-shell">% csv-import "struct<a:bigint,b:string,c:double>"
- examples/TestCSVFileImport.test10rows.csv /tmp/test.orc
-[2018-04-11 11:12:16] Start importing Orc file...
-[2018-04-11 11:12:16] Finish importing Orc file.
-[2018-04-11 11:12:16] Total writer elasped time: 0.001352s.
-[2018-04-11 11:12:16] Total writer CPU time: 0.001339s.
-</code></pre>
-
-<h2 id="orc-scan">orc-scan</h2>
-
-<p>Scans and displays the row count of the ORC file. With the <code>batch</code> option
-to set the batch size which is 1024 rows by default. It is useful to check
-if the ORC file is damaged.</p>
-
-<pre><code class="language-shell">% orc-scan [--batch=<size>] <filename>
-</code></pre>
-
-<p>If you run it on the example file TestOrcFile.test1.orc, you’ll see:</p>
-
-<pre><code class="language-shell">% orc-scan examples/TestOrcFile.test1.orc
-Rows: 2
-Batches: 1
-</code></pre>
-
-<h2 id="orc-statistics">orc-statistics</h2>
-
-<p>Displays the file-level and stripe-level column statistics of the ORC file.
-With the <code>withIndex</code> option to include column statistics in each row group.</p>
-
-<pre><code class="language-shell">% orc-statistics [--withIndex] <filename>
-</code></pre>
-
-<p>If you run it on the example file TestOrcFile.TestOrcFile.columnProjection.orc
-you’ll see:</p>
-
-<pre><code class="language-shell">% orc-statistics examples/TestOrcFile.columnProjection.orc
-File examples/TestOrcFile.columnProjection.orc has 3 columns
-*** Column 0 ***
-Column has 21000 values and has null value: no
-
-*** Column 1 ***
-Data type: Integer
-Values: 21000
-Has null: no
-Minimum: -2147439072
-Maximum: 2147257982
-Sum: 268482658568
-
-*** Column 2 ***
-Data type: String
-Values: 21000
-Has null: no
-Minimum: 100119c272d7db89
-Maximum: fffe9f6f23b287f3
-Total length: 334559
-
-File examples/TestOrcFile.columnProjection.orc has 5 stripes
-*** Stripe 0 ***
-
---- Column 0 ---
-Column has 5000 values and has null value: no
-
---- Column 1 ---
-Data type: Integer
-Values: 5000
-Has null: no
-Minimum: -2145365268
-Maximum: 2147025027
-Sum: -29841423854
-
---- Column 2 ---
-Data type: String
-Values: 5000
-Has null: no
-Minimum: 1005350489418be2
-Maximum: fffbb8718c92b09f
-Total length: 79644
-
-*** Stripe 1 ***
-
---- Column 0 ---
-Column has 5000 values and has null value: no
-
---- Column 1 ---
-Data type: Integer
-Values: 5000
-Has null: no
-Minimum: -2147115959
-Maximum: 2147257982
-Sum: 108604887785
-
---- Column 2 ---
-Data type: String
-Values: 5000
-Has null: no
-Minimum: 100119c272d7db89
-Maximum: fff0ae41d41e6afc
-Total length: 79640
-
-*** Stripe 2 ***
-
---- Column 0 ---
-Column has 5000 values and has null value: no
-
---- Column 1 ---
-Data type: Integer
-Values: 5000
-Has null: no
-Minimum: -2145932387
-Maximum: 2145877119
-Sum: 70064190848
-
---- Column 2 ---
-Data type: String
-Values: 5000
-Has null: no
-Minimum: 10130af874ae036c
-Maximum: fffe9f6f23b287f3
-Total length: 79645
-
-*** Stripe 3 ***
-
---- Column 0 ---
-Column has 5000 values and has null value: no
-
---- Column 1 ---
-Data type: Integer
-Values: 5000
-Has null: no
-Minimum: -2147439072
-Maximum: 2147074354
-Sum: 104681356482
-
---- Column 2 ---
-Data type: String
-Values: 5000
-Has null: no
-Minimum: 102547d48ed06518
-Maximum: fffa47c57dc7b69a
-Total length: 79689
-
-*** Stripe 4 ***
-
---- Column 0 ---
-Column has 1000 values and has null value: no
-
---- Column 1 ---
-Data type: Integer
-Values: 1000
-Has null: no
-Minimum: -2141222223
-Maximum: 2145816096
-Sum: 14973647307
+<p>Displays the contents of the ORC file as a JSON document. With the
+<code>columns</code> argument only the selected columns are printed.</p>
---- Column 2 ---
-Data type: String
-Values: 1000
-Has null: no
-Minimum: 1059d81c9025a217
-Maximum: ffc17f0e35e1a6c0
-Total length: 15941
+<pre><code class="language-shell">% orc-contents [--columns=1,2,...] <filename>
</code></pre>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+<p>If you run it on the example file TestOrcFile.test1.orc, you’ll see (without
+the line breaks within each record):</p>
-
-
+<pre><code class="language-shell">% orc-contents examples/TestOrcFile.test1.orc
+{"boolean1": false, "byte1": 1, "short1": 1024, "int1": 65536, \\
+ "long1": 9223372036854775807, "float1": 1, "double1": -15, \\
+ "bytes1": [0, 1, 2, 3, 4], "string1": "hi", "middle": \\
+ {"list": [{"int1": 1, "string1": "bye"}, \\
+ {"int1": 2, "string1": "sigh"}]}, \\
+ "list": [{"int1": 3, "string1": "good"}, \\
+ {"int1": 4, "string1": "bad"}], \\
+ "map": []}
+{"boolean1": true, "byte1": 100, "short1": 2048, "int1": 65536,
+ "long1": 9223372036854775807, "float1": 2, "double1": -5, \\
+ "bytes1": [], "string1": "bye", \\
+ "middle": {"list": [{"int1": 1, "string1": "bye"}, \\
+ {"int1": 2, "string1": "sigh"}]}, \\
+ "list": [{"int1": 100000000, "string1": "cat"}, \\
+ {"int1": -100000, "string1": "in"}, \\
+ {"int1": 1234, "string1": "hat"}], \\
+ "map": [{"key": "chani", "value": {"int1": 5, "string1": "chani"}}, \\
+ {"key": "mauddib", \\
+ "value": {"int1": 1, "string1": "mauddib"}}]}
+</code></pre>
-
-
+<h2 id="orc-metadata">orc-metadata</h2>
-
-
+<p>Displays the metadata of the ORC file as a JSON document. With the
+<code>verbose</code> option additional information about the layout of the file
+is also printed.</p>
-
-
+<p>For diagnosing problems, it is useful to use the ‘–raw’ option that
+prints the protocol buffers from the ORC file directly rather than
+interpreting them.</p>
-
-
+<pre><code class="language-shell">% orc-metadata [-v] [--raw] <filename>
+</code></pre>
-
-
- <div class="section-nav">
- <div class="left align-right">
-
-
-
- <a href="/docs/core-cpp.html" class="prev">Back</a>
-
- </div>
- <div class="right align-left">
-
-
-
- <a href="/docs/java-tools.html" class="next">Next</a>
-
- </div>
- </div>
- <div class="clear"></div>
-
+<p>If you run it on the example file TestOrcFile.test1.orc, you’ll see:</p>
- </article>
- </div>
+<pre><code class="language-shell">% orc-metadata examples/TestOrcFile.test1.orc
+{ "name": "../examples/TestOrcFile.test1.orc",
+ "type": "struct<boolean1:boolean,byte1:tinyint,short1:smallint,
+int1:int,long1:bigint,float1:float,double1:double,bytes1:binary,
+string1:string,middle:struct<list:array<struct<int1:int,string1:
+string>>>,list:array<struct<int1:int,string1:string>>,map:map<
+string,struct<int1:int,string1:string>>>",
+ "rows": 2,
+ "stripe count": 1,
+ "format": "0.12", "writer version": "HIVE-8732",
+ "compression": "zlib", "compression block": 10000,
+ "file length": 1711,
+ "content": 1015, "stripe stats": 250, "footer": 421, "postscript": 24,
+ "row index stride": 10000,
+ "user metadata": {
+ },
+ "stripes": [
+ { "stripe": 0, "rows": 2,
+ "offset": 3, "length": 1012,
+ "index": 570, "data": 243, "footer": 199
+ }
+ ]
+}
+</code></pre>
- <div class="unit one-fifth hide-on-mobiles">
- <aside>
-
- <h4>Overview</h4>
-
+<h2 id="csv-import">csv-import</h2>
-<ul>
+<p>Imports CSV file into an Orc file using the specified schema.
+Compound types are not yet supported. <code>delimiter</code> option indicates
+the delimiter in the input CSV file and by default is <code>,</code>. <code>stripe</code>
+option means the stripe size and set to 128MB by default. <code>block</code>
+option is compression block size which is 64KB by default. <code>batch</code>
+option is by default 1024 rows for one batch.</p>
-
+<pre><code class="language-shell">% csv-import [--delimiter=<character>] [--stripe=<size>]
+ [--block=<size>] [--batch=<size>]
+ <schema> <inputCSVFile> <outputORCFile>
+</code></pre>
-
-
-
+<p>If you run it on the example file TestCSVFileImport.test10rows.csv,
+you’ll see:</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/index.html">Background</a></li>
-
+<pre><code class="language-shell">% csv-import "struct<a:bigint,b:string,c:double>"
+ examples/TestCSVFileImport.test10rows.csv /tmp/test.orc
+[2018-04-11 11:12:16] Start importing Orc file...
+[2018-04-11 11:12:16] Finish importing Orc file.
+[2018-04-11 11:12:16] Total writer elasped time: 0.001352s.
+[2018-04-11 11:12:16] Total writer CPU time: 0.001339s.
+</code></pre>
+<h2 id="orc-scan">orc-scan</h2>
-
+<p>Scans and displays the row count of the ORC file. With the <code>batch</code> option
+to set the batch size which is 1024 rows by default. It is useful to check
+if the ORC file is damaged.</p>
-
-
-
+<pre><code class="language-shell">% orc-scan [--batch=<size>] <filename>
+</code></pre>
-
-
-
-
- <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
-
+<p>If you run it on the example file TestOrcFile.test1.orc, you’ll see:</p>
+<pre><code class="language-shell">% orc-scan examples/TestOrcFile.test1.orc
+Rows: 2
+Batches: 1
+</code></pre>
-
+<h2 id="orc-statistics">orc-statistics</h2>
-
-
-
+<p>Displays the file-level and stripe-level column statistics of the ORC file.
+With the <code>withIndex</code> option to include column statistics in each row group.</p>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/types.html">Types</a></li>
-
+<pre><code class="language-shell">% orc-statistics [--withIndex] <filename>
+</code></pre>
+<p>If you run it on the example file TestOrcFile.TestOrcFile.columnProjection.orc
+you’ll see:</p>
-
+<pre><code class="language-shell">% orc-statistics examples/TestOrcFile.columnProjection.orc
+File examples/TestOrcFile.columnProjection.orc has 3 columns
+*** Column 0 ***
+Column has 21000 values and has null value: no
-
-
-
+*** Column 1 ***
+Data type: Integer
+Values: 21000
+Has null: no
+Minimum: -2147439072
+Maximum: 2147257982
+Sum: 268482658568
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/indexes.html">Indexes</a></li>
-
+*** Column 2 ***
+Data type: String
+Values: 21000
+Has null: no
+Minimum: 100119c272d7db89
+Maximum: fffe9f6f23b287f3
+Total length: 334559
+File examples/TestOrcFile.columnProjection.orc has 5 stripes
+*** Stripe 0 ***
-
+--- Column 0 ---
+Column has 5000 values and has null value: no
-
-
-
+--- Column 1 ---
+Data type: Integer
+Values: 5000
+Has null: no
+Minimum: -2145365268
+Maximum: 2147025027
+Sum: -29841423854
-
-
- <li class=""><a href="/docs/acid.html">ACID support</a></li>
-
+--- Column 2 ---
+Data type: String
+Values: 5000
+Has null: no
+Minimum: 1005350489418be2
+Maximum: fffbb8718c92b09f
+Total length: 79644
+*** Stripe 1 ***
-</ul>
+--- Column 0 ---
+Column has 5000 values and has null value: no
-
- <h4>Installing</h4>
-
+--- Column 1 ---
+Data type: Integer
+Values: 5000
+Has null: no
+Minimum: -2147115959
+Maximum: 2147257982
+Sum: 108604887785
-<ul>
+--- Column 2 ---
+Data type: String
+Values: 5000
+Has null: no
+Minimum: 100119c272d7db89
+Maximum: fff0ae41d41e6afc
+Total length: 79640
-
+*** Stripe 2 ***
-
-
-
+--- Column 0 ---
+Column has 5000 values and has null value: no
-
-
-
-
-
-
- <li class=""><a href="/docs/building.html">Building ORC</a></li>
-
+--- Column 1 ---
+Data type: Integer
+Values: 5000
+Has null: no
+Minimum: -2145932387
+Maximum: 2145877119
+Sum: 70064190848
+--- Column 2 ---
+Data type: String
+Values: 5000
+Has null: no
+Minimum: 10130af874ae036c
+Maximum: fffe9f6f23b287f3
+Total length: 79645
-
+*** Stripe 3 ***
-
-
-
+--- Column 0 ---
+Column has 5000 values and has null value: no
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/releases.html">Releases</a></li>
-
+--- Column 1 ---
+Data type: Integer
+Values: 5000
+Has null: no
+Minimum: -2147439072
+Maximum: 2147074354
+Sum: 104681356482
+--- Column 2 ---
+Data type: String
+Values: 5000
+Has null: no
+Minimum: 102547d48ed06518
+Maximum: fffa47c57dc7b69a
+Total length: 79689
+
+*** Stripe 4 ***
+
+--- Column 0 ---
+Column has 1000 values and has null value: no
+
+--- Column 1 ---
+Data type: Integer
+Values: 1000
+Has null: no
+Minimum: -2141222223
+Maximum: 2145816096
+Sum: 14973647307
+
+--- Column 2 ---
+Data type: String
+Values: 1000
+Has null: no
+Minimum: 1059d81c9025a217
+Maximum: ffc17f0e35e1a6c0
+Total length: 15941
+</code></pre>
+
+
-</ul>
-
- <h4>Using in Hive</h4>
-
-<ul>
-
-
-
-
-
-
+
-
-
+
-
-
+
-
-
+
-
-
- <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
-
+
+
+
-
-
-
+
-
-
+
-
-
+
-
-
+
-
+ <div class="section-nav">
+ <div class="left align-right">
+
+
+
+ <a href="/docs/core-cpp.html" class="prev">Back</a>
+
+ </div>
+ <div class="right align-left">
+
+
+
+ <a href="/docs/java-tools.html" class="next">Next</a>
+
+ </div>
+ </div>
+ <div class="clear"></div>
- <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
+ </article>
+ </div>
-</ul>
-
+ <div class="unit one-fifth hide-on-mobiles">
+ <aside>
- <h4>Using in MapReduce</h4>
+ <h4>Overview</h4>
<ul>
@@ -1891,19 +1028,21 @@ Total length: 15941
+ <li class=""><a href="/docs/index.html">Background</a></li>
+
+
+
-
-
-
+
-
+
- <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
+ <li class=""><a href="/docs/adopters.html">ORC Adopters</a></li>
@@ -1943,20 +1082,10 @@ Total length: 15941
-
-
- <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+ <li class=""><a href="/docs/types.html">Types</a></li>
-</ul>
-
-
- <h4>Using ORC Core</h4>
-
-
-<ul>
-
@@ -1975,34 +1104,34 @@ Total length: 15941
- <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
-
-
-
-
-
-
-
+ <li class=""><a href="/docs/indexes.html">Indexes</a></li>
+
+
+
+
+
+
+
- <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+ <li class=""><a href="/docs/acid.html">ACID support</a></li>
</ul>
- <h4>Tools</h4>
+ <h4>Installing</h4>
<ul>
@@ -2019,15 +1148,7 @@ Total length: 15941
-
-
-
-
-
-
-
-
- <li class="current"><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+ <li class=""><a href="/docs/building.html">Building ORC</a></li>
@@ -2065,14 +1186,14 @@ Total length: 15941
- <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>
+ <li class=""><a href="/docs/releases.html">Releases</a></li>
</ul>
- <h4>Format Specification</h4>
+ <h4>Using in Hive</h4>
<ul>
@@ -2099,31 +1220,7 @@ Total length: 15941
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/spec-intro.html">Introduction</a></li>
+ <li class=""><a href="/docs/hive-ddl.html">Hive DDL</a></li>
@@ -2147,31 +1244,17 @@ Total length: 15941
-
-
-
-
- <li class=""><a href="/docs/file-tail.html">File Tail</a></li>
+ <li class=""><a href="/docs/hive-config.html">Hive Configuration</a></li>
-
-
-
-
-
+</ul>
-
-
-
-
-
-
+ <h4>Using in MapReduce</h4>
- <li class=""><a href="/docs/compression.html">Compression</a></li>
-
+<ul>
@@ -2203,19 +1286,7 @@ Total length: 15941
-
-
-
-
-
-
-
-
-
-
-
-
- <li class=""><a href="/docs/run-length.html">Run Length Encoding</a></li>
+ <li class=""><a href="/docs/mapred.html">Using in MapRed</a></li>
@@ -2251,13 +1322,25 @@ Total length: 15941
-
+ <li class=""><a href="/docs/mapreduce.html">Using in MapReduce</a></li>
+
+
+
+</ul>
+
-
+ <h4>Using ORC Core</h4>
+
+<ul>
+
+
+
+
+
@@ -2267,7 +1350,7 @@ Total length: 15941
- <li class=""><a href="/docs/stripes.html">Stripes</a></li>
+ <li class=""><a href="/docs/core-java.html">Using Core Java</a></li>
@@ -2285,17 +1368,17 @@ Total length: 15941
-
-
-
-
-
+ <li class=""><a href="/docs/core-cpp.html">Using Core C++</a></li>
+
+
+
+</ul>
+
-
+ <h4>Tools</h4>
- <li class=""><a href="/docs/encodings.html">Column Encodings</a></li>
-
+<ul>
@@ -2315,11 +1398,17 @@ Total length: 15941
+ <li class="current"><a href="/docs/cpp-tools.html">C++ Tools</a></li>
+
+
+
-
+
+
+
@@ -2341,7 +1430,7 @@ Total length: 15941
- <li class=""><a href="/docs/spec-index.html">Indexes</a></li>
+ <li class=""><a href="/docs/java-tools.html">Java Tools</a></li>