You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@orc.apache.org by om...@apache.org on 2017/07/24 17:49:49 UTC

[47/51] [partial] orc git commit: ORC-204 Update and use CMake External Project to build C++ compression libraries.

http://git-wip-us.apache.org/repos/asf/orc/blob/590245a0/c++/libs/lz4-r131/lz4_Frame_format.md
----------------------------------------------------------------------
diff --git a/c++/libs/lz4-r131/lz4_Frame_format.md b/c++/libs/lz4-r131/lz4_Frame_format.md
deleted file mode 100644
index 73d3cba..0000000
--- a/c++/libs/lz4-r131/lz4_Frame_format.md
+++ /dev/null
@@ -1,385 +0,0 @@
-LZ4 Frame Format Description
-============================
-
-###Notices
-
-Copyright (c) 2013-2015 Yann Collet
-
-Permission is granted to copy and distribute this document 
-for any  purpose and without charge, 
-including translations into other  languages 
-and incorporation into compilations, 
-provided that the copyright notice and this notice are preserved, 
-and that any substantive changes or deletions from the original 
-are clearly marked.
-Distribution of this document is unlimited.
-
-###Version
-
-1.5.1 (31/03/2015)
-
-
-Introduction
-------------
-
-The purpose of this document is to define a lossless compressed data format, 
-that is independent of CPU type, operating system, 
-file system and character set, suitable for 
-File compression, Pipe and streaming compression 
-using the [LZ4 algorithm](http://www.lz4.info).
-
-The data can be produced or consumed, 
-even for an arbitrarily long sequentially presented input data stream,
-using only an a priori bounded amount of intermediate storage,
-and hence can be used in data communications.
-The format uses the LZ4 compression method,
-and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash),
-for detection of data corruption.
-
-The data format defined by this specification 
-does not attempt to allow random access to compressed data.
-
-This specification is intended for use by implementers of software
-to compress data into LZ4 format and/or decompress data from LZ4 format.
-The text of the specification assumes a basic background in programming
-at the level of bits and other primitive data representations.
-
-Unless otherwise indicated below,
-a compliant compressor must produce data sets
-that conform to the specifications presented here.
-It doesn’t need to support all options though.
-
-A compliant decompressor must be able to decompress
-at least one working set of parameters
-that conforms to the specifications presented here.
-It may also ignore checksums.
-Whenever it does not support a specific parameter within the compressed stream,
-it must produce a non-ambiguous error code
-and associated error message explaining which parameter is unsupported.
-
-
-General Structure of LZ4 Frame format
--------------------------------------
-
-| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum |
-|:-------:|:-------------:| ----- | ----- | ------- | ----------- |
-| 4 bytes |  3-11 bytes   |       |       | 4 bytes |   4 bytes   | 
-
-__Magic Number__
-
-4 Bytes, Little endian format.
-Value : 0x184D2204
-
-__Frame Descriptor__
-
-3 to 11 Bytes, to be detailed in the next part.
-Most important part of the spec.
-
-__Data Blocks__
-
-To be detailed later on.
-That’s where compressed data is stored.
-
-__EndMark__
-
-The flow of blocks ends when the last data block has a size of “0”.
-The size is expressed as a 32-bits value.
-
-__Content Checksum__
-
-Content Checksum verify that the full content has been decoded correctly.
-The content checksum is the result 
-of [xxh32() hash function](https://github.com/Cyan4973/xxHash)
-digesting the original (decoded) data as input, and a seed of zero.
-Content checksum is only present when its associated flag
-is set in the frame descriptor. 
-Content Checksum validates the result,
-that all blocks were fully transmitted in the correct order and without error,
-and also that the encoding/decoding process itself generated no distortion.
-Its usage is recommended.
-
-__Frame Concatenation__
-
-In some circumstances, it may be preferable to append multiple frames,
-for example in order to add new data to an existing compressed file
-without re-framing it.
-
-In such case, each frame has its own set of descriptor flags.
-Each frame is considered independent.
-The only relation between frames is their sequential order.
-
-The ability to decode multiple concatenated frames 
-within a single stream or file
-is left outside of this specification. 
-As an example, the reference lz4 command line utility behavior is
-to decode all concatenated frames in their sequential order.
-
- 
-Frame Descriptor
-----------------
-
-| FLG     | BD      | (Content Size) | HC      |
-| ------- | ------- |:--------------:| ------- |
-| 1 byte  | 1 byte  |  0 - 8 bytes   | 1 byte  | 
-
-The descriptor uses a minimum of 3 bytes,
-and up to 11 bytes depending on optional parameters.
-
-__FLG byte__
-
-|  BitNb  |   7-6   |    5    |     4     |   3     |     2     |    1-0   |
-| ------- | ------- | ------- | --------- | ------- | --------- | -------- |
-|FieldName| Version | B.Indep | B.Checksum| C.Size  | C.Checksum|*Reserved*|
-
-
-__BD byte__
-
-|  BitNb  |     7    |     6-5-4    |  3-2-1-0 |
-| ------- | -------- | ------------ | -------- |
-|FieldName|*Reserved*| Block MaxSize|*Reserved*|
-
-In the tables, bit 7 is highest bit, while bit 0 is lowest.
-
-__Version Number__
-
-2-bits field, must be set to “01”.
-Any other value cannot be decoded by this version of the specification.
-Other version numbers will use different flag layouts.
-
-__Block Independence flag__
-
-If this flag is set to “1”, blocks are independent. 
-If this flag is set to “0”, each block depends on previous ones
-(up to LZ4 window size, which is 64 KB).
-In such case, it’s necessary to decode all blocks in sequence.
-
-Block dependency improves compression ratio, especially for small blocks.
-On the other hand, it makes direct jumps or multi-threaded decoding impossible.
-
-__Block checksum flag__
-
-If this flag is set, each data block will be followed by a 4-bytes checksum,
-calculated by using the xxHash-32 algorithm on the raw (compressed) data block.
-The intention is to detect data corruption (storage or transmission errors) 
-immediately, before decoding.
-Block checksum usage is optional.
-
-__Content Size flag__
-
-If this flag is set, the uncompressed size of data included within the frame 
-will be present as an 8 bytes unsigned little endian value, after the flags.
-Content Size usage is optional.
-
-__Content checksum flag__
-
-If this flag is set, a content checksum will be appended after the EndMark.
-
-Recommended value : “1” (content checksum is present)
-
-__Block Maximum Size__
-
-This information is intended to help the decoder allocate memory.
-Size here refers to the original (uncompressed) data size.
-Block Maximum Size is one value among the following table :
-
-|  0  |  1  |  2  |  3  |   4   |   5    |  6   |  7   | 
-| --- | --- | --- | --- | ----- | ------ | ---- | ---- | 
-| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB | 
-
-The decoder may refuse to allocate block sizes above a (system-specific) size.
-Unused values may be used in a future revision of the spec.
-A decoder conformant to the current version of the spec
-is only able to decode blocksizes defined in this spec.
-
-__Reserved bits__
-
-Value of reserved bits **must** be 0 (zero).
-Reserved bit might be used in a future version of the specification,
-typically enabling new optional features.
-If this happens, a decoder respecting the current version of the specification
-shall not be able to decode such a frame.
-
-__Content Size__
-
-This is the original (uncompressed) size.
-This information is optional, and only present if the associated flag is set.
-Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes.
-Format is Little endian.
-This value is informational, typically for display or memory allocation.
-It can be skipped by a decoder, or used to validate content correctness.
-
-__Header Checksum__
-
-One-byte checksum of combined descriptor fields, including optional ones.
-The value is the second byte of xxh32() : ` (xxh32()>>8) & 0xFF `
-using zero as a seed,
-and the full Frame Descriptor as an input
-(including optional fields when they are present).
-A wrong checksum indicates an error in the descriptor.
-Header checksum is informational and can be skipped.
-
-
-Data Blocks
------------
-
-| Block Size |  data  | (Block Checksum) |
-|:----------:| ------ |:----------------:|
-|  4 bytes   |        |   0 - 4 bytes    | 
-
-
-__Block Size__
-
-This field uses 4-bytes, format is little-endian.
-
-The highest bit is “1” if data in the block is uncompressed.
-
-The highest bit is “0” if data in the block is compressed by LZ4.
-
-All other bits give the size, in bytes, of the following data block
-(the size does not include the block checksum if present).
-
-Block Size shall never be larger than Block Maximum Size.
-Such a thing could happen for incompressible source data. 
-In such case, such a data block shall be passed in uncompressed format.
-
-__Data__
-
-Where the actual data to decode stands.
-It might be compressed or not, depending on previous field indications.
-Uncompressed size of Data can be any size, up to “block maximum size”.
-Note that data block is not necessarily full : 
-an arbitrary “flush” may happen anytime. Any block can be “partially filled”.
-
-__Block checksum__
-
-Only present if the associated flag is set.
-This is a 4-bytes checksum value, in little endian format,
-calculated by using the xxHash-32 algorithm on the raw (undecoded) data block,
-and a seed of zero.
-The intention is to detect data corruption (storage or transmission errors) 
-before decoding.
-
-Block checksum is cumulative with Content checksum.
-
-
-Skippable Frames
-----------------
-
-| Magic Number | Frame Size | User Data |
-|:------------:|:----------:| --------- |
-|   4 bytes    |  4 bytes   |           | 
-
-Skippable frames allow the integration of user-defined data
-into a flow of concatenated frames.
-Its design is pretty straightforward,
-with the sole objective to allow the decoder to quickly skip 
-over user-defined data and continue decoding.
-
-For the purpose of facilitating identification,
-it is discouraged to start a flow of concatenated frames with a skippable frame.
-If there is a need to start such a flow with some user data
-encapsulated into a skippable frame,
-it’s recommended to start with a zero-byte LZ4 frame
-followed by a skippable frame.
-This will make it easier for file type identifiers.
-
- 
-__Magic Number__
-
-4 Bytes, Little endian format.
-Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
-All 16 values are valid to identify a skippable frame.
-
-__Frame Size__ 
-
-This is the size, in bytes, of the following User Data
-(without including the magic number nor the size field itself).
-4 Bytes, Little endian format, unsigned 32-bits.
-This means User Data can’t be bigger than (2^32-1) Bytes.
-
-__User Data__
-
-User Data can be anything. Data will just be skipped by the decoder.
-
-
-Legacy frame
-------------
-
-The Legacy frame format was defined into the initial versions of “LZ4Demo”.
-Newer compressors should not use this format anymore, as it is too restrictive.
-
-Main characteristics of the legacy format :
-
-- Fixed block size : 8 MB.
-- All blocks must be completely filled, except the last one.
-- All blocks are always compressed, even when compression is detrimental.
-- The last block is detected either because 
-  it is followed by the “EOF” (End of File) mark,
-  or because it is followed by a known Frame Magic Number.
-- No checksum
-- Convention is Little endian
-
-| MagicNb | B.CSize | CData | B.CSize | CData |  (...)  | EndMark |
-| ------- | ------- | ----- | ------- | ----- | ------- | ------- |
-| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times |   EOF   |
-
-
-__Magic Number__
-
-4 Bytes, Little endian format.
-Value : 0x184C2102
-
-__Block Compressed Size__
-
-This is the size, in bytes, of the following compressed data block.
-4 Bytes, Little endian format.
-
-__Data__
-
-Where the actual compressed data stands.
-Data is always compressed, even when compression is detrimental.
-
-__EndMark__
-
-End of legacy frame is implicit only.
-It must be followed by a standard EOF (End Of File) signal,
-wether it is a file or a stream.
-
-Alternatively, if the frame is followed by a valid Frame Magic Number,
-it is considered completed.
-It makes legacy frames compatible with frame concatenation.
-
-Any other value will be interpreted as a block size,
-and trigger an error if it does not fit within acceptable range.
-
-
-Version changes
----------------
-
-1.5.1 : changed format to MarkDown compatible
-
-1.5 : removed Dictionary ID from specification
-
-1.4.1 : changed wording from “stream” to “frame”
-
-1.4 : added skippable streams, re-added stream checksum
-
-1.3 : modified header checksum
-
-1.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”.
-
-1.1 : optional fields are now part of the descriptor
-
-1.0 : changed “block size” specification, adding a compressed/uncompressed flag
-
-0.9 : reduced scale of “block maximum size” table
-
-0.8 : removed : high compression flag
-
-0.7 : removed : stream checksum
-
-0.6 : settled : stream size uses 8 bytes, endian convention is little endian
-
-0.5: added copyright notice
-
-0.4 : changed format to Google Doc compatible OpenDocument
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/orc/blob/590245a0/c++/libs/snappy-1.1.2/AUTHORS
----------------------------------------------------------------------
diff --git a/c++/libs/snappy-1.1.2/AUTHORS b/c++/libs/snappy-1.1.2/AUTHORS
deleted file mode 100644
index 4858b37..0000000
--- a/c++/libs/snappy-1.1.2/AUTHORS
+++ /dev/null
@@ -1 +0,0 @@
-opensource@google.com

http://git-wip-us.apache.org/repos/asf/orc/blob/590245a0/c++/libs/snappy-1.1.2/CMakeLists.txt
----------------------------------------------------------------------
diff --git a/c++/libs/snappy-1.1.2/CMakeLists.txt b/c++/libs/snappy-1.1.2/CMakeLists.txt
deleted file mode 100644
index 684b064..0000000
--- a/c++/libs/snappy-1.1.2/CMakeLists.txt
+++ /dev/null
@@ -1,20 +0,0 @@
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#     http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-add_library (snappy STATIC
-  snappy.cc
-  snappy-sinksource.cc
-  snappy-stubs-internal.cc
-  snappy-c.cc
-  )
-
-install (TARGETS snappy DESTINATION lib)

http://git-wip-us.apache.org/repos/asf/orc/blob/590245a0/c++/libs/snappy-1.1.2/COPYING
----------------------------------------------------------------------
diff --git a/c++/libs/snappy-1.1.2/COPYING b/c++/libs/snappy-1.1.2/COPYING
deleted file mode 100644
index 4816c43..0000000
--- a/c++/libs/snappy-1.1.2/COPYING
+++ /dev/null
@@ -1,54 +0,0 @@
-Copyright 2011, Google Inc.
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are
-met:
-
-    * Redistributions of source code must retain the above copyright
-notice, this list of conditions and the following disclaimer.
-    * Redistributions in binary form must reproduce the above
-copyright notice, this list of conditions and the following disclaimer
-in the documentation and/or other materials provided with the
-distribution.
-    * Neither the name of Google Inc. nor the names of its
-contributors may be used to endorse or promote products derived from
-this software without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
-OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
-SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
-LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
-THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-===
-
-Some of the benchmark data in util/zippy/testdata is licensed differently:
-
- - fireworks.jpeg is Copyright 2013 Steinar H. Gunderson, and
-   is licensed under the Creative Commons Attribution 3.0 license
-   (CC-BY-3.0). See https://creativecommons.org/licenses/by/3.0/
-   for more information.
-
- - kppkn.gtb is taken from the Gaviota chess tablebase set, and
-   is licensed under the MIT License. See
-   https://sites.google.com/site/gaviotachessengine/Home/endgame-tablebases-1
-   for more information.
-
- - paper-100k.pdf is an excerpt (bytes 92160 to 194560) from the paper
-   “Combinatorial Modeling of Chromatin Features Quantitatively Predicts DNA
-   Replication Timing in _Drosophila_” by Federico Comoglio and Renato Paro,
-   which is licensed under the CC-BY license. See
-   http://www.ploscompbiol.org/static/license for more ifnormation.
-
- - alice29.txt, asyoulik.txt, plrabn12.txt and lcet10.txt are from Project
-   Gutenberg. The first three have expired copyrights and are in the public
-   domain; the latter does not have expired copyright, but is still in the
-   public domain according to the license information
-   (http://www.gutenberg.org/ebooks/53).