You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@orc.apache.org by om...@apache.org on 2017/06/19 22:31:41 UTC
orc git commit: Fix the documentation issues that Dain brought up.
Repository: orc
Updated Branches:
refs/heads/master 54c54775a -> cdfc1ea47
Fix the documentation issues that Dain brought up.
Fixes #133
Signed-off-by: Owen O'Malley <om...@apache.org>
Project: http://git-wip-us.apache.org/repos/asf/orc/repo
Commit: http://git-wip-us.apache.org/repos/asf/orc/commit/cdfc1ea4
Tree: http://git-wip-us.apache.org/repos/asf/orc/tree/cdfc1ea4
Diff: http://git-wip-us.apache.org/repos/asf/orc/diff/cdfc1ea4
Branch: refs/heads/master
Commit: cdfc1ea47584d5aee2e2dc3dcca597d53ba5527a
Parents: 54c5477
Author: Owen O'Malley <om...@apache.org>
Authored: Mon Jun 19 13:18:43 2017 -0700
Committer: Owen O'Malley <om...@apache.org>
Committed: Mon Jun 19 15:30:40 2017 -0700
----------------------------------------------------------------------
site/_docs/compression.md | 9 +++++----
site/_docs/encodings.md | 9 ++++++---
site/_docs/file-tail.md | 2 +-
3 files changed, 12 insertions(+), 8 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/orc/blob/cdfc1ea4/site/_docs/compression.md
----------------------------------------------------------------------
diff --git a/site/_docs/compression.md b/site/_docs/compression.md
index 62cc199..aee2640 100644
--- a/site/_docs/compression.md
+++ b/site/_docs/compression.md
@@ -23,10 +23,11 @@ start decompressing without the previous bytes.
![compression streams]({{ site.url }}/img/CompressionStream.png)
The default compression chunk size is 256K, but writers can choose
-their own value less than 223. Larger chunks lead to better
-compression, but require more memory. The chunk size is recorded in
-the Postscript so that readers can allocate appropriately sized
-buffers.
+their own value. Larger chunks lead to better compression, but require
+more memory. The chunk size is recorded in the Postscript so that
+readers can allocate appropriately sized buffers. Readers are
+guaranteed that no chunk will expand to more than the compression chunk
+size.
ORC files without generic compression write each stream directly
with no headers.
http://git-wip-us.apache.org/repos/asf/orc/blob/cdfc1ea4/site/_docs/encodings.md
----------------------------------------------------------------------
diff --git a/site/_docs/encodings.md b/site/_docs/encodings.md
index 285ca71..9c565dc 100644
--- a/site/_docs/encodings.md
+++ b/site/_docs/encodings.md
@@ -32,9 +32,12 @@ DIRECT | PRESENT | Yes | Boolean RLE
## String, Char, and VarChar Columns
-String columns are adaptively encoded based on whether the first
-10,000 values are sufficiently distinct. In all of the encodings, the
-PRESENT stream encodes whether the value is null.
+String, char, and varchar columns may be encoded either using a
+dictionary encoding or a direct encoding. A direct encoding should be
+preferred when there are many distinct values. In all of the
+encodings, the PRESENT stream encodes whether the value is null. The
+Java ORC writer automatically picks the encoding after the first row
+group (10,000 rows).
For direct encoding the UTF-8 bytes are saved in the DATA stream and
the length of each value is written into the LENGTH stream. In direct
http://git-wip-us.apache.org/repos/asf/orc/blob/cdfc1ea4/site/_docs/file-tail.md
----------------------------------------------------------------------
diff --git a/site/_docs/file-tail.md b/site/_docs/file-tail.md
index d2700bb..316c001 100644
--- a/site/_docs/file-tail.md
+++ b/site/_docs/file-tail.md
@@ -173,7 +173,7 @@ that contains the list of their children's type ids.
repeated uint32 subtypes = 2 [packed=true];
// the list of field names for struct
repeated string fieldNames = 3;
- // the maximum length of the type for varchar or char
+ // the maximum length of the type for varchar or char in UTF-8 characters
optional uint32 maximumLength = 4;
// the precision and scale for decimal
optional uint32 precision = 5;