You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by Dain Sundstrom <da...@iq80.com> on 2017/06/16 19:19:31 UTC

Documentations issues

Recently I have been working on a custom writer for Presto and during this I kept notes on sections of the documentation that might have problems.  Some of these may have already been addressed:

## Compression
see https://orc.apache.org/docs/compression.html

I think the hex sequence for 100000 compressed is [0x41 0x0D 0x03].  Also, it is not clear if compressed length is 2 bytes, or .
```
Each header is 3 bytes long with (compressedLength * 2 + isOriginal) stored as a little endian value.   For example, the header for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d, 0x03]. The header for 5 bytes that did not compress would be [0x0b, 0x00, 0x00]. 
```

This section is not clear:
```
The default compression chunk size is 256K, but writers can choose their own value less than 223.
```
Should the that be 223K?  If so, that seems strange since I would assume any value smaller than 256K is legit.


## String encodings
see https://orc.apache.org/docs/encodings.html#string-char-and-varchar-columns

This first sentence seems to be describing a heuristic used by the default implementation.

## File tail
The docs should make it clear that the maximum length stored for archer and char are the maximum number of unicode characters and specifically not byte count and not UTF-16 sequences (like Java does by default).
```
// the maximum length of the type for varchar or char
 optional uint32 maximumLength = 4;
```

Re: Documentations issues

Posted by Owen O'Malley <ow...@gmail.com>.

Ok, I just put in a pull request for this:

https://github.com/apache/orc/pull/133

Let me know if anything is still unclear.

Thanks,
    Owen

On Fri, Jun 16, 2017 at 12:19 PM, Dain Sundstrom <da...@iq80.com> wrote:

> Recently I have been working on a custom writer for Presto and during this
> I kept notes on sections of the documentation that might have problems.
> Some of these may have already been addressed:
>
> ## Compression
> see https://orc.apache.org/docs/compression.html
>
> I think the hex sequence for 100000 compressed is [0x41 0x0D 0x03].


No, if it is compressed the low bit is 0. It ends up with:

2 * 100,000 + 0 = 0x30d40


>   Also, it is not clear if compressed length is 2 bytes, or .
>

The header is always 3 bytes. I thought about adding a special case if the
chunk size was less than 32k, but didn't.


> ```
> Each header is 3 bytes long with (compressedLength * 2 + isOriginal)
> stored as a little endian value.   For example, the header for a chunk that
> compressed to 100,000 bytes would be [0x40, 0x0d, 0x03]. The header for 5
> bytes that did not compress would be [0x0b, 0x00, 0x00].
> ```
>
> This section is not clear:
> ```
> The default compression chunk size is 256K, but writers can choose their
> own value less than 223.
> ```
> Should the that be 223K?  If so, that seems strange since I would assume
> any value smaller than 256K is legit.
>
>
> ## String encodings
> see https://orc.apache.org/docs/encodings.html#string-char-
> and-varchar-columns
>
> This first sentence seems to be describing a heuristic used by the default
> implementation.
>
> ## File tail
> The docs should make it clear that the maximum length stored for archer
> and char are the maximum number of unicode characters and specifically not
> byte count and not UTF-16 sequences (like Java does by default).
> ```
> // the maximum length of the type for varchar or char
>  optional uint32 maximumLength = 4;
> ```
>
>