You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by "Ruochen Zou (Jira)" <ji...@apache.org> on 2020/03/29 18:23:00 UTC

[jira] [Created] (ORC-616) In Patched Base encoding, the value of headerThirdByte goes beyond the range of byte

Ruochen Zou created ORC-616:
-------------------------------

             Summary: In Patched Base encoding, the value of headerThirdByte goes beyond the range of byte
                 Key: ORC-616
                 URL: https://issues.apache.org/jira/browse/ORC-616
             Project: ORC
          Issue Type: Bug
          Components: Java
    Affects Versions: master
            Reporter: Ruochen Zou


In Patched Base encoding, the first three bits of headerThirdByte represent the base value width. If Math.abs(min) greater than or equal to 1 << 56, the value of baseBytes is 9, and the value of bb goes beyond range fo byte.

{code:java}
final boolean isNegative = min < 0 ? true : false;
if (isNegative) {
  min = -min;
}
// find the number of bytes required for base and shift it by 5 bits
// to accommodate patch width. The additional bit is used to store the sign
// of the base value.
final int baseWidth = utils.findClosestNumBits(min) + 1;
final int baseBytes = baseWidth % 8 == 0 ? baseWidth / 8 : (baseWidth / 8) + 1;
final int bb = (baseBytes - 1) << 5;

// if the base value is negative then set MSB to 1
if (isNegative) {
  min |= (1L << ((baseBytes * 8) - 1));
}

// third byte contains 3 bits for number of bytes occupied by base
// and 5 bits for patchWidth
final int headerThirdByte = bb | utils.encodeBitWidth(patchWidth);
{code}
The byte to be written is the eight low-order bits of the headerThirdByte, the value read by RunLengthIntegerReaderV2 is incorrect, as well as data of the column is unexpected.

{code:java}
// extract the number of bytes occupied by base
int thirdByte = input.read();
int bw = (thirdByte >>> 5) & 0x07;
// base width is one off
bw += 1;
{code}
In some cases, RunLengthIntegerReaderV2 fails with EOFExeption.

{code:java}
Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 2 kind DATA position: 3213835 length: 3213835 range: 0 offset: 3217373 limit: 3217373 range 0 = 0 to 3213835 uncompressed: 184478 to 184478
        at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
        at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
        at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:369)
        at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:587)
        at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
        at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
        ... 20 more
{code}
For example, consider the following sequence:

{code:java}
long data[] = {-9007199254740992l,-8725724278030337l,-1125762467889153l,-1l,-9007199254740992l,-9007199254740992l,-497l,127l,-1l,-72057594037927936l,-4194304l,-9007199254740992l,-4503599593816065l,-4194304l,-8936830510563329l,-9007199254740992l, -1l, -70334384439312l,-4063233l, -6755399441973249l};
{code}
The min value is -72057594037927936(-1 << 56),RLEv2 writes this sequence with Patched Base encoding, and the data read out by RunLengthIntegerReaderV2 is:

{code:java}
[281474976710656, 36275087623585792, 247390116249599, 72053196528287743, 72057594037927935, 72022409665839104, 246290604621824, -71776119061217282, 4222124650659840, 36028797018963967, 71776119061217280, 281474976694272, 246290604621824, 263882790797311, 72057594037911552, 246565482528767, 72022409665839104, 281474976710655, 72057319294238719, 67835469387252223]
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)