You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Yu-Wen Lai (Jira)" <ji...@apache.org> on 2022/01/07 01:53:00 UTC

[jira] [Updated] (ORC-1078) Row group end offset doesn't accommodate all the blocks

     [ https://issues.apache.org/jira/browse/ORC-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yu-Wen Lai updated ORC-1078:
----------------------------
    Description: 
The error message in current master:
{code:java}
java.lang.IllegalArgumentException
    at java.nio.Buffer.position(Buffer.java:244)
    at org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
    at org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
    at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
    at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
    at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
The same error can appear a little differently in older version:
{code:java}
java.io.IOException: Seek outside of data in compressed stream Stream for column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 limit: 36674 range 0 = 75
282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 uncompressed: 1024 to 1024 to 111956{code}
Here is the info extracted from the problematic orc file:
{code:java}
Compression: ZLIB
Compression size: 1024
Calendar: Julian/Gregorian
Type: struct<col:timestamp>

Row group indices:
      Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
      Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
      Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
      Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
The issue happened when entry 2 is selected and read due to incorrect end offset for this row group. To be more specific, when compression size is smaller than 2048, there is edge case we cannot accommodate all the blocks by the factor of 2 (please see the code snippet below).
{code:java}
public static long estimateRgEndOffset(boolean isCompressed,
    int bufferSize,
    boolean isLast,
    long nextGroupOffset,
    long streamLength) {
  // figure out the worst case last location
  // if adjacent groups have the same compressed block offset then stretch the slop
  // by factor of 2 to safely accommodate the next compression block.
  // One for the current compression block and another for the next compression block.
  long slop = isCompressed? 
    2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
  return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + slop);
}{code}
In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 1027 * 2 = 2054. That causes seeking outside of range.

In terms of the worst case, we might have uncompressed block in compressed stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 4 + header bytes) / C.
C = 1024 -> factor should be 3
C = 512 -> factor should be 5 ... and so forth.

  was:
The error message in current master:
{code:java}
java.lang.IllegalArgumentException
    at java.nio.Buffer.position(Buffer.java:244)
    at org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
    at org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
    at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
    at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
    at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
The same error can appear a little differently in older version:
{code:java}
java.io.IOException: Seek outside of data in compressed stream Stream for column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 limit: 36674 range 0 = 75
282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 uncompressed: 1024 to 1024 to 111956{code}
Here is the info extracted from the problematic orc file:
{code:java}
Compression: ZLIB
Compression size: 1024
Calendar: Julian/Gregorian
Type: struct<col:timestamp>

Row group indices:
      Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
      Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
      Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
      Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
The issue happened when entry 2 is selected and read due to incorrect end offset for this row group. To be more specific, when compression size is smaller than 2048, there is edge case we cannot accommodate all the blocks by the factor of 2 (please see the code snippet below).
{code:java}
public static long estimateRgEndOffset(boolean isCompressed,
    int bufferSize,
    boolean isLast,
    long nextGroupOffset,
    long streamLength) {
  // figure out the worst case last location
  // if adjacent groups have the same compressed block offset then stretch the slop
  // by factor of 2 to safely accommodate the next compression block.
  // One for the current compression block and another for the next compression block.
  long slop = isCompressed? 
    2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
  return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + slop);
}{code}
In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 1027 * 2 = 2054. That causes seeking outside of range.

In terms of the worst case, we might have uncompressed block in compressed stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 4 + header bytes) / C.
C = 1024 -> factor should be 3
C = 512 -> factor should be 4 ... and so forth.


> Row group end offset doesn't accommodate all the blocks
> -------------------------------------------------------
>
>                 Key: ORC-1078
>                 URL: https://issues.apache.org/jira/browse/ORC-1078
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Yu-Wen Lai
>            Priority: Major
>
> The error message in current master:
> {code:java}
> java.lang.IllegalArgumentException
>     at java.nio.Buffer.position(Buffer.java:244)
>     at org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
>     at org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
>     at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
>     at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
>     at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
> The same error can appear a little differently in older version:
> {code:java}
> java.io.IOException: Seek outside of data in compressed stream Stream for column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 limit: 36674 range 0 = 75
> 282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 uncompressed: 1024 to 1024 to 111956{code}
> Here is the info extracted from the problematic orc file:
> {code:java}
> Compression: ZLIB
> Compression size: 1024
> Calendar: Julian/Gregorian
> Type: struct<col:timestamp>
> Row group indices:
>       Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
>       Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
>       Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
>       Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
> The issue happened when entry 2 is selected and read due to incorrect end offset for this row group. To be more specific, when compression size is smaller than 2048, there is edge case we cannot accommodate all the blocks by the factor of 2 (please see the code snippet below).
> {code:java}
> public static long estimateRgEndOffset(boolean isCompressed,
>     int bufferSize,
>     boolean isLast,
>     long nextGroupOffset,
>     long streamLength) {
>   // figure out the worst case last location
>   // if adjacent groups have the same compressed block offset then stretch the slop
>   // by factor of 2 to safely accommodate the next compression block.
>   // One for the current compression block and another for the next compression block.
>   long slop = isCompressed? 
>     2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
>   return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + slop);
> }{code}
> In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 1027 * 2 = 2054. That causes seeking outside of range.
> In terms of the worst case, we might have uncompressed block in compressed stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 4 + header bytes) / C.
> C = 1024 -> factor should be 3
> C = 512 -> factor should be 5 ... and so forth.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)