You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2022/01/11 03:07:00 UTC

[jira] [Assigned] (ORC-1087) Seek overflow in an uncompressed chunk

     [ https://issues.apache.org/jira/browse/ORC-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Quanlong Huang reassigned ORC-1087:
-----------------------------------


> Seek overflow in an uncompressed chunk
> --------------------------------------
>
>                 Key: ORC-1087
>                 URL: https://issues.apache.org/jira/browse/ORC-1087
>             Project: ORC
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 1.7.2, 1.7.1, 1.7.0
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>         Attachments: scan_with_sarg.cc, seek-issue-snappy-500k.orc
>
>
> Reading the attached ORC file with SearchArgument "{{{}sr_return_amt > 10000{}}}" using the C++ reader will fail with
> {code:java}
> Corrupt PATCHED_BASE encoded data (pl==0)!{code}
> It's ok to read it without the SearchArgument. The java reader is able to read it with the same SearchArgument.
> Attached the source codes (scan_with_sarg.cc) for reproducing the issue. Build the ORC lib and compile it by
> {code:bash}
> g++ scan_with_sarg.cc -o scan_with_sarg -I../c++/include -Ic++/include -Lc++/src/ -Lsnappy_ep-prefix/src/snappy_ep-build/ -Llz4_ep-prefix/src/lz4_ep-build/ -Lzlib_ep-prefix/src/zlib_ep-build/ -Lzstd_ep-prefix/src/zstd_ep-build/lib/ -Lprotobuf_ep-prefix/src/protobuf_ep-build/ -lorc -lz -lsnappy -llz4 -lzstd -lprotobuf
> {code}
> Run it as
> {code:bash}
> $ LD_LIBRARY_PATH="$LD_LIBRARY_PATH:zstd_ep-prefix/src/zstd_ep-build/lib/" ./scan_with_sarg 
> leaf-0 = (column(id=17) <= 10000), expr = (not leaf-0)
> terminate called after throwing an instance of 'orc::ParseError'
>   what():  Corrupt PATCHED_BASE encoded data (pl==0)!
> Aborted (core dumped)
> {code}
> *RCA*
> The sarg introduces a seek to RowGroup 42. The following codes in {{DecompressionStream::seek}} didn't handle the case when uncompressedBufferLength < posInChunk. Then seeks to an illegal position and the length overflow.
> {code:cpp}
> if (headerPosition == seekedPosition
>     && inputBufferStartPosition <= headerPosition + 3 && inputBufferStart) {
>   position.next(); // Skip the input level position.
>   size_t posInChunk = position.next(); // Chunk level position.
>   // Overflow here! uncompressedBufferLength=30950, posInChunk=39498
>   outputBufferLength = uncompressedBufferLength - posInChunk;
>   outputBuffer = outputBufferStart + posInChunk;
>   return;
> }{code}
> That chunk is an uncompressed chunk, and the whole chunk is read in pieces. The position (posInChunk) hasn't been read out yet. We need to handle this case.
> I think this only happens on uncompressed chunks. For compressed chunks, they are decompressed as a whole. So posInChunk will always be valid in the output buffer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)