You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@orc.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2021/06/04 04:33:00 UTC

[jira] [Updated] (ORC-614) Implement efficient seek() in decompression streams

     [ https://issues.apache.org/jira/browse/ORC-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated ORC-614:
------------------------------
    Affects Version/s: 1.7.0

> Implement efficient seek() in decompression streams
> ---------------------------------------------------
>
>                 Key: ORC-614
>                 URL: https://issues.apache.org/jira/browse/ORC-614
>             Project: ORC
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 1.7.0
>            Reporter: Csaba Ringhofer
>            Assignee: Gang Wu
>            Priority: Major
>             Fix For: 1.7.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current implementation of ZlibDecompressionStream/BlockDecompressionStream::seek resets the state of the decompressor and the underlying file reader and throws away their buffers. The buffers can still have usable data in the following cases;
> 1. If the new row group's start position is in the same compressed chunk we were reading, then we just jumped to another position within the same uncompressed buffer, so both the original compressed buffer and the decompressed  buffer can be reused. This is a very common scenario with the default ORC configs of unaligned 256KB>=chunks and 10K row groups, e.g. chunk can contain 3 full row groups of 8 byte int without any encoding.
> 2.  If the new row group's start position is in another compressed chunk, but it starts in the current compressed  buffer (as we have read ahead during file reading), then the compressed buffer can be kept and only the uncompressed buffer needs to be dropped. This is the usual case in Apache Impala, as 8 MB block size is used which leads to reading the whole stream to the buffer for typical columns.
> The lack of these optimizations lead to regression during the testing of https://github.com/apache/orc/pull/476, which uses seek() when a row group is skipped due to predicate push down, as all seeks caused the whole stream to be read again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)