You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@orc.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2021/12/25 04:02:00 UTC

[jira] [Resolved] (ORC-1060) batch read with Java interface uses high memory when reading ORC string dictionary encoding column

     [ https://issues.apache.org/jira/browse/ORC-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun resolved ORC-1060.
--------------------------------
    Fix Version/s: 1.8.0
       Resolution: Fixed

This is resolved via https://github.com/apache/orc/pull/971

> batch read with Java interface uses high memory when reading ORC string dictionary encoding column
> --------------------------------------------------------------------------------------------------
>
>                 Key: ORC-1060
>                 URL: https://issues.apache.org/jira/browse/ORC-1060
>             Project: ORC
>          Issue Type: Improvement
>          Components: Java, Reader
>    Affects Versions: 1.5.13
>            Reporter: xiaoli
>            Priority: Minor
>             Fix For: 1.8.0
>
>
> We are upgrading spark version from 2.2 to 3.0. During this work, we find spark3.0 uses higher memory than spark2.2 when reading ORC string dictionary encoding column.
> The reason is:
> spark2.2 use hive's lib to read ORC [https://github.com/aixuebo/hive1.2.1.ql/blob/master/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java]  In this code, StringDictionaryTreeReader class with row read interface hold only one string dictionary in memory when reading across multiple file stripes.
> spark3.0 use orc lib to read ORC
> [https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java] In this code, StringDictionaryTreeReader class with batch read interface could hold 3 string dictionary in memory when reading across multiple file stripes: 2 copy of current stripe's dictionary data (dictionaryBuffer variable and dictionaryBufferInBytesCache variable) and 1 copy of next stripe's dictionary data  (dictionaryBuffer variable, when call advanceToNextRow method in RecordReaderImpl class's nextBatch method)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)