You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2021/12/25 04:02:00 UTC
[jira] [Resolved] (ORC-1060) batch read with Java interface uses high memory when reading ORC string dictionary encoding column
[ https://issues.apache.org/jira/browse/ORC-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun resolved ORC-1060.
--------------------------------
Fix Version/s: 1.8.0
Resolution: Fixed
This is resolved via https://github.com/apache/orc/pull/971
> batch read with Java interface uses high memory when reading ORC string dictionary encoding column
> --------------------------------------------------------------------------------------------------
>
> Key: ORC-1060
> URL: https://issues.apache.org/jira/browse/ORC-1060
> Project: ORC
> Issue Type: Improvement
> Components: Java, Reader
> Affects Versions: 1.5.13
> Reporter: xiaoli
> Priority: Minor
> Fix For: 1.8.0
>
>
> We are upgrading spark version from 2.2 to 3.0. During this work, we find spark3.0 uses higher memory than spark2.2 when reading ORC string dictionary encoding column.
> The reason is:
> spark2.2 use hive's lib to read ORC [https://github.com/aixuebo/hive1.2.1.ql/blob/master/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java] In this code, StringDictionaryTreeReader class with row read interface hold only one string dictionary in memory when reading across multiple file stripes.
> spark3.0 use orc lib to read ORC
> [https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java] In this code, StringDictionaryTreeReader class with batch read interface could hold 3 string dictionary in memory when reading across multiple file stripes: 2 copy of current stripe's dictionary data (dictionaryBuffer variable and dictionaryBufferInBytesCache variable) and 1 copy of next stripe's dictionary data (dictionaryBuffer variable, when call advanceToNextRow method in RecordReaderImpl class's nextBatch method)
--
This message was sent by Atlassian Jira
(v8.20.1#820001)