You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2024/03/08 17:34:01 UTC

[jira] [Closed] (ORC-1553) Reading information from Row group, where there are 0 records of SArg column

     [ https://issues.apache.org/jira/browse/ORC-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun closed ORC-1553.
------------------------------

> Reading information from Row group, where there are 0 records of SArg column
> ----------------------------------------------------------------------------
>
>                 Key: ORC-1553
>                 URL: https://issues.apache.org/jira/browse/ORC-1553
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.9.2
>            Reporter: Alexander Petrossian (PAF)
>            Assignee: Yiqun Zhang
>            Priority: Major
>             Fix For: 2.0.0, 1.9.3
>
>         Attachments: MAJOR-2023-11-21.orc, Снимок экрана 2023-12-21 в 10.00.23.png
>
>
> We have created .orc file using Apache ORC library, I can not provide a reproducible way to create such a file.
> We have statistics for 100% row groups, checked with orc dump.
> But I see that when we search by that file we get a very strange behavior:
> {code}
> TRACE org.apache.orc.impl.RecordReaderImpl: Stats = numberOfValues: 0
> stringStatistics {
> }
> hasNull: false
> TRACE org.apache.orc.impl.RecordReaderImpl: Setting (EQUALS value 71231231212) to YES_NO_NULL
> DEBUG org.apache.orc.impl.RecordReaderImpl: Row group 340000 to 349999 is included.
> {code}
> If there are 0 values according to existing statistics, so there is obviously no need to read that row group.
> And yet we have YES_NO_NULL decision which forces inclusion of that row group in subsequent operation, which meaningless and bad for performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)