You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@orc.apache.org by "Yiqun Zhang (Jira)" <ji...@apache.org> on 2023/08/17 03:54:00 UTC

[jira] [Commented] (ORC-1482) RecordReaderImpl.evaluatePredicateProto assumes floating point stats are always present

    [ https://issues.apache.org/jira/browse/ORC-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755347#comment-17755347 ] 

Yiqun Zhang commented on ORC-1482:
----------------------------------

[~jlowe] Thanks for reporting this issue, do you have an orc file that reproduces it?

For the official orc writer, if it's a DOUBLE / FLOAT type, I think it must write DoubleColumnStatistics.
{code:java}
public static ColumnStatisticsImpl create(TypeDescription schema,
                                          boolean convertToProleptic) {
  switch (schema.getCategory()) {
    case BOOLEAN:
      return new BooleanStatisticsImpl();
    case BYTE:
    case SHORT:
    case INT:
    case LONG:
      return new IntegerStatisticsImpl();
    case LIST:
    case MAP:
      return new CollectionColumnStatisticsImpl();
    case FLOAT:
    case DOUBLE:
      return new DoubleStatisticsImpl();
..... {code}
So I think it might be the implementation of other writers that is causing the problem, maybe we need to explicitly state this in the spec.

Correct me if I'm wrong, thanks!

> RecordReaderImpl.evaluatePredicateProto assumes floating point stats are always present
> ---------------------------------------------------------------------------------------
>
>                 Key: ORC-1482
>                 URL: https://issues.apache.org/jira/browse/ORC-1482
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.7.4
>            Reporter: Jason Darrell Lowe
>            Priority: Major
>
> ORC-629 added custom handling of predicate pushdown on doubles, but the code that was added blindly assumes that double statistics were present in the file which may not have been the case.  Here's the snippet of code in question:
> {code:java}
>      } else if (category == TypeDescription.Category.DOUBLE ||
>         category == TypeDescription.Category.FLOAT) {
>       DoubleColumnStatistics dstas = (DoubleColumnStatistics) cs;
> {code}
>  
> Elsewhere in the code, there's a type check on the result of statistics deserialization before casting, but here the type checks are missing.  It appears this should be checking for DoubleColumnStatistics before assuming the cast will succeed, and if the expected statistics type is not present then the predicate should not be evaluated on that column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)