You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2020/03/16 21:14:00 UTC

[jira] [Created] (ORC-612) Improve CHAR(N)/VARCHAR(N) support in predicate push down

Csaba Ringhofer created ORC-612:
-----------------------------------

             Summary: Improve CHAR(N)/VARCHAR(N) support in predicate push down
                 Key: ORC-612
                 URL: https://issues.apache.org/jira/browse/ORC-612
             Project: ORC
          Issue Type: Improvement
          Components: C++
            Reporter: Csaba Ringhofer


This came up during the implementation of min/max filters in Apache Impala: https://gerrit.cloudera.org/#/c/15403/ by [~norbertluksa]

Impala reads CHAR(N)/VARCHAR(N) the following way:
0. push down min/max predicates (on review)
1. read the value as STRING from ORC
2. truncate the value if it is longer than N or pad it with spaces if the type is char and the value is short than N
3. evalute the predicates once all columns are read

It is possible that a value from ORC does not satisfy the predicates before truncation/padding, but it does afterwards. For example:
a single column: "s VARCHAR(1)"
a single value in the ORC file : "aa"
a predicate: s="a" .
"aa" does not pass the predicate, but after truncation it becomes "a", which passes.

Currently it is tricky to push this predicate down, as simply passing s="a" would skip the file as min="aa" > "a". (what could work is pushing s>="a" AND s<"b" instead, as all values that can be truncated to "a" are <= "b").

It would be much simpler (for us at least) if we could pass the max length and min length to the SARGS interface, and it would apply truncation/padding to the min/max statistics before comparing them to the literal we provided. So in the example above min=max="aa" would become "a", and it would satisfy the pushed down s="a".

Note that Impala doesn't care about encoding, so the length is byte length. Other clients may need UTF-8 length instead.

Apart from min/max stats, CHAR/VARCHAR are also problematic for bloom filters - in the example above, "aa"'s hash is probably different than "a"'s, so looking up "a" could fail.
Bloom filters could only work if we are sure that there won't be any truncation/padding (which is actually quite likely if the schema didn't change, as DB system enforces this during writing). If there were stats about min/max length of strings, then it would be possible verify this during predicate push down and use bloom filters if it is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)