You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Gang Wu (Jira)" <ji...@apache.org> on 2020/03/19 07:44:00 UTC
[jira] [Comment Edited] (ORC-612) Improve CHAR(N)/VARCHAR(N) support in predicate push down

    [ https://issues.apache.org/jira/browse/ORC-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062351#comment-17062351 ] 

Gang Wu edited comment on ORC-612 at 3/19/20, 7:43 AM:
-------------------------------------------------------

Are you able to reproduce the issue in your example? Both C++ and Java ORC writer have been implemented to strictly pad & truncate CHAR and VARCHAR types. I don't think the above issue will happen unless there are bugs : ( . Same answer for bloom filters of CHAR/VARCHAR.

In addition, both C++ and Java ORC writer assumes input of string types are already UTF-8 encoded. [~csringhofer]

 

 


was (Author: wgtmac):
Are you able to reproduce the issue in your example? Both C++ and Java ORC writer have been implemented to strictly pad & truncate CHAR and VARCHAR types. I don't think the above issue will happen unless there are bugs : ( . Same answer for bloom filters of CHAR/VARCHAR.

In addition, both C++ and Java ORC writer assumes input of string types are already UTF-8 encoded.

 

 

> Improve CHAR(N)/VARCHAR(N) support in predicate push down
> ---------------------------------------------------------
>
>                 Key: ORC-612
>                 URL: https://issues.apache.org/jira/browse/ORC-612
>             Project: ORC
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> This came up during the implementation of min/max filters in Apache Impala: https://gerrit.cloudera.org/#/c/15403/ by [~norbertluksa]
> Impala reads CHAR(N)/VARCHAR(N) the following way:
> 0. push down min/max predicates (on review)
> 1. read the value as STRING from ORC
> 2. truncate the value if it is longer than N or pad it with spaces if the type is char and the value is short than N
> 3. evalute the predicates once all columns are read
> It is possible that a value from ORC does not satisfy the predicates before truncation/padding, but it does afterwards. For example:
> a single column: "s VARCHAR(1)"
> a single value in the ORC file : "aa"
> a predicate: s="a" .
> "aa" does not pass the predicate, but after truncation it becomes "a", which passes.
> Currently it is tricky to push this predicate down, as simply passing s="a" would skip the file as min="aa" > "a". (what could work is pushing s>="a" AND s<"b" instead, as all values that can be truncated to "a" are <= "b").
> It would be much simpler (for us at least) if we could pass the max length and min length to the SARGS interface, and it would apply truncation/padding to the min/max statistics before comparing them to the literal we provided. So in the example above min=max="aa" would become "a", and it would satisfy the pushed down s="a".
> Note that Impala doesn't care about encoding, so the length is byte length. Other clients may need UTF-8 length instead.
> Apart from min/max stats, CHAR/VARCHAR are also problematic for bloom filters - in the example above, "aa"'s hash is probably different than "a"'s, so looking up "a" could fail.
> Bloom filters could only work if we are sure that there won't be any truncation/padding (which is actually quite likely if the schema didn't change, as DB system enforces this during writing). If there were stats about min/max length of strings, then it would be possible verify this during predicate push down and use bloom filters if it is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)