You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2021/08/27 05:57:00 UTC

[jira] [Updated] (ORC-968) [C++] Column names used to build SearchArgument should be full path names

     [ https://issues.apache.org/jira/browse/ORC-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Quanlong Huang updated ORC-968:
-------------------------------
    Summary: [C++] Column names used to build SearchArgument should be full path names  (was: Column names used to build SearchArgument should be full path names)

> [C++] Column names used to build SearchArgument should be full path names
> -------------------------------------------------------------------------
>
>                 Key: ORC-968
>                 URL: https://issues.apache.org/jira/browse/ORC-968
>             Project: ORC
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 1.7.0
>            Reporter: Quanlong Huang
>            Priority: Blocker
>
> The C++ reader provides the same interfaces as those in hive-storage-api for building SearchArguments, e.g.
> {code:cpp}
> virtual SearchArgumentBuilder& lessThan(const std::string& column,
>                                         PredicateDataType type,
>                                         Literal literal) = 0;
> {code}
> C++: [https://github.com/apache/orc/blob/2143841e24abb2e0fef1a3396376682fc3bb6fea/c%2B%2B/include/orc/sargs/SearchArgument.hh#L90-L92]
>  Java: [https://github.com/apache/hive/blob/620b2b197269041d7f508bd0e4564ed8e5edfcfd/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L219-L220]
> However, the column strings in Java codes are treated as full path names instead of the leaf field names. E.g. for a nested column 'id' inside a struct type column 's', the name string is expected to be 's.id' instead of 'id'.
> More details about the string format:
> {quote}Find a subtype of this schema by name. If the name is a simple integer, it will be used as a column number. Otherwise, this routine will recursively search for the name. * Struct fields are selected by name.
>  * List children are selected by "_elem".
>  * Map children are selected by "_key" or "_value".
>  * Union children are selected by number starting at 0.
> Names are separated by '.'.
> {quote}
> [https://javadoc.io/doc/org.apache.orc/orc-core/1.6.10/org/apache/orc/TypeDescription.html#findSubtype-java.lang.String-]
> The C++ library currently treats the name strings as struct field names and recursively searching it in SargsApplier::findColumn
>  [https://github.com/apache/orc/blob/2143841e24abb2e0fef1a3396376682fc3bb6fea/c%2B%2B/src/sargs/SargsApplier.cc#L25-L39]
> To be consistent with Java codes and to avoid ambiguous selection (if there are identical sub-field names in different struct columns), we should adjust the C++ library to treat name strings as full path names. It should also deal with special names like "_elem", "_key", and "_value".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)