You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Chunyang Wen (JIRA)" <ji...@apache.org> on 2016/08/29 02:09:20 UTC

[jira] [Commented] (ORC-97) Support column name selection in ReaderOptions

    [ https://issues.apache.org/jira/browse/ORC-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15444533#comment-15444533 ] 

Chunyang Wen commented on ORC-97:
---------------------------------

In Parquet, there is a class called ColumnPath which represented nested column as dot separated string.

I plan to first build a map from column path( dot separated strings) to its id so that users can specify dot separated columns like (a.b.c).

When receiving include name request from ReaderOptions, we can just turn it into its type id and then call includeTypes from ReaderOptions (omalley has commited includeTypes).

The cost is an in-memory data structure of a map, but it simplify the implementation of include_name for nested column names. By the way, we do not need to add any public API and it is compatible.

For struct type: it is easy to understand (we just add a dot separated field  to the column path). For other non-primitive type like map, union, list, we have to make it clear that how to specify them.

struct <m:map<string, primitive_type>>
struct <m:map<string, non_primitive_type>>


> Support column name selection in ReaderOptions
> ----------------------------------------------
>
>                 Key: ORC-97
>                 URL: https://issues.apache.org/jira/browse/ORC-97
>             Project: Orc
>          Issue Type: New Feature
>          Components: C++
>    Affects Versions: 1.2.0
>            Reporter: Chunyang Wen
>            Assignee: Chunyang Wen
>
> After orc-92 patch, column id selection is supported. But actually select sub-type by name is more useful.
> In my project, we use period(.) to separate nested field names.
> <s1:struct<s2:struct<int1: int>>>
> we choose int1 by s1.s2.int1 which will be passed include(std::list<std:string>).
> In my implementation: first I build a map for name and column id, and then   direct the function call to includeTypes. If this is a candidate solution, I will provide a patch for review soon.
> When a sub-type is selected, all his child types should be selected also, as O'Malley pointed out in orc-92.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)