You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Vihang Karajgaonkar (JIRA)" <ji...@apache.org> on 2018/08/03 18:30:01 UTC

[jira] [Commented] (HIVE-19715) Consolidated and flexible API for fetching partition metadata from HMS

    [ https://issues.apache.org/jira/browse/HIVE-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568590#comment-16568590 ] 

Vihang Karajgaonkar commented on HIVE-19715:
--------------------------------------------

While I started working on it I realized a few things which could make changes in the design. By default, Thrift field requiredness is "default requiredness" [https://thrift.apache.org/docs/idl#field-requiredness] which is like a hybrid of {{optional}} and {{required}}. So in the write path thrift attempts to write them as long as its possible (null fields cannot be written IIUC). On the read side, reader always checks if the field is set. This is really the behavior what we want and fortunately, the Partition thrift definition has either default requiredness or optional which works well for partially filled partitions. So even in theory I can just return a List<Partition> for this API, but I think using PartitionSpec still makes a lot of sense since it groups the partitions according the {{table location, fieldSchema, deserializer class}}. I think in case of non-standard partition locations, there is no harm in grouping them together esp when there are lot of such non-standard partitions.

I am planning to use {{PropertyUtils}} from {{commons-beanutils}} package which is already in the classpath for metastore from {{apache-commons}} dependency. It provides the {{setNestedProperty}} method which can used to set the fields. All the fields defined in Thrift have setter methods so this should not cause any problems.

For setting the projected fields, in case of JDO we cannot set multi-valued fields in {{setResult}} clause which is a JDO limitation. In such a case the JDO version of the API will fall back to retrieving the full partitions. The directSQL version of the API however should be able to parse and set multi-valued fields like it does currently. I am currently looking at the directSQL implementation of setting partition fields and come up with a more maintainable way to selectively fire correct queries based on the projection field list instead of introducing bunch of if/else or case statements in that code, so I am thinking of creating a PartitionFieldParser class which will split out the right queries for the given list of fields. We will have to take care of optimizing the field list as well. It should remove redundant fields eg. if {{sd}} is present we can safe remove the redundant {{sd.location}} or {{sd.serdeInfo.serializationClass}}. Similarly, if all the nested fields of {{sd}} are present individually we can combine them together to form one field {{sd}}. I am currently treating these as optional improvements which I will fix later as needed.

I plan to divide the work into sub-tasks since each one of these could be considerable code change.
 1. Expose thrift API with the support for projected fields
 2. Add support for filters
 3. Add support for pagination

Will update the design doc based on the above modifications once I am close to completion of the sub-task 1 just in case there are more puzzles to solve.

> Consolidated and flexible API for fetching partition metadata from HMS
> ----------------------------------------------------------------------
>
>                 Key: HIVE-19715
>                 URL: https://issues.apache.org/jira/browse/HIVE-19715
>             Project: Hive
>          Issue Type: New Feature
>          Components: Standalone Metastore
>            Reporter: Todd Lipcon
>            Assignee: Vihang Karajgaonkar
>            Priority: Major
>         Attachments: HIVE-19715-design-doc.pdf
>
>
> Currently, the HMS thrift API exposes 17 different APIs for fetching partition-related information. There is somewhat of a combinatorial explosion going on, where each API has variants with and without "auth" info, by pspecs vs names, by filters, by exprs, etc. Having all of these separate APIs long term is a maintenance burden and also more confusing for consumers.
> Additionally, even with all of these APIs, there is a lack of granularity in fetching only the information needed for a particular use case. For example, in some use cases it may be beneficial to only fetch the partition locations without wasting effort fetching statistics, etc.
> This JIRA proposes that we add a new "one API to rule them all" for fetching partition info. The request and response would be encapsulated in structs. Some desirable properties:
> - the request should be able to specify which pieces of information are required (eg location, properties, etc)
> - in the case of partition parameters, the request should be able to do either whitelisting or blacklisting (eg to exclude large incremental column stats HLL dumped in there by Impala)
> - the request should optionally specify auth info (to encompas the "with_auth" variants)
> - the request should be able to designate the set of partitions to access through one of several different methods (eg "all", list<name>, expr, part_vals, etc) 
> - the struct should be easily evolvable so that new pieces of info can be added
> - the response should be designed in such a way as to avoid transferring redundant information for common cases (eg simple "dictionary coding" of strings like parameter names, etc)
> - the API should support some form of pagination for tables with large partition counts



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)