You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Charles Pritchard (JIRA)" <ji...@apache.org> on 2016/01/26 23:17:39 UTC

[jira] [Created] (HIVE-12936) Better support for multimap semantics

Charles Pritchard created HIVE-12936:
----------------------------------------

             Summary: Better support for multimap semantics
                 Key: HIVE-12936
                 URL: https://issues.apache.org/jira/browse/HIVE-12936
             Project: Hive
          Issue Type: Improvement
            Reporter: Charles Pritchard


Currently life gets difficult when working with data in the form of: array<struct<key:string,value:string>>.

When processing incoming files, the struct type as well as the simpler: map<string,string> are well supported. If the incoming data has duplicate keys, then the array struct semantic needs to be used or data will be lost. But at this point it's very difficult to perform reasonable queries.

There are various UDF features I'd like to see, as well as Serde for TextInputFormat.

Examples:
UDF:
- str_to_map - have an equivalent for str_to_structarray.
- array_struct_indexof - Search the array of structs and return the first offset. This is very difficult to perform in a reasonable manner using straight SQL, as I believe it needs:  lateral outer view inline  partition by over. I need to be able to say  str_to_structarray("k=v,k=v2", "key","value") to get array(struct(key,value)). And I need to be able to run array_struct_indexof(array(struct), "key", "k") to get an offset of [0] so I can reasonably select the value.

For TextInputFormat, I'd like to be able to process Map<string, Array<string>>. This would simply collect values instead of only using one value when there are duplicate keys.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)