You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sadananda Hegde <sa...@gmail.com> on 2012/10/01 21:26:58 UTC

Re: Defining collection items terminated by for a nested data type

I tested the nesting with the following DDL.

CREATE TABLE test_tbl ( col1 STRING, col2: INT,  col3 MAP<STRING,
ARRAY<STRING>>)
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
    COLLECTION ITEMS TERMINATED BY ','
    MAP KEYS TERMINATED BY ':'
    LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

I was able to load and query the test_tbl with the following data
text1|1|key1:value1a^Dvalue1b^Dvalue1c, key2:value2a^Dvalue2b

That is: '|' as field delimiter
            ':' as map key - value separator
           ',' as item separator for key - value pair (i.e. key / value
pair 1 and key/value pair 2 are separated by ','
           '^D' as the item separator for the inside array elements

As you can see, all defined delimiters work fine with level 1; but will use
the hive's default separators (like ^D in this case) for level 2, 3 ,
etc. Obviously
the control characters are hard to read and much more difficult to produce
in the data file. What is the best option to be able to specify the
readable characters for all levels?

Is it possible to handle this kind of nested structures using SERDE Row
format?  Can some one please help me with the actual CREATE Table syntax
with ROW FORMAT SERDE  for example above?

Thanks in advance for your help.

Sadu


On Fri, Sep 28, 2012 at 9:27 AM, Sadananda Hegde <sa...@gmail.com>wrote:

> Thanks Manish.
>
> It's a good article; But it's still not clear to mehow you define when the
> column is of nested type (like array of maps, maps or array, etc).
>
> Just a clarification on item 2 below.
>
> 2.      **What would be the seperator for map elements?****
>
> For Map element separator is “=”
>
>
> '=' is the MAP key separator, what I mean is the item separator when the
> map contains multiple key/value pairs like,
>
>    (Key1=Value1; Key2=Value2; Key3=Value3....)
>
>
> Here '=' is the key separator and ';' is the item separator.
>
>
> I can handle the above example with  COLLECTION ITEMS TERMINATED BY ';'
> and MAP KEYS TERMINATED BY '=' if the element is of type MAP. The  COLLECTION
> ITEMS TERMINATED BY ',' works on all three data types ( maps, arrays,
> struct) when they are by them selves. The problem is defining them for
> nested structures. Because we need multiple separators: one separator for
> array items and a different separator for map items defined within that
> array, etc.
>
>
> The default hive delimiters work just fine.The delimiters in that case
> will be level1 will have '^A', level 2 '^B', level 3 '^C', etc; What I am
> trying to do is to explicitly define them. The COLLECTION ITEMS
> TERMINATED BY ',' statement addresses the first level (^A); but don't know
> how to define the separators for other levels (to use instead of ^B, ^C,
> etc).
>
>
> Thanks,
>
> Sadu
>
>
>
> On Fri, Sep 28, 2012 at 1:28 AM, Manish.Bhoge <Ma...@target.com>wrote:
>
>> Hi Sadu,****
>>
>> ** **
>>
>> See my answer below.****
>>
>> ** **
>>
>> Also this will help you to understand in detail about collection, MAP and
>> Array.****
>>
>> ** **
>>
>>
>> http://datumengineering.wordpress.com/2012/09/27/agility-in-hive-map-array-score-for-hive/
>> ****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Sadananda Hegde [mailto:saduhegde@gmail.com]
>> *Sent:* Friday, September 28, 2012 10:31 AM
>> *To:* user@hive.apache.org
>> *Subject:* Defining collection items terminated by for a nested data type
>> ****
>>
>> ** **
>>
>> How does "collection items terminated by" work  on a nested structure?
>> Say the  table is created with the DDL:****
>>
>>  ****
>>
>> CREATE TABLE table_1(f1 int, f2 string, f3  array <struct <a string, b
>> int, c map<string, string>>>)
>> ROW FORMAT DELIMITED
>> FIELDS TERMINATED BY '|'
>> COLLECTION ITEMS TERMINATED BY ','
>> MAP KEYS TERMINATED BY '='
>> LINES TERMINATED BY '\'n'
>> STORED AS TEXTFILE;****
>>
>>  ****
>>
>> I guess comma seperator wll be used for the items in the outer
>> most structure (i.e. array).  Is that true?****
>>
>> Yes. Right, comma is a separator for array.****
>>
>> **1.      **What would be the seperator character between a,b and c
>> (struct  elements)?****
>>
>> I think it is \n. Not very sure about this.****
>>
>> **2.      **What would be the seperator for mapelements?****
>>
>> For Map element separator is “=”****
>>
>>  3. Is there a way to explicitly specify those ITEMS seperators rather
>> than using the default ones like ^B, ^C, etc, (like multiple collection
>> items)?****
>>
>>  You can define the custom separator. But multiple collection seems
>> infeasible. ****
>>
>>  The original data is in xml format (complex one with many nested levels)
>> and we are planning to parse that xml using a java parser into delimited
>> text file which can be used to load the hive table. My question is:****
>>
>>      " How should we be representng the f3 like structures in the data
>> file?" ****
>>
>>  ****
>>
>> The actual file has lot many fields with quite a few complex types like
>> f3 above; but I guess logic would be the same. ****
>>
>>  --- For this either you need to write custom input reader in MAP-REDUCE
>> or use custom serde.****
>>
>> Thanks for your help.....****
>>
>>  ****
>>
>> Regards,****
>>
>> Sadu****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>
>