You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by 周梦想 <ab...@gmail.com> on 2013/03/11 05:04:53 UTC

how to handle variable format data of text file?

I have files like this:
03/11/13 10:59:52 00000ec0 1009 180538126 92041 2300 0 0 7 21|47|20|33|11
0:2775
03/11/13 10:59:52 00000744 1010 178343610 92042 350 1 0 -1 NULL NULL 22 45
the format is separated by blank space:
date time threadid gid userid [variable formated data grouped by fields
separated by space ]

I'd like to create a table like:

hive> create external table handresult (hdate string,htime string, thid
string, gid int, userid string,ldata string) row format delimited fields
terminated by  " ";
OK

but the above table will only have a part of the data.
select * from handresult;
03/11/13 10:59:52 00000ec0 1009 180538126 92041
03/11/13 10:59:52 00000744 1010 178343610 92042

the remain data  like "2300 0 0 7 21|47|20|33|11 0:2775 "  I can't get.

while ldata may be variance length and format separated by " " or an array,
the ldata we will parse diferent  by each gid.

how do this?

Thanks,
Andy Zhou

Re: how to handle variable format data of text file?

Posted by Ramki Palle <ra...@gmail.com>.

One way you can try is to make your ldata as a map field as it contains
variable formatted data and write a UDF to get whatever information you
need get.

Regards,
Ramki.



On Mon, Mar 18, 2013 at 1:23 AM, Zhiwen Sun <pe...@gmail.com> wrote:

> As u defined in create table hql: fields delimited by blank space. So, the
> other data is omitted
>
> if you wanna contain rest data at the end of line. I suggest you use
> org.apache.hadoop.hive.contrib.serde2.RegexSerDe row format instead of
> default delimited format.
>
>
> Zhiwen Sun
>
>
>
> On Mon, Mar 11, 2013 at 12:04 PM, 周梦想 <ab...@gmail.com> wrote:
>
>> I have files like this:
>> 03/11/13 10:59:52 00000ec0 1009 180538126 92041 2300 0 0 7 21|47|20|33|11
>> 0:2775
>> 03/11/13 10:59:52 00000744 1010 178343610 92042 350 1 0 -1 NULL NULL 22 45
>> the format is separated by blank space:
>> date time threadid gid userid [variable formated data grouped by fields
>> separated by space ]
>>
>> I'd like to create a table like:
>>
>> hive> create external table handresult (hdate string,htime string, thid
>> string, gid int, userid string,ldata string) row format delimited fields
>> terminated by  " ";
>> OK
>>
>> but the above table will only have a part of the data.
>> select * from handresult;
>> 03/11/13 10:59:52 00000ec0 1009 180538126 92041
>> 03/11/13 10:59:52 00000744 1010 178343610 92042
>>
>> the remain data  like "2300 0 0 7 21|47|20|33|11 0:2775 "  I can't get.
>>
>> while ldata may be variance length and format separated by " " or an
>> array, the ldata we will parse diferent  by each gid.
>>
>> how do this?
>>
>> Thanks,
>> Andy Zhou
>>
>
>

Re: how to handle variable format data of text file?

Posted by Zhiwen Sun <pe...@gmail.com>.

As u defined in create table hql: fields delimited by blank space. So, the
other data is omitted

if you wanna contain rest data at the end of line. I suggest you use
org.apache.hadoop.hive.contrib.serde2.RegexSerDe row format instead of
default delimited format.


Zhiwen Sun



On Mon, Mar 11, 2013 at 12:04 PM, 周梦想 <ab...@gmail.com> wrote:

> I have files like this:
> 03/11/13 10:59:52 00000ec0 1009 180538126 92041 2300 0 0 7 21|47|20|33|11
> 0:2775
> 03/11/13 10:59:52 00000744 1010 178343610 92042 350 1 0 -1 NULL NULL 22 45
> the format is separated by blank space:
> date time threadid gid userid [variable formated data grouped by fields
> separated by space ]
>
> I'd like to create a table like:
>
> hive> create external table handresult (hdate string,htime string, thid
> string, gid int, userid string,ldata string) row format delimited fields
> terminated by  " ";
> OK
>
> but the above table will only have a part of the data.
> select * from handresult;
> 03/11/13 10:59:52 00000ec0 1009 180538126 92041
> 03/11/13 10:59:52 00000744 1010 178343610 92042
>
> the remain data  like "2300 0 0 7 21|47|20|33|11 0:2775 "  I can't get.
>
> while ldata may be variance length and format separated by " " or an
> array, the ldata we will parse diferent  by each gid.
>
> how do this?
>
> Thanks,
> Andy Zhou
>