You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mayuran Yogarajah <ma...@casalemedia.com> on 2009/09/10 06:15:04 UTC

Custom serde for parsing

I have a file in HDFS which has the following format:
c1<space>c2<space>c3<space>c4<tab>c5<space>c6<space>c7

where cX represents column X.

Can someone please show me how I can create a table in Hive for this?

I tried the following but it gave an error:
CREATE TABLE test (
c1 STRING,
c2 STRING,
c3 STRING,
c4 STRING,
c5 STRING,
c6 STRING,
c7 STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('regex'='(\w+) (\w+) (\w+) (\w+)\t(\w+) (\w+) (\w+)')
STORED AS TEXTFILE;

hive> load data inpath '/user/hadoop/test' into table test;

hive> select * from test;
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: This 
table does not have serde property "input.regex"!

Thank you very much =)

Re: Custom serde for parsing

Posted by Zheng Shao <zs...@gmail.com>.

You can write your own specified SerDe to make it more efficient.

Basically, copy and paste RegexSerde, and:
1. use your own string scan instead of Regex Match,
2. return org.apache.hadoop.io.Text instead of java.lang.String (and reuse
the same Text for the same field in different rows)

Zheng

On Thu, Sep 10, 2009 at 9:05 PM, Mayuran Yogarajah <
mayuran.yogarajah@casalemedia.com> wrote:

> Zheng Shao wrote:
>
>> 1. Yes the performance will be affected, especially we are doing one regex
>> match per row, as well as creating a lot of String objects. If we define
>> them as int and uses the default row format, we won't create those String
>> objects.
>>
>>  Is there anything I can do to alleviate this without reformatting the
> data ?
>
> thanks
>

-- 
Yours,
Zheng

Re: Custom serde for parsing

Posted by Mayuran Yogarajah <ma...@casalemedia.com>.

Zheng Shao wrote:
> 1. Yes the performance will be affected, especially we are doing one 
> regex match per row, as well as creating a lot of String objects. If 
> we define them as int and uses the default row format, we won't create 
> those String objects.
>
Is there anything I can do to alleviate this without reformatting the data ?

thanks

Re: Custom serde for parsing

Posted by Zheng Shao <zs...@gmail.com>.

1. Yes the performance will be affected, especially we are doing one regex
match per row, as well as creating a lot of String objects. If we define
them as int and uses the default row format, we won't create those String
objects.

2. Yes that's expected. You can do the following if you don't want to move
the original files:

CREATE EXTERTAL TABLE mytable (...) ... LOCATION "hdfs://.../"



Zheng

On Thu, Sep 10, 2009 at 11:48 AM, Mayuran Yogarajah <
mayuran.yogarajah@casalemedia.com> wrote:

> Zheng Shao wrote:
>
>> WITH SERDEPROPERTIES ('input.regex'='(\\w+) (\\w+) (\\w+) (\\w+)\\t(\\w+)
>> (\\w+) (\\w+)')
>>
>> The reason for double backslash is that Hive string constant will take one
>> level of escaping, and the regular expression will take another level.
>>
>> Please let us know where you see the 'regex'='...' syntax. It's outdated.
>> We need to update it.
>>
>> Zheng
>>
>>  I really wish I had just made all the columns tab separated =/
> Thanks for the regexp, it now correctly parses the file.
>
> Question, RegexSerDe only works with string columns so I had to use strings
> for my columns
> even though they are all numbers.  What are the implications of this? Is
> there a performance
> penalty?
>
> Also I notice that the file used to load the table in HDFS no longer exists
> after I issue the
> LOAD DATA INPATH command.  Is this expected? Is there some way to get
> around this?
>
> thanks
>
>
>  On Wed, Sep 9, 2009 at 9:15 PM, Mayuran Yogarajah <
>> mayuran.yogarajah@casalemedia.com <mailto:
>> mayuran.yogarajah@casalemedia.com>> wrote:
>>
>>    I have a file in HDFS which has the following format:
>>    c1<space>c2<space>c3<space>c4<tab>c5<space>c6<space>c7
>>
>>    where cX represents column X.
>>
>>    Can someone please show me how I can create a table in Hive for this?
>>
>>    I tried the following but it gave an error:
>>    CREATE TABLE test (
>>    c1 STRING,
>>    c2 STRING,
>>    c3 STRING,
>>    c4 STRING,
>>    c5 STRING,
>>    c6 STRING,
>>    c7 STRING )
>>    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>>    WITH SERDEPROPERTIES ('regex'='(\w+) (\w+) (\w+) (\w+)\t(\w+)
>>    (\w+) (\w+)')
>>    STORED AS TEXTFILE;
>>
>>    hive> load data inpath '/user/hadoop/test' into table test;
>>
>>    hive> select * from test;
>>    OK
>>    Failed with exception
>>    java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
>>    This table does not have serde property "input.regex"!
>>
>>    Thank you very much =)
>>
>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>


-- 
Yours,
Zheng

Re: Custom serde for parsing

Posted by Mayuran Yogarajah <ma...@casalemedia.com>.

Zheng Shao wrote:
> WITH SERDEPROPERTIES ('input.regex'='(\\w+) (\\w+) (\\w+) 
> (\\w+)\\t(\\w+) (\\w+) (\\w+)')
>
> The reason for double backslash is that Hive string constant will take 
> one level of escaping, and the regular expression will take another level.
>
> Please let us know where you see the 'regex'='...' syntax. It's 
> outdated. We need to update it.
>
> Zheng
>
I really wish I had just made all the columns tab separated =/
Thanks for the regexp, it now correctly parses the file.

Question, RegexSerDe only works with string columns so I had to use 
strings for my columns
even though they are all numbers.  What are the implications of this? Is 
there a performance
penalty?

Also I notice that the file used to load the table in HDFS no longer 
exists after I issue the
LOAD DATA INPATH command.  Is this expected? Is there some way to get 
around this?

thanks

> On Wed, Sep 9, 2009 at 9:15 PM, Mayuran Yogarajah 
> <mayuran.yogarajah@casalemedia.com 
> <ma...@casalemedia.com>> wrote:
>
>     I have a file in HDFS which has the following format:
>     c1<space>c2<space>c3<space>c4<tab>c5<space>c6<space>c7
>
>     where cX represents column X.
>
>     Can someone please show me how I can create a table in Hive for this?
>
>     I tried the following but it gave an error:
>     CREATE TABLE test (
>     c1 STRING,
>     c2 STRING,
>     c3 STRING,
>     c4 STRING,
>     c5 STRING,
>     c6 STRING,
>     c7 STRING )
>     ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>     WITH SERDEPROPERTIES ('regex'='(\w+) (\w+) (\w+) (\w+)\t(\w+)
>     (\w+) (\w+)')
>     STORED AS TEXTFILE;
>
>     hive> load data inpath '/user/hadoop/test' into table test;
>
>     hive> select * from test;
>     OK
>     Failed with exception
>     java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
>     This table does not have serde property "input.regex"!
>
>     Thank you very much =)
>
>
>
>
> -- 
> Yours,
> Zheng

Re: Custom serde for parsing

Posted by Zheng Shao <zs...@gmail.com>.

WITH SERDEPROPERTIES ('input.regex'='(\\w+) (\\w+) (\\w+) (\\w+)\\t(\\w+)
(\\w+) (\\w+)')

The reason for double backslash is that Hive string constant will take one
level of escaping, and the regular expression will take another level.

Please let us know where you see the 'regex'='...' syntax. It's outdated. We
need to update it.

Zheng

On Wed, Sep 9, 2009 at 9:15 PM, Mayuran Yogarajah <
mayuran.yogarajah@casalemedia.com> wrote:

> I have a file in HDFS which has the following format:
> c1<space>c2<space>c3<space>c4<tab>c5<space>c6<space>c7
>
> where cX represents column X.
>
> Can someone please show me how I can create a table in Hive for this?
>
> I tried the following but it gave an error:
> CREATE TABLE test (
> c1 STRING,
> c2 STRING,
> c3 STRING,
> c4 STRING,
> c5 STRING,
> c6 STRING,
> c7 STRING )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES ('regex'='(\w+) (\w+) (\w+) (\w+)\t(\w+) (\w+) (\w+)')
> STORED AS TEXTFILE;
>
> hive> load data inpath '/user/hadoop/test' into table test;
>
> hive> select * from test;
> OK
> Failed with exception
> java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: This table
> does not have serde property "input.regex"!
>
> Thank you very much =)
>
>


-- 
Yours,
Zheng