You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by java8964 java8964 <ja...@hotmail.com> on 2012/04/03 19:47:30 UTC

Question about org.apache.hadoop.hive.contrib.serde2.RegexSerDe

Hi, 
I have a question about the behavior of the class org.apache.hadoop.hive.contrib.serde2.RegexSerDe. Here is the example I tested using the Cloudra hive-0.7.1-cdh3u3 release. The above class did NOT do what I expect, any one knows the reason?
user:~/tmp> more Test.javaimport java.io.*;import java.text.*;
class Test {    public static void main (String[] argv) throws Exception    {        String line = "aaa,\"bbb\",\"cc,c\"";        String[] tokens = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");        int i = 1;        for(String t : tokens) {            System.out.println(i + "> "+t);            i++;        }    }}
:~/tmp> java Test1> aaa2> "bbb"3> "cc,c"
As you can see, the Java regular expression ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)" did what I want it to do, it parse the string aaa,"bbb","cc,c" to 3 tokens: (aaa), ("bbb"), and ("cc,c"). So the regular expression works fine.
Now in the hive:
:~> more test.txtaaa,"bbb","cc,c":~> hiveHive history file=/tmp/user/hive_job_log_user_201204031242_591028210.txthive> create table test(    >  c1 string,    >  c2 string,    >  c3 string    > )    > row format    > SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'    > WITH SERDEPROPERTIES (    > "input.regex" = ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"    > )    > STORED AS TEXTFILE;OKTime taken: 0.401 secondshive> load data local inpath 'test.txt' overwrite into table test;Copying data from file:/home/user/test.txtCopying file: file:/home/user/test.txtLoading data to table dev.testDeleted hdfs://host/user/hive/warehouse/dev.db/testOKTime taken: 0.282 secondshive> select * from test;                                         OKNULL    NULL    NULL
When I query this table, I don't get what I expected. I expect the output should be the 3 strings like this ----->        aaa        "bbb"       "cc,c"
Why the output gives me 3 NULLs?
Thanks for your help.