You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rahul Channe <dr...@googlemail.com> on 2014/08/25 19:24:32 UTC

Column value incorrect in hive xml table

Hi All,

I am using hivexmlserde-1.0.5.1.jar to load xml data into hive table.
Following is the table DDL

create external table h_xml (member_id string,personal_identity
map<string,string>,personal_contact map<string,string>,Enrollment
map<string,string> )
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with SERDEPROPERTIES (
"column.xpath.member_id"="/PayLoad/Members/Member/MemberID/text()",
"column.xpath.personal_identity"="/PayLoad/Members/Member/Demographics/PersonIdentity",
"column.xpath.personal_contact"="/PayLoad/Members/Member/Demographics/PersonContactInformation",
"column.xpath.Enrollment"="/PayLoad/Members/Member/Enrollment"
) STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES
(
"xmlinput.start"="<PayLoad",
"xmlinput.end"="</PayLoad>"
)



the sample xml is attached in the email.

Problem:- When I am executing the below query
select * from h_xml limit 1 ;

the result shows me member_id of multiple nodes in the same row

hive> select * from h_xml  ;
OK
<string>02573767-0502573768-01</string> {"PersonIdentity":"<string><SSN>

in fact they are two different memberId and should show up in 2 rows.

Re: Column value incorrect in hive xml table

Posted by Rahul Channe <dr...@googlemail.com>.
Hi Dmitry,

Thank you, it worked.
Any reason why we have to specify the lowest possible node to make it work.

Regards,
Rahul

On Monday, August 25, 2014, Dmitry Vasilenko <dv...@gmail.com> wrote:

> I would try something like this:
>
> CREATE TABLE h_xml(member_id STRING , personal_identity
> map<string,string>,enrollment map<string,string>)
>      ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
> WITH SERDEPROPERTIES (
> "column.xpath.member_id"="/Member/MemberID/text()",
> "column.xpath.personal_identity"="/Member/Demographics/PersonIdentity/*",
> "column.xpath.enrollment"="/Member/Enrollment/*"
> )
> STORED AS
>      INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
>      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
>      TBLPROPERTIES (
>      "xmlinput.start"="<Member>",
>      "xmlinput.end"="</Member>"
>      );
>
> load data local inpath '/home/dvasilen/Downloads/few_sample_h.xml'
> OVERWRITE into table h_xml;
>
> select * from h_xml;
>
> Here is the output:
>
>     > select * from h_xml;
> OK
> 02573767-05
>  {"LastName":"ZZZZ","Gender":"F","DateOfBirth":"9999-01-01","Firstname":"YYYY","SSN":"XXXXXX"}
>  {"ResponsiblePartyRelationshipCode":"DEPENDANT","IsPrimary":"true","GroupID":"9898989","GroupName":"PPPPPPPP"}
> 02573768-01
>  {"LastName":"PPPPP","Gender":"F","DateOfBirth":"1999-01-01","Firstname":"TTTTT","SSN":"XXXXXXXX"}
>
>  {"ResponsiblePartyRelationshipCode":"SELF","IsPrimary":"true","GroupID":"11111","GroupName":"PPPPPPP"}
> Time taken: 0.067 seconds, Fetched: 2 row(s)
> hive>
>
>
>
>
>
> On 8/25/14, Rahul Channe <drahulc@googlemail.com <javascript:;>> wrote:
> > Hi All,
> >
> > I am using hivexmlserde-1.0.5.1.jar to load xml data into hive table.
> > Following is the table DDL
> >
> > create external table h_xml (member_id string,personal_identity
> > map<string,string>,personal_contact map<string,string>,Enrollment
> > map<string,string> )
> > row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
> > with SERDEPROPERTIES (
> > "column.xpath.member_id"="/PayLoad/Members/Member/MemberID/text()",
> >
> "column.xpath.personal_identity"="/PayLoad/Members/Member/Demographics/PersonIdentity",
> >
> "column.xpath.personal_contact"="/PayLoad/Members/Member/Demographics/PersonContactInformation",
> > "column.xpath.Enrollment"="/PayLoad/Members/Member/Enrollment"
> > ) STORED AS
> > INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
> > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
> > TBLPROPERTIES
> > (
> > "xmlinput.start"="<PayLoad",
> > "xmlinput.end"="</PayLoad>"
> > )
> >
> >
> >
> > the sample xml is attached in the email.
> >
> > Problem:- When I am executing the below query
> > select * from h_xml limit 1 ;
> >
> > the result shows me member_id of multiple nodes in the same row
> >
> > hive> select * from h_xml  ;
> > OK
> > <string>02573767-0502573768-01</string> {"PersonIdentity":"<string><SSN>
> >
> > in fact they are two different memberId and should show up in 2 rows.
> >
>

Re: Column value incorrect in hive xml table

Posted by Dmitry Vasilenko <dv...@gmail.com>.
I would try something like this:

CREATE TABLE h_xml(member_id STRING , personal_identity
map<string,string>,enrollment map<string,string>)
     ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.member_id"="/Member/MemberID/text()",
"column.xpath.personal_identity"="/Member/Demographics/PersonIdentity/*",
"column.xpath.enrollment"="/Member/Enrollment/*"
)
STORED AS
     INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
     TBLPROPERTIES (
     "xmlinput.start"="<Member>",
     "xmlinput.end"="</Member>"
     );

load data local inpath '/home/dvasilen/Downloads/few_sample_h.xml'
OVERWRITE into table h_xml;

select * from h_xml;

Here is the output:

    > select * from h_xml;
OK
02573767-05	{"LastName":"ZZZZ","Gender":"F","DateOfBirth":"9999-01-01","Firstname":"YYYY","SSN":"XXXXXX"}	{"ResponsiblePartyRelationshipCode":"DEPENDANT","IsPrimary":"true","GroupID":"9898989","GroupName":"PPPPPPPP"}
02573768-01	{"LastName":"PPPPP","Gender":"F","DateOfBirth":"1999-01-01","Firstname":"TTTTT","SSN":"XXXXXXXX"}	{"ResponsiblePartyRelationshipCode":"SELF","IsPrimary":"true","GroupID":"11111","GroupName":"PPPPPPP"}
Time taken: 0.067 seconds, Fetched: 2 row(s)
hive>





On 8/25/14, Rahul Channe <dr...@googlemail.com> wrote:
> Hi All,
>
> I am using hivexmlserde-1.0.5.1.jar to load xml data into hive table.
> Following is the table DDL
>
> create external table h_xml (member_id string,personal_identity
> map<string,string>,personal_contact map<string,string>,Enrollment
> map<string,string> )
> row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
> with SERDEPROPERTIES (
> "column.xpath.member_id"="/PayLoad/Members/Member/MemberID/text()",
> "column.xpath.personal_identity"="/PayLoad/Members/Member/Demographics/PersonIdentity",
> "column.xpath.personal_contact"="/PayLoad/Members/Member/Demographics/PersonContactInformation",
> "column.xpath.Enrollment"="/PayLoad/Members/Member/Enrollment"
> ) STORED AS
> INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
> TBLPROPERTIES
> (
> "xmlinput.start"="<PayLoad",
> "xmlinput.end"="</PayLoad>"
> )
>
>
>
> the sample xml is attached in the email.
>
> Problem:- When I am executing the below query
> select * from h_xml limit 1 ;
>
> the result shows me member_id of multiple nodes in the same row
>
> hive> select * from h_xml  ;
> OK
> <string>02573767-0502573768-01</string> {"PersonIdentity":"<string><SSN>
>
> in fact they are two different memberId and should show up in 2 rows.
>