You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by krishnan N <kr...@gmail.com> on 2012/04/20 02:27:57 UTC
Parsing XML using PIG
Hi All,
I am trying XML parsing using PIG, the below are the code which uses the
xmlloader class . I am trying to convert XML to text file with attribute in
columns and attribute value as column value.
register /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
xml_file = LOAD '/home/test2.xml' using
org.apache.pig.piggybank.storage.XMLLoader('field') as (doc:chararray);
loof_file = foreach xml_file generate field;
store_file = store loof_file into '/home/xml2_to_text.dat';
The xmlloader identifies only the ‘tag’ supplied as input parameter and
gives the below result only for the particular tag. Is there any way to get
attribute values.
<field id="productId">
<value>12354678</value>
</field>
<field id="AckLevel">
<value>LEVEL2</value>
</field>
<field id="AckDate">
<value>2012-02-29T16:21:54</value>
</field>
<field id="Success">
<value>true</value>
</field>
Required Output :
Product_Id| AckLevel AckDate| Success
12354678 | LEVEL2 |2012-02-29T16:21:54|true
Thanks
Krishnan
RE: Parsing XML using PIG
Posted by wi...@thomsonreuters.com.
I just use XMLLoader to break the input xml into records, then stream that through an xml parser to pull out what I need into the fields of a relation for subsequent pig processing. Like
-- The analyze_src_recs.py script reads input xml from stdin, and writes to
-- stdout for each relevant part of each input record:
-- citeddocid,citingdocid,collection,seqno,year,[....]
--
define analyze_src `analyze_src_recs.py`
input (stdin)
output (stdout USING PigStreaming(','))
ship ('$scriptDir/analyze_src_recs.py');
SrcLines = load '$src_xml/*.xml*'
using org.apache.pig.piggybank.storage.XMLLoader('REC')
as (doc:chararray);
ParseOut = stream SrcLines through analyze_src
as (rec_type : int,
citeddocid : int,
citingdocid: int,
col : chararray,
seq : chararray,
[....]
);
-- rec_type determines which of two kinds of records the UDF streaming
-- function analyze_src_recs.py has generated
split ParseOut into
ParseOutCitation if rec_type == 0,
ParseOutSrc if rec_type == 1;
[...]
HTH,
Will
William F Dowling
Senior Technologist
Thomson Reuters
-----Original Message-----
From: krishnan N [mailto:krishnan.smile@gmail.com]
Sent: Thursday, April 19, 2012 8:28 PM
To: user@pig.apache.org
Subject: Parsing XML using PIG
Hi All,
I am trying XML parsing using PIG, the below are the code which uses the
xmlloader class . I am trying to convert XML to text file with attribute in
columns and attribute value as column value.
register /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
xml_file = LOAD '/home/test2.xml' using
org.apache.pig.piggybank.storage.XMLLoader('field') as (doc:chararray);
loof_file = foreach xml_file generate field;
store_file = store loof_file into '/home/xml2_to_text.dat';
The xmlloader identifies only the ‘tag’ supplied as input parameter and
gives the below result only for the particular tag. Is there any way to get
attribute values.
<field id="productId">
<value>12354678</value>
</field>
<field id="AckLevel">
<value>LEVEL2</value>
</field>
<field id="AckDate">
<value>2012-02-29T16:21:54</value>
</field>
<field id="Success">
<value>true</value>
</field>
Required Output :
Product_Id| AckLevel AckDate| Success
12354678 | LEVEL2 |2012-02-29T16:21:54|true
Thanks
Krishnan