You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Debashish Dhar <de...@yahoo.com> on 2014/03/28 22:43:07 UTC

Pig Issue - seeking clarification

Hello,
I have a PIG script to extract sequence files using the SequenceFileLoader() function. I can extract the XML, but when I trying parsing the XML using ElemenTree.py or minidom.py scripts I get an error stating 'an internal error occurred inside the function while returning'. My question is, can we parse an output from SequenceFileLoader by directly feeding it to a UDF or the string needs to be formatted before passing as an argument? One way is to store the output to HDFS as an .xml file, and then use the XMLoader function in Pig to parse, but I want to do it on the fly bypassing the store option.

register /use/lib/pig/piggybank.jar
register /use/lib64/python2.6/XML/etree/ElementTree.py using jython as myudf;
Define SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
a = LOAD '/data/appl/20142803/hq.seq' using SequenceFileLoader('/u001') as (key:chararray, value:chararray);
b = Filter a by key == 'crt.xml';
c = Foreach b Generate myudf.fromstring(value);
dump c;

Please inform if the parsing can be done on the fly as above.

Thanking you in advance for your help in this regards.

Thanks,
Debashish Dhar

Sent from my iPhone