You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Russell Jurney <ru...@gmail.com> on 2012/12/24 08:24:02 UTC

XML -> Pig UDF

I want to extend the existing XMLLoader to go beyond capturing the text
inside a tag and to actually create a Pig mapping of the Document Object
Model the XML represents. This would be similar to elephant-bird's
JsonLoader.

For instance, check this example: https://gist.github.com/4368194

Semi-structured data can vary, so this behavior can be risky but... I want
people to be able to load JSON and XML data easily their first session with
Pig.

Russell Jurney http://datasyndrome.com

Re: XML -> Pig UDF

Posted by Vitalii Tymchyshyn <ti...@gmail.com>.
Nope, sorry, I wish I could open source this. I did some patches to the
loader (e.g. it did not like empty tags) - those are submitted as pull
requests.

Some more hints:
1) I've found pig-style concat function to be very useful - mine could take
any input, skip nulls, flatten bags and tuples

2) I had to introduce custom type. It does not like top-level custom types,
but works OK with tuples of custom types.
24 груд. 2012 10:13, "Russell Jurney" <ru...@gmail.com> напис.

> Thanks - any chance of contributing some of that code? :)
>
> I have thought of a similar approach: starting with an XMLToPig
> EvalFunc that takes the output of the existing XMLLoader and converts
> it to tuple/bag/map form. Easier to baby step that, just a matter of
> plugging that code in to the xml slice trimmed by XMLLoader, and much
> easier once the EvalFunc works.
>
> Russell Jurney http://datasyndrome.com
>
> On Dec 24, 2012, at 12:10 AM, Vitalii Tymchyshyn <ti...@gmail.com> wrote:
>
> > I was doing such a thing in my previous project, but I did parse on
> demand.
> > What I mean is that I've created set of xml-processing functions, each
> can
> > take a string or Dom on input plus explicit parse function.
> > I did this because I was usually using concatenation/grouping on parsed
> > input files and processing was done only after that. Or processing can be
> > done in another MR step and serialization of string is much easier than
> of
> > Dom.
> > 24 груд. 2012 09:24, "Russell Jurney" <ru...@gmail.com> напис.
> >
> >> I want to extend the existing XMLLoader to go beyond capturing the text
> >> inside a tag and to actually create a Pig mapping of the Document Object
> >> Model the XML represents. This would be similar to elephant-bird's
> >> JsonLoader.
> >>
> >> For instance, check this example: https://gist.github.com/4368194
> >>
> >> Semi-structured data can vary, so this behavior can be risky but... I
> want
> >> people to be able to load JSON and XML data easily their first session
> with
> >> Pig.
> >>
> >> Russell Jurney http://datasyndrome.com
> >>
>

Re: XML -> Pig UDF

Posted by Russell Jurney <ru...@gmail.com>.
Thanks - any chance of contributing some of that code? :)

I have thought of a similar approach: starting with an XMLToPig
EvalFunc that takes the output of the existing XMLLoader and converts
it to tuple/bag/map form. Easier to baby step that, just a matter of
plugging that code in to the xml slice trimmed by XMLLoader, and much
easier once the EvalFunc works.

Russell Jurney http://datasyndrome.com

On Dec 24, 2012, at 12:10 AM, Vitalii Tymchyshyn <ti...@gmail.com> wrote:

> I was doing such a thing in my previous project, but I did parse on demand.
> What I mean is that I've created set of xml-processing functions, each can
> take a string or Dom on input plus explicit parse function.
> I did this because I was usually using concatenation/grouping on parsed
> input files and processing was done only after that. Or processing can be
> done in another MR step and serialization of string is much easier than of
> Dom.
> 24 груд. 2012 09:24, "Russell Jurney" <ru...@gmail.com> напис.
>
>> I want to extend the existing XMLLoader to go beyond capturing the text
>> inside a tag and to actually create a Pig mapping of the Document Object
>> Model the XML represents. This would be similar to elephant-bird's
>> JsonLoader.
>>
>> For instance, check this example: https://gist.github.com/4368194
>>
>> Semi-structured data can vary, so this behavior can be risky but... I want
>> people to be able to load JSON and XML data easily their first session with
>> Pig.
>>
>> Russell Jurney http://datasyndrome.com
>>

Re: XML -> Pig UDF

Posted by Vitalii Tymchyshyn <ti...@gmail.com>.
I was doing such a thing in my previous project, but I did parse on demand.
What I mean is that I've created set of xml-processing functions, each can
take a string or Dom on input plus explicit parse function.
I did this because I was usually using concatenation/grouping on parsed
input files and processing was done only after that. Or processing can be
done in another MR step and serialization of string is much easier than of
Dom.
24 груд. 2012 09:24, "Russell Jurney" <ru...@gmail.com> напис.

> I want to extend the existing XMLLoader to go beyond capturing the text
> inside a tag and to actually create a Pig mapping of the Document Object
> Model the XML represents. This would be similar to elephant-bird's
> JsonLoader.
>
> For instance, check this example: https://gist.github.com/4368194
>
> Semi-structured data can vary, so this behavior can be risky but... I want
> people to be able to load JSON and XML data easily their first session with
> Pig.
>
> Russell Jurney http://datasyndrome.com
>