You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sumit Kumar <ks...@gmail.com> on 2012/06/24 19:09:13 UTC

XML Serde

Hi,

So i looked for a generic approach for handling xml files in hive but found
none and thought i could use the concepts from json-serde (
http://code.google.com/p/hive-json-serde/) in creating a generic xml serde.
XPath was something that came immediately in my mind and should work in the
same way that json works for json-serde. The problem is with the use case
that one xml file could contain multiple rows of interest in a single xml
file. Example shown below.

<root>
 <book> ... </book>
 <book> ... </book>
 <book> ... </book>
</root>

In this case, serde is supposed to generate three rows for each book node.
I looked at json-serde implementation but there the deserialize step
returns an ArrayList instance with column values set in indices of the
ArrayList; and this one instance maps to one row. I do see that deserialize
step can return any java Object but not sure what would be the appropriate
way to return multiple rows corresponding to each book node. I'm going to
give it a shot anyway but thought to seek help from the community if
somebody has already tried this or has a better approach. Would really
appreciate any input, if i succeed, i will share my code; if not, i will
anyway come back :-)

Thanks in advance.
-Sumit

Re: XML Serde

Posted by Sumit Kumar <ks...@gmail.com>.
So i found this discussion on this topic
http://mail-archives.apache.org/mod_mbox/hive-user/201006.mbox/%3CAANLkTikYL3HinOwFO36YEYId9VOJyH_6pe3slORHyKWI@mail.gmail.com%3E.
Makes more sense now. Will post my final resolution.

On Sun, Jun 24, 2012 at 10:39 PM, Sumit Kumar <ks...@gmail.com> wrote:

> Hi,
>
> So i looked for a generic approach for handling xml files in hive but
> found none and thought i could use the concepts from json-serde (
> http://code.google.com/p/hive-json-serde/) in creating a generic xml
> serde. XPath was something that came immediately in my mind and should work
> in the same way that json works for json-serde. The problem is with the use
> case that one xml file could contain multiple rows of interest in a single
> xml file. Example shown below.
>
> <root>
>  <book> ... </book>
>  <book> ... </book>
>  <book> ... </book>
> </root>
>
> In this case, serde is supposed to generate three rows for each book node.
> I looked at json-serde implementation but there the deserialize step
> returns an ArrayList instance with column values set in indices of the
> ArrayList; and this one instance maps to one row. I do see that deserialize
> step can return any java Object but not sure what would be the appropriate
> way to return multiple rows corresponding to each book node. I'm going to
> give it a shot anyway but thought to seek help from the community if
> somebody has already tried this or has a better approach. Would really
> appreciate any input, if i succeed, i will share my code; if not, i will
> anyway come back :-)
>
> Thanks in advance.
> -Sumit
>

Re: XML Serde

Posted by Sumit Kumar <ks...@gmail.com>.
So i found this discussion on this topic
http://mail-archives.apache.org/mod_mbox/hive-user/201006.mbox/%3CAANLkTikYL3HinOwFO36YEYId9VOJyH_6pe3slORHyKWI@mail.gmail.com%3E.
Makes more sense now. Will post my final resolution.

On Sun, Jun 24, 2012 at 10:39 PM, Sumit Kumar <ks...@gmail.com> wrote:

> Hi,
>
> So i looked for a generic approach for handling xml files in hive but
> found none and thought i could use the concepts from json-serde (
> http://code.google.com/p/hive-json-serde/) in creating a generic xml
> serde. XPath was something that came immediately in my mind and should work
> in the same way that json works for json-serde. The problem is with the use
> case that one xml file could contain multiple rows of interest in a single
> xml file. Example shown below.
>
> <root>
>  <book> ... </book>
>  <book> ... </book>
>  <book> ... </book>
> </root>
>
> In this case, serde is supposed to generate three rows for each book node.
> I looked at json-serde implementation but there the deserialize step
> returns an ArrayList instance with column values set in indices of the
> ArrayList; and this one instance maps to one row. I do see that deserialize
> step can return any java Object but not sure what would be the appropriate
> way to return multiple rows corresponding to each book node. I'm going to
> give it a shot anyway but thought to seek help from the community if
> somebody has already tried this or has a better approach. Would really
> appreciate any input, if i succeed, i will share my code; if not, i will
> anyway come back :-)
>
> Thanks in advance.
> -Sumit
>