You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sreekanth Jella <sr...@gmail.com> on 2016/08/13 00:33:48 UTC
Flattening XML in a DataFrame
Hi Folks,
I am trying flatten variety of XMLs using DataFrames. I'm using spark-xml
package which is automatically inferring my schema and creating a DataFrame.
I do not want to hard code any column names in DataFrame as I have lot of
varieties of XML documents and each might be lot more depth of child nodes.
I simply want to flatten any type of XML and then write output data to a
hive table. Can you please give some expert advice for the same.
Example XML and expected output is given below.
Sample XML:
<emplist>
<emp>
<manager>
<id>1</id>
<name>foo</name>
<subordinates>
<clerk>
<cid>1</cid>
<cname>foo</cname>
</clerk>
<clerk>
<cid>1</cid>
<cname>foo</cname>
</clerk>
</subordinates>
</manager>
</emp>
</emplist>
Expected output:
id, name, clerk.cid, clerk.cname
1, foo, 2, cname2
1, foo, 3, cname3
Thanks,
Sreekanth Jella
RE: Flattening XML in a DataFrame
Posted by sr...@gmail.com.
Hi Hyukjin,
I have created the below issue.
https://github.com/databricks/spark-xml/issues/155
Sent from Mail for Windows 10
From: Hyukjin Kwon
Re: Flattening XML in a DataFrame
Posted by Hyukjin Kwon <gu...@gmail.com>.
Sorry for late reply.
Currently, the library only supports to load XML documents just as they are.
Do you mind if I ask open an issue with some more explanations here,
https://github.com/databricks/spark-xml/issues?
2016-08-17 7:22 GMT+09:00 Sreekanth Jella <sr...@gmail.com>:
> Hi Experts,
>
>
>
> Please suggest. Thanks in advance.
>
>
>
> Thanks,
>
> Sreekanth
>
>
>
> *From:* Sreekanth Jella [mailto:srikanth.jella@gmail.com]
> *Sent:* Sunday, August 14, 2016 11:46 AM
> *To:* 'Hyukjin Kwon' <gu...@gmail.com>
> *Cc:* 'user @spark' <us...@spark.apache.org>
> *Subject:* Re: Flattening XML in a DataFrame
>
>
>
> Hi Hyukjin Kwon,
>
> Thank you for reply.
>
> There are several types of XML documents with different schema which needs
> to be parsed and tag names do not know in hand. All we know is the XSD for
> the given XML.
>
> Is it possible to get the same results even when we do not know the xml
> tags like manager.id, manager.name or is it possible to read the tag
> names from XSD and use?
>
> Thanks,
> Sreekanth
>
>
>
> On Aug 12, 2016 9:58 PM, "Hyukjin Kwon" <gu...@gmail.com> wrote:
>
> Hi Sreekanth,
>
>
>
> Assuming you are using Spark 1.x,
>
>
>
> I believe this code below:
>
> sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "emp").load("/tmp/sample.xml")
>
> .selectExpr("manager.id", "manager.name", "explode(manager.subordinates.clerk) as clerk")
>
> .selectExpr("id", "name", "clerk.cid", "clerk.cname")
>
> .show()
>
> would print the results below as you want:
>
> +---+----+---+-----+
>
> | id|name|cid|cname|
>
> +---+----+---+-----+
>
> | 1| foo| 1| foo|
>
> | 1| foo| 1| foo|
>
> +---+----+---+-----+
>
>
>
>
>
> I hope this is helpful.
>
>
>
> Thanks!
>
>
>
>
>
>
>
>
>
> 2016-08-13 9:33 GMT+09:00 Sreekanth Jella <sr...@gmail.com>:
>
> Hi Folks,
>
>
>
> I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml
> package which is automatically inferring my schema and creating a
> DataFrame.
>
>
>
> I do not want to hard code any column names in DataFrame as I have lot of
> varieties of XML documents and each might be lot more depth of child nodes.
> I simply want to flatten any type of XML and then write output data to a
> hive table. Can you please give some expert advice for the same.
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> <emplist>
>
> <emp>
>
> <manager>
>
> <id>1</id>
>
> <name>foo</name>
>
> <subordinates>
>
> <clerk>
>
> <cid>1</cid>
>
> <cname>foo</cname>
>
> </clerk>
>
> <clerk>
>
> <cid>1</cid>
>
> <cname>foo</cname>
>
> </clerk>
>
> </subordinates>
>
> </manager>
>
> </emp>
>
> </emplist>
>
>
>
> Expected output:
>
> id, name, clerk.cid, clerk.cname
>
> 1, foo, 2, cname2
>
> 1, foo, 3, cname3
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>
>
>
>
RE: Flattening XML in a DataFrame
Posted by Sreekanth Jella <sr...@gmail.com>.
Hi Experts,
Please suggest. Thanks in advance.
Thanks,
Sreekanth
From: Sreekanth Jella [mailto:srikanth.jella@gmail.com]
Sent: Sunday, August 14, 2016 11:46 AM
To: 'Hyukjin Kwon' <gu...@gmail.com>
Cc: 'user @spark' <us...@spark.apache.org>
Subject: Re: Flattening XML in a DataFrame
Hi Hyukjin Kwon,
Thank you for reply.
There are several types of XML documents with different schema which needs to be parsed and tag names do not know in hand. All we know is the XSD for the given XML.
Is it possible to get the same results even when we do not know the xml tags like manager.id, manager.name or is it possible to read the tag names from XSD and use?
Thanks,
Sreekanth
On Aug 12, 2016 9:58 PM, "Hyukjin Kwon" <gurwls223@gmail.com <ma...@gmail.com> > wrote:
Hi Sreekanth,
Assuming you are using Spark 1.x,
I believe this code below:
sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "emp").load("/tmp/sample.xml")
.selectExpr("manager.id <http://manager.id> ", "manager.name <http://manager.name> ", "explode(manager.subordinates.clerk) as clerk")
.selectExpr("id", "name", "clerk.cid", "clerk.cname")
.show()
would print the results below as you want:
+---+----+---+-----+
| id|name|cid|cname|
+---+----+---+-----+
| 1| foo| 1| foo|
| 1| foo| 1| foo|
+---+----+---+-----+
I hope this is helpful.
Thanks!
2016-08-13 9:33 GMT+09:00 Sreekanth Jella <srikanth.jella@gmail.com <ma...@gmail.com> >:
Hi Folks,
I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml package which is automatically inferring my schema and creating a DataFrame.
I do not want to hard code any column names in DataFrame as I have lot of varieties of XML documents and each might be lot more depth of child nodes. I simply want to flatten any type of XML and then write output data to a hive table. Can you please give some expert advice for the same.
Example XML and expected output is given below.
Sample XML:
<emplist>
<emp>
<manager>
<id>1</id>
<name>foo</name>
<subordinates>
<clerk>
<cid>1</cid>
<cname>foo</cname>
</clerk>
<clerk>
<cid>1</cid>
<cname>foo</cname>
</clerk>
</subordinates>
</manager>
</emp>
</emplist>
Expected output:
id, name, clerk.cid, clerk.cname
1, foo, 2, cname2
1, foo, 3, cname3
Thanks,
Sreekanth Jella
Re: Flattening XML in a DataFrame
Posted by Sreekanth Jella <sr...@gmail.com>.
Hi Hyukjin Kwon,
Thank you for reply.
There are several types of XML documents with different schema which needs to be parsed and tag names do not know in hand. All we know is the XSD for the given XML.
Is it possible to get the same results even when we do not know the xml tags like manager.id, manager.name or is it possible to read the tag names from XSD and use?
Thanks,
Sreekanth
On Aug 12, 2016 9:58 PM, "Hyukjin Kwon" <gurwls223@gmail.com <ma...@gmail.com> > wrote:
Hi Sreekanth,
Assuming you are using Spark 1.x,
I believe this code below:
sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "emp").load("/tmp/sample.xml")
.selectExpr("manager.id <http://manager.id> ", "manager.name <http://manager.name> ", "explode(manager.subordinates.clerk) as clerk")
.selectExpr("id", "name", "clerk.cid", "clerk.cname")
.show()
would print the results below as you want:
+---+----+---+-----+
| id|name|cid|cname|
+---+----+---+-----+
| 1| foo| 1| foo|
| 1| foo| 1| foo|
+---+----+---+-----+
I hope this is helpful.
Thanks!
2016-08-13 9:33 GMT+09:00 Sreekanth Jella <srikanth.jella@gmail.com <ma...@gmail.com> >:
Hi Folks,
I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml package which is automatically inferring my schema and creating a DataFrame.
I do not want to hard code any column names in DataFrame as I have lot of varieties of XML documents and each might be lot more depth of child nodes. I simply want to flatten any type of XML and then write output data to a hive table. Can you please give some expert advice for the same.
Example XML and expected output is given below.
Sample XML:
<emplist>
<emp>
<manager>
<id>1</id>
<name>foo</name>
<subordinates>
<clerk>
<cid>1</cid>
<cname>foo</cname>
</clerk>
<clerk>
<cid>1</cid>
<cname>foo</cname>
</clerk>
</subordinates>
</manager>
</emp>
</emplist>
Expected output:
id, name, clerk.cid, clerk.cname
1, foo, 2, cname2
1, foo, 3, cname3
Thanks,
Sreekanth Jella
Re: Flattening XML in a DataFrame
Posted by Hyukjin Kwon <gu...@gmail.com>.
Hi Sreekanth,
Assuming you are using Spark 1.x,
I believe this code below:
sqlContext.read.format("com.databricks.spark.xml").option("rowTag",
"emp").load("/tmp/sample.xml")
.selectExpr("manager.id", "manager.name",
"explode(manager.subordinates.clerk) as clerk")
.selectExpr("id", "name", "clerk.cid", "clerk.cname")
.show()
would print the results below as you want:
+---+----+---+-----+
| id|name|cid|cname|
+---+----+---+-----+
| 1| foo| 1| foo|
| 1| foo| 1| foo|
+---+----+---+-----+
I hope this is helpful.
Thanks!
2016-08-13 9:33 GMT+09:00 Sreekanth Jella <sr...@gmail.com>:
> Hi Folks,
>
>
>
> I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml
> package which is automatically inferring my schema and creating a
> DataFrame.
>
>
>
> I do not want to hard code any column names in DataFrame as I have lot of
> varieties of XML documents and each might be lot more depth of child nodes.
> I simply want to flatten any type of XML and then write output data to a
> hive table. Can you please give some expert advice for the same.
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> <emplist>
>
> <emp>
>
> <manager>
>
> <id>1</id>
>
> <name>foo</name>
>
> <subordinates>
>
> <clerk>
>
> <cid>1</cid>
>
> <cname>foo</cname>
>
> </clerk>
>
> <clerk>
>
> <cid>1</cid>
>
> <cname>foo</cname>
>
> </clerk>
>
> </subordinates>
>
> </manager>
>
> </emp>
>
> </emplist>
>
>
>
> Expected output:
>
> id, name, clerk.cid, clerk.cname
>
> 1, foo, 2, cname2
>
> 1, foo, 3, cname3
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>