You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Talap, Amol" <am...@capgemini.com> on 2017/06/29 04:30:13 UTC

SparkSQL to read XML Blob data to create multiple rows

Hi:

We are trying to parse XML data to get below output from given input sample.
Can someone suggest a way to pass one DFrames output into load() function or any other alternative to get this output.

Input Data from Oracle Table XMLBlob:
SequenceID

Name

City

XMLComment

1

Amol

Kolhapur

<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>

2

Suresh

Mumbai

<books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>

3

Vishal

Delhi

<books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>

4

Swastik

Bangalore

<books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>


Output Data Expected using Spark SQL:
SequenceID

Name

City

Title

Description

1

Amol

Kolhapur

Title1.1

Description_1.1

1

Amol

Kolhapur

Title1.1

Description_1.2

1

Amol

Kolhapur

Title1.3

Description_1.3

2

Suresh

Mumbai

Title2

Description_2

3

Vishal

Delhi

Title3.1

Description_3.1

4

Swastik

Bangalore

Title4

Description_4


I am able to parse single XML using below approach in spark-shell using example below but how do we apply the same recursively for all rows ?
https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.

val dfX = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")

val xData = dfX.registerTempTable("books")

dfX.printSchema()

val books_inexp =sqlContext.sql("select title,author from books where price<10")

books_inexp.show


Regards,
Amol
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.

Re: SparkSQL to read XML Blob data to create multiple rows

Posted by Amol Talap <am...@gmail.com>.
Hi Zhang & All,

Thanks so much for your earlier response. I am trying to get final
solution. We could parse the data successfully however I am getting
Nullpointerexception while mapping it back. Can you please suggest on
below findings ?

spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
//----------------------------------
scala> spark.version
res19: String = 2.0.0

scala >
def xml2DF (xml:String) = {
     | val xRDD = sc.parallelize(Seq(xml))
     | val df = new XmlReader().xmlRdd(spark.sqlContext,xRDD)
     | val new_df = df.withColumn("comment",
explode(df("Comments.Comment"))).select($"comment.Description",$"comment.Title")
     | new_df.collectAsList}
xml2DF: (xml: String)java.util.List[org.apache.spark.sql.Row]
//-----------------------------------
scala> val xml3="<books><Comments><Comment><Title>Title3.1</Title><Description>Descript3.1</Description></Comment><Comment><Title>Title3333.1</Title><Description>Descript3333.1</Description></Comment></Comments></books>"

scala> xml2DF(xml3)
res18: java.util.List[org.apache.spark.sql.Row] =
[[Descript3.1,Title3.1], [Descript3333.1,Title3333.1]]
//xml2DF works as expected
//-----------------------------------
val rdd = sc.textFile("file:///home/spark/XML_Project/data.txt")
val xml_pRDDs = rdd.map(x=>(x.split(',')(0).toInt, x.split(',')(3)))
scala> xml_pRDDs.map(x=>(x._1,"call xml2DF "+x._2)).collect
res17: Array[(Int, String)] = Array((1,call xml2DF
<books><Comments><Comment><Title>Title1.1</Title><Description>Descript1.1</D..
//-----------------------------------
xml_pRDDs.map(x=>(x._1,xml2DF(x._2))).collect
//This gives Nullpointerexception, I was expecting result as below
//Array[(Int, java.util.List[org.apache.spark.sql.Row])] =
Array((1,[[Descript3.1,Title3.1], [Descript3333.1,Title3333.1]]),....
====================
Below is content of data.txt file.
1,Amol,Kolhapur,<books><Comments><Comment><Title>Title1.1</Title><Description>Descript1.1</Description></Comment><Comment><Title>Title1.2</Title><Description>Descript1.2</Description></Comment><Comment><Title>Title1.3</Title><Description>Descript1.3</Description></Comment></Comments></books>
2,Ameet,Bangalore,<books><Comments><Comment><Title>Title2.1</Title><Description>Descript2.1</Description></Comment><Comment><Title>Title2.2</Title><Description>Descript2.2</Description></Comment></Comments></books>
3,Rajesh,Jaipur,<books><Comments><Comment><Title>Title3.1</Title><Description>Descript3.1</Description></Comment><Comment><Title>Title3.2</Title><Description>Descript3.2</Description></Comment><Comment><Title>Title3.3</Title><Description>Descript3.3</Description></Comment><Comment><Title>Title3.4</Title><Description>Descript3.4</Description></Comment></Comments></books>

Regards,
Amol



On 6/29/17, Yong Zhang <ja...@hotmail.com> wrote:
> scala>spark.version
> res6: String = 2.1.1
>
> scala>val rdd  =
> sc.parallelize(Seq("""<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description></Comment>
> <Comment><Title>Title1.2</Title><Description>Description_1.2</Description></Comment>
> <Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment>
> </Comments></books>"""))
> rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
> parallelize at <console>:24
>
> scala>import com.databricks.spark.xml.XmlReader
>
> scala>val df = new XmlReader().xmlRdd(spark.sqlContext, rdd)
> df: org.apache.spark.sql.DataFrame = [Comments: struct<Comment:
> array<struct<Description:string,Title:string>>>]
>
> scala>df.printSchema
> root
>  |-- Comments: struct (nullable = true)
>  |    |-- Comment: array (nullable = true)
>  |    |    |-- element: struct (containsNull = true)
>  |    |    |    |-- Description: string (nullable = true)
>  |    |    |    |-- Title: string (nullable = true)
>
> scala>df.show(false)
> +--------------------------------------------------------------------------------------------------+
> |Comments
>                       |
> +--------------------------------------------------------------------------------------------------+
> |[WrappedArray([Description_1.1,Title1.1], [Description_1.2,Title1.2],
> [Description_1.3,Title1.3])]|
> +--------------------------------------------------------------------------------------------------+
>
>
> scala>df.withColumn("comment",
> explode(df("Comments.Comment"))).select($"comment.Description",
> $"comment.Title").show
> +---------------+--------+
> |    Description|   Title|
> +---------------+--------+
> |Description_1.1|Title1.1|
> |Description_1.2|Title1.2|
> |Description_1.3|Title1.3|
> +---------------+--------+
>
>
>
> ________________________________
> From: Talap, Amol <am...@capgemini.com>
> Sent: Thursday, June 29, 2017 9:38 AM
> To: Judit Planas; user@spark.apache.org
> Subject: RE: SparkSQL to read XML Blob data to create multiple rows
>
>
> Thanks Judit, Ayan
>
> Judit,
>
> You almost got it. The explode might help here.
>
> But when I tried I see load() doesn’t like to read from xmlcomment column on
> oracle_data.
>
>
>
> scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")
>
> 17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM
> oracle_data
>
> 17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed
>
> …
>
> scala> val xmlDF_flattened =
> xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
>
> <console>:22: error: overloaded method value load with alternatives:
>
>   ()org.apache.spark.sql.DataFrame <and>
>
>   (path: String)org.apache.spark.sql.DataFrame
>
> cannot be applied to (org.apache.spark.sql.ColumnName)
>
>        val xmlDF_flattened =
> xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
>
>
>
> Ayan,
>
> Output of books_inexp.show was as below
> title,                             author
> Midnight Rain,            Ralls, Kim
> Maeve Ascendant,      Corets, Eva
>
>
>
> Regards,
>
> Amol
>
> From: Judit Planas [mailto:judit.planas@epfl.ch]
> Sent: Thursday, June 29, 2017 3:46 AM
> To: user@spark.apache.org
> Subject: Re: SparkSQL to read XML Blob data to create multiple rows
>
>
>
> Hi Amol,
>
> Not sure I understand completely your question, but the SQL function
> "explode" may help you:
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
>
> pyspark.sql module — PySpark 2.1.1
> documentation<http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode>
> spark.apache.org
> pyspark.sql.SparkSession Main entry point for DataFrame and SQL
> functionality. pyspark.sql.DataFrame A distributed collection of data
> grouped into named columns.
>
>
>
>
> Here you can find a nice example:
> https://stackoverflow.com/questions/38210507/explode-in-pyspark
> [https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/38210507/explode-in-pyspark>
>
> python - Explode in PySpark - Stack
> Overflow<https://stackoverflow.com/questions/38210507/explode-in-pyspark>
> stackoverflow.com
> I would like to transform from a DataFrame that contains lists of words into
> a DataFrame with each word in its own row. How do I do explode on a column
> in a DataFrame?
>
>
>
>
> HTH,
> Judit
>
> On 29/06/17 09:05, ayan guha wrote:
>
> Hi
>
>
>
> Not sure if I follow your issue. Can you please post output of
> books_inexp.show()?
>
>
>
> On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol
> <am...@capgemini.com>> wrote:
>
> Hi:
>
>
>
> We are trying to parse XML data to get below output from given input
> sample.
>
> Can someone suggest a way to pass one DFrames output into load() function or
> any other alternative to get this output.
>
>
>
> Input Data from Oracle Table XMLBlob:
>
> SequenceID
>
>
> Name
>
>
> City
>
>
> XMLComment
>
>
> 1
>
>
> Amol
>
>
> Kolhapur
>
>
> <books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>
>
>
> 2
>
>
> Suresh
>
>
> Mumbai
>
>
> <books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>
>
>
> 3
>
>
> Vishal
>
>
> Delhi
>
>
> <books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>
>
>
> 4
>
>
> Swastik
>
>
> Bangalore
>
>
> <books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>
>
>
>
>
> Output Data Expected using Spark SQL:
>
> SequenceID
>
>
> Name
>
>
> City
>
>
> Title
>
>
> Description
>
>
> 1
>
>
> Amol
>
>
> Kolhapur
>
>
> Title1.1
>
>
> Description_1.1
>
>
> 1
>
>
> Amol
>
>
> Kolhapur
>
>
> Title1.1
>
>
> Description_1.2
>
>
> 1
>
>
> Amol
>
>
> Kolhapur
>
>
> Title1.3
>
>
> Description_1.3
>
>
> 2
>
>
> Suresh
>
>
> Mumbai
>
>
> Title2
>
>
> Description_2
>
>
> 3
>
>
> Vishal
>
>
> Delhi
>
>
> Title3.1
>
>
> Description_3.1
>
>
> 4
>
>
> Swastik
>
>
> Bangalore
>
>
> Title4
>
>
> Description_4
>
>
>
>
> I am able to parse single XML using below approach in spark-shell using
> example below but how do we apply the same recursively for all rows ?
>
> https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.
>
> val dfX =
> sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")
>
>
> val xData = dfX.registerTempTable("books")
>
>
> dfX.printSchema()
>
>
> val books_inexp =sqlContext.sql("select title,author from books where
> price<10")
>
>
> books_inexp.show
>
>
>
>
> Regards,
>
> Amol
>
> This message contains information that may be privileged or confidential and
> is the property of the Capgemini Group. It is intended only for the person
> to whom it is addressed. If you are not the intended recipient, you are not
> authorized to read, print, retain, copy, disseminate, distribute, or use
> this message or any part thereof. If you receive this message in error,
> please notify the sender immediately and delete all copies of this message.
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>
>
> This message contains information that may be privileged or confidential and
> is the property of the Capgemini Group. It is intended only for the person
> to whom it is addressed. If you are not the intended recipient, you are not
> authorized to read, print, retain, copy, disseminate, distribute, or use
> this message or any part thereof. If you receive this message in error,
> please notify the sender immediately and delete all copies of this message.
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


RE: SparkSQL to read XML Blob data to create multiple rows

Posted by "Talap, Amol" <am...@capgemini.com>.
Thanks so much Zhang. This definitely helps.

From: Yong Zhang [mailto:java8964@hotmail.com]
Sent: Thursday, June 29, 2017 4:59 PM
To: Talap, Amol; Judit Planas; user@spark.apache.org
Subject: Re: SparkSQL to read XML Blob data to create multiple rows


scala>spark.version

res6: String = 2.1.1



scala>val rdd  = sc.parallelize(Seq("""<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description></Comment>

<Comment><Title>Title1.2</Title><Description>Description_1.2</Description></Comment>

<Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment>

</Comments></books>"""))

rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24



scala>import com.databricks.spark.xml.XmlReader



scala>val df = new XmlReader().xmlRdd(spark.sqlContext, rdd)

df: org.apache.spark.sql.DataFrame = [Comments: struct<Comment: array<struct<Description:string,Title:string>>>]



scala>df.printSchema

root

 |-- Comments: struct (nullable = true)

 |    |-- Comment: array (nullable = true)

 |    |    |-- element: struct (containsNull = true)

 |    |    |    |-- Description: string (nullable = true)

 |    |    |    |-- Title: string (nullable = true)



scala>df.show(false)

+--------------------------------------------------------------------------------------------------+

|Comments                                                                                          |

+--------------------------------------------------------------------------------------------------+

|[WrappedArray([Description_1.1,Title1.1], [Description_1.2,Title1.2], [Description_1.3,Title1.3])]|

+--------------------------------------------------------------------------------------------------+





scala>df.withColumn("comment", explode(df("Comments.Comment"))).select($"comment.Description", $"comment.Title").show

+---------------+--------+

|    Description|   Title|

+---------------+--------+

|Description_1.1|Title1.1|

|Description_1.2|Title1.2|

|Description_1.3|Title1.3|

+---------------+--------+


________________________________
From: Talap, Amol <am...@capgemini.com>>
Sent: Thursday, June 29, 2017 9:38 AM
To: Judit Planas; user@spark.apache.org<ma...@spark.apache.org>
Subject: RE: SparkSQL to read XML Blob data to create multiple rows


Thanks Judit, Ayan

Judit,

You almost got it. The explode might help here.

But when I tried I see load() doesn't like to read from xmlcomment column on oracle_data.



scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")

17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM oracle_data

17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed

...

scala> val xmlDF_flattened = xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))

<console>:22: error: overloaded method value load with alternatives:

  ()org.apache.spark.sql.DataFrame <and>

  (path: String)org.apache.spark.sql.DataFrame

cannot be applied to (org.apache.spark.sql.ColumnName)

       val xmlDF_flattened = xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))



Ayan,

Output of books_inexp.show was as below
title,                             author
Midnight Rain,            Ralls, Kim
Maeve Ascendant,      Corets, Eva



Regards,

Amol

From: Judit Planas [mailto:judit.planas@epfl.ch]
Sent: Thursday, June 29, 2017 3:46 AM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: SparkSQL to read XML Blob data to create multiple rows



Hi Amol,

Not sure I understand completely your question, but the SQL function "explode" may help you:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
pyspark.sql module - PySpark 2.1.1 documentation<http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode>
spark.apache.org
pyspark.sql.SparkSession Main entry point for DataFrame and SQL functionality. pyspark.sql.DataFrame A distributed collection of data grouped into named columns.




Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark
[https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/38210507/explode-in-pyspark>

python - Explode in PySpark - Stack Overflow<https://stackoverflow.com/questions/38210507/explode-in-pyspark>
stackoverflow.com
I would like to transform from a DataFrame that contains lists of words into a DataFrame with each word in its own row. How do I do explode on a column in a DataFrame?




HTH,
Judit

On 29/06/17 09:05, ayan guha wrote:

Hi



Not sure if I follow your issue. Can you please post output of books_inexp.show()?



On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol <am...@capgemini.com>> wrote:

Hi:



We are trying to parse XML data to get below output from given input sample.

Can someone suggest a way to pass one DFrames output into load() function or any other alternative to get this output.



Input Data from Oracle Table XMLBlob:

SequenceID


Name


City


XMLComment


1


Amol


Kolhapur


<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>


2


Suresh


Mumbai


<books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>


3


Vishal


Delhi


<books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>


4


Swastik


Bangalore


<books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>




Output Data Expected using Spark SQL:

SequenceID


Name


City


Title


Description


1


Amol


Kolhapur


Title1.1


Description_1.1


1


Amol


Kolhapur


Title1.1


Description_1.2


1


Amol


Kolhapur


Title1.3


Description_1.3


2


Suresh


Mumbai


Title2


Description_2


3


Vishal


Delhi


Title3.1


Description_3.1


4


Swastik


Bangalore


Title4


Description_4




I am able to parse single XML using below approach in spark-shell using example below but how do we apply the same recursively for all rows ?

https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.

val dfX = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")


val xData = dfX.registerTempTable("books")


dfX.printSchema()


val books_inexp =sqlContext.sql("select title,author from books where price<10")


books_inexp.show




Regards,

Amol

This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.





--

Best Regards,
Ayan Guha


This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.

Re: SparkSQL to read XML Blob data to create multiple rows

Posted by Yong Zhang <ja...@hotmail.com>.
scala>spark.version
res6: String = 2.1.1

scala>val rdd  = sc.parallelize(Seq("""<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description></Comment>
<Comment><Title>Title1.2</Title><Description>Description_1.2</Description></Comment>
<Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment>
</Comments></books>"""))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala>import com.databricks.spark.xml.XmlReader

scala>val df = new XmlReader().xmlRdd(spark.sqlContext, rdd)
df: org.apache.spark.sql.DataFrame = [Comments: struct<Comment: array<struct<Description:string,Title:string>>>]

scala>df.printSchema
root
 |-- Comments: struct (nullable = true)
 |    |-- Comment: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Description: string (nullable = true)
 |    |    |    |-- Title: string (nullable = true)

scala>df.show(false)
+--------------------------------------------------------------------------------------------------+
|Comments                                                                                          |
+--------------------------------------------------------------------------------------------------+
|[WrappedArray([Description_1.1,Title1.1], [Description_1.2,Title1.2], [Description_1.3,Title1.3])]|
+--------------------------------------------------------------------------------------------------+


scala>df.withColumn("comment", explode(df("Comments.Comment"))).select($"comment.Description", $"comment.Title").show
+---------------+--------+
|    Description|   Title|
+---------------+--------+
|Description_1.1|Title1.1|
|Description_1.2|Title1.2|
|Description_1.3|Title1.3|
+---------------+--------+



________________________________
From: Talap, Amol <am...@capgemini.com>
Sent: Thursday, June 29, 2017 9:38 AM
To: Judit Planas; user@spark.apache.org
Subject: RE: SparkSQL to read XML Blob data to create multiple rows


Thanks Judit, Ayan

Judit,

You almost got it. The explode might help here.

But when I tried I see load() doesn’t like to read from xmlcomment column on oracle_data.



scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")

17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM oracle_data

17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed

…

scala> val xmlDF_flattened = xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))

<console>:22: error: overloaded method value load with alternatives:

  ()org.apache.spark.sql.DataFrame <and>

  (path: String)org.apache.spark.sql.DataFrame

cannot be applied to (org.apache.spark.sql.ColumnName)

       val xmlDF_flattened = xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))



Ayan,

Output of books_inexp.show was as below
title,                             author
Midnight Rain,            Ralls, Kim
Maeve Ascendant,      Corets, Eva



Regards,

Amol

From: Judit Planas [mailto:judit.planas@epfl.ch]
Sent: Thursday, June 29, 2017 3:46 AM
To: user@spark.apache.org
Subject: Re: SparkSQL to read XML Blob data to create multiple rows



Hi Amol,

Not sure I understand completely your question, but the SQL function "explode" may help you:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

pyspark.sql module — PySpark 2.1.1 documentation<http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode>
spark.apache.org
pyspark.sql.SparkSession Main entry point for DataFrame and SQL functionality. pyspark.sql.DataFrame A distributed collection of data grouped into named columns.




Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark
[https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/38210507/explode-in-pyspark>

python - Explode in PySpark - Stack Overflow<https://stackoverflow.com/questions/38210507/explode-in-pyspark>
stackoverflow.com
I would like to transform from a DataFrame that contains lists of words into a DataFrame with each word in its own row. How do I do explode on a column in a DataFrame?




HTH,
Judit

On 29/06/17 09:05, ayan guha wrote:

Hi



Not sure if I follow your issue. Can you please post output of books_inexp.show()?



On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol <am...@capgemini.com>> wrote:

Hi:



We are trying to parse XML data to get below output from given input sample.

Can someone suggest a way to pass one DFrames output into load() function or any other alternative to get this output.



Input Data from Oracle Table XMLBlob:

SequenceID


Name


City


XMLComment


1


Amol


Kolhapur


<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>


2


Suresh


Mumbai


<books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>


3


Vishal


Delhi


<books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>


4


Swastik


Bangalore


<books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>




Output Data Expected using Spark SQL:

SequenceID


Name


City


Title


Description


1


Amol


Kolhapur


Title1.1


Description_1.1


1


Amol


Kolhapur


Title1.1


Description_1.2


1


Amol


Kolhapur


Title1.3


Description_1.3


2


Suresh


Mumbai


Title2


Description_2


3


Vishal


Delhi


Title3.1


Description_3.1


4


Swastik


Bangalore


Title4


Description_4




I am able to parse single XML using below approach in spark-shell using example below but how do we apply the same recursively for all rows ?

https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.

val dfX = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")


val xData = dfX.registerTempTable("books")


dfX.printSchema()


val books_inexp =sqlContext.sql("select title,author from books where price<10")


books_inexp.show




Regards,

Amol

This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.





--

Best Regards,
Ayan Guha



This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.

Re: SparkSQL to read XML Blob data to create multiple rows

Posted by Judit Planas <ju...@epfl.ch>.
Hi Ayan,

I'm afraid I can't help you here. I'm a beginner in Spark with 0 
experience in Scala (always used PySpark).

In my case, I had a list of numbers, and the explode function was 
exactly what I was looking for.

Sorry about that, hope you can solve it.

Best,
Judit

On 29/06/17 15:38, Talap, Amol wrote:
>
> Thanks Judit, Ayan
>
> Judit,
>
> You almost got it. The explode might help here.
>
> But when I tried I see load() doesn’t like to read from xmlcomment 
> column on oracle_data.
>
> scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")
>
> 17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * 
> FROM oracle_data
>
> 17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed
>
> …
>
> scala> val xmlDF_flattened = 
> xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
>
> <console>:22: error: overloaded method value load with alternatives:
>
> ()org.apache.spark.sql.DataFrame <and>
>
> (path: String)org.apache.spark.sql.DataFrame
>
> cannot be applied to (org.apache.spark.sql.ColumnName)
>
> val xmlDF_flattened = 
> xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
>
> Ayan,
>
> Output of books_inexp.show was as below
> title,                             author
> Midnight Rain,            Ralls, Kim
> Maeve Ascendant,      Corets, Eva
>
> Regards,
>
> Amol
>
> *From:*Judit Planas [mailto:judit.planas@epfl.ch]
> *Sent:* Thursday, June 29, 2017 3:46 AM
> *To:* user@spark.apache.org
> *Subject:* Re: SparkSQL to read XML Blob data to create multiple rows
>
> Hi Amol,
>
> Not sure I understand completely your question, but the SQL function 
> "explode" may help you:
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
>
> Here you can find a nice example:
> https://stackoverflow.com/questions/38210507/explode-in-pyspark
>
> HTH,
> Judit
>
> On 29/06/17 09:05, ayan guha wrote:
>
>     Hi
>
>     Not sure if I follow your issue. Can you please post output of
>     books_inexp.show()?
>
>     On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol
>     <amol.talap@capgemini.com <ma...@capgemini.com>> wrote:
>
>         Hi:
>
>         We are trying to parse XML data to get below output from given
>         input sample.
>
>         Can someone suggest a way to pass one DFrames output into
>         load() function or any other alternative to get this output.
>
>         Input Data from Oracle Table XMLBlob:
>
>         *SequenceID*
>
>         	
>
>         *Name*
>
>         	
>
>         *City*
>
>         	
>
>         *XMLComment*
>
>         1
>
>         	
>
>         Amol
>
>         	
>
>         Kolhapur
>
>         	
>
>         <books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>
>
>         2
>
>         	
>
>         Suresh
>
>         	
>
>         Mumbai
>
>         	
>
>         <books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>
>
>         3
>
>         	
>
>         Vishal
>
>         	
>
>         Delhi
>
>         	
>
>         <books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>
>
>         4
>
>         	
>
>         Swastik
>
>         	
>
>         Bangalore
>
>         	
>
>         <books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>
>
>         Output Data Expected using Spark SQL:
>
>         *SequenceID*
>
>         	
>
>         *Name*
>
>         	
>
>         *City*
>
>         	
>
>         *Title*
>
>         	
>
>         *Description*
>
>         1
>
>         	
>
>         Amol
>
>         	
>
>         Kolhapur
>
>         	
>
>         Title1.1
>
>         	
>
>         Description_1.1
>
>         1
>
>         	
>
>         Amol
>
>         	
>
>         Kolhapur
>
>         	
>
>         Title1.1
>
>         	
>
>         Description_1.2
>
>         1
>
>         	
>
>         Amol
>
>         	
>
>         Kolhapur
>
>         	
>
>         Title1.3
>
>         	
>
>         Description_1.3
>
>         2
>
>         	
>
>         Suresh
>
>         	
>
>         Mumbai
>
>         	
>
>         Title2
>
>         	
>
>         Description_2
>
>         3
>
>         	
>
>         Vishal
>
>         	
>
>         Delhi
>
>         	
>
>         Title3.1
>
>         	
>
>         Description_3.1
>
>         4
>
>         	
>
>         Swastik
>
>         	
>
>         Bangalore
>
>         	
>
>         Title4
>
>         	
>
>         Description_4
>
>         I am able to parse single XML using below approach in
>         spark-shell using example below but how do we apply the same
>         recursively for all rows ?
>
>         https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.
>
>
>
>         val dfX =
>         sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")
>
>         val xData = dfX.registerTempTable("books")
>
>         dfX.printSchema()
>
>         val books_inexp =sqlContext.sql("select title,author from
>         books where price<10")
>
>         books_inexp.show
>
>         Regards,
>
>         Amol
>
>         This message contains information that may be privileged or
>         confidential and is the property of the Capgemini Group. It is
>         intended only for the person to whom it is addressed. If you
>         are not the intended recipient, you are not authorized to
>         read, print, retain, copy, disseminate, distribute, or use
>         this message or any part thereof. If you receive this message
>         in error, please notify the sender immediately and delete all
>         copies of this message.
>
>
>
>     -- 
>
>     Best Regards,
>     Ayan Guha
>
> This message contains information that may be privileged or 
> confidential and is the property of the Capgemini Group. It is 
> intended only for the person to whom it is addressed. If you are not 
> the intended recipient, you are not authorized to read, print, retain, 
> copy, disseminate, distribute, or use this message or any part 
> thereof. If you receive this message in error, please notify the 
> sender immediately and delete all copies of this message. 


RE: SparkSQL to read XML Blob data to create multiple rows

Posted by "Talap, Amol" <am...@capgemini.com>.
Thanks Judit, Ayan
Judit,
You almost got it. The explode might help here.
But when I tried I see load() doesn’t like to read from xmlcomment column on oracle_data.

scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")
17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM oracle_data
17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed
…
scala> val xmlDF_flattened = xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
<console>:22: error: overloaded method value load with alternatives:
  ()org.apache.spark.sql.DataFrame <and>
  (path: String)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.sql.ColumnName)
       val xmlDF_flattened = xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))

Ayan,
Output of books_inexp.show was as below
title,                             author
Midnight Rain,            Ralls, Kim
Maeve Ascendant,      Corets, Eva

Regards,
Amol
From: Judit Planas [mailto:judit.planas@epfl.ch]
Sent: Thursday, June 29, 2017 3:46 AM
To: user@spark.apache.org
Subject: Re: SparkSQL to read XML Blob data to create multiple rows

Hi Amol,

Not sure I understand completely your question, but the SQL function "explode" may help you:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark

HTH,
Judit
On 29/06/17 09:05, ayan guha wrote:
Hi

Not sure if I follow your issue. Can you please post output of books_inexp.show()?

On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol <am...@capgemini.com>> wrote:
Hi:

We are trying to parse XML data to get below output from given input sample.
Can someone suggest a way to pass one DFrames output into load() function or any other alternative to get this output.

Input Data from Oracle Table XMLBlob:
SequenceID

Name

City

XMLComment

1

Amol

Kolhapur

<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>

2

Suresh

Mumbai

<books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>

3

Vishal

Delhi

<books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>

4

Swastik

Bangalore

<books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>


Output Data Expected using Spark SQL:
SequenceID

Name

City

Title

Description

1

Amol

Kolhapur

Title1.1

Description_1.1

1

Amol

Kolhapur

Title1.1

Description_1.2

1

Amol

Kolhapur

Title1.3

Description_1.3

2

Suresh

Mumbai

Title2

Description_2

3

Vishal

Delhi

Title3.1

Description_3.1

4

Swastik

Bangalore

Title4

Description_4


I am able to parse single XML using below approach in spark-shell using example below but how do we apply the same recursively for all rows ?
https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.

val dfX = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")

val xData = dfX.registerTempTable("books")

dfX.printSchema()

val books_inexp =sqlContext.sql("select title,author from books where price<10")

books_inexp.show


Regards,
Amol
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.



--
Best Regards,
Ayan Guha

This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.

Re: SparkSQL to read XML Blob data to create multiple rows

Posted by Judit Planas <ju...@epfl.ch>.
Hi Amol,

Not sure I understand completely your question, but the SQL function 
"explode" may help you:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark

HTH,
Judit

On 29/06/17 09:05, ayan guha wrote:
> Hi
>
> Not sure if I follow your issue. Can you please post output of 
> books_inexp.show()?
>
> On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol <amol.talap@capgemini.com 
> <ma...@capgemini.com>> wrote:
>
>     Hi:
>
>     We are trying to parse XML data to get below output from given
>     input sample.
>
>     Can someone suggest a way to pass one DFrames output into load()
>     function or any other alternative to get this output.
>
>     Input Data from Oracle Table XMLBlob:
>
>     *SequenceID*
>
>     	
>
>     *Name*
>
>     	
>
>     *City*
>
>     	
>
>     *XMLComment*
>
>     1
>
>     	
>
>     Amol
>
>     	
>
>     Kolhapur
>
>     	
>
>     <books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>
>
>     2
>
>     	
>
>     Suresh
>
>     	
>
>     Mumbai
>
>     	
>
>     <books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>
>
>     3
>
>     	
>
>     Vishal
>
>     	
>
>     Delhi
>
>     	
>
>     <books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>
>
>     4
>
>     	
>
>     Swastik
>
>     	
>
>     Bangalore
>
>     	
>
>     <books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>
>
>     Output Data Expected using Spark SQL:
>
>     *SequenceID*
>
>     	
>
>     *Name*
>
>     	
>
>     *City*
>
>     	
>
>     *Title*
>
>     	
>
>     *Description*
>
>     1
>
>     	
>
>     Amol
>
>     	
>
>     Kolhapur
>
>     	
>
>     Title1.1
>
>     	
>
>     Description_1.1
>
>     1
>
>     	
>
>     Amol
>
>     	
>
>     Kolhapur
>
>     	
>
>     Title1.1
>
>     	
>
>     Description_1.2
>
>     1
>
>     	
>
>     Amol
>
>     	
>
>     Kolhapur
>
>     	
>
>     Title1.3
>
>     	
>
>     Description_1.3
>
>     2
>
>     	
>
>     Suresh
>
>     	
>
>     Mumbai
>
>     	
>
>     Title2
>
>     	
>
>     Description_2
>
>     3
>
>     	
>
>     Vishal
>
>     	
>
>     Delhi
>
>     	
>
>     Title3.1
>
>     	
>
>     Description_3.1
>
>     4
>
>     	
>
>     Swastik
>
>     	
>
>     Bangalore
>
>     	
>
>     Title4
>
>     	
>
>     Description_4
>
>     I am able to parse single XML using below approach in spark-shell
>     using example below but how do we apply the same recursively for
>     all rows ?
>
>     https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html
>     <https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html>.
>
>
>
>     val dfX =
>     sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")
>
>     val xData = dfX.registerTempTable("books")
>
>     dfX.printSchema()
>
>     val books_inexp =sqlContext.sql("select title,author from books
>     where price<10")
>
>     books_inexp.show
>
>     Regards,
>
>     Amol
>
>     This message contains information that may be privileged or
>     confidential and is the property of the Capgemini Group. It is
>     intended only for the person to whom it is addressed. If you are
>     not the intended recipient, you are not authorized to read, print,
>     retain, copy, disseminate, distribute, or use this message or any
>     part thereof. If you receive this message in error, please notify
>     the sender immediately and delete all copies of this message.
>
>
>
>
> -- 
> Best Regards,
> Ayan Guha


Re: SparkSQL to read XML Blob data to create multiple rows

Posted by ayan guha <gu...@gmail.com>.
Hi

Not sure if I follow your issue. Can you please post output of
books_inexp.show()?

On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol <am...@capgemini.com>
wrote:

> Hi:
>
>
>
> We are trying to parse XML data to get below output from given input
> sample.
>
> Can someone suggest a way to pass one DFrames output into load() function
> or any other alternative to get this output.
>
>
>
> Input Data from Oracle Table XMLBlob:
>
> *SequenceID*
>
> *Name*
>
> *City*
>
> *XMLComment*
>
> 1
>
> Amol
>
> Kolhapur
>
> <books><Comments><Comment><Title>Title1.1</Title><
> Description>Description_1.1</Description><Comment><Title>
> Title1.2</Title><Description>Description_1.2</Description><
> Comment><Title>Title1.3</Title><Description>Description_1.3</Description><
> /Comment></Comments></books>
>
> 2
>
> Suresh
>
> Mumbai
>
> <books><Comments><Comment><Title>Title2</Title><
> Description>Description_2</Description></Comment></Comments></books>
>
> 3
>
> Vishal
>
> Delhi
>
> <books><Comments><Comment><Title>Title3</Title><
> Description>Description_3</Description></Comment></Comments></books>
>
> 4
>
> Swastik
>
> Bangalore
>
> <books><Comments><Comment><Title>Title4</Title><
> Description>Description_4</Description></Comment></Comments></books>
>
>
>
> Output Data Expected using Spark SQL:
>
> *SequenceID*
>
> *Name*
>
> *City*
>
> *Title*
>
> *Description*
>
> 1
>
> Amol
>
> Kolhapur
>
> Title1.1
>
> Description_1.1
>
> 1
>
> Amol
>
> Kolhapur
>
> Title1.1
>
> Description_1.2
>
> 1
>
> Amol
>
> Kolhapur
>
> Title1.3
>
> Description_1.3
>
> 2
>
> Suresh
>
> Mumbai
>
> Title2
>
> Description_2
>
> 3
>
> Vishal
>
> Delhi
>
> Title3.1
>
> Description_3.1
>
> 4
>
> Swastik
>
> Bangalore
>
> Title4
>
> Description_4
>
>
>
> I am able to parse single XML using below approach in spark-shell using
> example below but how do we apply the same recursively for all rows ?
>
> https://community.hortonworks.com/questions/71538/parsing-
> xml-in-spark-rdd.html.
>
>
> val dfX = sqlContext.read.format("com.databricks.spark.xml").option(
> "rowTag","book").load("books.xml")
>
> val xData = dfX.registerTempTable("books")
>
> dfX.printSchema()
>
> val books_inexp =sqlContext.sql("select title,author from books where
> price<10")
>
> books_inexp.show
>
>
>
> Regards,
>
> Amol
> This message contains information that may be privileged or confidential
> and is the property of the Capgemini Group. It is intended only for the
> person to whom it is addressed. If you are not the intended recipient, you
> are not authorized to read, print, retain, copy, disseminate, distribute,
> or use this message or any part thereof. If you receive this message in
> error, please notify the sender immediately and delete all copies of this
> message.
>



-- 
Best Regards,
Ayan Guha