You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Prathamesh Dharangutte <pr...@gmail.com> on 2016/02/21 14:19:04 UTC
spark-xml can't recognize schema
I am trying to parse xml file using spark-xml. But for some reason when i
print schema it only shows root instead of the hierarchy. I am using
sqlcontext to read the data. I am proceeding according to this video :
https://www.youtube.com/watch?v=NemEp53yGbI
The structure of xml file is somewhat like this:
<books>
<book>
<name></name>
<price></price>
<orderId></orderId>
</book>
<book>
//Some more data
</book>
</books>
For some books there,are multiple orders i.e. large number of orders while
for some it just occurs once as empty. I use the "rowtag" attribute as
book. How do i proceed or is there any other way to tackle this problem?
Help would be much appreciated. Thank you.
Re: spark-xml can't recognize schema
Posted by Sebastian Piu <se...@gmail.com>.
No because you didn't say that explicitly. Can you share a sample file too?
On Sun, 21 Feb 2016, 14:34 Prathamesh Dharangutte <pr...@gmail.com>
wrote:
> I am using spark 1.4.0 with scala 2.10.4 and 0.3.2 of spark-xml
> Orderid is empty for some books and multiple entries of it for other
> books,did you include that in your xml file?
>
> *From: *Sebastian Piu
> *Sent: *Sunday, 21 February 2016 20:00
> *To: *Prathamesh Dharangutte
> *Cc: *user@spark.apache.org
> *Subject: *Re: spark-xml can't recognize schema
>
> Just ran that code and it works fine, here is the output:
>
> What version are you using?
>
> val ctx = SQLContext.getOrCreate(sc)
> val df = ctx.read.format("com.databricks.spark.xml").option("rowTag", "book").load("file:///tmp/sample.xml")
> df.printSchema()
>
> root
> |-- name: long (nullable = true)
> |-- orderId: long (nullable = true)
> |-- price: long (nullable = true)
>
>
>
> On Sun, Feb 21, 2016 at 2:14 PM Prathamesh Dharangutte <
> pratham.d192@gmail.com> wrote:
>
>> This is the code I am using for parsing xml file:
>>
>>
>>
>> import org.apache.spark.{SparkConf,SparkContext}
>> import org.apache.spark.sql.{DataFrame,SQLContext}
>> import com.databricks.spark.xml
>>
>>
>> object XmlProcessing {
>>
>> def main(args : Array[String]) = {
>>
>> val conf = new SparkConf()
>> .setAppName("XmlProcessing")
>> .setMaster("local")
>>
>> val sc = new SparkContext(conf)
>> val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc)
>>
>> loadXMLdata(sqlContext)
>>
>> }
>>
>> def loadXMLdata(sqlContext : SQLContext) = {
>>
>> var df : DataFrame = null
>>
>> var newDf : DataFrame = null
>>
>> df = sqlContext.read
>> .format("com.databricks.spark.xml")
>> .option("rowTag","book")
>> .load("/home/prathamsh/Workspace/Xml/datafiles/sample.xml")
>>
>> df.printSchema()
>>
>>
>>
>> }
>>
>> }
>>
>>
>>
>>
>>
>>
>> On Sun, Feb 21, 2016 at 7:10 PM, Sebastian Piu <se...@gmail.com>
>> wrote:
>>
>>> Can you paste the code you are using?
>>>
>>> On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte <
>>> pratham.d192@gmail.com> wrote:
>>>
>>>> I am trying to parse xml file using spark-xml. But for some reason when
>>>> i print schema it only shows root instead of the hierarchy. I am using
>>>> sqlcontext to read the data. I am proceeding according to this video :
>>>> https://www.youtube.com/watch?v=NemEp53yGbI
>>>>
>>>> The structure of xml file is somewhat like this:
>>>>
>>>> <books>
>>>> <book>
>>>> <name></name>
>>>> <price></price>
>>>> <orderId></orderId>
>>>> </book>
>>>> <book>
>>>> //Some more data
>>>> </book>
>>>> </books>
>>>>
>>>> For some books there,are multiple orders i.e. large number of orders
>>>> while for some it just occurs once as empty. I use the "rowtag" attribute
>>>> as book. How do i proceed or is there any other way to tackle this
>>>> problem? Help would be much appreciated. Thank you.
>>>>
>>>
>>
>
Re: spark-xml can't recognize schema
Posted by Sebastian Piu <se...@gmail.com>.
Just ran that code and it works fine, here is the output:
What version are you using?
val ctx = SQLContext.getOrCreate(sc)
val df = ctx.read.format("com.databricks.spark.xml").option("rowTag",
"book").load("file:///tmp/sample.xml")
df.printSchema()
root
|-- name: long (nullable = true)
|-- orderId: long (nullable = true)
|-- price: long (nullable = true)
On Sun, Feb 21, 2016 at 2:14 PM Prathamesh Dharangutte <
pratham.d192@gmail.com> wrote:
> This is the code I am using for parsing xml file:
>
>
>
> import org.apache.spark.{SparkConf,SparkContext}
> import org.apache.spark.sql.{DataFrame,SQLContext}
> import com.databricks.spark.xml
>
>
> object XmlProcessing {
>
> def main(args : Array[String]) = {
>
> val conf = new SparkConf()
> .setAppName("XmlProcessing")
> .setMaster("local")
>
> val sc = new SparkContext(conf)
> val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc)
>
> loadXMLdata(sqlContext)
>
> }
>
> def loadXMLdata(sqlContext : SQLContext) = {
>
> var df : DataFrame = null
>
> var newDf : DataFrame = null
>
> df = sqlContext.read
> .format("com.databricks.spark.xml")
> .option("rowTag","book")
> .load("/home/prathamsh/Workspace/Xml/datafiles/sample.xml")
>
> df.printSchema()
>
>
>
> }
>
> }
>
>
>
>
>
>
> On Sun, Feb 21, 2016 at 7:10 PM, Sebastian Piu <se...@gmail.com>
> wrote:
>
>> Can you paste the code you are using?
>>
>> On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte <pr...@gmail.com>
>> wrote:
>>
>>> I am trying to parse xml file using spark-xml. But for some reason when
>>> i print schema it only shows root instead of the hierarchy. I am using
>>> sqlcontext to read the data. I am proceeding according to this video :
>>> https://www.youtube.com/watch?v=NemEp53yGbI
>>>
>>> The structure of xml file is somewhat like this:
>>>
>>> <books>
>>> <book>
>>> <name></name>
>>> <price></price>
>>> <orderId></orderId>
>>> </book>
>>> <book>
>>> //Some more data
>>> </book>
>>> </books>
>>>
>>> For some books there,are multiple orders i.e. large number of orders
>>> while for some it just occurs once as empty. I use the "rowtag" attribute
>>> as book. How do i proceed or is there any other way to tackle this
>>> problem? Help would be much appreciated. Thank you.
>>>
>>
>
Re: spark-xml can't recognize schema
Posted by Dave Moyers <da...@icloud.com>.
Make sure the xml input file is well formed (check your end tags).
Sent from my iPhone
> On Feb 21, 2016, at 8:14 AM, Prathamesh Dharangutte <pr...@gmail.com> wrote:
>
> This is the code I am using for parsing xml file:
>
>
>
> import org.apache.spark.{SparkConf,SparkContext}
> import org.apache.spark.sql.{DataFrame,SQLContext}
> import com.databricks.spark.xml
>
>
> object XmlProcessing {
>
> def main(args : Array[String]) = {
>
> val conf = new SparkConf()
> .setAppName("XmlProcessing")
> .setMaster("local")
>
> val sc = new SparkContext(conf)
> val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc)
>
> loadXMLdata(sqlContext)
>
> }
>
> def loadXMLdata(sqlContext : SQLContext) = {
>
> var df : DataFrame = null
>
> var newDf : DataFrame = null
>
> df = sqlContext.read
> .format("com.databricks.spark.xml")
> .option("rowTag","book")
> .load("/home/prathamsh/Workspace/Xml/datafiles/sample.xml")
>
> df.printSchema()
>
>
> }
>
> }
>
>
>
>
>
>
>> On Sun, Feb 21, 2016 at 7:10 PM, Sebastian Piu <se...@gmail.com> wrote:
>> Can you paste the code you are using?
>>
>>
>>> On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte <pr...@gmail.com> wrote:
>>> I am trying to parse xml file using spark-xml. But for some reason when i print schema it only shows root instead of the hierarchy. I am using sqlcontext to read the data. I am proceeding according to this video : https://www.youtube.com/watch?v=NemEp53yGbI
>>>
>>> The structure of xml file is somewhat like this:
>>>
>>> <books>
>>> <book>
>>> <name></name>
>>> <price></price>
>>> <orderId></orderId>
>>> </book>
>>> <book>
>>> //Some more data
>>> </book>
>>> </books>
>>>
>>> For some books there,are multiple orders i.e. large number of orders while for some it just occurs once as empty. I use the "rowtag" attribute as book. How do i proceed or is there any other way to tackle this problem? Help would be much appreciated. Thank you.
>
Re: spark-xml can't recognize schema
Posted by Prathamesh Dharangutte <pr...@gmail.com>.
This is the code I am using for parsing xml file:
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.sql.{DataFrame,SQLContext}
import com.databricks.spark.xml
object XmlProcessing {
def main(args : Array[String]) = {
val conf = new SparkConf()
.setAppName("XmlProcessing")
.setMaster("local")
val sc = new SparkContext(conf)
val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc)
loadXMLdata(sqlContext)
}
def loadXMLdata(sqlContext : SQLContext) = {
var df : DataFrame = null
var newDf : DataFrame = null
df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag","book")
.load("/home/prathamsh/Workspace/Xml/datafiles/sample.xml")
df.printSchema()
}
}
On Sun, Feb 21, 2016 at 7:10 PM, Sebastian Piu <se...@gmail.com>
wrote:
> Can you paste the code you are using?
>
> On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte <pr...@gmail.com>
> wrote:
>
>> I am trying to parse xml file using spark-xml. But for some reason when i
>> print schema it only shows root instead of the hierarchy. I am using
>> sqlcontext to read the data. I am proceeding according to this video :
>> https://www.youtube.com/watch?v=NemEp53yGbI
>>
>> The structure of xml file is somewhat like this:
>>
>> <books>
>> <book>
>> <name></name>
>> <price></price>
>> <orderId></orderId>
>> </book>
>> <book>
>> //Some more data
>> </book>
>> </books>
>>
>> For some books there,are multiple orders i.e. large number of orders
>> while for some it just occurs once as empty. I use the "rowtag" attribute
>> as book. How do i proceed or is there any other way to tackle this
>> problem? Help would be much appreciated. Thank you.
>>
>
Re: spark-xml can't recognize schema
Posted by Sebastian Piu <se...@gmail.com>.
Can you paste the code you are using?
On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte <pr...@gmail.com>
wrote:
> I am trying to parse xml file using spark-xml. But for some reason when i
> print schema it only shows root instead of the hierarchy. I am using
> sqlcontext to read the data. I am proceeding according to this video :
> https://www.youtube.com/watch?v=NemEp53yGbI
>
> The structure of xml file is somewhat like this:
>
> <books>
> <book>
> <name></name>
> <price></price>
> <orderId></orderId>
> </book>
> <book>
> //Some more data
> </book>
> </books>
>
> For some books there,are multiple orders i.e. large number of orders while
> for some it just occurs once as empty. I use the "rowtag" attribute as
> book. How do i proceed or is there any other way to tackle this problem?
> Help would be much appreciated. Thank you.
>