You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ashish Soni <as...@gmail.com> on 2015/07/19 19:38:18 UTC

XML Parsing

Hi All ,

I have an XML file with same tag repeated multiple times as below , Please
suggest what would be best way to process this data inside spark as ...

How can i extract each open and closing tag and process them or how can i
combine multiple line into single line

<review>
</review>
<review>
</review>
...
..
..

Thanks,

Re: XML Parsing

Posted by Ram Sriharsha <sr...@gmail.com>.
You would need to write an Xml Input Format that can parse XML into lines
based on start/end tags
Mahout has a XMLInputFormat implementation you should be able to import:
https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java

Once you have such a format, you can use Spark's Hadoop API to read the XML
into Strings

sc.newAPIHadoopFile(path,classOf[XMLInputFormat],classOf[NullWritable],classOf[Text])

Ram


On Sun, Jul 19, 2015 at 10:38 AM, Ashish Soni <as...@gmail.com> wrote:

> Hi All ,
>
> I have an XML file with same tag repeated multiple times as below , Please
> suggest what would be best way to process this data inside spark as ...
>
> How can i extract each open and closing tag and process them or how can i
> combine multiple line into single line
>
> <review>
> </review>
> <review>
> </review>
> ...
> ..
> ..
>
> Thanks,
>