You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ashish Soni <as...@gmail.com> on 2015/07/19 19:38:18 UTC
XML Parsing
Hi All ,
I have an XML file with same tag repeated multiple times as below , Please
suggest what would be best way to process this data inside spark as ...
How can i extract each open and closing tag and process them or how can i
combine multiple line into single line
<review>
</review>
<review>
</review>
...
..
..
Thanks,
Re: XML Parsing
Posted by Ram Sriharsha <sr...@gmail.com>.
You would need to write an Xml Input Format that can parse XML into lines
based on start/end tags
Mahout has a XMLInputFormat implementation you should be able to import:
https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java
Once you have such a format, you can use Spark's Hadoop API to read the XML
into Strings
sc.newAPIHadoopFile(path,classOf[XMLInputFormat],classOf[NullWritable],classOf[Text])
Ram
On Sun, Jul 19, 2015 at 10:38 AM, Ashish Soni <as...@gmail.com> wrote:
> Hi All ,
>
> I have an XML file with same tag repeated multiple times as below , Please
> suggest what would be best way to process this data inside spark as ...
>
> How can i extract each open and closing tag and process them or how can i
> combine multiple line into single line
>
> <review>
> </review>
> <review>
> </review>
> ...
> ..
> ..
>
> Thanks,
>