You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by rakesh sharma <ra...@hotmail.com> on 2015/08/31 13:40:24 UTC

Reading xml in java using spark

I want to parse an xml file in sparkBut as far as example is concerned it reads it as text file. The maping to xml will be a tedious job.How can I find the number of elements of a particular type using that. Any help in java/scala code is also welcome
thanksrakesh 		 	   		  

Re: Reading xml in java using spark

Posted by Darin McBeath <dd...@yahoo.com.INVALID>.
Another option might be to leverage spark-xml-utils (https://github.com/dmcbeath/spark-xml-utils)

This is a collection of xml utilities that I've recently revamped that make it relatively easy to use xpath, xslt, or xquery within the context of a Spark application (or at least I think so).  My previous attempt was not overly friendly, but as I've learned more about Spark (and needed easier to use xml utilities) I've hopefully made this easier to use and understand.  I hope others find it useful.

Back to your problem.  Assuming you have a bunch of xml records in an RDD, you should be able to do something like the following to count the number of elements for that type.  In the example below, I'm counting the number of references in documents.  The xmlKeyPair is an RDD of type (String,String) where the first item is the 'key' and the second item is the xml record.  The xpath expression identifies the 'reference' element I want to count.

import com.elsevier.spark_xml_utils.xpath.XPathProcessor
import scala.collection.JavaConverters._
import java.util.HashMap

xmlKeyPair.mapPartitions(recsIter => {
                 val xpath = "count(/xocs:doc/xocs:meta/xocs:references/xocs:ref-info)"
                 val namespaces = new HashMap[String,String](Map(
                                            "xocs" -> "http://www.elsevier.com/xml/xocs/dtd"
                                  ).asJava)
                 val proc = XPathProcessor.getInstance(xpath,namespaces)
                 recsIter.map(rec => proc.evaluateString(rec._2).toInt)
               }).sum


There is more documentation on the spark-xml-utils github site.  Let me know if the documentation is not clear or if you have any questions. 

Darin.


________________________________
From: Rick Hillegas <ri...@gmail.com>
To: Sonal Goyal <so...@gmail.com> 
Cc: rakesh sharma <ra...@hotmail.com>; user@spark.apache.org 
Sent: Monday, August 31, 2015 10:51 AM
Subject: Re: Reading xml in java using spark



Hi Rakesh,

You might also take a look at the Derby code.
   org.apache.derby.vti.XmlVTI provides a number of static methods for
   turning an XML resource into a JDBC ResultSet.

Thanks,
-Rick

On 8/31/15 4:44 AM, Sonal Goyal wrote: 


I think the mahout project had an xmlinoutformat which you can leverage.
>On Aug 31, 2015 5:10 PM, "rakesh sharma" <ra...@hotmail.com> wrote:
>
>I want to parse an xml file in spark 
>>But as far as example is concerned it reads it as text file. The maping to xml will be a tedious job.
>>How can I find the number of elements of a particular type using that. Any help in java/scala code is also welcome
>>
>>
>>thanks
>>rakesh

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Reading xml in java using spark

Posted by Rick Hillegas <ri...@gmail.com>.
Hi Rakesh,

You might also take a look at the Derby code. 
org.apache.derby.vti.XmlVTI provides a number of static methods for 
turning an XML resource into a JDBC ResultSet.

Thanks,
-Rick

On 8/31/15 4:44 AM, Sonal Goyal wrote:
>
> I think the mahout project had an xmlinoutformat which you can leverage.
>
> On Aug 31, 2015 5:10 PM, "rakesh sharma" <rakeshsharma14@hotmail.com 
> <ma...@hotmail.com>> wrote:
>
>     I want to parse an xml file in spark
>     But as far as example is concerned it reads it as text file. The
>     maping to xml will be a tedious job.
>     How can I find the number of elements of a particular type using
>     that. Any help in java/scala code is also welcome
>
>     thanks
>     rakesh
>


Re: Reading xml in java using spark

Posted by Sonal Goyal <so...@gmail.com>.
I think the mahout project had an xmlinoutformat which you can leverage.
On Aug 31, 2015 5:10 PM, "rakesh sharma" <ra...@hotmail.com> wrote:

> I want to parse an xml file in spark
> But as far as example is concerned it reads it as text file. The maping to
> xml will be a tedious job.
> How can I find the number of elements of a particular type using that. Any
> help in java/scala code is also welcome
>
> thanks
> rakesh
>