You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ajay Nair <pr...@gmail.com> on 2014/04/26 23:20:03 UTC

Parsing wikipedia xml data in Spark

Is there a way in spark to parse wikipedia xml dump? It seems like the
freebase dump is longer available. Also does the spark shell support the
xml load file sax parser that is present in scala.

Thanks
AJ

Re: Parsing wikipedia xml data in Spark

Posted by Geoffroy Fouquier <ge...@exensa.com>.

We did it using scala xml with spark

We start by creating a rdd containing each page is store as a single line :
   - split the xml dump with xml_split
   - process each split with a shell script which remove "xml_split" tag 
and siteinfo section, and put each page on a single line.
   - copy resulting files on hdfs

Then the dataset may be load as a text file and processed

  val rawDataset = sparkContext.textFile(input)
  val allDocuments = rawDataset.map{
     case document =>
         val page = scala.xml.XML.loadString(document)
         val pageTitle = (page \ "title").text
         [...]
  }

We create a demo using the dataset here: http://wikinsights.org

Le 26/04/2014 23:20, Ajay Nair a écrit :
> Is there a way in spark to parse wikipedia xml dump? It seems like the
> freebase dump is longer available. Also does the spark shell support the
> xml load file sax parser that is present in scala.
>
> Thanks
> AJ
>

Geoffroy Fouquier
http://eXenSa.com