You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Niko Stahl <r....@gmail.com> on 2014/03/27 14:45:46 UTC

WikipediaPageRank Data Set

Hello,

I would like to run the
WikipediaPageRank<https://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scala>example,
but the Wikipedia dump XML files are no longer available on
Freebase. Does anyone know an alternative source for the data?

Thanks,
Niko

Re: WikipediaPageRank Data Set

Posted by Ankur Dave <an...@gmail.com>.

In particular, we are using this dataset:
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Ankur <http://www.ankurdave.com/>


On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave <an...@gmail.com> wrote:

> The GraphX team has been using Wikipedia dumps from
> http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less
> convenient format than the Freebase dumps. In particular, an article may
> span multiple lines, so more involved input parsing is required.
>
> Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML
> parser from Mahout: see WikiPipelineBenchmark.scala<https://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157>and
> WikiArticle.scala<https://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala>
> .
>
> However, we plan to upload a parsed version of this dataset to S3 for
> easier access from Spark and GraphX.
>
> Ankur <http://www.ankurdave.com/>
>
> On 27 Mar, 2014, at 9:45 pm, Niko Stahl <r....@gmail.com> wrote:
>
> I would like to run the WikipediaPageRank<https://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scala>example, but the Wikipedia dump XML files are no longer available on
>> Freebase. Does anyone know an alternative source for the data?
>>
>
>

Re: WikipediaPageRank Data Set

Posted by Ankur Dave <an...@gmail.com>.

The GraphX team has been using Wikipedia dumps from
http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less
convenient format than the Freebase dumps. In particular, an article may
span multiple lines, so more involved input parsing is required.

Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML
parser from Mahout: see
WikiPipelineBenchmark.scala<https://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157>and
WikiArticle.scala<https://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala>
.

However, we plan to upload a parsed version of this dataset to S3 for
easier access from Spark and GraphX.

Ankur <http://www.ankurdave.com/>

On 27 Mar, 2014, at 9:45 pm, Niko Stahl <r....@gmail.com> wrote:

I would like to run the
WikipediaPageRank<https://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scala>example,
but the Wikipedia dump XML files are no longer available on
> Freebase. Does anyone know an alternative source for the data?
>

Re: WikipediaPageRank Data Set

Posted by Tsai Li Ming <ma...@ltsai.com>.

I’m interested in obtaining the data set too.

Thanks!
On 27 Mar, 2014, at 9:45 pm, Niko Stahl <r....@gmail.com> wrote:

> Hello,
> 
> I would like to run the WikipediaPageRank example, but the Wikipedia dump XML files are no longer available on Freebase. Does anyone know an alternative source for the data?
> 
> Thanks,
> Niko