You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ajay Nair <pr...@gmail.com> on 2014/05/05 10:40:48 UTC

Apache spark on 27gb wikipedia data

Hi,

I am using 1 master and 3 slave workers for processing 27gb of Wikipedia
data that is tab separated and every line contains wikipedia page
information. The tab separated data has title of the page and the page
contents. I am using the regular expression to extract links as mentioned in
the site below:
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html#running-pagerank-on-wikipedia

Although it runs fne for around 300Mb data set, it runs in to issues when I
try to execute the same code using the 27gb data from hdfs.
The error thrown is given below:
14/05/05 07:15:22 WARN scheduler.TaskSetManager: Loss was due to
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.regex.Matcher.<init>(Matcher.java:224)

Is there any way to over come this issue?

My cluster is a ec2 m3.large machine.

Thanks
Ajay



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-spark-on-27gb-wikipedia-data-tp6487.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Apache spark on 27gb wikipedia data

Posted by Prashant Sharma <sc...@gmail.com>.
Try tuning the options like memoryFraction and executorMemory found here :
http://spark.apache.org/docs/latest/configuration.html.

Thanks

Prashant Sharma


On Mon, May 5, 2014 at 9:34 PM, Ajay Nair <pr...@gmail.com> wrote:

> Hi,
>
> Is there any way to overcome this error? I am running this from the
> spark-shell, is that the cause of concern ?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-spark-on-27gb-wikipedia-data-tp6487p6490.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

Re: Apache spark on 27gb wikipedia data

Posted by Ajay Nair <pr...@gmail.com>.
Hi,

Is there any way to overcome this error? I am running this from the
spark-shell, is that the cause of concern ?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-spark-on-27gb-wikipedia-data-tp6487p6490.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Apache spark on 27gb wikipedia data

Posted by Prashant Sharma <sc...@gmail.com>.
I just thought may be we could put a warning whenever that error comes user
can tune either memoryFraction or executor memory options. And this warning
get's displayed when TaskSetManager receives task failures due to  OOM.

Prashant Sharma


On Mon, May 5, 2014 at 2:10 PM, Ajay Nair <pr...@gmail.com> wrote:

> Hi,
>
> I am using 1 master and 3 slave workers for processing 27gb of Wikipedia
> data that is tab separated and every line contains wikipedia page
> information. The tab separated data has title of the page and the page
> contents. I am using the regular expression to extract links as mentioned
> in
> the site below:
>
> http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html#running-pagerank-on-wikipedia
>
> Although it runs fne for around 300Mb data set, it runs in to issues when I
> try to execute the same code using the 27gb data from hdfs.
> The error thrown is given below:
> 14/05/05 07:15:22 WARN scheduler.TaskSetManager: Loss was due to
> java.lang.OutOfMemoryError
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.util.regex.Matcher.<init>(Matcher.java:224)
>
> Is there any way to over come this issue?
>
> My cluster is a ec2 m3.large machine.
>
> Thanks
> Ajay
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-spark-on-27gb-wikipedia-data-tp6487.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>