You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Thodoris Zois <th...@gmail.com> on 2018/10/28 20:54:00 UTC

[GraphX] - OOM Java Heap Space

Hello,

I have the edges of a graph stored as parquet files (about 3GB). I am loading the graph and trying to compute the total number of triplets and triangles. Here is my code:

val edges_parq = sqlContext.read.option("header","true").parquet(args(0) + "/year=" + year) 
val edges: RDD[Edge[Int]] = edges_parq.rdd.map(row => Edge(row(0).asInstanceOf[Int].toInt, row(1).asInstanceOf[Int].toInt))
val graph = Graph.fromEdges(edges, 1.toInt).partitionBy(PartitionStrategy.RandomVertexCut)

// The actual computation
var numberOfTriplets = graph.triplets.count
val tmp =  graph.triangleCount().vertices.filter{ case (vid, count) => count > 0 }
var numberOfTriangles = tmp.map(a => a._2).sum()

Even though it manages to compute the number of triplets, I can’t compute the number of triangles. Every time I get an exception OOM - Java Heap Space on some executors and the application fails.
I am using 100 executors (1 core and 6GBs per executor). I have tried to use 'hdfsConf.set("mapreduce.input.fileinputformat.split.maxsize", "33554432”)’ in the code but still no results.

Here are some of my configurations:
--conf spark.driver.memory=20G 
--conf spark.driver.maxResultSize=20G 
--conf spark.yarn.executor.memoryOverhead=6144 

- Thodoris
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org