You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Steven Ruppert (JIRA)" <ji...@apache.org> on 2017/01/06 03:19:58 UTC

[jira] [Created] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations

Steven Ruppert created SPARK-19098:
--------------------------------------

             Summary: Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
                 Key: SPARK-19098
                 URL: https://issues.apache.org/jira/browse/SPARK-19098
             Project: Spark
          Issue Type: Bug
          Components: GraphX
    Affects Versions: 2.1.0
         Environment: Linux x64
Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0)
Spark on YARN, dynamic allocation with shuffle service
Input/Output data on HDFS
kryo serialization turned on
checkpointing directory set on HDFS
            Reporter: Steven Ruppert
            Priority: Critical


I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla ConnectedComponents use, notably one that works fine with identical code on spark 2.0.1, but not on 2.1.0.

I unfortunately haven't narrowed this down to a test case yet, nor do I have access to the original logs, so this initial report will be a little vague. However, this behavior as described might ring a bell to somebody.

Roughly: 

{noformat}
val edges: RDD[Edge[Int]] = _ // from file
val vertices: RDD[(VertexId, Int)] = _ // from file
val graph = Graph(vertices, edges)

val components: RDD[(VertexId, ComponentId)] = ConnectedComponents
  .run(graph, 10)
  .vertices
{noformat}

Running this against my input of ~5B edges and ~3B vertices leads to a strange doubling of shuffle traffic in each round of Pregel (inside ConnectedComponents), increasing from the actual data size of ~50 GB, to 100GB, to 200GB, all the way to around 40TB before I killed the job. The data being shuffled was apparently an RDD of ShippableVertexPartition .

Oddly enough, only the kryo-serialized shuffled data doubled in size. The heap usage of the executors themselves remained stable, or at least did not account 1 to 1 for the 40TB of shuffled data, for I definitely do not have 40TB of RAM. Furthermore, I also have kryo reference tracking turned on still, so whatever is leaking somehow gets around that.

I'll update this ticket once I have more details, unless somebody else with the same problem reports back first.










--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org