You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Iulian Dragos (JIRA)" <ji...@apache.org> on 2016/02/03 15:47:39 UTC

[jira] [Created] (SPARK-13159) External shuffle service broken w/ Mesos

Iulian Dragos created SPARK-13159:
-------------------------------------

             Summary: External shuffle service broken w/ Mesos
                 Key: SPARK-13159
                 URL: https://issues.apache.org/jira/browse/SPARK-13159
             Project: Spark
          Issue Type: Bug
          Components: Mesos
    Affects Versions: 2.0.0
            Reporter: Iulian Dragos


Dynamic allocation and external shuffle service won't work together on Mesos for applications longer than {{spark.network.timeout}}.

After two minutes (default value for {{spark.network.timeout}}), I see a lot of FileNotFoundExceptions and spark jobs just fail.

{code}
16/02/03 15:26:51 WARN TaskSetManager: Lost task 728.0 in stage 3.0 (TID 2755, 10.0.1.208): java.io.FileNotFoundException: /tmp/blockmgr-ea5b2392-626a-4278-8ae3-fb2c4262d758/02/shuffle_1_728_0.data.57efd66e-7662-4810-a5b1-56d7e2d7a9f0 (No such file or directory)
	at java.io.FileOutputStream.open(Native Method)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
	at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181)
	at org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(WritablePartitionedPairCollection.scala:56)
	at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:661)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
...
{code}

h3. Analysis

The Mesos external shuffle service needs a way to know when it's safe to delete shuffle files for a given application. The current solution (that seemed to work fine while the RPC transport was based on Akka) was to open a TCP connection between the driver and each external shuffle service. Once the driver went down (graciously or crashed), the shuffle service would eventually get a notification from the network layer, and delete the corresponding files.

This solution stopped working because it relies on an idle connection, and the new Netty-based RPC layer is closing the connection after {{spark.network.timeout}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org