You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Patrick Wendell (JIRA)" <ji...@apache.org> on 2014/10/26 19:08:33 UTC

[jira] [Commented] (SPARK-2532) Fix issues with consolidated shuffle

    [ https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184576#comment-14184576 ] 

Patrick Wendell commented on SPARK-2532:
----------------------------------------

Hey [~matei] - you created some sub-tasks here that are pretty tersely described... would you mind looking through them and deciding whether these are still relevant? Not sure whether we can close this.

> Fix issues with consolidated shuffle
> ------------------------------------
>
>                 Key: SPARK-2532
>                 URL: https://issues.apache.org/jira/browse/SPARK-2532
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.1.0
>         Environment: All
>            Reporter: Mridul Muralidharan
>            Assignee: Mridul Muralidharan
>            Priority: Critical
>
> Will file PR with changes as soon as merge is done (earlier merge became outdated in 2 weeks unfortunately :) ).
> Consolidated shuffle is broken in multiple ways in spark :
> a) Task failure(s) can cause the state to become inconsistent.
> b) Multiple revert's or combination of close/revert/close can cause the state to be inconsistent.
> (As part of exception/error handling).
> c) Some of the api in block writer causes implementation issues - for example: a revert is always followed by close : but the implemention tries to keep them separate, resulting in surface for errors.
> d) Fetching data from consolidated shuffle files can go badly wrong if the file is being actively written to : it computes length by subtracting next offset from current offset (or length if this is last offset)- the latter fails when fetch is happening in parallel to write.
> Note, this happens even if there are no task failures of any kind !
> This usually results in stream corruption or decompression errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org