You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ji Yan <ji...@drive.ai> on 2017/02/16 18:34:14 UTC

Will Spark ever run the same task at the same time

Dear spark users,

Is there any mechanism in Spark that does not guarantee the idempotent
nature? For example, for stranglers, the framework might start another task
assuming the strangler is slow while the strangler is still running. This
would be annoying sometime when say the task is writing to a file, but have
the same tasks running at the same time may corrupt the file. From the
documentation page, I know that Spark's speculative execution mode is
turned off by default. Does anyone know any other mechanism in Spark that
may cause problem in scenario like this?

Thanks
Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.

Re: Will Spark ever run the same task at the same time

Posted by Steve Loughran <st...@hortonworks.com>.

> On 16 Feb 2017, at 18:34, Ji Yan <ji...@drive.ai> wrote:
> 
> Dear spark users,
> 
> Is there any mechanism in Spark that does not guarantee the idempotent nature? For example, for stranglers, the framework might start another task assuming the strangler is slow while the strangler is still running. This would be annoying sometime when say the task is writing to a file, but have the same tasks running at the same time may corrupt the file. From the documentation page, I know that Spark's speculative execution mode is turned off by default. Does anyone know any other mechanism in Spark that may cause problem in scenario like this?

 It's not so much "Two tasks writing to the same file' as "two tasks writing to different places with the work renamed into place at the end"

speculation is the key case when there's >1  writer, though they do write to different directories; the spark commit protocol guarantees that only the committed task gets its work into the final output.

Some failure modes *may* have >1 executor running the same work, right up to the point where the task commit operation is started. More specifically, a network partition may cause the executor to lose touch with the driver, and the driver to pass the same task on to another executor, while the existing executor keeps going. Its when that first executor tries to commit the data that you get a guarantee that the work doesn't get committed (no connectivity => no commit, connectivity resumed => driver will tell executor it's been aborted).

If you are working with files outside of the tasks' working directory, then the outcome of failure will be "undefined". The FileCommitProtocol lets you  ask for a temp file which is rename()d to the destination in the commit. Use this and the files will only appear the task is committed. Even there there is a small, but non-zero chance that the commit may fail partway through, in which case the outcome is, as they say, "undefined". Avoid that today by not manually adding custom partitions to data sources in your hive metastore. 

Steve

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Fwd: Will Spark ever run the same task at the same time

Posted by Mark Hamstra <ma...@clearstorydata.com>.

First, the word you are looking for is "straggler", not "strangler" -- very
different words. Second, "idempotent" doesn't mean "only happens once", but
rather "if it does happen more than once, the effect is no different than
if it only happened once".

It is possible to insert a nearly limitless variety of side-effecting code
into Spark Tasks, and there is no guarantee from Spark that such code will
execute idempotently. Speculation is one way that a Task can run more than
once, but it is not the only way. A simple FetchFailure (from a lost
Executor or another reason) will mean that a Task has to be re-run in order
to re-compute the missing outputs from a prior execution. In general, Spark
will run a Task as many times as needed to satisfy the requirements of the
Jobs it is requested to fulfill, and you can assume neither that a Task
will run only once nor that it will execute idempotently (unless, of
course, it is side-effect free). Guaranteeing idempotency requires a higher
level coordinator with access to information on all Task executions. The
OutputCommitCoordinator handles that guarantee for HDFS writes, and the
JIRA discussion associated with the introduction of
the OutputCommitCoordinator covers most of the design issues:
https://issues.apache.org/jira/browse/SPARK-4879

On Thu, Feb 16, 2017 at 10:34 AM, Ji Yan <ji...@drive.ai> wrote:

> Dear spark users,
>
> Is there any mechanism in Spark that does not guarantee the idempotent
> nature? For example, for stranglers, the framework might start another task
> assuming the strangler is slow while the strangler is still running. This
> would be annoying sometime when say the task is writing to a file, but have
> the same tasks running at the same time may corrupt the file. From the
> documentation page, I know that Spark's speculative execution mode is
> turned off by default. Does anyone know any other mechanism in Spark that
> may cause problem in scenario like this?
>
> Thanks
> Ji
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>