You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2020/01/28 19:26:00 UTC

[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

     [ https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enrico Minack updated SPARK-30666:
----------------------------------
    Description: 
This proposes a pragmatic improvement to allow for reliable single-stage accumulators. Under the assumption that a given stage / partition / rdd produces identical results, non-deterministic code incrementing accumulators also produces identical accumulator increments on success. Rerunning partitions for any reason should always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are compared to earlier increments. Depending on the strategy of how a new increment updates over an earlier increment from the same partition, different semantics of accumulators (here called accumulator modes) can be implemented:
 - ALL sums over all increments of each partition: this represents the current implementation of accumulators
 - MAX over all increments of each partition: assuming accumulators only increment while a partition is processed, a successful task provides an accumulator value that is always larger than any value of failed tasks, hence it paramounts any failed task's value. This produces reliable accumulator values. This should only be used in a single stage.
 - LAST increment: allows to retrieve the latest increment for each partition only.

The implementation for MAX and LAST requires extra memory that scales with the number of partitions. The current ALL implementation does not require extra memory.

  was:
This proposes a pragmatic improvement to allow for reliable single-stage accumulators. Under the assumption that a given stage / partition / rdd produces identical results, non-deterministic code incrementing accumulators also produces identical accumulator increments on success. Rerunning partitions for any reason should always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are compared to earlier increments. Depending on the strategy of how a new increment updates over an earlier increment from the same partition, different semantics of accumulators (here called accumulator modes) can be implemented:
 - SUM over all increments of each partition: this represents the current implementation of accumulators
 - MAX over all increments of each partition: assuming accumulators only increment while a partition is processed, a successful task provides an accumulator value that is always larger than any value of failed tasks, hence it paramounts any failed task's value. This produces reliable accumulator values. This should only be used in a single stage.
 - LAST increment: allows to retrieve the latest increment for each partition only.

The implementation for MAX and LAST requires extra memory that scales with the number of partitions. The current SUM implementation does not require extra memory.


> Reliable single-stage accumulators
> ----------------------------------
>
>                 Key: SPARK-30666
>                 URL: https://issues.apache.org/jira/browse/SPARK-30666
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Enrico Minack
>            Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage accumulators. Under the assumption that a given stage / partition / rdd produces identical results, non-deterministic code incrementing accumulators also produces identical accumulator increments on success. Rerunning partitions for any reason should always produce the same increments on success.
> With this pragmatic approach, increments from individual partitions / tasks are compared to earlier increments. Depending on the strategy of how a new increment updates over an earlier increment from the same partition, different semantics of accumulators (here called accumulator modes) can be implemented:
>  - ALL sums over all increments of each partition: this represents the current implementation of accumulators
>  - MAX over all increments of each partition: assuming accumulators only increment while a partition is processed, a successful task provides an accumulator value that is always larger than any value of failed tasks, hence it paramounts any failed task's value. This produces reliable accumulator values. This should only be used in a single stage.
>  - LAST increment: allows to retrieve the latest increment for each partition only.
> The implementation for MAX and LAST requires extra memory that scales with the number of partitions. The current ALL implementation does not require extra memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org