You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "JoshRosen (via GitHub)" <gi...@apache.org> on 2023/09/28 21:32:17 UTC

[GitHub] [spark] JoshRosen opened a new pull request, #39767: [SPARK-42205][CORE] Don't log accumulator values in stage / task start event logs

JoshRosen opened a new pull request, #39767:
URL: https://github.com/apache/spark/pull/39767

### What changes were proposed in this pull request?

This PR modifies JsonProtocol in order to skip logging of accumulator values in the logs for SparkListenerTaskStart and SparkListenerStageSubmitted events.

The SparkListenerTaskStart and SparkListenerStageSubmitted events contain mutable TaskInfo and StageInfo objects, which in turn contain Accumulables fields. When a task or stage is submitted, Accumulables is initially empty. When the task or stage finishes, this field is updated with values from the task.

If a task or stage finishes _before_ the start event has been logged by the event logging listener then the start event will contain the Accumulable values from the task or stage end event.

This PR updates JsonProtocol to log an empty Accumulables value for stage and task start events.

I considered and rejected an alternative approach where the listener event itself would contain an immutable snapshot of the TaskInfo or StageInfo, as this will increase memory pressure on the driver during periods of heavy event logging.

Those accumulables values in the start events are not used: I confirmed this by checking AppStatusListener and SQLAppStatusListener code.

I have deliberately chosen to **not** drop the field for _job_ start events because it is technically possible (but rare) for a job to reference stages that are completed at the time that the job is submitted (a state can technically belong to multiple jobs) and in that case it seems consistent to have the StageInfo accurately reflect all of the information about the already-completed stage.

### Why are the changes needed?

This information isn't used by the History Server and contributes to wasteful bloat in event log sizes. In one real-world log, I found that ~10% of the uncompressed log size was due to these redundant Accumulable fields in stage and task start events.

I don't think that we need to worry about backwards-compatibility here because the old behavior was non-deterministic: whether or not a start event log contained accumulator updates was a function of the relative speed of task completion and the processing rate of the event logging listener; it seems unlikely that any third-party event log consumers would be relying on such an inconsistently present value when they could instead rely on the values in the corresponding end events.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New and updated tests in JsonProtocolSuite.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #39767: [SPARK-42205][CORE] Don't log accumulator values in stage / task start event logs

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #39767:
URL: https://github.com/apache/spark/pull/39767#issuecomment-1537580440

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] JoshRosen commented on pull request #39767: [SPARK-42205][CORE] Don't log accumulator values in stage / task start and getting result event logs

Posted by "JoshRosen (via GitHub)" <gi...@apache.org>.

JoshRosen commented on PR #39767:
URL: https://github.com/apache/spark/pull/39767#issuecomment-1741321217

   Merged for Spark 4.0.0. Thanks everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #39767: [SPARK-42205][CORE] Don't log accumulator values in stage / task start event logs

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #39767: [SPARK-42205][CORE] Don't log accumulator values in stage / task start event logs
URL: https://github.com/apache/spark/pull/39767


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #39767: [SPARK-42205][CORE] Don't log accumulator values in stage / task start event logs

Posted by "mridulm (via GitHub)" <gi...@apache.org>.

mridulm commented on code in PR #39767:
URL: https://github.com/apache/spark/pull/39767#discussion_r1089412418


##########
core/src/main/scala/org/apache/spark/util/JsonProtocol.scala:
##########
@@ -163,7 +165,7 @@ private[spark] object JsonProtocol {
     g.writeStartObject()
     g.writeStringField("Event", SPARK_LISTENER_EVENT_FORMATTED_CLASS_NAMES.taskGettingResult)
     g.writeFieldName("Task Info")
-    taskInfoToJson(taskInfo, g)
+    taskInfoToJson(taskInfo, g, includeAccumulables = true)

Review Comment:
   Why is this `true` ? task end would be fired which would have accumulables



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] JoshRosen closed pull request #39767: [SPARK-42205][CORE] Don't log accumulator values in stage / task start and getting result event logs

Posted by "JoshRosen (via GitHub)" <gi...@apache.org>.

JoshRosen closed pull request #39767: [SPARK-42205][CORE] Don't log accumulator values in stage / task start and getting result event logs
URL: https://github.com/apache/spark/pull/39767


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] JoshRosen commented on a diff in pull request #39767: [SPARK-42205][CORE] Don't log accumulator values in stage / task start event logs

Posted by "JoshRosen (via GitHub)" <gi...@apache.org>.

JoshRosen commented on code in PR #39767:
URL: https://github.com/apache/spark/pull/39767#discussion_r1340709730


##########
core/src/main/scala/org/apache/spark/util/JsonProtocol.scala:
##########
@@ -163,7 +165,7 @@ private[spark] object JsonProtocol {
     g.writeStartObject()
     g.writeStringField("Event", SPARK_LISTENER_EVENT_FORMATTED_CLASS_NAMES.taskGettingResult)
     g.writeFieldName("Task Info")
-    taskInfoToJson(taskInfo, g)
+    taskInfoToJson(taskInfo, g, includeAccumulables = true)

Review Comment:
   Good catch:
   
   I think that this should actually be `false` because the presence of non-empty accumulables in `SparkListenerTaskGettingResult` is prone to the same non-deterministic race as in `SparkListenerTaskStart`.
   
   Tracing through the call chain that leads to the event being posted:
   
   - [In TaskResultGetter](https://github.com/apache/spark/blob/b3d5bc0c10908aa66510844eaabc43b6764dd7c0/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L87) `TaskScheduler.handleTaksGettingResult` is called when an `IndirectTaskResult` fetch begins.
   - [The TaskSchedulerImpl calls the handler on the TaskSetManager](https://github.com/apache/spark/blob/b3d5bc0c10908aa66510844eaabc43b6764dd7c0/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L924-L926)
   - [The TaskSetManager enqueues an event on the DAGScheduler event loop](https://github.com/apache/spark/blob/b3d5bc0c10908aa66510844eaabc43b6764dd7c0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L753-L760)
   - [The DAGScheduler loop enqueues the listener event](https://github.com/apache/spark/blob/b3d5bc0c10908aa66510844eaabc43b6764dd7c0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1262-L1264)
   
   These steps always take place prior to the publishing of the SparkListenerTaskEnd event because that event is published after `TaskScheduler.handleSuccessfulTask` or `TaskScheduler.handleFailedTask` are called, both of which take place after the indirect task result fetching starts.
   
   Given this, I think we should exclude this for `SparkListenerTaskGettingResult` as well.
   
   `SparkListenerTaskGettingResult` is somewhat rare event compared to task starts and ends because most tasks don't use the indirect task result path. However, there are rare cases where tasks can have such large accumulator updates that we can wind up with many / most tasks emitting IndirectTaskResults. In those cases, dropping the accumulables could be a size win.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org