You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/08/25 16:23:34 UTC

[GitHub] [spark] Ngone51 opened a new pull request #25577: [WIP][CORE][SPARK-28867] InMemoryStore checkpoint to speed up replay log file in HistoryServer

Ngone51 opened a new pull request #25577: [WIP][CORE][SPARK-28867] InMemoryStore checkpoint to speed up replay log file in HistoryServer
URL: https://github.com/apache/spark/pull/25577

### What changes were proposed in this pull request?

This PR aims to improve the replay performance in HistoryServer by periodically checkpoint InMemoryStore in an in-completed application and achieve incremental replay.The main idea
is, for an in-completed application, we periodically (normally every N events num) checkpoint InMemoryStore with processed events num(X) into event log dir. And in HistoryServer, it reconstructs InMemoryStore from checkpoint file and gets X. Then, we could skip X events while replaying the log file basing on the partial InMemoryStore. Note that we should also recover those live entities from the the partial InMemoryStore in AppStatusListener to perform incremental replay. For a completed application, HistoryServer could just reconstructs InMemoryStore and no need to do replay.

And in this PR, we only focus on handling InMemoryStore in HistoryServer, while LevelDB is planed to be handled in similar way in following PR.

Basic experiment on a completed application of 20055 events shows the improvement of this optimization:

without optimization | with optimization(including deserialization time)
:-: | :-: |
4343 | 78(70)
4512 | 92(85)
4475 | 74(68)
4254 | 93(78)
4126 | 81(71)

Work TODO

- [ ] compression support when checkpoint InMemoryStore
- [ ] More accurate conversion from wrapper data to live entity
- [ ] checkpoint file cleaning in HistoryServer
- [ ] overcome frequently StackOverError in deserialization
- [ ] unit tests

### Why are the changes needed?

Change is needed because HistoryServer now could be very slow to replay a large log file at the first time and it always re-replay an in-progress log file after it changes which leads to low efficiency.

### Does this PR introduce any user-facing change?

Yes, if user wants to use this optimization by several new configurations.

### How was this patch tested?

Only tested manually yet, still work in process.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org