You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nemo.apache.org by GitBox <gi...@apache.org> on 2018/08/20 05:10:02 UTC

[GitHub] sanha opened a new pull request #111: [NEMO-139, 6] Logic in the scheduler for appending jobs, Support RDD caching

sanha opened a new pull request #111: [NEMO-139, 6] Logic in the scheduler for appending jobs, Support RDD caching
URL: https://github.com/apache/incubator-nemo/pull/111
 
 
   JIRA: [NEMO-139: Logic in the scheduler for appending jobs](https://issues.apache.org/jira/projects/NEMO/issues/NEMO-139)
   JIRA: [NEMO-6: Support RDD caching](https://issues.apache.org/jira/projects/NEMO/issues/NEMO-6)
   
   **Major changes:**
   - add a logic in the scheduler for appending plans (NEMO-139)
     - implement `PlanAppender` that appends submitted `PhysicalPlan` to a original `PhysicalPlan`
     - refactor `PlanStateManager`, `BatchScheduler`, `BlockManagerMaster`, and `TaskDispatcher` to reflect that all plans from a single job are appended to a single `PhysicalPlan` through `PlanAppender`
   - support RDD caching (NEMO-6)
     - add `CacheIdProperty` property and `GhostProperty`
       - When a Spark user program call `cache()` or `persist()` for a `RDD`, the RDD creates a ghost vertex and connect the vertex having the `RDD` to the ghost vertex. This edge to the ghost vertex is annotated with an ID of cache (`cacheIdProperty`). When a plan with this edge is executed in our runtime, the data to cache will be stored in the edge as the required `StorageLevel` format. (Any extra feature is not required in our runtime to produce or sustain this data.)
       - When the `BatchScheduler` encounter a task that annotated with the `GhostProperty`, the vertex will not be scheduled but just regarded as a completed task.
     - implement `Optimizer` that conducts optimization by using `OptimizationPass`es from our `UserApplicationRunner` to separate the roll.
       - When an IR DAG that contains any edge with `cacheIdProperty` is submitted and there was any already executed IR DAG that contains an edge with the identical `cacheIdProperty`, the `Optimizer` crops the IR DAG before the cache edge and adds a `CachedSourceVertex` before the edge. 
     - make `PlanAppender` properly handle the caching
       - Make `PlanAppender` append the `PhysicalPlan` constructed from the cropped IR DAG with caching edge to the original `PhysicalPlan` and add a new edge from the vertex that has the actual  edge to a ghost vertex and the new `CachedSourceVertex`. In runtime, when the `CachedSourceVertex` requires the data, the cached data that produced and stored in the edge to the ghost vertex will be read through our `DuplicateEdgeGroupProperty` logic.
   
   **Minor changes to note:**
   - N/A.
   
   **Tests for the changes:**
   - add an integration test that tests `SparkCachingWordCount` application
       - `SparkCachingWordCount` caches a shuffle data and calculates that which keys have identical count by using the cached data.
   
   **Other comments:**
   - I'm sorry for the late. This issue is a part of our first release and the target due was August 16th, but It is delayed to resolve conflicts. 
   
   resolves [NEMO-139](https://issues.apache.org/jira/projects/NEMO/issues/NEMO-139)
   resolves [NEMO-6](https://issues.apache.org/jira/projects/NEMO/issues/NEMO-6)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services