You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nemo.apache.org by GitBox <gi...@apache.org> on 2018/03/07 13:22:26 UTC

[GitHub] sanha opened a new pull request #5: [NEMO-27] Element Wise Block Write

sanha opened a new pull request #5: [NEMO-27] Element Wise Block Write
URL: https://github.com/apache/incubator-nemo/pull/5
 
 
   JIRA: [NEMO-27: Element Wise Block Write](https://issues.apache.org/jira/browse/NEMO-27)
   
   **Major changes:**
   - Changed interface of `Partitioner` to accept data element (instead of `Iterable`) and return it's key.
   
   - Made `OutputWriter` write data to `BlockStore` element-by-element.
       - When a writer start to write data (element-wisely) to a `Block`, `Block` manages `Partition` according to the writing process. If an element with key which appears first time, the `Block` create a `Partition` having corresponding key. If else, the element will be accumulated to existing `Partition`.
       - When the writer finish the writing process and declare that the `Block` is committed, the `Block` commit all `Partition`s and prohibit any further write. 
       - The previous partition-level writing method also will be used for inter-`BlockStore` data movement. (For example, when we spill data from `SerializedMemoryStore` to `LocalFileStore`, we have to be able to fetch and write data in units of `Partition` to avoid extra (de)serialization.)
   
   **Minor changes to note:**
   - Adapted `DataSkewRuntimePass` and it's metric collecting process for the change.
       - The sizes of `Partition`s in a `Block` will be collected when the `Block` is committed, but not each `Partition` is written.
       - The type of partition size metric is changed from `Long` to `Pair<Integer, Long>`.
           - Before this pr, all partitions in `HashRange` were created by `Partitioner` even if the `Partition` is empty. Because of this, `List<Long>` was enough to represent the integer key and size of each `Partition`.
           - However, after this pr, only `Partition`s with actual data will be created, so the metric have to contain the key of each `Partition`.
   
   - Changed the output files created during ITCases to be deleted even if the test fails.
   
   **Tests for the changes:**
   
   - Changed interface of `Partitioner`
       - `DataTransferTest` and all ITCases cover the change.
   - Element-wise write to `BlockStore`
       - Refactored `BlockStoreTest` to cover this change.
   - Modified `DataSkewRuntimePass`
      - `DataSkewRuntimePassTest` and `MapReduceITCase#testDataSkew` cover this change.
   
   **Other comments:**
   - At now, [Intra-Stage pipelining](https://issues.apache.org/jira/projects/NEMO/issues/NEMO-7) cannot fully exploit it's advantage because all output data produced by a `TaskGroup` have to be collected in `OutputWriter` until the `TaskGroup` is completed. If we support this element-wise write to `BlockStore`, these data can be serialized early, and it can reduce memory pressure.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services