You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/07/19 23:39:21 UTC

[GitHub] [incubator-druid] jihoonson opened a new pull request #8115: Add shuffleSegmentPusher which is a dataSegmentPusher used for writin…

jihoonson opened a new pull request #8115: Add shuffleSegmentPusher which is a dataSegmentPusher used for writin…
URL: https://github.com/apache/incubator-druid/pull/8115
 
 
   This PR is for https://github.com/apache/incubator-druid/issues/8061 and based on https://github.com/apache/incubator-druid/pull/8114.
   
   ### Description
   
   `ShuffleDataSegmentPusher` is a dataSegmentPusher used for writing shuffle data in local storage. 
   
   `ShuffleDataSegmentPusher` uses `IntermediaryDataManager` internally which coordinates the segment writes in a round-robin fashion per supervisor task across sub tasks. This is to fully utilize the local disk bandwidth for shuffle.
   
   The middleManager and the indexer can use this. However, with the middleManager, each task uses a separate `IntermediaryDataManager` instance. This could potentially result in two issues:
   
   - The distribution of shuffle segments can be suboptimal across local storage locations.
   - `IntermediaryDataSegment` needs to smoosh segment files into larger ones to avoid "too many open files" problem. This could also be an issue if there are a lot of tasks since `IntermediaryDataSegment` can't smoosh files across tasks with middleManager.
   
   I think this would be ok for now and could be improved if required in the future. 
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added unit tests or modified existing tests to cover new code paths.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org