You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/05/13 10:40:27 UTC

[GitHub] [flink] zhuzhurk opened a new pull request #8430: [FLINK-12068] [runtime] Backtrack failover regions if intermediate results are unavailable

zhuzhurk opened a new pull request #8430: [FLINK-12068] [runtime] Backtrack failover regions if intermediate results are unavailable
URL: https://github.com/apache/flink/pull/8430
 
 
   ## What is the purpose of the change
   
   *In region failover, when a region fails due to unavailable input result partitions, it needs to backtrack the failover regions to recover the failed tasks as well as the unavailable result partitions. The detailed design is at https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8/edit.*
   
   
   ## Brief change log
   
     - *RestartPipelinedRegionStrategy(based on next generation interface) handles DataConsumptionException, proposing to restart regions producing needed but unavailable result partitions as well as all its consumer regions.*
     - *Add a ResultPartitionAvailabilityChecker interface and implement it for querying result partition availability*
     - *Calculate and cache region inputs and consumers in region building phase. This helps to speed up the failover handling significantly, at the time cost of slows down the region building significantly and space cost for 2 edge scale caches. See verification part below.*
   
   
   ## Verifying this change
   
   
   This change added tests and can be verified as follows:
     - *Added tests in flip1 RestartPipelinedRegionStrategyTest to verify the correctness*
     - *Performance tests are manually conducted. Currently for a job with 16 million edges it takes < 200ms to calculate the tasks to restart. The region building time in this case increased from 600ms to ~6s to build some helper caches.*
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (no)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
     - The serializers: (no)
     - The runtime per-record code paths (performance sensitive): (no)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
     - The S3 file system connector: (no)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (no)
     - If yes, how is the feature documented? (not documented)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services