You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/08/08 22:58:00 UTC

[jira] [Work logged] (BEAM-5110) Reconile Flink JVM singleton management with deployment

     [ https://issues.apache.org/jira/browse/BEAM-5110?focusedWorklogId=132737&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-132737 ]

ASF GitHub Bot logged work on BEAM-5110:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Aug/18 22:57
            Start Date: 08/Aug/18 22:57
    Worklog Time Spent: 10m 
      Work Description: angoenka opened a new pull request #6189: [BEAM-5110] Explicitly count the references for BatchFlinkExecutableStageContext …
URL: https://github.com/apache/beam/pull/6189
 
 
   …and clean them after TTL
   
   We relied on weak reference and finalize to keep track of active BatchFlinkExecutableStageContext for each job. This is error prone and sub optimal.
   Introducing explicit reference counting and cleanup of BatchFlinkExecutableStageContext for each job.
   ------------------------
   
   Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   It will help us expedite review of your Pull Request if you tag someone (e.g. `@username`) to look at it.
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/) | --- | --- | --- | --- | --- | ---
   Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/)
   Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/) </br> [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | --- | --- | --- | ---
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 132737)
            Time Spent: 10m
    Remaining Estimate: 0h

> Reconile Flink JVM singleton management with deployment
> -------------------------------------------------------
>
>                 Key: BEAM-5110
>                 URL: https://issues.apache.org/jira/browse/BEAM-5110
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>            Reporter: Ben Sidhom
>            Assignee: Ben Sidhom
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~angoenka] noticed through debugging that multiple instances of BatchFlinkExecutableStageContext.BatchFactory are loaded for a given job when executing in standalone cluster mode. This context factory is responsible for maintaining singleton state across a TaskManager (JVM) in order to share SDK Environments across workers in a given job. The multiple-loading breaks singleton semantics and results in an indeterminate number of Environments being created.
> It turns out that the [Flink classloading mechanism|https://ci.apache.org/projects/flink/flink-docs-release-1.5/monitoring/debugging_classloading.html] is determined by deployment mode. Note that "user code" as referenced by this link is actually the Flink job server jar. Actual end-user code lives inside of the SDK Environment and uploaded artifacts.
> In order to maintain singletons without resorting to IPC (for example, using file locks and/or additional gRPC servers), we need to force non-dynamic classloading. For example, this happens when jobs are submitted to YARN for one-off deployments via `flink run`. However, connecting to an existing (Flink standalone) deployment results in dynamic classloading.
> We should investigate this behavior and either document (and attempt to enforce) deployment modes that are consistent with our requirements, or (if possible) create a custom classloader that enforces singleton loading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)