You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Beam JIRA Bot (Jira)" <ji...@apache.org> on 2022/04/19 22:25:00 UTC

[jira] [Updated] (BEAM-14332) Improve the workflow of cluster management for Flink on Dataproc

     [ https://issues.apache.org/jira/browse/BEAM-14332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Beam JIRA Bot updated BEAM-14332:
---------------------------------
    Status: Open  (was: Triage Needed)

> Improve the workflow of cluster management for Flink on Dataproc
> ----------------------------------------------------------------
>
>                 Key: BEAM-14332
>                 URL: https://issues.apache.org/jira/browse/BEAM-14332
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-py-interactive
>            Reporter: Ning
>            Assignee: Ning
>            Priority: P2
>
> Improve the workflow of cluster management.
> There is an option to configure a default [cluster name|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/interactive_beam.py#L366]. The existing user flows are:
>  # Use the default cluster name to create a new cluster if none is in use;
>  # Reuse a created cluster that has the default cluster name;
>  # If the default cluster name is configured to a new value, re-apply 1 and 2.
>  A better solution is to 
>  # Create a new cluster implicitly if there is none or explicitly if the user wants one with specific provisioning;
>  # Always default to using the last created cluster.
>  The reasons are:
>  * Cluster name is meaningless to the user when a cluster is just a medium to run OSS runners (as applications) such as Flink or Spark. The cluster could also be running anywhere (on GCP) such as Dataproc, k8s, or even Dataflow itself.
>  * Clusters should be uniquely identified, thus should always have a distinct name. Clusters are managed (created/reused/deleted) behind the scenes by the notebook runtime when the user doesn’t explicitly do so (the capability to explicitly manage clusters is still available). Reusing the same default cluster name is risky when a cluster is deleted by one notebook runtime while another cluster with the same name is created by a different notebook runtime. 
>  * Provide the capability for the user to explicitly provision a cluster.
> Current implementation provisions each cluster at the location specified by GoogleCloudOptions using 3 worker nodes. There is no explicit API to configure the number or shape of workers.
> We could use the WorkerOptions to allow customers to explicitly provision a cluster and expose an explicit API (with UX in notebook extension) for customers to change the size of a cluster connected with their notebook (until we have an auto scaling solution with Dataproc for Flink).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)