You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Stephan Ewen <se...@apache.org> on 2018/12/09 20:49:20 UTC

Re: [DISCUSS] Flink backward compatibility

Few thoughts from my side:

(1) The client needs big refactoring / cleanup. It should use a proper HTTP
client library to help with future authentication mechanisms.
Once that is done, we should identify a "client API" that we make stable,
just as the DataStream / DataSet API.

(2) We will most likely refactor the stack in the near future (see
discussion threads on batch / streaming unification).
I would suggest that we define a DAG API as the common substrate and as the
data structure in which jobs are submitted to the REST API (session modes)
and stored in HA services (job mode). Think of it as a JobGraph++.  It may
be a good idea to define that structure via ProtoBuf (or a similar tool) to
support forward/backwards compatibility.

Best,
Stephan


On Wed, Nov 28, 2018 at 10:45 AM Chesnay Schepler <ch...@apache.org>
wrote:

> so let's take a look...
>
> binary client compatibility: The key issue i see hasn't changed since
> the last time this was brought up: Clients rely on the JobGraph to
> submit the job which is an internal data structure. AFAIK there will
> also be changes made to said class soon(ish). So long as we don't
> introduce a decoupled structure and/or compatibility routines here this
> is not feasible.
> The client in general may be in the way here. The unfortunate reality is
> that the client code is one big mess that is due for a complete rewrite.
> I doubt anyone has an all-encompassing view over hidden assumptions that
> are baked into it, that we would have to retain if we go for backwards
> compatibility.
>
> CLI compatibility: Does this include all start scripts or just the flink
> executable? I think this makes sense, but so far we did a reasonable job
> at not changing command-line parameters. (But maybe only because
> changing this part of the CLI is a massive pain...)
>
> REST API: The versioning introduced in 1.7.0 is a significant step
> towards a stable API as it allows us to modify things without
> (inherently) breaking it.
> We're primarily missing tests here to verify the stability, but these
> are being worked on.
>
> Metrics: I would not categorize them as stable in general, the reason
> being that we are still refactoring and stream-lining the usage. For
> some core system metrics (checkpoint info, IO) we can _probably_
> guarantee stability.
>
> On 27.11.2018 18:43, Thomas Weise wrote:
> > Some scenarios that come to mind:
> >
> > Flink client binary compatibility with remote cluster: This would include
> > RemoteStreamEnvironment, RESTClusterClient etc. - User should be able to
> > submit the job built with 1.6.x using the 1.6.x binaries to the remote
> > Flink 1.7.x or later cluster. The use case for this is Beam.
> >
> > REST API compatibility: User tooling built against 1.6.x REST API spec
> > continues to work with 1.7.x or later REST API
> >
> > CLI compatibility: The commands/options exposed in the CLI continue to be
> > available after an upgrade. Users can just point to the new CLI location.
> >
> > Metrics:  Metrics that exist in 1.6.x are available in 1.7.x
> >
> > There is probably a lot more (such as various backends that users can
> > configure and their options) and there are different levels of
> > cost/complexity trade-offs. I brought up the REST API in the past after
> > observing the tools breakage when going from 1.4.x to 1.5.x.
> >
> > The client binary compatibility issue will grow more severe as the
> > ecosystem expands. Beam is a representative example in that category. To
> > solve the issue downstream, different communities and users each would
> need
> > to come up with build system/release support for multiple parallel Flink
> > versions. It would be better to shield from such complexity.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Tue, Nov 27, 2018 at 6:27 AM Fabian Hueske <fh...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I think this is a very good discussion to have.
> >> Flink is becoming part of more and more production deployments and more
> >> tools are built around it.
> >> The question is do we want to (or can we) make parts of the
> >> control/maintenance/monitoring API stable such that external
> >> systems/frameworks can rely on them as stable.
> >>
> >> Which APIs are relevant?
> >> Which APIs could be declared as stable?
> >> Which parts are still evolving?
> >>
> >> Fabian
> >>
> >> Am Di., 27. Nov. 2018 um 15:10 Uhr schrieb Chesnay Schepler <
> >> chesnay@apache.org>:
> >>
> >>> I think this discussion needs specific examples as to what should be
> >>> possible as it otherwise is to vague / open to interpretation.
> >>>
> >>> For example, "job submission" may refer to CLI invocations continuing
> to
> >>> work (i.e. CLI arguments), or being able to use a 1.6 client against a
> >>> 1.7 cluster, which are entirely different things.
> >>>
> >>> What does "management" include? Dependencies? Set of  jars that are
> >>> released on maven? Set of jars bundled with flink-dist?
> >>>
> >>> On 26.11.2018 17:24, Thomas Weise wrote:
> >>>> Hi,
> >>>>
> >>>> I wanted to bring back the topic of backward compatibility with
> respect
> >>> to
> >>>> all/most of the user facing aspects of Flink. Please note that isn't
> >>>> limited to the programming API, but also includes job submission and
> >>>> management.
> >>>>
> >>>> As can be seen in [1], changes in these areas cause difficulties
> >>>> downstream. Projects have to choose between Flink versions and users
> are
> >>>> ultimately at disadvantage, either by not being able to use the
> desired
> >>>> dependency or facing forced upgrades to their infrastructure.
> >>>>
> >>>> IMO the preferred solution would be that downstream projects can build
> >>>> against a minimum version of Flink and expect compatibility with
> future
> >>>> releases of the major version stream. For example, my project depends
> on
> >>>> 1.6.x and can expect to run without recompilation on 1.7.x and later.
> >>>>
> >>>> How far away is Flink from stabilizing the surface that affects
> typical
> >>>> users?
> >>>>
> >>>> Thanks,
> >>>> Thomas
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/BEAM-5419
> >>>>
> >>>
>
>