You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Chamikara Jayalath <ch...@google.com> on 2018/09/05 00:34:48 UTC

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Based on this email thread and offline feedback from several folks, current
concerns regarding dependency upgrade policy and tooling seems to be
following.

(1) We have to be careful when upgrading dependencies. For example, we
should not create JIRAs for upgrading to dependency versions that have
known issues.

(2) Dependency owners list can get stale. Somebody who is interested in
upgrading a dependency today might not be interested in the same task in
six months. Responsibility of upgrading a dependency should lie with the
community instead of pre-identified owner(s).

On the other hand we do not want Beam to significantly fall behind when it
comes to dependencies. We should upgrade dependencies whenever it makes
sense. This allows us to offer a more up to date system and to makes things
easy for users that deploy Beam along with other systems.

I discussed these issues with Yifan and we would like to suggest following
changes to current policy and tooling that might help alleviate some of the
concerns.

(1) Instead of a dependency "owners" list we will be maintaining an
"interested parties" list. When we create a JIRA for a dependency we will
not assign it to an owner but rather we will CC all the folks that
mentioned that they will be interested in receiving updates related to that
dependency. Hope is that some of the interested parties will also put
forward the effort to upgrade dependencies they are interested in but the
responsibility of upgrading dependencies lie with the community as a whole.

 (2) We will be creating JIRAs for upgrading individual dependencies, not
for upgrading to specific versions of those dependencies. For example, if a
given dependency X is three minor versions or an year behind we will create
a JIRA for upgrading that. But the specific version to upgrade to has to be
determined by the Beam community. Beam community might choose to close a
JIRA if there are known issues with available recent releases. Tool may
reopen such a closed JIRA in the future if new information becomes
available (for example, 3 new versions have been released since JIRA was
closed).

Thoughts ?

Thanks,
Cham

On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <ch...@google.com>
wrote:

>
>
> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <th...@apache.org> wrote:
>
>> I think there is an invalid assumption being made in this discussion,
>> which is that most projects comply with semantic versioning. The reality in
>> the open source big data space is unfortunately quite different. Ismaël has
>> well characterized the situation and HBase isn't an exception. Another
>> indicator for the scale of problem is extensive amount of shading used in
>> Beam and other projects. It wouldn't be necessary if semver compliance was
>> something we can rely on.
>>
>> Our recent Flink upgrade broke user(s). And we noticed a backward
>> incompatible Flink change that affected the portable Flink runner even
>> between patches.
>>
>> Many projects (including Beam) guarantee compatibility only for a subset
>> of public API. Sometimes a REST API is not covered, sometimes not strictly
>> internal protocols change and so on, all of which can break users, despite
>> the public API remaining "compatible". As much as I would love to rely on
>> the version number to tell me wether an upgrade is safe or not, that's not
>> practically possible.
>>
>> Furthermore, we need to proceed with caution forcing upgrades on users
>> that host the target systems. To stay with the Flink example, moving Beam
>> from 1.4 to 1.5 is actually a major change to some, because they now have
>> to upgrade their Flink clusters/deployments to be able to use the new
>> version of Beam.
>>
>> Upgrades need to be done with caution and may require extensive
>> verification beyond what our automation provides. I think the Spark change
>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>> provided the community a window to provide feedback and influence the
>> change.
>>
>
> Thanks for the clarification.
>
> Current policy indeed requests caution and explicit checks when upgrading
> all dependencies (including minor and patch versions) but language might
> have to be updated to emphasize your concerns.
>
> Here's the current text.
>
> "Beam releases adhere to <https://beam.apache.org/get-started/downloads/> semantic
> versioning. Hence, community members should take care when updating
> dependencies. Minor version updates to dependencies should be backwards
> compatible in most cases. Some updates to dependencies though may result in
> backwards incompatible API or functionality changes to Beam. PR reviewers
> and committers should take care to detect any dependency updates that could
> potentially introduce backwards incompatible changes to Beam before merging
> and PRs that update dependencies should include a statement regarding this
> verification in the form of a PR comment. Dependency updates that result in
> backwards incompatible changes to non-experimental features of Beam should
> be held till next major version release of Beam. Any exceptions to this
> policy should only occur in extreme cases (for example, due to a security
> vulnerability of an existing dependency that is only fixed in a subsequent
> major version) and should be discussed in the Beam dev list. Note that
> backwards incompatible changes to experimental features may be introduced
> in a minor version release."
>
> Also, are there any other steps we can take to make sure that Beam
> dependencies are not too old while offering a stable system ? Note that
> having a lot of legacy dependencies that do not get upgraded regularly can
> also result in user pain and Beam being unusable for certain users who run
> into dependency conflicts when using Beam along with other systems (which
> will increase the amount of shading/vendoring we have to do).
>
> Please note that current tooling does not force upgrades or automatically
> upgrade dependencies. It simply creates JIRAs that can be closed with a
> reason if needed. For Python SDK though we have version ranges in place for
> most dependencies [1] so these dependencies get updated automatically
> according to the corresponding ranges.
> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103
>
> Thanks,
> Cham
>
>
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <ra...@google.com> wrote:
>>
>>> Thanks for the IO versioning summary.
>>> KafkaIO's policy of 'let the user decide exact version at runtime' has
>>> been quite useful so far. How feasible is that for other connectors?
>>>
>>> Also, KafkaIO does not limit itself to minimum features available across
>>> all the supported versions. Some of the features (e.g. server side
>>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>>> currently run with single recent version. Integration tests could certainly
>>> use multiple versions. With some more effort in writing tests, we could
>>> make multiple versions of the unit tests.
>>>
>>> Raghu.
>>>
>>> IO versioning
>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>> more active users needing it (more deployments). We support 2.x and
>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>> because most big data distributions still use 5.x (however 5.x has
>>>> been EOL).
>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>> module uses a single version with the kafka client as a provided
>>>> dependency and so far it works (but we don’t have multi version
>>>> tests).
>>>>
>>>
>>>
>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>
>>>> I think we should refine the strategy on dependencies discussed
>>>> recently. Sorry to come late with this (I did not follow closely the
>>>> previous discussion), but the current approach is clearly not in line
>>>> with the industry reality (at least not for IO connectors + Hadoop +
>>>> Spark/Flink use).
>>>>
>>>> A really proactive approach to dependency updates is a good practice
>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>> sources or processing systems this gets more complicated and I think
>>>> we should be more flexible and do this case by case (and remove these
>>>> from the auto update email reminder).
>>>>
>>>> Some open source projects have at least three maintained versions:
>>>> - LTS – maps to what most of the people have installed (or the big
>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>
>>>> Following the most recent versions can be good to be close to the
>>>> current development of other projects and some of the fixes, but these
>>>> versions are commonly not deployed for most users and adopting a LTS
>>>> or stable only approach won't satisfy all cases either. To understand
>>>> why this is complex let’s see some historical issues:
>>>>
>>>> IO versioning
>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>> more active users needing it (more deployments). We support 2.x and
>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>> because most big data distributions still use 5.x (however 5.x has
>>>> been EOL).
>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>> module uses a single version with the kafka client as a provided
>>>> dependency and so far it works (but we don’t have multi version
>>>> tests).
>>>>
>>>> Runners versioning
>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>> tradeoffs between maintaining multiple version support and to have
>>>> breaking changes with the issues of maintaining multiple versions.
>>>> This is a rare case but also with consequences. This dependency is
>>>> provided but we don't actively test issues on version migration.
>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>> checkpointing (discussed recently and with not yet consensus on how to
>>>> handle).
>>>>
>>>> As you can see, it seems really hard to have a solution that fits all
>>>> cases. Probably the only rule that I see from this list is that we
>>>> should upgrade versions for connectors that have been deprecated or
>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>
>>>> For the case of the provided dependencies I wonder if as part of the
>>>> tests we should provide tests with multiple versions (note that this
>>>> is currently blocked by BEAM-4087).
>>>>
>>>> Any other ideas or opinions to see how we can handle this? What other
>>>> people in the community think ? (Notice that this can have relation
>>>> with the ongoing LTS discussion.
>>>>
>>>>
>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>> <ti...@gmail.com> wrote:
>>>> >
>>>> > Hi folks,
>>>> >
>>>> > I'd like to revisit the discussion around our versioning policy
>>>> specifically for the Hadoop ecosystem and make sure we are aware of the
>>>> implications.
>>>> >
>>>> > As an example our policy today would have us on HBase 2.1 and I have
>>>> reminders to address this.
>>>> >
>>>> > However, currently the versions of HBase in the major hadoop distros
>>>> are:
>>>> >
>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
>>>> assume is not widely adopted)
>>>> >  - AWS EMR HBase on 1.4
>>>> >
>>>> > On the versioning I think we might need a more nuanced approach to
>>>> ensure that we target real communities of existing and potential users.
>>>> Enterprise users need to stick to the supported versions in the
>>>> distributions to maintain support contracts from the vendors.
>>>> >
>>>> > Should our versioning policy have more room to consider on a case by
>>>> case basis?
>>>> >
>>>> > For Hadoop might we benefit from a strategy on which community of
>>>> users Beam is targeting?
>>>> >
>>>> > (OT: I'm collecting some thoughts on what we might consider to target
>>>> enterprise hadoop users - kerberos on all relevant IO, performance, leaking
>>>> beyond encryption zones with temporary files etc)
>>>> >
>>>> > Thanks,
>>>> > Tim
>>>>
>>>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Posted by Yifan Zou <yi...@google.com>.

Hello,

We modified the tool based on the new policies discussed in this thread
(the latest report
<https://builds.apache.org/job/beam_Dependency_Check/lastSuccessfulBuild/artifact/src/build/dependencyUpdates/beam-dependency-check-report.html>).
The major changes include:

- Title (summary) of the issues will be in the format "Beam Dependency
Update Request: <dep_name>".
- Owners of dependencies will be cc'ed by tagging the corresponding JIRA
IDs mentioned in the ownership files
<https://github.com/apache/beam/tree/master/ownership>.
- Users are able to close JIRAs if denpendencies are not good to upgrade.
Specify the fix versions if applicable.
- Automated tool will reopen a JIRA for a given dependency when one of
following conditions is met:
  1. Next SDK release is for a fix version mentioned in the JIRA.
  2. 6 months and three or more minor version releases have passed since
the JIRA was closed.

We will close the old JIRAs to avoid duplicates and confusion.
Please refer to the Dependency Guide
<https://beam.apache.org/contribute/dependencies/> for more information.

Thank you.

Regards.
Yifan

On Fri, Sep 7, 2018 at 3:46 PM Yifan Zou <yi...@google.com> wrote:

> Thanks all for comments and suggestions. We want to close this thread and
> start implementing the new policy based on the discussion:
>
> 1. Stop assigning JIRAs to the first person listed in the dependency owners
> files <https://github.com/apache/beam/tree/master/ownership>. Instead, cc
> people on the owner list.
> 2. We will be creating JIRAs for upgrading individual dependencies, not
> for upgrading to specific versions of those dependencies. For example, if a
> given dependency X is three minor versions or an year behind we will create
> a JIRA for upgrading that. But the specific version to upgrade to has to be
> determined by the Beam community. Beam community might choose to close a
> JIRA if there are known issues with available recent releases. Tool will
> reopen such a closed JIRA to inform owners if Beam is hitting the 'fixed
> version' or 3 new versions of the dependency have been released since JIRA
> was closed.
>
> Thank you.
>
> Regards.
> Yifan
>
> On Wed, Sep 5, 2018 at 2:14 PM Yifan Zou <yi...@google.com> wrote:
>
>> +1 on the jira "fix version".
>> The release frequency of dependencies are various, so that using new
>> information such as versions from the Jira closing date to reopen the
>> issues might not be very proper. We could check the fix versions first, and
>> if specified, then reopen the issue in that version's release cycle; it
>> not, follow Cham's proposal (2).
>>
>> On Wed, Sep 5, 2018 at 1:59 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson <ti...@gmail.com>
>>> wrote:
>>>
>>>> Thank you Cham, and everyone for contributing
>>>>
>>>> Sorry for slow reply to a thread I started, but I've been swamped on
>>>> non Beam projects.
>>>>
>>>> KafkaIO's policy of 'let the user decide exact version at runtime' has
>>>>> been quite useful so far. How feasible is that for other connectors?
>>>>
>>>>
>>>> I presume shimming might be needed in a few places but it's certainly
>>>> something we might want to explore more. I'll look into KafkaIO.
>>>>
>>>> On Cham's proposal :
>>>>
>>>> (1) +0.5. We can always then opt to either assign or take ownership of
>>>> an issue, although I am also happy to stick with the owners model - it
>>>> prompted me to investigate and resulted in this thread.
>>>>
>>>> (2) I think this makes sense.
>>>> A bot informing us that we're falling behind versions is immensely
>>>> useful as long as we can link issues to others which might have a wider
>>>> discussion (remember many dependencies need to be treated together such as
>>>> "Support Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let
>>>> owners use the Jira "fix versions" to put in future release to inform the
>>>> bot when it should start alerting again?
>>>>
>>>
>>> I think this makes sense. Setting a "fix version" will be specially
>>> useful for dependency changes that result in API changes that have to be
>>> postponed till next major version of Beam.
>>>
>>> On grouping, I believe we already group JIRAs into tasks and sub-tasks
>>> based on group ids of dependencies. I suppose it will not be too hard to
>>> close multiple sub-tasks with the same reasoning.
>>>
>>>
>>>>
>>>>
>>>> On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou <yi...@google.com> wrote:
>>>>
>>>>> Thanks Cham for putting this together. Also, after modifying the
>>>>> dependency tool based on the policy above, we will close all existing JIRA
>>>>> issues that prevent creating duplicate bugs and stop pushing assignees to
>>>>> upgrade dependencies with old bugs.
>>>>>
>>>>> Please let us know if you have any comments on the revised policy in
>>>>> Cham's email.
>>>>>
>>>>> Thanks all.
>>>>>
>>>>> Regards.
>>>>> Yifan Zou
>>>>>
>>>>> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <
>>>>> chamikara@google.com> wrote:
>>>>>
>>>>>> Based on this email thread and offline feedback from several folks,
>>>>>> current concerns regarding dependency upgrade policy and tooling seems to
>>>>>> be following.
>>>>>>
>>>>>> (1) We have to be careful when upgrading dependencies. For example,
>>>>>> we should not create JIRAs for upgrading to dependency versions that have
>>>>>> known issues.
>>>>>>
>>>>>> (2) Dependency owners list can get stale. Somebody who is interested
>>>>>> in upgrading a dependency today might not be interested in the same task in
>>>>>> six months. Responsibility of upgrading a dependency should lie with the
>>>>>> community instead of pre-identified owner(s).
>>>>>>
>>>>>> On the other hand we do not want Beam to significantly fall behind
>>>>>> when it comes to dependencies. We should upgrade dependencies whenever it
>>>>>> makes sense. This allows us to offer a more up to date system and to makes
>>>>>> things easy for users that deploy Beam along with other systems.
>>>>>>
>>>>>> I discussed these issues with Yifan and we would like to suggest
>>>>>> following changes to current policy and tooling that might help alleviate
>>>>>> some of the concerns.
>>>>>>
>>>>>> (1) Instead of a dependency "owners" list we will be maintaining an
>>>>>> "interested parties" list. When we create a JIRA for a dependency we will
>>>>>> not assign it to an owner but rather we will CC all the folks that
>>>>>> mentioned that they will be interested in receiving updates related to that
>>>>>> dependency. Hope is that some of the interested parties will also put
>>>>>> forward the effort to upgrade dependencies they are interested in but the
>>>>>> responsibility of upgrading dependencies lie with the community as a whole.
>>>>>>
>>>>>>  (2) We will be creating JIRAs for upgrading individual dependencies,
>>>>>> not for upgrading to specific versions of those dependencies. For example,
>>>>>> if a given dependency X is three minor versions or an year behind we will
>>>>>> create a JIRA for upgrading that. But the specific version to upgrade to
>>>>>> has to be determined by the Beam community. Beam community might choose to
>>>>>> close a JIRA if there are known issues with available recent releases. Tool
>>>>>> may reopen such a closed JIRA in the future if new information becomes
>>>>>> available (for example, 3 new versions have been released since JIRA was
>>>>>> closed).
>>>>>>
>>>>>> Thoughts ?
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <
>>>>>> chamikara@google.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <th...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think there is an invalid assumption being made in this
>>>>>>>> discussion, which is that most projects comply with semantic versioning.
>>>>>>>> The reality in the open source big data space is unfortunately quite
>>>>>>>> different. Ismaël has well characterized the situation and HBase isn't an
>>>>>>>> exception. Another indicator for the scale of problem is extensive amount
>>>>>>>> of shading used in Beam and other projects. It wouldn't be necessary if
>>>>>>>> semver compliance was something we can rely on.
>>>>>>>>
>>>>>>>> Our recent Flink upgrade broke user(s). And we noticed a backward
>>>>>>>> incompatible Flink change that affected the portable Flink runner even
>>>>>>>> between patches.
>>>>>>>>
>>>>>>>> Many projects (including Beam) guarantee compatibility only for a
>>>>>>>> subset of public API. Sometimes a REST API is not covered, sometimes not
>>>>>>>> strictly internal protocols change and so on, all of which can break users,
>>>>>>>> despite the public API remaining "compatible". As much as I would love to
>>>>>>>> rely on the version number to tell me wether an upgrade is safe or not,
>>>>>>>> that's not practically possible.
>>>>>>>>
>>>>>>>> Furthermore, we need to proceed with caution forcing upgrades on
>>>>>>>> users that host the target systems. To stay with the Flink example, moving
>>>>>>>> Beam from 1.4 to 1.5 is actually a major change to some, because they now
>>>>>>>> have to upgrade their Flink clusters/deployments to be able to use the new
>>>>>>>> version of Beam.
>>>>>>>>
>>>>>>>> Upgrades need to be done with caution and may require extensive
>>>>>>>> verification beyond what our automation provides. I think the Spark change
>>>>>>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>>>>>>>> provided the community a window to provide feedback and influence the
>>>>>>>> change.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks for the clarification.
>>>>>>>
>>>>>>> Current policy indeed requests caution and explicit checks when
>>>>>>> upgrading all dependencies (including minor and patch versions) but
>>>>>>> language might have to be updated to emphasize your concerns.
>>>>>>>
>>>>>>> Here's the current text.
>>>>>>>
>>>>>>> "Beam releases adhere to
>>>>>>> <https://beam.apache.org/get-started/downloads/> semantic
>>>>>>> versioning. Hence, community members should take care when updating
>>>>>>> dependencies. Minor version updates to dependencies should be backwards
>>>>>>> compatible in most cases. Some updates to dependencies though may result in
>>>>>>> backwards incompatible API or functionality changes to Beam. PR reviewers
>>>>>>> and committers should take care to detect any dependency updates that could
>>>>>>> potentially introduce backwards incompatible changes to Beam before merging
>>>>>>> and PRs that update dependencies should include a statement regarding this
>>>>>>> verification in the form of a PR comment. Dependency updates that result in
>>>>>>> backwards incompatible changes to non-experimental features of Beam should
>>>>>>> be held till next major version release of Beam. Any exceptions to this
>>>>>>> policy should only occur in extreme cases (for example, due to a security
>>>>>>> vulnerability of an existing dependency that is only fixed in a subsequent
>>>>>>> major version) and should be discussed in the Beam dev list. Note that
>>>>>>> backwards incompatible changes to experimental features may be introduced
>>>>>>> in a minor version release."
>>>>>>>
>>>>>>> Also, are there any other steps we can take to make sure that Beam
>>>>>>> dependencies are not too old while offering a stable system ? Note that
>>>>>>> having a lot of legacy dependencies that do not get upgraded regularly can
>>>>>>> also result in user pain and Beam being unusable for certain users who run
>>>>>>> into dependency conflicts when using Beam along with other systems (which
>>>>>>> will increase the amount of shading/vendoring we have to do).
>>>>>>>
>>>>>>> Please note that current tooling does not force upgrades or
>>>>>>> automatically upgrade dependencies. It simply creates JIRAs that can be
>>>>>>> closed with a reason if needed. For Python SDK though we have version
>>>>>>> ranges in place for most dependencies [1] so these dependencies get updated
>>>>>>> automatically according to the corresponding ranges.
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Cham
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <ra...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the IO versioning summary.
>>>>>>>>> KafkaIO's policy of 'let the user decide exact version at runtime'
>>>>>>>>> has been quite useful so far. How feasible is that for other connectors?
>>>>>>>>>
>>>>>>>>> Also, KafkaIO does not limit itself to minimum features available
>>>>>>>>> across all the supported versions. Some of the features (e.g. server side
>>>>>>>>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>>>>>>>>> currently run with single recent version. Integration tests could certainly
>>>>>>>>> use multiple versions. With some more effort in writing tests, we could
>>>>>>>>> make multiple versions of the unit tests.
>>>>>>>>>
>>>>>>>>> Raghu.
>>>>>>>>>
>>>>>>>>> IO versioning
>>>>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard
>>>>>>>>>> of
>>>>>>>>>> more active users needing it (more deployments). We support 2.x
>>>>>>>>>> and
>>>>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>>>>> been EOL).
>>>>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x,
>>>>>>>>>> however
>>>>>>>>>> most of the deployments of Kafka use earlier versions than 1.x.
>>>>>>>>>> This
>>>>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>>>>> tests).
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ie...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think we should refine the strategy on dependencies discussed
>>>>>>>>>> recently. Sorry to come late with this (I did not follow closely
>>>>>>>>>> the
>>>>>>>>>> previous discussion), but the current approach is clearly not in
>>>>>>>>>> line
>>>>>>>>>> with the industry reality (at least not for IO connectors +
>>>>>>>>>> Hadoop +
>>>>>>>>>> Spark/Flink use).
>>>>>>>>>>
>>>>>>>>>> A really proactive approach to dependency updates is a good
>>>>>>>>>> practice
>>>>>>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>>>>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g.
>>>>>>>>>> GCS,
>>>>>>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>>>>>>>> sources or processing systems this gets more complicated and I
>>>>>>>>>> think
>>>>>>>>>> we should be more flexible and do this case by case (and remove
>>>>>>>>>> these
>>>>>>>>>> from the auto update email reminder).
>>>>>>>>>>
>>>>>>>>>> Some open source projects have at least three maintained versions:
>>>>>>>>>> - LTS – maps to what most of the people have installed (or the big
>>>>>>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>>>>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>>>>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>>>>>>>
>>>>>>>>>> Following the most recent versions can be good to be close to the
>>>>>>>>>> current development of other projects and some of the fixes, but
>>>>>>>>>> these
>>>>>>>>>> versions are commonly not deployed for most users and adopting a
>>>>>>>>>> LTS
>>>>>>>>>> or stable only approach won't satisfy all cases either. To
>>>>>>>>>> understand
>>>>>>>>>> why this is complex let’s see some historical issues:
>>>>>>>>>>
>>>>>>>>>> IO versioning
>>>>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard
>>>>>>>>>> of
>>>>>>>>>> more active users needing it (more deployments). We support 2.x
>>>>>>>>>> and
>>>>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>>>>> been EOL).
>>>>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x,
>>>>>>>>>> however
>>>>>>>>>> most of the deployments of Kafka use earlier versions than 1.x.
>>>>>>>>>> This
>>>>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>>>>> tests).
>>>>>>>>>>
>>>>>>>>>> Runners versioning
>>>>>>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>>>>>>>> tradeoffs between maintaining multiple version support and to have
>>>>>>>>>> breaking changes with the issues of maintaining multiple versions.
>>>>>>>>>> This is a rare case but also with consequences. This dependency is
>>>>>>>>>> provided but we don't actively test issues on version migration.
>>>>>>>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>>>>>>>> checkpointing (discussed recently and with not yet consensus on
>>>>>>>>>> how to
>>>>>>>>>> handle).
>>>>>>>>>>
>>>>>>>>>> As you can see, it seems really hard to have a solution that fits
>>>>>>>>>> all
>>>>>>>>>> cases. Probably the only rule that I see from this list is that we
>>>>>>>>>> should upgrade versions for connectors that have been deprecated
>>>>>>>>>> or
>>>>>>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>>>>>>>
>>>>>>>>>> For the case of the provided dependencies I wonder if as part of
>>>>>>>>>> the
>>>>>>>>>> tests we should provide tests with multiple versions (note that
>>>>>>>>>> this
>>>>>>>>>> is currently blocked by BEAM-4087).
>>>>>>>>>>
>>>>>>>>>> Any other ideas or opinions to see how we can handle this? What
>>>>>>>>>> other
>>>>>>>>>> people in the community think ? (Notice that this can have
>>>>>>>>>> relation
>>>>>>>>>> with the ongoing LTS discussion.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>>>>>>>> <ti...@gmail.com> wrote:
>>>>>>>>>> >
>>>>>>>>>> > Hi folks,
>>>>>>>>>> >
>>>>>>>>>> > I'd like to revisit the discussion around our versioning policy
>>>>>>>>>> specifically for the Hadoop ecosystem and make sure we are aware of the
>>>>>>>>>> implications.
>>>>>>>>>> >
>>>>>>>>>> > As an example our policy today would have us on HBase 2.1 and I
>>>>>>>>>> have reminders to address this.
>>>>>>>>>> >
>>>>>>>>>> > However, currently the versions of HBase in the major hadoop
>>>>>>>>>> distros are:
>>>>>>>>>> >
>>>>>>>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in
>>>>>>>>>> beta)
>>>>>>>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we
>>>>>>>>>> can assume is not widely adopted)
>>>>>>>>>> >  - AWS EMR HBase on 1.4
>>>>>>>>>> >
>>>>>>>>>> > On the versioning I think we might need a more nuanced approach
>>>>>>>>>> to ensure that we target real communities of existing and potential users.
>>>>>>>>>> Enterprise users need to stick to the supported versions in the
>>>>>>>>>> distributions to maintain support contracts from the vendors.
>>>>>>>>>> >
>>>>>>>>>> > Should our versioning policy have more room to consider on a
>>>>>>>>>> case by case basis?
>>>>>>>>>> >
>>>>>>>>>> > For Hadoop might we benefit from a strategy on which community
>>>>>>>>>> of users Beam is targeting?
>>>>>>>>>> >
>>>>>>>>>> > (OT: I'm collecting some thoughts on what we might consider to
>>>>>>>>>> target enterprise hadoop users - kerberos on all relevant IO, performance,
>>>>>>>>>> leaking beyond encryption zones with temporary files etc)
>>>>>>>>>> >
>>>>>>>>>> > Thanks,
>>>>>>>>>> > Tim
>>>>>>>>>>
>>>>>>>>>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Posted by Yifan Zou <yi...@google.com>.

Thanks all for comments and suggestions. We want to close this thread and
start implementing the new policy based on the discussion:

1. Stop assigning JIRAs to the first person listed in the dependency owners
files <https://github.com/apache/beam/tree/master/ownership>. Instead, cc
people on the owner list.
2. We will be creating JIRAs for upgrading individual dependencies, not for
upgrading to specific versions of those dependencies. For example, if a
given dependency X is three minor versions or an year behind we will create
a JIRA for upgrading that. But the specific version to upgrade to has to be
determined by the Beam community. Beam community might choose to close a
JIRA if there are known issues with available recent releases. Tool will
reopen such a closed JIRA to inform owners if Beam is hitting the 'fixed
version' or 3 new versions of the dependency have been released since JIRA
was closed.

Thank you.

Regards.
Yifan

On Wed, Sep 5, 2018 at 2:14 PM Yifan Zou <yi...@google.com> wrote:

> +1 on the jira "fix version".
> The release frequency of dependencies are various, so that using new
> information such as versions from the Jira closing date to reopen the
> issues might not be very proper. We could check the fix versions first, and
> if specified, then reopen the issue in that version's release cycle; it
> not, follow Cham's proposal (2).
>
> On Wed, Sep 5, 2018 at 1:59 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>>
>>
>> On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson <ti...@gmail.com>
>> wrote:
>>
>>> Thank you Cham, and everyone for contributing
>>>
>>> Sorry for slow reply to a thread I started, but I've been swamped on non
>>> Beam projects.
>>>
>>> KafkaIO's policy of 'let the user decide exact version at runtime' has
>>>> been quite useful so far. How feasible is that for other connectors?
>>>
>>>
>>> I presume shimming might be needed in a few places but it's certainly
>>> something we might want to explore more. I'll look into KafkaIO.
>>>
>>> On Cham's proposal :
>>>
>>> (1) +0.5. We can always then opt to either assign or take ownership of
>>> an issue, although I am also happy to stick with the owners model - it
>>> prompted me to investigate and resulted in this thread.
>>>
>>> (2) I think this makes sense.
>>> A bot informing us that we're falling behind versions is immensely
>>> useful as long as we can link issues to others which might have a wider
>>> discussion (remember many dependencies need to be treated together such as
>>> "Support Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let
>>> owners use the Jira "fix versions" to put in future release to inform the
>>> bot when it should start alerting again?
>>>
>>
>> I think this makes sense. Setting a "fix version" will be specially
>> useful for dependency changes that result in API changes that have to be
>> postponed till next major version of Beam.
>>
>> On grouping, I believe we already group JIRAs into tasks and sub-tasks
>> based on group ids of dependencies. I suppose it will not be too hard to
>> close multiple sub-tasks with the same reasoning.
>>
>>
>>>
>>>
>>> On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou <yi...@google.com> wrote:
>>>
>>>> Thanks Cham for putting this together. Also, after modifying the
>>>> dependency tool based on the policy above, we will close all existing JIRA
>>>> issues that prevent creating duplicate bugs and stop pushing assignees to
>>>> upgrade dependencies with old bugs.
>>>>
>>>> Please let us know if you have any comments on the revised policy in
>>>> Cham's email.
>>>>
>>>> Thanks all.
>>>>
>>>> Regards.
>>>> Yifan Zou
>>>>
>>>> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <ch...@google.com>
>>>> wrote:
>>>>
>>>>> Based on this email thread and offline feedback from several folks,
>>>>> current concerns regarding dependency upgrade policy and tooling seems to
>>>>> be following.
>>>>>
>>>>> (1) We have to be careful when upgrading dependencies. For example, we
>>>>> should not create JIRAs for upgrading to dependency versions that have
>>>>> known issues.
>>>>>
>>>>> (2) Dependency owners list can get stale. Somebody who is interested
>>>>> in upgrading a dependency today might not be interested in the same task in
>>>>> six months. Responsibility of upgrading a dependency should lie with the
>>>>> community instead of pre-identified owner(s).
>>>>>
>>>>> On the other hand we do not want Beam to significantly fall behind
>>>>> when it comes to dependencies. We should upgrade dependencies whenever it
>>>>> makes sense. This allows us to offer a more up to date system and to makes
>>>>> things easy for users that deploy Beam along with other systems.
>>>>>
>>>>> I discussed these issues with Yifan and we would like to suggest
>>>>> following changes to current policy and tooling that might help alleviate
>>>>> some of the concerns.
>>>>>
>>>>> (1) Instead of a dependency "owners" list we will be maintaining an
>>>>> "interested parties" list. When we create a JIRA for a dependency we will
>>>>> not assign it to an owner but rather we will CC all the folks that
>>>>> mentioned that they will be interested in receiving updates related to that
>>>>> dependency. Hope is that some of the interested parties will also put
>>>>> forward the effort to upgrade dependencies they are interested in but the
>>>>> responsibility of upgrading dependencies lie with the community as a whole.
>>>>>
>>>>>  (2) We will be creating JIRAs for upgrading individual dependencies,
>>>>> not for upgrading to specific versions of those dependencies. For example,
>>>>> if a given dependency X is three minor versions or an year behind we will
>>>>> create a JIRA for upgrading that. But the specific version to upgrade to
>>>>> has to be determined by the Beam community. Beam community might choose to
>>>>> close a JIRA if there are known issues with available recent releases. Tool
>>>>> may reopen such a closed JIRA in the future if new information becomes
>>>>> available (for example, 3 new versions have been released since JIRA was
>>>>> closed).
>>>>>
>>>>> Thoughts ?
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <
>>>>> chamikara@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <th...@apache.org> wrote:
>>>>>>
>>>>>>> I think there is an invalid assumption being made in this
>>>>>>> discussion, which is that most projects comply with semantic versioning.
>>>>>>> The reality in the open source big data space is unfortunately quite
>>>>>>> different. Ismaël has well characterized the situation and HBase isn't an
>>>>>>> exception. Another indicator for the scale of problem is extensive amount
>>>>>>> of shading used in Beam and other projects. It wouldn't be necessary if
>>>>>>> semver compliance was something we can rely on.
>>>>>>>
>>>>>>> Our recent Flink upgrade broke user(s). And we noticed a backward
>>>>>>> incompatible Flink change that affected the portable Flink runner even
>>>>>>> between patches.
>>>>>>>
>>>>>>> Many projects (including Beam) guarantee compatibility only for a
>>>>>>> subset of public API. Sometimes a REST API is not covered, sometimes not
>>>>>>> strictly internal protocols change and so on, all of which can break users,
>>>>>>> despite the public API remaining "compatible". As much as I would love to
>>>>>>> rely on the version number to tell me wether an upgrade is safe or not,
>>>>>>> that's not practically possible.
>>>>>>>
>>>>>>> Furthermore, we need to proceed with caution forcing upgrades on
>>>>>>> users that host the target systems. To stay with the Flink example, moving
>>>>>>> Beam from 1.4 to 1.5 is actually a major change to some, because they now
>>>>>>> have to upgrade their Flink clusters/deployments to be able to use the new
>>>>>>> version of Beam.
>>>>>>>
>>>>>>> Upgrades need to be done with caution and may require extensive
>>>>>>> verification beyond what our automation provides. I think the Spark change
>>>>>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>>>>>>> provided the community a window to provide feedback and influence the
>>>>>>> change.
>>>>>>>
>>>>>>
>>>>>> Thanks for the clarification.
>>>>>>
>>>>>> Current policy indeed requests caution and explicit checks when
>>>>>> upgrading all dependencies (including minor and patch versions) but
>>>>>> language might have to be updated to emphasize your concerns.
>>>>>>
>>>>>> Here's the current text.
>>>>>>
>>>>>> "Beam releases adhere to
>>>>>> <https://beam.apache.org/get-started/downloads/> semantic
>>>>>> versioning. Hence, community members should take care when updating
>>>>>> dependencies. Minor version updates to dependencies should be backwards
>>>>>> compatible in most cases. Some updates to dependencies though may result in
>>>>>> backwards incompatible API or functionality changes to Beam. PR reviewers
>>>>>> and committers should take care to detect any dependency updates that could
>>>>>> potentially introduce backwards incompatible changes to Beam before merging
>>>>>> and PRs that update dependencies should include a statement regarding this
>>>>>> verification in the form of a PR comment. Dependency updates that result in
>>>>>> backwards incompatible changes to non-experimental features of Beam should
>>>>>> be held till next major version release of Beam. Any exceptions to this
>>>>>> policy should only occur in extreme cases (for example, due to a security
>>>>>> vulnerability of an existing dependency that is only fixed in a subsequent
>>>>>> major version) and should be discussed in the Beam dev list. Note that
>>>>>> backwards incompatible changes to experimental features may be introduced
>>>>>> in a minor version release."
>>>>>>
>>>>>> Also, are there any other steps we can take to make sure that Beam
>>>>>> dependencies are not too old while offering a stable system ? Note that
>>>>>> having a lot of legacy dependencies that do not get upgraded regularly can
>>>>>> also result in user pain and Beam being unusable for certain users who run
>>>>>> into dependency conflicts when using Beam along with other systems (which
>>>>>> will increase the amount of shading/vendoring we have to do).
>>>>>>
>>>>>> Please note that current tooling does not force upgrades or
>>>>>> automatically upgrade dependencies. It simply creates JIRAs that can be
>>>>>> closed with a reason if needed. For Python SDK though we have version
>>>>>> ranges in place for most dependencies [1] so these dependencies get updated
>>>>>> automatically according to the corresponding ranges.
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <ra...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the IO versioning summary.
>>>>>>>> KafkaIO's policy of 'let the user decide exact version at runtime'
>>>>>>>> has been quite useful so far. How feasible is that for other connectors?
>>>>>>>>
>>>>>>>> Also, KafkaIO does not limit itself to minimum features available
>>>>>>>> across all the supported versions. Some of the features (e.g. server side
>>>>>>>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>>>>>>>> currently run with single recent version. Integration tests could certainly
>>>>>>>> use multiple versions. With some more effort in writing tests, we could
>>>>>>>> make multiple versions of the unit tests.
>>>>>>>>
>>>>>>>> Raghu.
>>>>>>>>
>>>>>>>> IO versioning
>>>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>>>> been EOL).
>>>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>>>> most of the deployments of Kafka use earlier versions than 1.x.
>>>>>>>>> This
>>>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>>>> tests).
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ie...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think we should refine the strategy on dependencies discussed
>>>>>>>>> recently. Sorry to come late with this (I did not follow closely
>>>>>>>>> the
>>>>>>>>> previous discussion), but the current approach is clearly not in
>>>>>>>>> line
>>>>>>>>> with the industry reality (at least not for IO connectors + Hadoop
>>>>>>>>> +
>>>>>>>>> Spark/Flink use).
>>>>>>>>>
>>>>>>>>> A really proactive approach to dependency updates is a good
>>>>>>>>> practice
>>>>>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>>>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g.
>>>>>>>>> GCS,
>>>>>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>>>>>>> sources or processing systems this gets more complicated and I
>>>>>>>>> think
>>>>>>>>> we should be more flexible and do this case by case (and remove
>>>>>>>>> these
>>>>>>>>> from the auto update email reminder).
>>>>>>>>>
>>>>>>>>> Some open source projects have at least three maintained versions:
>>>>>>>>> - LTS – maps to what most of the people have installed (or the big
>>>>>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>>>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>>>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>>>>>>
>>>>>>>>> Following the most recent versions can be good to be close to the
>>>>>>>>> current development of other projects and some of the fixes, but
>>>>>>>>> these
>>>>>>>>> versions are commonly not deployed for most users and adopting a
>>>>>>>>> LTS
>>>>>>>>> or stable only approach won't satisfy all cases either. To
>>>>>>>>> understand
>>>>>>>>> why this is complex let’s see some historical issues:
>>>>>>>>>
>>>>>>>>> IO versioning
>>>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>>>> been EOL).
>>>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>>>> most of the deployments of Kafka use earlier versions than 1.x.
>>>>>>>>> This
>>>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>>>> tests).
>>>>>>>>>
>>>>>>>>> Runners versioning
>>>>>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>>>>>>> tradeoffs between maintaining multiple version support and to have
>>>>>>>>> breaking changes with the issues of maintaining multiple versions.
>>>>>>>>> This is a rare case but also with consequences. This dependency is
>>>>>>>>> provided but we don't actively test issues on version migration.
>>>>>>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>>>>>>> checkpointing (discussed recently and with not yet consensus on
>>>>>>>>> how to
>>>>>>>>> handle).
>>>>>>>>>
>>>>>>>>> As you can see, it seems really hard to have a solution that fits
>>>>>>>>> all
>>>>>>>>> cases. Probably the only rule that I see from this list is that we
>>>>>>>>> should upgrade versions for connectors that have been deprecated or
>>>>>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>>>>>>
>>>>>>>>> For the case of the provided dependencies I wonder if as part of
>>>>>>>>> the
>>>>>>>>> tests we should provide tests with multiple versions (note that
>>>>>>>>> this
>>>>>>>>> is currently blocked by BEAM-4087).
>>>>>>>>>
>>>>>>>>> Any other ideas or opinions to see how we can handle this? What
>>>>>>>>> other
>>>>>>>>> people in the community think ? (Notice that this can have relation
>>>>>>>>> with the ongoing LTS discussion.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>>>>>>> <ti...@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > Hi folks,
>>>>>>>>> >
>>>>>>>>> > I'd like to revisit the discussion around our versioning policy
>>>>>>>>> specifically for the Hadoop ecosystem and make sure we are aware of the
>>>>>>>>> implications.
>>>>>>>>> >
>>>>>>>>> > As an example our policy today would have us on HBase 2.1 and I
>>>>>>>>> have reminders to address this.
>>>>>>>>> >
>>>>>>>>> > However, currently the versions of HBase in the major hadoop
>>>>>>>>> distros are:
>>>>>>>>> >
>>>>>>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in
>>>>>>>>> beta)
>>>>>>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we
>>>>>>>>> can assume is not widely adopted)
>>>>>>>>> >  - AWS EMR HBase on 1.4
>>>>>>>>> >
>>>>>>>>> > On the versioning I think we might need a more nuanced approach
>>>>>>>>> to ensure that we target real communities of existing and potential users.
>>>>>>>>> Enterprise users need to stick to the supported versions in the
>>>>>>>>> distributions to maintain support contracts from the vendors.
>>>>>>>>> >
>>>>>>>>> > Should our versioning policy have more room to consider on a
>>>>>>>>> case by case basis?
>>>>>>>>> >
>>>>>>>>> > For Hadoop might we benefit from a strategy on which community
>>>>>>>>> of users Beam is targeting?
>>>>>>>>> >
>>>>>>>>> > (OT: I'm collecting some thoughts on what we might consider to
>>>>>>>>> target enterprise hadoop users - kerberos on all relevant IO, performance,
>>>>>>>>> leaking beyond encryption zones with temporary files etc)
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Tim
>>>>>>>>>
>>>>>>>>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Posted by Yifan Zou <yi...@google.com>.

+1 on the jira "fix version".
The release frequency of dependencies are various, so that using new
information such as versions from the Jira closing date to reopen the
issues might not be very proper. We could check the fix versions first, and
if specified, then reopen the issue in that version's release cycle; it
not, follow Cham's proposal (2).

On Wed, Sep 5, 2018 at 1:59 PM Chamikara Jayalath <ch...@google.com>
wrote:

>
>
> On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson <ti...@gmail.com>
> wrote:
>
>> Thank you Cham, and everyone for contributing
>>
>> Sorry for slow reply to a thread I started, but I've been swamped on non
>> Beam projects.
>>
>> KafkaIO's policy of 'let the user decide exact version at runtime' has
>>> been quite useful so far. How feasible is that for other connectors?
>>
>>
>> I presume shimming might be needed in a few places but it's certainly
>> something we might want to explore more. I'll look into KafkaIO.
>>
>> On Cham's proposal :
>>
>> (1) +0.5. We can always then opt to either assign or take ownership of an
>> issue, although I am also happy to stick with the owners model - it
>> prompted me to investigate and resulted in this thread.
>>
>> (2) I think this makes sense.
>> A bot informing us that we're falling behind versions is immensely useful
>> as long as we can link issues to others which might have a wider discussion
>> (remember many dependencies need to be treated together such as "Support
>> Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let owners
>> use the Jira "fix versions" to put in future release to inform the bot when
>> it should start alerting again?
>>
>
> I think this makes sense. Setting a "fix version" will be specially useful
> for dependency changes that result in API changes that have to be postponed
> till next major version of Beam.
>
> On grouping, I believe we already group JIRAs into tasks and sub-tasks
> based on group ids of dependencies. I suppose it will not be too hard to
> close multiple sub-tasks with the same reasoning.
>
>
>>
>>
>> On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou <yi...@google.com> wrote:
>>
>>> Thanks Cham for putting this together. Also, after modifying the
>>> dependency tool based on the policy above, we will close all existing JIRA
>>> issues that prevent creating duplicate bugs and stop pushing assignees to
>>> upgrade dependencies with old bugs.
>>>
>>> Please let us know if you have any comments on the revised policy in
>>> Cham's email.
>>>
>>> Thanks all.
>>>
>>> Regards.
>>> Yifan Zou
>>>
>>> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>> Based on this email thread and offline feedback from several folks,
>>>> current concerns regarding dependency upgrade policy and tooling seems to
>>>> be following.
>>>>
>>>> (1) We have to be careful when upgrading dependencies. For example, we
>>>> should not create JIRAs for upgrading to dependency versions that have
>>>> known issues.
>>>>
>>>> (2) Dependency owners list can get stale. Somebody who is interested in
>>>> upgrading a dependency today might not be interested in the same task in
>>>> six months. Responsibility of upgrading a dependency should lie with the
>>>> community instead of pre-identified owner(s).
>>>>
>>>> On the other hand we do not want Beam to significantly fall behind when
>>>> it comes to dependencies. We should upgrade dependencies whenever it makes
>>>> sense. This allows us to offer a more up to date system and to makes things
>>>> easy for users that deploy Beam along with other systems.
>>>>
>>>> I discussed these issues with Yifan and we would like to suggest
>>>> following changes to current policy and tooling that might help alleviate
>>>> some of the concerns.
>>>>
>>>> (1) Instead of a dependency "owners" list we will be maintaining an
>>>> "interested parties" list. When we create a JIRA for a dependency we will
>>>> not assign it to an owner but rather we will CC all the folks that
>>>> mentioned that they will be interested in receiving updates related to that
>>>> dependency. Hope is that some of the interested parties will also put
>>>> forward the effort to upgrade dependencies they are interested in but the
>>>> responsibility of upgrading dependencies lie with the community as a whole.
>>>>
>>>>  (2) We will be creating JIRAs for upgrading individual dependencies,
>>>> not for upgrading to specific versions of those dependencies. For example,
>>>> if a given dependency X is three minor versions or an year behind we will
>>>> create a JIRA for upgrading that. But the specific version to upgrade to
>>>> has to be determined by the Beam community. Beam community might choose to
>>>> close a JIRA if there are known issues with available recent releases. Tool
>>>> may reopen such a closed JIRA in the future if new information becomes
>>>> available (for example, 3 new versions have been released since JIRA was
>>>> closed).
>>>>
>>>> Thoughts ?
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <
>>>> chamikara@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <th...@apache.org> wrote:
>>>>>
>>>>>> I think there is an invalid assumption being made in this discussion,
>>>>>> which is that most projects comply with semantic versioning. The reality in
>>>>>> the open source big data space is unfortunately quite different. Ismaël has
>>>>>> well characterized the situation and HBase isn't an exception. Another
>>>>>> indicator for the scale of problem is extensive amount of shading used in
>>>>>> Beam and other projects. It wouldn't be necessary if semver compliance was
>>>>>> something we can rely on.
>>>>>>
>>>>>> Our recent Flink upgrade broke user(s). And we noticed a backward
>>>>>> incompatible Flink change that affected the portable Flink runner even
>>>>>> between patches.
>>>>>>
>>>>>> Many projects (including Beam) guarantee compatibility only for a
>>>>>> subset of public API. Sometimes a REST API is not covered, sometimes not
>>>>>> strictly internal protocols change and so on, all of which can break users,
>>>>>> despite the public API remaining "compatible". As much as I would love to
>>>>>> rely on the version number to tell me wether an upgrade is safe or not,
>>>>>> that's not practically possible.
>>>>>>
>>>>>> Furthermore, we need to proceed with caution forcing upgrades on
>>>>>> users that host the target systems. To stay with the Flink example, moving
>>>>>> Beam from 1.4 to 1.5 is actually a major change to some, because they now
>>>>>> have to upgrade their Flink clusters/deployments to be able to use the new
>>>>>> version of Beam.
>>>>>>
>>>>>> Upgrades need to be done with caution and may require extensive
>>>>>> verification beyond what our automation provides. I think the Spark change
>>>>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>>>>>> provided the community a window to provide feedback and influence the
>>>>>> change.
>>>>>>
>>>>>
>>>>> Thanks for the clarification.
>>>>>
>>>>> Current policy indeed requests caution and explicit checks when
>>>>> upgrading all dependencies (including minor and patch versions) but
>>>>> language might have to be updated to emphasize your concerns.
>>>>>
>>>>> Here's the current text.
>>>>>
>>>>> "Beam releases adhere to
>>>>> <https://beam.apache.org/get-started/downloads/> semantic versioning.
>>>>> Hence, community members should take care when updating dependencies. Minor
>>>>> version updates to dependencies should be backwards compatible in most
>>>>> cases. Some updates to dependencies though may result in backwards
>>>>> incompatible API or functionality changes to Beam. PR reviewers and
>>>>> committers should take care to detect any dependency updates that could
>>>>> potentially introduce backwards incompatible changes to Beam before merging
>>>>> and PRs that update dependencies should include a statement regarding this
>>>>> verification in the form of a PR comment. Dependency updates that result in
>>>>> backwards incompatible changes to non-experimental features of Beam should
>>>>> be held till next major version release of Beam. Any exceptions to this
>>>>> policy should only occur in extreme cases (for example, due to a security
>>>>> vulnerability of an existing dependency that is only fixed in a subsequent
>>>>> major version) and should be discussed in the Beam dev list. Note that
>>>>> backwards incompatible changes to experimental features may be introduced
>>>>> in a minor version release."
>>>>>
>>>>> Also, are there any other steps we can take to make sure that Beam
>>>>> dependencies are not too old while offering a stable system ? Note that
>>>>> having a lot of legacy dependencies that do not get upgraded regularly can
>>>>> also result in user pain and Beam being unusable for certain users who run
>>>>> into dependency conflicts when using Beam along with other systems (which
>>>>> will increase the amount of shading/vendoring we have to do).
>>>>>
>>>>> Please note that current tooling does not force upgrades or
>>>>> automatically upgrade dependencies. It simply creates JIRAs that can be
>>>>> closed with a reason if needed. For Python SDK though we have version
>>>>> ranges in place for most dependencies [1] so these dependencies get updated
>>>>> automatically according to the corresponding ranges.
>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <ra...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the IO versioning summary.
>>>>>>> KafkaIO's policy of 'let the user decide exact version at runtime'
>>>>>>> has been quite useful so far. How feasible is that for other connectors?
>>>>>>>
>>>>>>> Also, KafkaIO does not limit itself to minimum features available
>>>>>>> across all the supported versions. Some of the features (e.g. server side
>>>>>>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>>>>>>> currently run with single recent version. Integration tests could certainly
>>>>>>> use multiple versions. With some more effort in writing tests, we could
>>>>>>> make multiple versions of the unit tests.
>>>>>>>
>>>>>>> Raghu.
>>>>>>>
>>>>>>> IO versioning
>>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>>> been EOL).
>>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>>> tests).
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ie...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think we should refine the strategy on dependencies discussed
>>>>>>>> recently. Sorry to come late with this (I did not follow closely the
>>>>>>>> previous discussion), but the current approach is clearly not in
>>>>>>>> line
>>>>>>>> with the industry reality (at least not for IO connectors + Hadoop +
>>>>>>>> Spark/Flink use).
>>>>>>>>
>>>>>>>> A really proactive approach to dependency updates is a good practice
>>>>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g.
>>>>>>>> GCS,
>>>>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>>>>>> sources or processing systems this gets more complicated and I think
>>>>>>>> we should be more flexible and do this case by case (and remove
>>>>>>>> these
>>>>>>>> from the auto update email reminder).
>>>>>>>>
>>>>>>>> Some open source projects have at least three maintained versions:
>>>>>>>> - LTS – maps to what most of the people have installed (or the big
>>>>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>>>>>
>>>>>>>> Following the most recent versions can be good to be close to the
>>>>>>>> current development of other projects and some of the fixes, but
>>>>>>>> these
>>>>>>>> versions are commonly not deployed for most users and adopting a LTS
>>>>>>>> or stable only approach won't satisfy all cases either. To
>>>>>>>> understand
>>>>>>>> why this is complex let’s see some historical issues:
>>>>>>>>
>>>>>>>> IO versioning
>>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>>> been EOL).
>>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>>> tests).
>>>>>>>>
>>>>>>>> Runners versioning
>>>>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>>>>>> tradeoffs between maintaining multiple version support and to have
>>>>>>>> breaking changes with the issues of maintaining multiple versions.
>>>>>>>> This is a rare case but also with consequences. This dependency is
>>>>>>>> provided but we don't actively test issues on version migration.
>>>>>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>>>>>> checkpointing (discussed recently and with not yet consensus on how
>>>>>>>> to
>>>>>>>> handle).
>>>>>>>>
>>>>>>>> As you can see, it seems really hard to have a solution that fits
>>>>>>>> all
>>>>>>>> cases. Probably the only rule that I see from this list is that we
>>>>>>>> should upgrade versions for connectors that have been deprecated or
>>>>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>>>>>
>>>>>>>> For the case of the provided dependencies I wonder if as part of the
>>>>>>>> tests we should provide tests with multiple versions (note that this
>>>>>>>> is currently blocked by BEAM-4087).
>>>>>>>>
>>>>>>>> Any other ideas or opinions to see how we can handle this? What
>>>>>>>> other
>>>>>>>> people in the community think ? (Notice that this can have relation
>>>>>>>> with the ongoing LTS discussion.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>>>>>> <ti...@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Hi folks,
>>>>>>>> >
>>>>>>>> > I'd like to revisit the discussion around our versioning policy
>>>>>>>> specifically for the Hadoop ecosystem and make sure we are aware of the
>>>>>>>> implications.
>>>>>>>> >
>>>>>>>> > As an example our policy today would have us on HBase 2.1 and I
>>>>>>>> have reminders to address this.
>>>>>>>> >
>>>>>>>> > However, currently the versions of HBase in the major hadoop
>>>>>>>> distros are:
>>>>>>>> >
>>>>>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>>>>>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we
>>>>>>>> can assume is not widely adopted)
>>>>>>>> >  - AWS EMR HBase on 1.4
>>>>>>>> >
>>>>>>>> > On the versioning I think we might need a more nuanced approach
>>>>>>>> to ensure that we target real communities of existing and potential users.
>>>>>>>> Enterprise users need to stick to the supported versions in the
>>>>>>>> distributions to maintain support contracts from the vendors.
>>>>>>>> >
>>>>>>>> > Should our versioning policy have more room to consider on a case
>>>>>>>> by case basis?
>>>>>>>> >
>>>>>>>> > For Hadoop might we benefit from a strategy on which community of
>>>>>>>> users Beam is targeting?
>>>>>>>> >
>>>>>>>> > (OT: I'm collecting some thoughts on what we might consider to
>>>>>>>> target enterprise hadoop users - kerberos on all relevant IO, performance,
>>>>>>>> leaking beyond encryption zones with temporary files etc)
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Tim
>>>>>>>>
>>>>>>>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Posted by Chamikara Jayalath <ch...@google.com>.

On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson <ti...@gmail.com>
wrote:

> Thank you Cham, and everyone for contributing
>
> Sorry for slow reply to a thread I started, but I've been swamped on non
> Beam projects.
>
> KafkaIO's policy of 'let the user decide exact version at runtime' has
>> been quite useful so far. How feasible is that for other connectors?
>
>
> I presume shimming might be needed in a few places but it's certainly
> something we might want to explore more. I'll look into KafkaIO.
>
> On Cham's proposal :
>
> (1) +0.5. We can always then opt to either assign or take ownership of an
> issue, although I am also happy to stick with the owners model - it
> prompted me to investigate and resulted in this thread.
>
> (2) I think this makes sense.
> A bot informing us that we're falling behind versions is immensely useful
> as long as we can link issues to others which might have a wider discussion
> (remember many dependencies need to be treated together such as "Support
> Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let owners
> use the Jira "fix versions" to put in future release to inform the bot when
> it should start alerting again?
>

I think this makes sense. Setting a "fix version" will be specially useful
for dependency changes that result in API changes that have to be postponed
till next major version of Beam.

On grouping, I believe we already group JIRAs into tasks and sub-tasks
based on group ids of dependencies. I suppose it will not be too hard to
close multiple sub-tasks with the same reasoning.


>
>
> On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou <yi...@google.com> wrote:
>
>> Thanks Cham for putting this together. Also, after modifying the
>> dependency tool based on the policy above, we will close all existing JIRA
>> issues that prevent creating duplicate bugs and stop pushing assignees to
>> upgrade dependencies with old bugs.
>>
>> Please let us know if you have any comments on the revised policy in
>> Cham's email.
>>
>> Thanks all.
>>
>> Regards.
>> Yifan Zou
>>
>> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> Based on this email thread and offline feedback from several folks,
>>> current concerns regarding dependency upgrade policy and tooling seems to
>>> be following.
>>>
>>> (1) We have to be careful when upgrading dependencies. For example, we
>>> should not create JIRAs for upgrading to dependency versions that have
>>> known issues.
>>>
>>> (2) Dependency owners list can get stale. Somebody who is interested in
>>> upgrading a dependency today might not be interested in the same task in
>>> six months. Responsibility of upgrading a dependency should lie with the
>>> community instead of pre-identified owner(s).
>>>
>>> On the other hand we do not want Beam to significantly fall behind when
>>> it comes to dependencies. We should upgrade dependencies whenever it makes
>>> sense. This allows us to offer a more up to date system and to makes things
>>> easy for users that deploy Beam along with other systems.
>>>
>>> I discussed these issues with Yifan and we would like to suggest
>>> following changes to current policy and tooling that might help alleviate
>>> some of the concerns.
>>>
>>> (1) Instead of a dependency "owners" list we will be maintaining an
>>> "interested parties" list. When we create a JIRA for a dependency we will
>>> not assign it to an owner but rather we will CC all the folks that
>>> mentioned that they will be interested in receiving updates related to that
>>> dependency. Hope is that some of the interested parties will also put
>>> forward the effort to upgrade dependencies they are interested in but the
>>> responsibility of upgrading dependencies lie with the community as a whole.
>>>
>>>  (2) We will be creating JIRAs for upgrading individual dependencies,
>>> not for upgrading to specific versions of those dependencies. For example,
>>> if a given dependency X is three minor versions or an year behind we will
>>> create a JIRA for upgrading that. But the specific version to upgrade to
>>> has to be determined by the Beam community. Beam community might choose to
>>> close a JIRA if there are known issues with available recent releases. Tool
>>> may reopen such a closed JIRA in the future if new information becomes
>>> available (for example, 3 new versions have been released since JIRA was
>>> closed).
>>>
>>> Thoughts ?
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <ch...@google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <th...@apache.org> wrote:
>>>>
>>>>> I think there is an invalid assumption being made in this discussion,
>>>>> which is that most projects comply with semantic versioning. The reality in
>>>>> the open source big data space is unfortunately quite different. Ismaël has
>>>>> well characterized the situation and HBase isn't an exception. Another
>>>>> indicator for the scale of problem is extensive amount of shading used in
>>>>> Beam and other projects. It wouldn't be necessary if semver compliance was
>>>>> something we can rely on.
>>>>>
>>>>> Our recent Flink upgrade broke user(s). And we noticed a backward
>>>>> incompatible Flink change that affected the portable Flink runner even
>>>>> between patches.
>>>>>
>>>>> Many projects (including Beam) guarantee compatibility only for a
>>>>> subset of public API. Sometimes a REST API is not covered, sometimes not
>>>>> strictly internal protocols change and so on, all of which can break users,
>>>>> despite the public API remaining "compatible". As much as I would love to
>>>>> rely on the version number to tell me wether an upgrade is safe or not,
>>>>> that's not practically possible.
>>>>>
>>>>> Furthermore, we need to proceed with caution forcing upgrades on users
>>>>> that host the target systems. To stay with the Flink example, moving Beam
>>>>> from 1.4 to 1.5 is actually a major change to some, because they now have
>>>>> to upgrade their Flink clusters/deployments to be able to use the new
>>>>> version of Beam.
>>>>>
>>>>> Upgrades need to be done with caution and may require extensive
>>>>> verification beyond what our automation provides. I think the Spark change
>>>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>>>>> provided the community a window to provide feedback and influence the
>>>>> change.
>>>>>
>>>>
>>>> Thanks for the clarification.
>>>>
>>>> Current policy indeed requests caution and explicit checks when
>>>> upgrading all dependencies (including minor and patch versions) but
>>>> language might have to be updated to emphasize your concerns.
>>>>
>>>> Here's the current text.
>>>>
>>>> "Beam releases adhere to
>>>> <https://beam.apache.org/get-started/downloads/> semantic versioning.
>>>> Hence, community members should take care when updating dependencies. Minor
>>>> version updates to dependencies should be backwards compatible in most
>>>> cases. Some updates to dependencies though may result in backwards
>>>> incompatible API or functionality changes to Beam. PR reviewers and
>>>> committers should take care to detect any dependency updates that could
>>>> potentially introduce backwards incompatible changes to Beam before merging
>>>> and PRs that update dependencies should include a statement regarding this
>>>> verification in the form of a PR comment. Dependency updates that result in
>>>> backwards incompatible changes to non-experimental features of Beam should
>>>> be held till next major version release of Beam. Any exceptions to this
>>>> policy should only occur in extreme cases (for example, due to a security
>>>> vulnerability of an existing dependency that is only fixed in a subsequent
>>>> major version) and should be discussed in the Beam dev list. Note that
>>>> backwards incompatible changes to experimental features may be introduced
>>>> in a minor version release."
>>>>
>>>> Also, are there any other steps we can take to make sure that Beam
>>>> dependencies are not too old while offering a stable system ? Note that
>>>> having a lot of legacy dependencies that do not get upgraded regularly can
>>>> also result in user pain and Beam being unusable for certain users who run
>>>> into dependency conflicts when using Beam along with other systems (which
>>>> will increase the amount of shading/vendoring we have to do).
>>>>
>>>> Please note that current tooling does not force upgrades or
>>>> automatically upgrade dependencies. It simply creates JIRAs that can be
>>>> closed with a reason if needed. For Python SDK though we have version
>>>> ranges in place for most dependencies [1] so these dependencies get updated
>>>> automatically according to the corresponding ranges.
>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <ra...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the IO versioning summary.
>>>>>> KafkaIO's policy of 'let the user decide exact version at runtime'
>>>>>> has been quite useful so far. How feasible is that for other connectors?
>>>>>>
>>>>>> Also, KafkaIO does not limit itself to minimum features available
>>>>>> across all the supported versions. Some of the features (e.g. server side
>>>>>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>>>>>> currently run with single recent version. Integration tests could certainly
>>>>>> use multiple versions. With some more effort in writing tests, we could
>>>>>> make multiple versions of the unit tests.
>>>>>>
>>>>>> Raghu.
>>>>>>
>>>>>> IO versioning
>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>> been EOL).
>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>> tests).
>>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ie...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think we should refine the strategy on dependencies discussed
>>>>>>> recently. Sorry to come late with this (I did not follow closely the
>>>>>>> previous discussion), but the current approach is clearly not in line
>>>>>>> with the industry reality (at least not for IO connectors + Hadoop +
>>>>>>> Spark/Flink use).
>>>>>>>
>>>>>>> A really proactive approach to dependency updates is a good practice
>>>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g.
>>>>>>> GCS,
>>>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>>>>> sources or processing systems this gets more complicated and I think
>>>>>>> we should be more flexible and do this case by case (and remove these
>>>>>>> from the auto update email reminder).
>>>>>>>
>>>>>>> Some open source projects have at least three maintained versions:
>>>>>>> - LTS – maps to what most of the people have installed (or the big
>>>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>>>>
>>>>>>> Following the most recent versions can be good to be close to the
>>>>>>> current development of other projects and some of the fixes, but
>>>>>>> these
>>>>>>> versions are commonly not deployed for most users and adopting a LTS
>>>>>>> or stable only approach won't satisfy all cases either. To understand
>>>>>>> why this is complex let’s see some historical issues:
>>>>>>>
>>>>>>> IO versioning
>>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>>> been EOL).
>>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>>> module uses a single version with the kafka client as a provided
>>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>>> tests).
>>>>>>>
>>>>>>> Runners versioning
>>>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>>>>> tradeoffs between maintaining multiple version support and to have
>>>>>>> breaking changes with the issues of maintaining multiple versions.
>>>>>>> This is a rare case but also with consequences. This dependency is
>>>>>>> provided but we don't actively test issues on version migration.
>>>>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>>>>> checkpointing (discussed recently and with not yet consensus on how
>>>>>>> to
>>>>>>> handle).
>>>>>>>
>>>>>>> As you can see, it seems really hard to have a solution that fits all
>>>>>>> cases. Probably the only rule that I see from this list is that we
>>>>>>> should upgrade versions for connectors that have been deprecated or
>>>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>>>>
>>>>>>> For the case of the provided dependencies I wonder if as part of the
>>>>>>> tests we should provide tests with multiple versions (note that this
>>>>>>> is currently blocked by BEAM-4087).
>>>>>>>
>>>>>>> Any other ideas or opinions to see how we can handle this? What other
>>>>>>> people in the community think ? (Notice that this can have relation
>>>>>>> with the ongoing LTS discussion.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>>>>> <ti...@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Hi folks,
>>>>>>> >
>>>>>>> > I'd like to revisit the discussion around our versioning policy
>>>>>>> specifically for the Hadoop ecosystem and make sure we are aware of the
>>>>>>> implications.
>>>>>>> >
>>>>>>> > As an example our policy today would have us on HBase 2.1 and I
>>>>>>> have reminders to address this.
>>>>>>> >
>>>>>>> > However, currently the versions of HBase in the major hadoop
>>>>>>> distros are:
>>>>>>> >
>>>>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>>>>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
>>>>>>> assume is not widely adopted)
>>>>>>> >  - AWS EMR HBase on 1.4
>>>>>>> >
>>>>>>> > On the versioning I think we might need a more nuanced approach to
>>>>>>> ensure that we target real communities of existing and potential users.
>>>>>>> Enterprise users need to stick to the supported versions in the
>>>>>>> distributions to maintain support contracts from the vendors.
>>>>>>> >
>>>>>>> > Should our versioning policy have more room to consider on a case
>>>>>>> by case basis?
>>>>>>> >
>>>>>>> > For Hadoop might we benefit from a strategy on which community of
>>>>>>> users Beam is targeting?
>>>>>>> >
>>>>>>> > (OT: I'm collecting some thoughts on what we might consider to
>>>>>>> target enterprise hadoop users - kerberos on all relevant IO, performance,
>>>>>>> leaking beyond encryption zones with temporary files etc)
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Tim
>>>>>>>
>>>>>>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Posted by Tim Robertson <ti...@gmail.com>.

Thank you Cham, and everyone for contributing

Sorry for slow reply to a thread I started, but I've been swamped on non
Beam projects.

KafkaIO's policy of 'let the user decide exact version at runtime' has been
> quite useful so far. How feasible is that for other connectors?


I presume shimming might be needed in a few places but it's certainly
something we might want to explore more. I'll look into KafkaIO.

On Cham's proposal :

(1) +0.5. We can always then opt to either assign or take ownership of an
issue, although I am also happy to stick with the owners model - it
prompted me to investigate and resulted in this thread.

(2) I think this makes sense.
A bot informing us that we're falling behind versions is immensely useful
as long as we can link issues to others which might have a wider discussion
(remember many dependencies need to be treated together such as "Support
Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let owners
use the Jira "fix versions" to put in future release to inform the bot when
it should start alerting again?



On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou <yi...@google.com> wrote:

> Thanks Cham for putting this together. Also, after modifying the
> dependency tool based on the policy above, we will close all existing JIRA
> issues that prevent creating duplicate bugs and stop pushing assignees to
> upgrade dependencies with old bugs.
>
> Please let us know if you have any comments on the revised policy in
> Cham's email.
>
> Thanks all.
>
> Regards.
> Yifan Zou
>
> On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> Based on this email thread and offline feedback from several folks,
>> current concerns regarding dependency upgrade policy and tooling seems to
>> be following.
>>
>> (1) We have to be careful when upgrading dependencies. For example, we
>> should not create JIRAs for upgrading to dependency versions that have
>> known issues.
>>
>> (2) Dependency owners list can get stale. Somebody who is interested in
>> upgrading a dependency today might not be interested in the same task in
>> six months. Responsibility of upgrading a dependency should lie with the
>> community instead of pre-identified owner(s).
>>
>> On the other hand we do not want Beam to significantly fall behind when
>> it comes to dependencies. We should upgrade dependencies whenever it makes
>> sense. This allows us to offer a more up to date system and to makes things
>> easy for users that deploy Beam along with other systems.
>>
>> I discussed these issues with Yifan and we would like to suggest
>> following changes to current policy and tooling that might help alleviate
>> some of the concerns.
>>
>> (1) Instead of a dependency "owners" list we will be maintaining an
>> "interested parties" list. When we create a JIRA for a dependency we will
>> not assign it to an owner but rather we will CC all the folks that
>> mentioned that they will be interested in receiving updates related to that
>> dependency. Hope is that some of the interested parties will also put
>> forward the effort to upgrade dependencies they are interested in but the
>> responsibility of upgrading dependencies lie with the community as a whole.
>>
>>  (2) We will be creating JIRAs for upgrading individual dependencies, not
>> for upgrading to specific versions of those dependencies. For example, if a
>> given dependency X is three minor versions or an year behind we will create
>> a JIRA for upgrading that. But the specific version to upgrade to has to be
>> determined by the Beam community. Beam community might choose to close a
>> JIRA if there are known issues with available recent releases. Tool may
>> reopen such a closed JIRA in the future if new information becomes
>> available (for example, 3 new versions have been released since JIRA was
>> closed).
>>
>> Thoughts ?
>>
>> Thanks,
>> Cham
>>
>> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <th...@apache.org> wrote:
>>>
>>>> I think there is an invalid assumption being made in this discussion,
>>>> which is that most projects comply with semantic versioning. The reality in
>>>> the open source big data space is unfortunately quite different. Ismaël has
>>>> well characterized the situation and HBase isn't an exception. Another
>>>> indicator for the scale of problem is extensive amount of shading used in
>>>> Beam and other projects. It wouldn't be necessary if semver compliance was
>>>> something we can rely on.
>>>>
>>>> Our recent Flink upgrade broke user(s). And we noticed a backward
>>>> incompatible Flink change that affected the portable Flink runner even
>>>> between patches.
>>>>
>>>> Many projects (including Beam) guarantee compatibility only for a
>>>> subset of public API. Sometimes a REST API is not covered, sometimes not
>>>> strictly internal protocols change and so on, all of which can break users,
>>>> despite the public API remaining "compatible". As much as I would love to
>>>> rely on the version number to tell me wether an upgrade is safe or not,
>>>> that's not practically possible.
>>>>
>>>> Furthermore, we need to proceed with caution forcing upgrades on users
>>>> that host the target systems. To stay with the Flink example, moving Beam
>>>> from 1.4 to 1.5 is actually a major change to some, because they now have
>>>> to upgrade their Flink clusters/deployments to be able to use the new
>>>> version of Beam.
>>>>
>>>> Upgrades need to be done with caution and may require extensive
>>>> verification beyond what our automation provides. I think the Spark change
>>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>>>> provided the community a window to provide feedback and influence the
>>>> change.
>>>>
>>>
>>> Thanks for the clarification.
>>>
>>> Current policy indeed requests caution and explicit checks when
>>> upgrading all dependencies (including minor and patch versions) but
>>> language might have to be updated to emphasize your concerns.
>>>
>>> Here's the current text.
>>>
>>> "Beam releases adhere to
>>> <https://beam.apache.org/get-started/downloads/> semantic versioning.
>>> Hence, community members should take care when updating dependencies. Minor
>>> version updates to dependencies should be backwards compatible in most
>>> cases. Some updates to dependencies though may result in backwards
>>> incompatible API or functionality changes to Beam. PR reviewers and
>>> committers should take care to detect any dependency updates that could
>>> potentially introduce backwards incompatible changes to Beam before merging
>>> and PRs that update dependencies should include a statement regarding this
>>> verification in the form of a PR comment. Dependency updates that result in
>>> backwards incompatible changes to non-experimental features of Beam should
>>> be held till next major version release of Beam. Any exceptions to this
>>> policy should only occur in extreme cases (for example, due to a security
>>> vulnerability of an existing dependency that is only fixed in a subsequent
>>> major version) and should be discussed in the Beam dev list. Note that
>>> backwards incompatible changes to experimental features may be introduced
>>> in a minor version release."
>>>
>>> Also, are there any other steps we can take to make sure that Beam
>>> dependencies are not too old while offering a stable system ? Note that
>>> having a lot of legacy dependencies that do not get upgraded regularly can
>>> also result in user pain and Beam being unusable for certain users who run
>>> into dependency conflicts when using Beam along with other systems (which
>>> will increase the amount of shading/vendoring we have to do).
>>>
>>> Please note that current tooling does not force upgrades or
>>> automatically upgrade dependencies. It simply creates JIRAs that can be
>>> closed with a reason if needed. For Python SDK though we have version
>>> ranges in place for most dependencies [1] so these dependencies get updated
>>> automatically according to the corresponding ranges.
>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <ra...@google.com>
>>>> wrote:
>>>>
>>>>> Thanks for the IO versioning summary.
>>>>> KafkaIO's policy of 'let the user decide exact version at runtime' has
>>>>> been quite useful so far. How feasible is that for other connectors?
>>>>>
>>>>> Also, KafkaIO does not limit itself to minimum features available
>>>>> across all the supported versions. Some of the features (e.g. server side
>>>>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>>>>> currently run with single recent version. Integration tests could certainly
>>>>> use multiple versions. With some more effort in writing tests, we could
>>>>> make multiple versions of the unit tests.
>>>>>
>>>>> Raghu.
>>>>>
>>>>> IO versioning
>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>> been EOL).
>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>> module uses a single version with the kafka client as a provided
>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>> tests).
>>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ie...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think we should refine the strategy on dependencies discussed
>>>>>> recently. Sorry to come late with this (I did not follow closely the
>>>>>> previous discussion), but the current approach is clearly not in line
>>>>>> with the industry reality (at least not for IO connectors + Hadoop +
>>>>>> Spark/Flink use).
>>>>>>
>>>>>> A really proactive approach to dependency updates is a good practice
>>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
>>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>>>> sources or processing systems this gets more complicated and I think
>>>>>> we should be more flexible and do this case by case (and remove these
>>>>>> from the auto update email reminder).
>>>>>>
>>>>>> Some open source projects have at least three maintained versions:
>>>>>> - LTS – maps to what most of the people have installed (or the big
>>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>>>
>>>>>> Following the most recent versions can be good to be close to the
>>>>>> current development of other projects and some of the fixes, but these
>>>>>> versions are commonly not deployed for most users and adopting a LTS
>>>>>> or stable only approach won't satisfy all cases either. To understand
>>>>>> why this is complex let’s see some historical issues:
>>>>>>
>>>>>> IO versioning
>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>> been EOL).
>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>> module uses a single version with the kafka client as a provided
>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>> tests).
>>>>>>
>>>>>> Runners versioning
>>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>>>> tradeoffs between maintaining multiple version support and to have
>>>>>> breaking changes with the issues of maintaining multiple versions.
>>>>>> This is a rare case but also with consequences. This dependency is
>>>>>> provided but we don't actively test issues on version migration.
>>>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>>>> checkpointing (discussed recently and with not yet consensus on how to
>>>>>> handle).
>>>>>>
>>>>>> As you can see, it seems really hard to have a solution that fits all
>>>>>> cases. Probably the only rule that I see from this list is that we
>>>>>> should upgrade versions for connectors that have been deprecated or
>>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>>>
>>>>>> For the case of the provided dependencies I wonder if as part of the
>>>>>> tests we should provide tests with multiple versions (note that this
>>>>>> is currently blocked by BEAM-4087).
>>>>>>
>>>>>> Any other ideas or opinions to see how we can handle this? What other
>>>>>> people in the community think ? (Notice that this can have relation
>>>>>> with the ongoing LTS discussion.
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>>>> <ti...@gmail.com> wrote:
>>>>>> >
>>>>>> > Hi folks,
>>>>>> >
>>>>>> > I'd like to revisit the discussion around our versioning policy
>>>>>> specifically for the Hadoop ecosystem and make sure we are aware of the
>>>>>> implications.
>>>>>> >
>>>>>> > As an example our policy today would have us on HBase 2.1 and I
>>>>>> have reminders to address this.
>>>>>> >
>>>>>> > However, currently the versions of HBase in the major hadoop
>>>>>> distros are:
>>>>>> >
>>>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>>>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
>>>>>> assume is not widely adopted)
>>>>>> >  - AWS EMR HBase on 1.4
>>>>>> >
>>>>>> > On the versioning I think we might need a more nuanced approach to
>>>>>> ensure that we target real communities of existing and potential users.
>>>>>> Enterprise users need to stick to the supported versions in the
>>>>>> distributions to maintain support contracts from the vendors.
>>>>>> >
>>>>>> > Should our versioning policy have more room to consider on a case
>>>>>> by case basis?
>>>>>> >
>>>>>> > For Hadoop might we benefit from a strategy on which community of
>>>>>> users Beam is targeting?
>>>>>> >
>>>>>> > (OT: I'm collecting some thoughts on what we might consider to
>>>>>> target enterprise hadoop users - kerberos on all relevant IO, performance,
>>>>>> leaking beyond encryption zones with temporary files etc)
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Tim
>>>>>>
>>>>>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Posted by Yifan Zou <yi...@google.com>.

Thanks Cham for putting this together. Also, after modifying the dependency
tool based on the policy above, we will close all existing JIRA issues that
prevent creating duplicate bugs and stop pushing assignees to upgrade
dependencies with old bugs.

Please let us know if you have any comments on the revised policy in Cham's
email.

Thanks all.

Regards.
Yifan Zou

On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <ch...@google.com>
wrote:

> Based on this email thread and offline feedback from several folks,
> current concerns regarding dependency upgrade policy and tooling seems to
> be following.
>
> (1) We have to be careful when upgrading dependencies. For example, we
> should not create JIRAs for upgrading to dependency versions that have
> known issues.
>
> (2) Dependency owners list can get stale. Somebody who is interested in
> upgrading a dependency today might not be interested in the same task in
> six months. Responsibility of upgrading a dependency should lie with the
> community instead of pre-identified owner(s).
>
> On the other hand we do not want Beam to significantly fall behind when it
> comes to dependencies. We should upgrade dependencies whenever it makes
> sense. This allows us to offer a more up to date system and to makes things
> easy for users that deploy Beam along with other systems.
>
> I discussed these issues with Yifan and we would like to suggest following
> changes to current policy and tooling that might help alleviate some of the
> concerns.
>
> (1) Instead of a dependency "owners" list we will be maintaining an
> "interested parties" list. When we create a JIRA for a dependency we will
> not assign it to an owner but rather we will CC all the folks that
> mentioned that they will be interested in receiving updates related to that
> dependency. Hope is that some of the interested parties will also put
> forward the effort to upgrade dependencies they are interested in but the
> responsibility of upgrading dependencies lie with the community as a whole.
>
>  (2) We will be creating JIRAs for upgrading individual dependencies, not
> for upgrading to specific versions of those dependencies. For example, if a
> given dependency X is three minor versions or an year behind we will create
> a JIRA for upgrading that. But the specific version to upgrade to has to be
> determined by the Beam community. Beam community might choose to close a
> JIRA if there are known issues with available recent releases. Tool may
> reopen such a closed JIRA in the future if new information becomes
> available (for example, 3 new versions have been released since JIRA was
> closed).
>
> Thoughts ?
>
> Thanks,
> Cham
>
> On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>>
>>
>> On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <th...@apache.org> wrote:
>>
>>> I think there is an invalid assumption being made in this discussion,
>>> which is that most projects comply with semantic versioning. The reality in
>>> the open source big data space is unfortunately quite different. Ismaël has
>>> well characterized the situation and HBase isn't an exception. Another
>>> indicator for the scale of problem is extensive amount of shading used in
>>> Beam and other projects. It wouldn't be necessary if semver compliance was
>>> something we can rely on.
>>>
>>> Our recent Flink upgrade broke user(s). And we noticed a backward
>>> incompatible Flink change that affected the portable Flink runner even
>>> between patches.
>>>
>>> Many projects (including Beam) guarantee compatibility only for a subset
>>> of public API. Sometimes a REST API is not covered, sometimes not strictly
>>> internal protocols change and so on, all of which can break users, despite
>>> the public API remaining "compatible". As much as I would love to rely on
>>> the version number to tell me wether an upgrade is safe or not, that's not
>>> practically possible.
>>>
>>> Furthermore, we need to proceed with caution forcing upgrades on users
>>> that host the target systems. To stay with the Flink example, moving Beam
>>> from 1.4 to 1.5 is actually a major change to some, because they now have
>>> to upgrade their Flink clusters/deployments to be able to use the new
>>> version of Beam.
>>>
>>> Upgrades need to be done with caution and may require extensive
>>> verification beyond what our automation provides. I think the Spark change
>>> from 1.x to 2.x and also the JDK 1.8 change were good examples, they
>>> provided the community a window to provide feedback and influence the
>>> change.
>>>
>>
>> Thanks for the clarification.
>>
>> Current policy indeed requests caution and explicit checks when upgrading
>> all dependencies (including minor and patch versions) but language might
>> have to be updated to emphasize your concerns.
>>
>> Here's the current text.
>>
>> "Beam releases adhere to <https://beam.apache.org/get-started/downloads/> semantic
>> versioning. Hence, community members should take care when updating
>> dependencies. Minor version updates to dependencies should be backwards
>> compatible in most cases. Some updates to dependencies though may result in
>> backwards incompatible API or functionality changes to Beam. PR reviewers
>> and committers should take care to detect any dependency updates that could
>> potentially introduce backwards incompatible changes to Beam before merging
>> and PRs that update dependencies should include a statement regarding this
>> verification in the form of a PR comment. Dependency updates that result in
>> backwards incompatible changes to non-experimental features of Beam should
>> be held till next major version release of Beam. Any exceptions to this
>> policy should only occur in extreme cases (for example, due to a security
>> vulnerability of an existing dependency that is only fixed in a subsequent
>> major version) and should be discussed in the Beam dev list. Note that
>> backwards incompatible changes to experimental features may be introduced
>> in a minor version release."
>>
>> Also, are there any other steps we can take to make sure that Beam
>> dependencies are not too old while offering a stable system ? Note that
>> having a lot of legacy dependencies that do not get upgraded regularly can
>> also result in user pain and Beam being unusable for certain users who run
>> into dependency conflicts when using Beam along with other systems (which
>> will increase the amount of shading/vendoring we have to do).
>>
>> Please note that current tooling does not force upgrades or automatically
>> upgrade dependencies. It simply creates JIRAs that can be closed with a
>> reason if needed. For Python SDK though we have version ranges in place for
>> most dependencies [1] so these dependencies get updated automatically
>> according to the corresponding ranges.
>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L103
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>>
>>> On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <ra...@google.com>
>>> wrote:
>>>
>>>> Thanks for the IO versioning summary.
>>>> KafkaIO's policy of 'let the user decide exact version at runtime' has
>>>> been quite useful so far. How feasible is that for other connectors?
>>>>
>>>> Also, KafkaIO does not limit itself to minimum features available
>>>> across all the supported versions. Some of the features (e.g. server side
>>>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>>>> currently run with single recent version. Integration tests could certainly
>>>> use multiple versions. With some more effort in writing tests, we could
>>>> make multiple versions of the unit tests.
>>>>
>>>> Raghu.
>>>>
>>>> IO versioning
>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>> more active users needing it (more deployments). We support 2.x and
>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>> been EOL).
>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>> module uses a single version with the kafka client as a provided
>>>>> dependency and so far it works (but we don’t have multi version
>>>>> tests).
>>>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>
>>>>> I think we should refine the strategy on dependencies discussed
>>>>> recently. Sorry to come late with this (I did not follow closely the
>>>>> previous discussion), but the current approach is clearly not in line
>>>>> with the industry reality (at least not for IO connectors + Hadoop +
>>>>> Spark/Flink use).
>>>>>
>>>>> A really proactive approach to dependency updates is a good practice
>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>>> sources or processing systems this gets more complicated and I think
>>>>> we should be more flexible and do this case by case (and remove these
>>>>> from the auto update email reminder).
>>>>>
>>>>> Some open source projects have at least three maintained versions:
>>>>> - LTS – maps to what most of the people have installed (or the big
>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>>
>>>>> Following the most recent versions can be good to be close to the
>>>>> current development of other projects and some of the fixes, but these
>>>>> versions are commonly not deployed for most users and adopting a LTS
>>>>> or stable only approach won't satisfy all cases either. To understand
>>>>> why this is complex let’s see some historical issues:
>>>>>
>>>>> IO versioning
>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>> more active users needing it (more deployments). We support 2.x and
>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>> been EOL).
>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>> module uses a single version with the kafka client as a provided
>>>>> dependency and so far it works (but we don’t have multi version
>>>>> tests).
>>>>>
>>>>> Runners versioning
>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>>> tradeoffs between maintaining multiple version support and to have
>>>>> breaking changes with the issues of maintaining multiple versions.
>>>>> This is a rare case but also with consequences. This dependency is
>>>>> provided but we don't actively test issues on version migration.
>>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>>> checkpointing (discussed recently and with not yet consensus on how to
>>>>> handle).
>>>>>
>>>>> As you can see, it seems really hard to have a solution that fits all
>>>>> cases. Probably the only rule that I see from this list is that we
>>>>> should upgrade versions for connectors that have been deprecated or
>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>>
>>>>> For the case of the provided dependencies I wonder if as part of the
>>>>> tests we should provide tests with multiple versions (note that this
>>>>> is currently blocked by BEAM-4087).
>>>>>
>>>>> Any other ideas or opinions to see how we can handle this? What other
>>>>> people in the community think ? (Notice that this can have relation
>>>>> with the ongoing LTS discussion.
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>>> <ti...@gmail.com> wrote:
>>>>> >
>>>>> > Hi folks,
>>>>> >
>>>>> > I'd like to revisit the discussion around our versioning policy
>>>>> specifically for the Hadoop ecosystem and make sure we are aware of the
>>>>> implications.
>>>>> >
>>>>> > As an example our policy today would have us on HBase 2.1 and I have
>>>>> reminders to address this.
>>>>> >
>>>>> > However, currently the versions of HBase in the major hadoop distros
>>>>> are:
>>>>> >
>>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
>>>>> assume is not widely adopted)
>>>>> >  - AWS EMR HBase on 1.4
>>>>> >
>>>>> > On the versioning I think we might need a more nuanced approach to
>>>>> ensure that we target real communities of existing and potential users.
>>>>> Enterprise users need to stick to the supported versions in the
>>>>> distributions to maintain support contracts from the vendors.
>>>>> >
>>>>> > Should our versioning policy have more room to consider on a case by
>>>>> case basis?
>>>>> >
>>>>> > For Hadoop might we benefit from a strategy on which community of
>>>>> users Beam is targeting?
>>>>> >
>>>>> > (OT: I'm collecting some thoughts on what we might consider to
>>>>> target enterprise hadoop users - kerberos on all relevant IO, performance,
>>>>> leaking beyond encryption zones with temporary files etc)
>>>>> >
>>>>> > Thanks,
>>>>> > Tim
>>>>>
>>>>