You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Wei-Chiu Chuang <we...@apache.org> on 2020/03/07 17:31:09 UTC

[DISCUSS] Accelerate Hadoop dependency updates

Hi Hadoop devs,

I the past, Hadoop tends to be pretty far behind the latest versions of
dependencies. Part of that is due to the fear of the breaking changes
brought in by the dependency updates.

However, things have changed dramatically over the past few years. With
more focus on security vulnerabilities, more vulnerabilities are discovered
in our dependencies, and users put more pressure on patching Hadoop (and
its ecosystem) to use the latest dependency versions.

As an example, Jackson-databind had 20 CVEs published in the last year
alone.
https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866

Jetty: 4 CVEs in 2019:
https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410

We can no longer keep Hadoop stay behind. The more we stay behind, the
harder it is to update. A good example is Jersey migration 1 -> 2
HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984> contributed
by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is hard.
If any critical vulnerability is found in Jersey 1, it will leave us in a
bad situation since we can't simply update Jersey version and be done.

Hadoop 3 adds new public artifacts that shade these dependencies. We should
advocate downstream applications to use the public artifacts to avoid
breakage.

I'd like to hear your thoughts: are you okay to see Hadoop keep up with the
latest dependency updates, or would rather stay behind to ensure
compatibility?

Coupled with that, I'd like to call for more frequent Hadoop releases for
the same purpose. IMHO that'll require better infrastructure to assist the
release work and some rethinking our current Hadoop code structure, like
separate each subproject into its own repository and release cadence. This
can be controversial but I think it'll be good for the project in the long
run.

Thanks,
Wei-Chiu

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@apache.org>.

That is unfortunately true.

Now that I recognize the impact of guava update in Hadoop 3.1/3.2, how can
we make this better for downstreamers to consume? Like I proposed, I think
a middle ground is to shade guava in hadoop-thirdparty, and include the
hadoop-thirdparty jar in the next Hadoop 3.1/3.2 release.



On Thu, Mar 12, 2020 at 12:03 AM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> How do you manage and version such dependency upgrades in subminor
> Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
> upgrades will be breaking for customers and can not be shipped in subminor
> CDH release? Or this is in preparation for the next major/minor release of
> CDH?
>
> On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
>> Apache version are they going to land, but we'll upstream them for sure.
>>
>> The guava change is debatable. It's not as critical as others. There are
>> critical vulnerabilities in other dependencies that we have no way but to
>> update to a new major/minor version because we are so far behind. And
>> given
>> the critical nature, I think it is worth the risk and backport to lower
>> maintenance releases is warranted. Moreover, our minor releases are at
>> best
>> 1 per year. That is too slow to respond to a critical vulnerability.
>>
>> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
>> wrote:
>>
>> > Generally I'm for updating dependencies, but I think that Hadoop should
>> > stick with semantic versioning and do not make major and
>> > minor dependency updates in subminor releases.
>> >
>> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of
>> this
>> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
>> > support Guava 27.0-jre.
>> >
>> > It would be better to make dependency upgrades when releasing new
>> > major/minor versions, for example Guava 27.0-jre upgrade was more
>> > appropriate for Hadoop 3.3.0 release than 3.2.1.
>> >
>> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
>> > <we...@cloudera.com.invalid> wrote:
>> >
>> >> I'm not hearing any feedback so far, but I want to suggest:
>> >>
>> >> use hadoop-thirdparty repository to host any dependencies that are
>> known
>> >> to
>> >> break compatibility.
>> >>
>> >> Candidate #1 guava
>> >> Candidate #2 Netty
>> >> Candidate #3 Jetty
>> >>
>> >> in fact, HBase shades these dependencies for the exact same reason.
>> >>
>> >> As an example of the cost of compatibility breakage: we spent the last
>> 6
>> >> months to backport the guava update change (guava 11 --> 27) throughout
>> >> Cloudera's stack, and after 6 months we are not done yet because we
>> have
>> >> to
>> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> >> guava
>> >> is in the classpath of every application.
>> >>
>> >> Thoughts?
>> >>
>> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> >> wrote:
>> >>
>> >> > Hi Hadoop devs,
>> >> >
>> >> > I the past, Hadoop tends to be pretty far behind the latest versions
>> of
>> >> > dependencies. Part of that is due to the fear of the breaking changes
>> >> > brought in by the dependency updates.
>> >> >
>> >> > However, things have changed dramatically over the past few years.
>> With
>> >> > more focus on security vulnerabilities, more vulnerabilities are
>> >> discovered
>> >> > in our dependencies, and users put more pressure on patching Hadoop
>> (and
>> >> > its ecosystem) to use the latest dependency versions.
>> >> >
>> >> > As an example, Jackson-databind had 20 CVEs published in the last
>> year
>> >> > alone.
>> >> >
>> >>
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >> >
>> >> > Jetty: 4 CVEs in 2019:
>> >> >
>> >>
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >> >
>> >> > We can no longer keep Hadoop stay behind. The more we stay behind,
>> the
>> >> > harder it is to update. A good example is Jersey migration 1 -> 2
>> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> >> contributed
>> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> >> hard.
>> >> > If any critical vulnerability is found in Jersey 1, it will leave us
>> in
>> >> a
>> >> > bad situation since we can't simply update Jersey version and be
>> done.
>> >> >
>> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> >> > should advocate downstream applications to use the public artifacts
>> to
>> >> > avoid breakage.
>> >> >
>> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
>> with
>> >> > the latest dependency updates, or would rather stay behind to ensure
>> >> > compatibility?
>> >> >
>> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> >> for
>> >> > the same purpose. IMHO that'll require better infrastructure to
>> assist
>> >> the
>> >> > release work and some rethinking our current Hadoop code structure,
>> like
>> >> > separate each subproject into its own repository and release cadence.
>> >> This
>> >> > can be controversial but I think it'll be good for the project in the
>> >> long
>> >> > run.
>> >> >
>> >> > Thanks,
>> >> > Wei-Chiu
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@apache.org>.

That is unfortunately true.

Now that I recognize the impact of guava update in Hadoop 3.1/3.2, how can
we make this better for downstreamers to consume? Like I proposed, I think
a middle ground is to shade guava in hadoop-thirdparty, and include the
hadoop-thirdparty jar in the next Hadoop 3.1/3.2 release.



On Thu, Mar 12, 2020 at 12:03 AM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> How do you manage and version such dependency upgrades in subminor
> Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
> upgrades will be breaking for customers and can not be shipped in subminor
> CDH release? Or this is in preparation for the next major/minor release of
> CDH?
>
> On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
>> Apache version are they going to land, but we'll upstream them for sure.
>>
>> The guava change is debatable. It's not as critical as others. There are
>> critical vulnerabilities in other dependencies that we have no way but to
>> update to a new major/minor version because we are so far behind. And
>> given
>> the critical nature, I think it is worth the risk and backport to lower
>> maintenance releases is warranted. Moreover, our minor releases are at
>> best
>> 1 per year. That is too slow to respond to a critical vulnerability.
>>
>> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
>> wrote:
>>
>> > Generally I'm for updating dependencies, but I think that Hadoop should
>> > stick with semantic versioning and do not make major and
>> > minor dependency updates in subminor releases.
>> >
>> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of
>> this
>> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
>> > support Guava 27.0-jre.
>> >
>> > It would be better to make dependency upgrades when releasing new
>> > major/minor versions, for example Guava 27.0-jre upgrade was more
>> > appropriate for Hadoop 3.3.0 release than 3.2.1.
>> >
>> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
>> > <we...@cloudera.com.invalid> wrote:
>> >
>> >> I'm not hearing any feedback so far, but I want to suggest:
>> >>
>> >> use hadoop-thirdparty repository to host any dependencies that are
>> known
>> >> to
>> >> break compatibility.
>> >>
>> >> Candidate #1 guava
>> >> Candidate #2 Netty
>> >> Candidate #3 Jetty
>> >>
>> >> in fact, HBase shades these dependencies for the exact same reason.
>> >>
>> >> As an example of the cost of compatibility breakage: we spent the last
>> 6
>> >> months to backport the guava update change (guava 11 --> 27) throughout
>> >> Cloudera's stack, and after 6 months we are not done yet because we
>> have
>> >> to
>> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> >> guava
>> >> is in the classpath of every application.
>> >>
>> >> Thoughts?
>> >>
>> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> >> wrote:
>> >>
>> >> > Hi Hadoop devs,
>> >> >
>> >> > I the past, Hadoop tends to be pretty far behind the latest versions
>> of
>> >> > dependencies. Part of that is due to the fear of the breaking changes
>> >> > brought in by the dependency updates.
>> >> >
>> >> > However, things have changed dramatically over the past few years.
>> With
>> >> > more focus on security vulnerabilities, more vulnerabilities are
>> >> discovered
>> >> > in our dependencies, and users put more pressure on patching Hadoop
>> (and
>> >> > its ecosystem) to use the latest dependency versions.
>> >> >
>> >> > As an example, Jackson-databind had 20 CVEs published in the last
>> year
>> >> > alone.
>> >> >
>> >>
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >> >
>> >> > Jetty: 4 CVEs in 2019:
>> >> >
>> >>
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >> >
>> >> > We can no longer keep Hadoop stay behind. The more we stay behind,
>> the
>> >> > harder it is to update. A good example is Jersey migration 1 -> 2
>> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> >> contributed
>> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> >> hard.
>> >> > If any critical vulnerability is found in Jersey 1, it will leave us
>> in
>> >> a
>> >> > bad situation since we can't simply update Jersey version and be
>> done.
>> >> >
>> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> >> > should advocate downstream applications to use the public artifacts
>> to
>> >> > avoid breakage.
>> >> >
>> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
>> with
>> >> > the latest dependency updates, or would rather stay behind to ensure
>> >> > compatibility?
>> >> >
>> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> >> for
>> >> > the same purpose. IMHO that'll require better infrastructure to
>> assist
>> >> the
>> >> > release work and some rethinking our current Hadoop code structure,
>> like
>> >> > separate each subproject into its own repository and release cadence.
>> >> This
>> >> > can be controversial but I think it'll be good for the project in the
>> >> long
>> >> > run.
>> >> >
>> >> > Thanks,
>> >> > Wei-Chiu
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@apache.org>.

That is unfortunately true.

Now that I recognize the impact of guava update in Hadoop 3.1/3.2, how can
we make this better for downstreamers to consume? Like I proposed, I think
a middle ground is to shade guava in hadoop-thirdparty, and include the
hadoop-thirdparty jar in the next Hadoop 3.1/3.2 release.



On Thu, Mar 12, 2020 at 12:03 AM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> How do you manage and version such dependency upgrades in subminor
> Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
> upgrades will be breaking for customers and can not be shipped in subminor
> CDH release? Or this is in preparation for the next major/minor release of
> CDH?
>
> On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
>> Apache version are they going to land, but we'll upstream them for sure.
>>
>> The guava change is debatable. It's not as critical as others. There are
>> critical vulnerabilities in other dependencies that we have no way but to
>> update to a new major/minor version because we are so far behind. And
>> given
>> the critical nature, I think it is worth the risk and backport to lower
>> maintenance releases is warranted. Moreover, our minor releases are at
>> best
>> 1 per year. That is too slow to respond to a critical vulnerability.
>>
>> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
>> wrote:
>>
>> > Generally I'm for updating dependencies, but I think that Hadoop should
>> > stick with semantic versioning and do not make major and
>> > minor dependency updates in subminor releases.
>> >
>> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of
>> this
>> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
>> > support Guava 27.0-jre.
>> >
>> > It would be better to make dependency upgrades when releasing new
>> > major/minor versions, for example Guava 27.0-jre upgrade was more
>> > appropriate for Hadoop 3.3.0 release than 3.2.1.
>> >
>> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
>> > <we...@cloudera.com.invalid> wrote:
>> >
>> >> I'm not hearing any feedback so far, but I want to suggest:
>> >>
>> >> use hadoop-thirdparty repository to host any dependencies that are
>> known
>> >> to
>> >> break compatibility.
>> >>
>> >> Candidate #1 guava
>> >> Candidate #2 Netty
>> >> Candidate #3 Jetty
>> >>
>> >> in fact, HBase shades these dependencies for the exact same reason.
>> >>
>> >> As an example of the cost of compatibility breakage: we spent the last
>> 6
>> >> months to backport the guava update change (guava 11 --> 27) throughout
>> >> Cloudera's stack, and after 6 months we are not done yet because we
>> have
>> >> to
>> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> >> guava
>> >> is in the classpath of every application.
>> >>
>> >> Thoughts?
>> >>
>> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> >> wrote:
>> >>
>> >> > Hi Hadoop devs,
>> >> >
>> >> > I the past, Hadoop tends to be pretty far behind the latest versions
>> of
>> >> > dependencies. Part of that is due to the fear of the breaking changes
>> >> > brought in by the dependency updates.
>> >> >
>> >> > However, things have changed dramatically over the past few years.
>> With
>> >> > more focus on security vulnerabilities, more vulnerabilities are
>> >> discovered
>> >> > in our dependencies, and users put more pressure on patching Hadoop
>> (and
>> >> > its ecosystem) to use the latest dependency versions.
>> >> >
>> >> > As an example, Jackson-databind had 20 CVEs published in the last
>> year
>> >> > alone.
>> >> >
>> >>
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >> >
>> >> > Jetty: 4 CVEs in 2019:
>> >> >
>> >>
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >> >
>> >> > We can no longer keep Hadoop stay behind. The more we stay behind,
>> the
>> >> > harder it is to update. A good example is Jersey migration 1 -> 2
>> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> >> contributed
>> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> >> hard.
>> >> > If any critical vulnerability is found in Jersey 1, it will leave us
>> in
>> >> a
>> >> > bad situation since we can't simply update Jersey version and be
>> done.
>> >> >
>> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> >> > should advocate downstream applications to use the public artifacts
>> to
>> >> > avoid breakage.
>> >> >
>> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
>> with
>> >> > the latest dependency updates, or would rather stay behind to ensure
>> >> > compatibility?
>> >> >
>> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> >> for
>> >> > the same purpose. IMHO that'll require better infrastructure to
>> assist
>> >> the
>> >> > release work and some rethinking our current Hadoop code structure,
>> like
>> >> > separate each subproject into its own repository and release cadence.
>> >> This
>> >> > can be controversial but I think it'll be good for the project in the
>> >> long
>> >> > run.
>> >> >
>> >> > Thanks,
>> >> > Wei-Chiu
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@apache.org>.

That is unfortunately true.

Now that I recognize the impact of guava update in Hadoop 3.1/3.2, how can
we make this better for downstreamers to consume? Like I proposed, I think
a middle ground is to shade guava in hadoop-thirdparty, and include the
hadoop-thirdparty jar in the next Hadoop 3.1/3.2 release.



On Thu, Mar 12, 2020 at 12:03 AM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> How do you manage and version such dependency upgrades in subminor
> Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
> upgrades will be breaking for customers and can not be shipped in subminor
> CDH release? Or this is in preparation for the next major/minor release of
> CDH?
>
> On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
>> Apache version are they going to land, but we'll upstream them for sure.
>>
>> The guava change is debatable. It's not as critical as others. There are
>> critical vulnerabilities in other dependencies that we have no way but to
>> update to a new major/minor version because we are so far behind. And
>> given
>> the critical nature, I think it is worth the risk and backport to lower
>> maintenance releases is warranted. Moreover, our minor releases are at
>> best
>> 1 per year. That is too slow to respond to a critical vulnerability.
>>
>> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
>> wrote:
>>
>> > Generally I'm for updating dependencies, but I think that Hadoop should
>> > stick with semantic versioning and do not make major and
>> > minor dependency updates in subminor releases.
>> >
>> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of
>> this
>> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
>> > support Guava 27.0-jre.
>> >
>> > It would be better to make dependency upgrades when releasing new
>> > major/minor versions, for example Guava 27.0-jre upgrade was more
>> > appropriate for Hadoop 3.3.0 release than 3.2.1.
>> >
>> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
>> > <we...@cloudera.com.invalid> wrote:
>> >
>> >> I'm not hearing any feedback so far, but I want to suggest:
>> >>
>> >> use hadoop-thirdparty repository to host any dependencies that are
>> known
>> >> to
>> >> break compatibility.
>> >>
>> >> Candidate #1 guava
>> >> Candidate #2 Netty
>> >> Candidate #3 Jetty
>> >>
>> >> in fact, HBase shades these dependencies for the exact same reason.
>> >>
>> >> As an example of the cost of compatibility breakage: we spent the last
>> 6
>> >> months to backport the guava update change (guava 11 --> 27) throughout
>> >> Cloudera's stack, and after 6 months we are not done yet because we
>> have
>> >> to
>> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> >> guava
>> >> is in the classpath of every application.
>> >>
>> >> Thoughts?
>> >>
>> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> >> wrote:
>> >>
>> >> > Hi Hadoop devs,
>> >> >
>> >> > I the past, Hadoop tends to be pretty far behind the latest versions
>> of
>> >> > dependencies. Part of that is due to the fear of the breaking changes
>> >> > brought in by the dependency updates.
>> >> >
>> >> > However, things have changed dramatically over the past few years.
>> With
>> >> > more focus on security vulnerabilities, more vulnerabilities are
>> >> discovered
>> >> > in our dependencies, and users put more pressure on patching Hadoop
>> (and
>> >> > its ecosystem) to use the latest dependency versions.
>> >> >
>> >> > As an example, Jackson-databind had 20 CVEs published in the last
>> year
>> >> > alone.
>> >> >
>> >>
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >> >
>> >> > Jetty: 4 CVEs in 2019:
>> >> >
>> >>
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >> >
>> >> > We can no longer keep Hadoop stay behind. The more we stay behind,
>> the
>> >> > harder it is to update. A good example is Jersey migration 1 -> 2
>> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> >> contributed
>> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> >> hard.
>> >> > If any critical vulnerability is found in Jersey 1, it will leave us
>> in
>> >> a
>> >> > bad situation since we can't simply update Jersey version and be
>> done.
>> >> >
>> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> >> > should advocate downstream applications to use the public artifacts
>> to
>> >> > avoid breakage.
>> >> >
>> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
>> with
>> >> > the latest dependency updates, or would rather stay behind to ensure
>> >> > compatibility?
>> >> >
>> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> >> for
>> >> > the same purpose. IMHO that'll require better infrastructure to
>> assist
>> >> the
>> >> > release work and some rethinking our current Hadoop code structure,
>> like
>> >> > separate each subproject into its own repository and release cadence.
>> >> This
>> >> > can be controversial but I think it'll be good for the project in the
>> >> long
>> >> > run.
>> >> >
>> >> > Thanks,
>> >> > Wei-Chiu
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@apache.org>.

That is unfortunately true.

Now that I recognize the impact of guava update in Hadoop 3.1/3.2, how can
we make this better for downstreamers to consume? Like I proposed, I think
a middle ground is to shade guava in hadoop-thirdparty, and include the
hadoop-thirdparty jar in the next Hadoop 3.1/3.2 release.



On Thu, Mar 12, 2020 at 12:03 AM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> How do you manage and version such dependency upgrades in subminor
> Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
> upgrades will be breaking for customers and can not be shipped in subminor
> CDH release? Or this is in preparation for the next major/minor release of
> CDH?
>
> On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
>> Apache version are they going to land, but we'll upstream them for sure.
>>
>> The guava change is debatable. It's not as critical as others. There are
>> critical vulnerabilities in other dependencies that we have no way but to
>> update to a new major/minor version because we are so far behind. And
>> given
>> the critical nature, I think it is worth the risk and backport to lower
>> maintenance releases is warranted. Moreover, our minor releases are at
>> best
>> 1 per year. That is too slow to respond to a critical vulnerability.
>>
>> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
>> wrote:
>>
>> > Generally I'm for updating dependencies, but I think that Hadoop should
>> > stick with semantic versioning and do not make major and
>> > minor dependency updates in subminor releases.
>> >
>> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of
>> this
>> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
>> > support Guava 27.0-jre.
>> >
>> > It would be better to make dependency upgrades when releasing new
>> > major/minor versions, for example Guava 27.0-jre upgrade was more
>> > appropriate for Hadoop 3.3.0 release than 3.2.1.
>> >
>> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
>> > <we...@cloudera.com.invalid> wrote:
>> >
>> >> I'm not hearing any feedback so far, but I want to suggest:
>> >>
>> >> use hadoop-thirdparty repository to host any dependencies that are
>> known
>> >> to
>> >> break compatibility.
>> >>
>> >> Candidate #1 guava
>> >> Candidate #2 Netty
>> >> Candidate #3 Jetty
>> >>
>> >> in fact, HBase shades these dependencies for the exact same reason.
>> >>
>> >> As an example of the cost of compatibility breakage: we spent the last
>> 6
>> >> months to backport the guava update change (guava 11 --> 27) throughout
>> >> Cloudera's stack, and after 6 months we are not done yet because we
>> have
>> >> to
>> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> >> guava
>> >> is in the classpath of every application.
>> >>
>> >> Thoughts?
>> >>
>> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> >> wrote:
>> >>
>> >> > Hi Hadoop devs,
>> >> >
>> >> > I the past, Hadoop tends to be pretty far behind the latest versions
>> of
>> >> > dependencies. Part of that is due to the fear of the breaking changes
>> >> > brought in by the dependency updates.
>> >> >
>> >> > However, things have changed dramatically over the past few years.
>> With
>> >> > more focus on security vulnerabilities, more vulnerabilities are
>> >> discovered
>> >> > in our dependencies, and users put more pressure on patching Hadoop
>> (and
>> >> > its ecosystem) to use the latest dependency versions.
>> >> >
>> >> > As an example, Jackson-databind had 20 CVEs published in the last
>> year
>> >> > alone.
>> >> >
>> >>
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >> >
>> >> > Jetty: 4 CVEs in 2019:
>> >> >
>> >>
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >> >
>> >> > We can no longer keep Hadoop stay behind. The more we stay behind,
>> the
>> >> > harder it is to update. A good example is Jersey migration 1 -> 2
>> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> >> contributed
>> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> >> hard.
>> >> > If any critical vulnerability is found in Jersey 1, it will leave us
>> in
>> >> a
>> >> > bad situation since we can't simply update Jersey version and be
>> done.
>> >> >
>> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> >> > should advocate downstream applications to use the public artifacts
>> to
>> >> > avoid breakage.
>> >> >
>> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
>> with
>> >> > the latest dependency updates, or would rather stay behind to ensure
>> >> > compatibility?
>> >> >
>> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> >> for
>> >> > the same purpose. IMHO that'll require better infrastructure to
>> assist
>> >> the
>> >> > release work and some rethinking our current Hadoop code structure,
>> like
>> >> > separate each subproject into its own repository and release cadence.
>> >> This
>> >> > can be controversial but I think it'll be good for the project in the
>> >> long
>> >> > run.
>> >> >
>> >> > Thanks,
>> >> > Wei-Chiu
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

How do you manage and version such dependency upgrades in subminor
Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
upgrades will be breaking for customers and can not be shipped in subminor
CDH release? Or this is in preparation for the next major/minor release of
CDH?

On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
> Apache version are they going to land, but we'll upstream them for sure.
>
> The guava change is debatable. It's not as critical as others. There are
> critical vulnerabilities in other dependencies that we have no way but to
> update to a new major/minor version because we are so far behind. And given
> the critical nature, I think it is worth the risk and backport to lower
> maintenance releases is warranted. Moreover, our minor releases are at best
> 1 per year. That is too slow to respond to a critical vulnerability.
>
> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
> wrote:
>
> > Generally I'm for updating dependencies, but I think that Hadoop should
> > stick with semantic versioning and do not make major and
> > minor dependency updates in subminor releases.
> >
> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> > support Guava 27.0-jre.
> >
> > It would be better to make dependency upgrades when releasing new
> > major/minor versions, for example Guava 27.0-jre upgrade was more
> > appropriate for Hadoop 3.3.0 release than 3.2.1.
> >
> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> > <we...@cloudera.com.invalid> wrote:
> >
> >> I'm not hearing any feedback so far, but I want to suggest:
> >>
> >> use hadoop-thirdparty repository to host any dependencies that are known
> >> to
> >> break compatibility.
> >>
> >> Candidate #1 guava
> >> Candidate #2 Netty
> >> Candidate #3 Jetty
> >>
> >> in fact, HBase shades these dependencies for the exact same reason.
> >>
> >> As an example of the cost of compatibility breakage: we spent the last 6
> >> months to backport the guava update change (guava 11 --> 27) throughout
> >> Cloudera's stack, and after 6 months we are not done yet because we have
> >> to
> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
> >> guava
> >> is in the classpath of every application.
> >>
> >> Thoughts?
> >>
> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
> >> wrote:
> >>
> >> > Hi Hadoop devs,
> >> >
> >> > I the past, Hadoop tends to be pretty far behind the latest versions
> of
> >> > dependencies. Part of that is due to the fear of the breaking changes
> >> > brought in by the dependency updates.
> >> >
> >> > However, things have changed dramatically over the past few years.
> With
> >> > more focus on security vulnerabilities, more vulnerabilities are
> >> discovered
> >> > in our dependencies, and users put more pressure on patching Hadoop
> (and
> >> > its ecosystem) to use the latest dependency versions.
> >> >
> >> > As an example, Jackson-databind had 20 CVEs published in the last year
> >> > alone.
> >> >
> >>
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >> >
> >> > Jetty: 4 CVEs in 2019:
> >> >
> >>
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >> >
> >> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> >> > harder it is to update. A good example is Jersey migration 1 -> 2
> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> >> contributed
> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> >> hard.
> >> > If any critical vulnerability is found in Jersey 1, it will leave us
> in
> >> a
> >> > bad situation since we can't simply update Jersey version and be done.
> >> >
> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> >> > should advocate downstream applications to use the public artifacts to
> >> > avoid breakage.
> >> >
> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
> with
> >> > the latest dependency updates, or would rather stay behind to ensure
> >> > compatibility?
> >> >
> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
> >> for
> >> > the same purpose. IMHO that'll require better infrastructure to assist
> >> the
> >> > release work and some rethinking our current Hadoop code structure,
> like
> >> > separate each subproject into its own repository and release cadence.
> >> This
> >> > can be controversial but I think it'll be good for the project in the
> >> long
> >> > run.
> >> >
> >> > Thanks,
> >> > Wei-Chiu
> >> >
> >>
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

How do you manage and version such dependency upgrades in subminor
Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
upgrades will be breaking for customers and can not be shipped in subminor
CDH release? Or this is in preparation for the next major/minor release of
CDH?

On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
> Apache version are they going to land, but we'll upstream them for sure.
>
> The guava change is debatable. It's not as critical as others. There are
> critical vulnerabilities in other dependencies that we have no way but to
> update to a new major/minor version because we are so far behind. And given
> the critical nature, I think it is worth the risk and backport to lower
> maintenance releases is warranted. Moreover, our minor releases are at best
> 1 per year. That is too slow to respond to a critical vulnerability.
>
> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
> wrote:
>
> > Generally I'm for updating dependencies, but I think that Hadoop should
> > stick with semantic versioning and do not make major and
> > minor dependency updates in subminor releases.
> >
> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> > support Guava 27.0-jre.
> >
> > It would be better to make dependency upgrades when releasing new
> > major/minor versions, for example Guava 27.0-jre upgrade was more
> > appropriate for Hadoop 3.3.0 release than 3.2.1.
> >
> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> > <we...@cloudera.com.invalid> wrote:
> >
> >> I'm not hearing any feedback so far, but I want to suggest:
> >>
> >> use hadoop-thirdparty repository to host any dependencies that are known
> >> to
> >> break compatibility.
> >>
> >> Candidate #1 guava
> >> Candidate #2 Netty
> >> Candidate #3 Jetty
> >>
> >> in fact, HBase shades these dependencies for the exact same reason.
> >>
> >> As an example of the cost of compatibility breakage: we spent the last 6
> >> months to backport the guava update change (guava 11 --> 27) throughout
> >> Cloudera's stack, and after 6 months we are not done yet because we have
> >> to
> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
> >> guava
> >> is in the classpath of every application.
> >>
> >> Thoughts?
> >>
> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
> >> wrote:
> >>
> >> > Hi Hadoop devs,
> >> >
> >> > I the past, Hadoop tends to be pretty far behind the latest versions
> of
> >> > dependencies. Part of that is due to the fear of the breaking changes
> >> > brought in by the dependency updates.
> >> >
> >> > However, things have changed dramatically over the past few years.
> With
> >> > more focus on security vulnerabilities, more vulnerabilities are
> >> discovered
> >> > in our dependencies, and users put more pressure on patching Hadoop
> (and
> >> > its ecosystem) to use the latest dependency versions.
> >> >
> >> > As an example, Jackson-databind had 20 CVEs published in the last year
> >> > alone.
> >> >
> >>
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >> >
> >> > Jetty: 4 CVEs in 2019:
> >> >
> >>
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >> >
> >> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> >> > harder it is to update. A good example is Jersey migration 1 -> 2
> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> >> contributed
> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> >> hard.
> >> > If any critical vulnerability is found in Jersey 1, it will leave us
> in
> >> a
> >> > bad situation since we can't simply update Jersey version and be done.
> >> >
> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> >> > should advocate downstream applications to use the public artifacts to
> >> > avoid breakage.
> >> >
> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
> with
> >> > the latest dependency updates, or would rather stay behind to ensure
> >> > compatibility?
> >> >
> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
> >> for
> >> > the same purpose. IMHO that'll require better infrastructure to assist
> >> the
> >> > release work and some rethinking our current Hadoop code structure,
> like
> >> > separate each subproject into its own repository and release cadence.
> >> This
> >> > can be controversial but I think it'll be good for the project in the
> >> long
> >> > run.
> >> >
> >> > Thanks,
> >> > Wei-Chiu
> >> >
> >>
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

How do you manage and version such dependency upgrades in subminor
Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
upgrades will be breaking for customers and can not be shipped in subminor
CDH release? Or this is in preparation for the next major/minor release of
CDH?

On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
> Apache version are they going to land, but we'll upstream them for sure.
>
> The guava change is debatable. It's not as critical as others. There are
> critical vulnerabilities in other dependencies that we have no way but to
> update to a new major/minor version because we are so far behind. And given
> the critical nature, I think it is worth the risk and backport to lower
> maintenance releases is warranted. Moreover, our minor releases are at best
> 1 per year. That is too slow to respond to a critical vulnerability.
>
> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
> wrote:
>
> > Generally I'm for updating dependencies, but I think that Hadoop should
> > stick with semantic versioning and do not make major and
> > minor dependency updates in subminor releases.
> >
> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> > support Guava 27.0-jre.
> >
> > It would be better to make dependency upgrades when releasing new
> > major/minor versions, for example Guava 27.0-jre upgrade was more
> > appropriate for Hadoop 3.3.0 release than 3.2.1.
> >
> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> > <we...@cloudera.com.invalid> wrote:
> >
> >> I'm not hearing any feedback so far, but I want to suggest:
> >>
> >> use hadoop-thirdparty repository to host any dependencies that are known
> >> to
> >> break compatibility.
> >>
> >> Candidate #1 guava
> >> Candidate #2 Netty
> >> Candidate #3 Jetty
> >>
> >> in fact, HBase shades these dependencies for the exact same reason.
> >>
> >> As an example of the cost of compatibility breakage: we spent the last 6
> >> months to backport the guava update change (guava 11 --> 27) throughout
> >> Cloudera's stack, and after 6 months we are not done yet because we have
> >> to
> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
> >> guava
> >> is in the classpath of every application.
> >>
> >> Thoughts?
> >>
> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
> >> wrote:
> >>
> >> > Hi Hadoop devs,
> >> >
> >> > I the past, Hadoop tends to be pretty far behind the latest versions
> of
> >> > dependencies. Part of that is due to the fear of the breaking changes
> >> > brought in by the dependency updates.
> >> >
> >> > However, things have changed dramatically over the past few years.
> With
> >> > more focus on security vulnerabilities, more vulnerabilities are
> >> discovered
> >> > in our dependencies, and users put more pressure on patching Hadoop
> (and
> >> > its ecosystem) to use the latest dependency versions.
> >> >
> >> > As an example, Jackson-databind had 20 CVEs published in the last year
> >> > alone.
> >> >
> >>
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >> >
> >> > Jetty: 4 CVEs in 2019:
> >> >
> >>
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >> >
> >> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> >> > harder it is to update. A good example is Jersey migration 1 -> 2
> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> >> contributed
> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> >> hard.
> >> > If any critical vulnerability is found in Jersey 1, it will leave us
> in
> >> a
> >> > bad situation since we can't simply update Jersey version and be done.
> >> >
> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> >> > should advocate downstream applications to use the public artifacts to
> >> > avoid breakage.
> >> >
> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
> with
> >> > the latest dependency updates, or would rather stay behind to ensure
> >> > compatibility?
> >> >
> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
> >> for
> >> > the same purpose. IMHO that'll require better infrastructure to assist
> >> the
> >> > release work and some rethinking our current Hadoop code structure,
> like
> >> > separate each subproject into its own repository and release cadence.
> >> This
> >> > can be controversial but I think it'll be good for the project in the
> >> long
> >> > run.
> >> >
> >> > Thanks,
> >> > Wei-Chiu
> >> >
> >>
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

How do you manage and version such dependency upgrades in subminor
Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
upgrades will be breaking for customers and can not be shipped in subminor
CDH release? Or this is in preparation for the next major/minor release of
CDH?

On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
> Apache version are they going to land, but we'll upstream them for sure.
>
> The guava change is debatable. It's not as critical as others. There are
> critical vulnerabilities in other dependencies that we have no way but to
> update to a new major/minor version because we are so far behind. And given
> the critical nature, I think it is worth the risk and backport to lower
> maintenance releases is warranted. Moreover, our minor releases are at best
> 1 per year. That is too slow to respond to a critical vulnerability.
>
> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
> wrote:
>
> > Generally I'm for updating dependencies, but I think that Hadoop should
> > stick with semantic versioning and do not make major and
> > minor dependency updates in subminor releases.
> >
> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> > support Guava 27.0-jre.
> >
> > It would be better to make dependency upgrades when releasing new
> > major/minor versions, for example Guava 27.0-jre upgrade was more
> > appropriate for Hadoop 3.3.0 release than 3.2.1.
> >
> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> > <we...@cloudera.com.invalid> wrote:
> >
> >> I'm not hearing any feedback so far, but I want to suggest:
> >>
> >> use hadoop-thirdparty repository to host any dependencies that are known
> >> to
> >> break compatibility.
> >>
> >> Candidate #1 guava
> >> Candidate #2 Netty
> >> Candidate #3 Jetty
> >>
> >> in fact, HBase shades these dependencies for the exact same reason.
> >>
> >> As an example of the cost of compatibility breakage: we spent the last 6
> >> months to backport the guava update change (guava 11 --> 27) throughout
> >> Cloudera's stack, and after 6 months we are not done yet because we have
> >> to
> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
> >> guava
> >> is in the classpath of every application.
> >>
> >> Thoughts?
> >>
> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
> >> wrote:
> >>
> >> > Hi Hadoop devs,
> >> >
> >> > I the past, Hadoop tends to be pretty far behind the latest versions
> of
> >> > dependencies. Part of that is due to the fear of the breaking changes
> >> > brought in by the dependency updates.
> >> >
> >> > However, things have changed dramatically over the past few years.
> With
> >> > more focus on security vulnerabilities, more vulnerabilities are
> >> discovered
> >> > in our dependencies, and users put more pressure on patching Hadoop
> (and
> >> > its ecosystem) to use the latest dependency versions.
> >> >
> >> > As an example, Jackson-databind had 20 CVEs published in the last year
> >> > alone.
> >> >
> >>
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >> >
> >> > Jetty: 4 CVEs in 2019:
> >> >
> >>
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >> >
> >> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> >> > harder it is to update. A good example is Jersey migration 1 -> 2
> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> >> contributed
> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> >> hard.
> >> > If any critical vulnerability is found in Jersey 1, it will leave us
> in
> >> a
> >> > bad situation since we can't simply update Jersey version and be done.
> >> >
> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> >> > should advocate downstream applications to use the public artifacts to
> >> > avoid breakage.
> >> >
> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
> with
> >> > the latest dependency updates, or would rather stay behind to ensure
> >> > compatibility?
> >> >
> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
> >> for
> >> > the same purpose. IMHO that'll require better infrastructure to assist
> >> the
> >> > release work and some rethinking our current Hadoop code structure,
> like
> >> > separate each subproject into its own repository and release cadence.
> >> This
> >> > can be controversial but I think it'll be good for the project in the
> >> long
> >> > run.
> >> >
> >> > Thanks,
> >> > Wei-Chiu
> >> >
> >>
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

How do you manage and version such dependency upgrades in subminor
Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
upgrades will be breaking for customers and can not be shipped in subminor
CDH release? Or this is in preparation for the next major/minor release of
CDH?

On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
> Apache version are they going to land, but we'll upstream them for sure.
>
> The guava change is debatable. It's not as critical as others. There are
> critical vulnerabilities in other dependencies that we have no way but to
> update to a new major/minor version because we are so far behind. And given
> the critical nature, I think it is worth the risk and backport to lower
> maintenance releases is warranted. Moreover, our minor releases are at best
> 1 per year. That is too slow to respond to a critical vulnerability.
>
> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
> wrote:
>
> > Generally I'm for updating dependencies, but I think that Hadoop should
> > stick with semantic versioning and do not make major and
> > minor dependency updates in subminor releases.
> >
> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> > support Guava 27.0-jre.
> >
> > It would be better to make dependency upgrades when releasing new
> > major/minor versions, for example Guava 27.0-jre upgrade was more
> > appropriate for Hadoop 3.3.0 release than 3.2.1.
> >
> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> > <we...@cloudera.com.invalid> wrote:
> >
> >> I'm not hearing any feedback so far, but I want to suggest:
> >>
> >> use hadoop-thirdparty repository to host any dependencies that are known
> >> to
> >> break compatibility.
> >>
> >> Candidate #1 guava
> >> Candidate #2 Netty
> >> Candidate #3 Jetty
> >>
> >> in fact, HBase shades these dependencies for the exact same reason.
> >>
> >> As an example of the cost of compatibility breakage: we spent the last 6
> >> months to backport the guava update change (guava 11 --> 27) throughout
> >> Cloudera's stack, and after 6 months we are not done yet because we have
> >> to
> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
> >> guava
> >> is in the classpath of every application.
> >>
> >> Thoughts?
> >>
> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
> >> wrote:
> >>
> >> > Hi Hadoop devs,
> >> >
> >> > I the past, Hadoop tends to be pretty far behind the latest versions
> of
> >> > dependencies. Part of that is due to the fear of the breaking changes
> >> > brought in by the dependency updates.
> >> >
> >> > However, things have changed dramatically over the past few years.
> With
> >> > more focus on security vulnerabilities, more vulnerabilities are
> >> discovered
> >> > in our dependencies, and users put more pressure on patching Hadoop
> (and
> >> > its ecosystem) to use the latest dependency versions.
> >> >
> >> > As an example, Jackson-databind had 20 CVEs published in the last year
> >> > alone.
> >> >
> >>
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >> >
> >> > Jetty: 4 CVEs in 2019:
> >> >
> >>
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >> >
> >> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> >> > harder it is to update. A good example is Jersey migration 1 -> 2
> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> >> contributed
> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> >> hard.
> >> > If any critical vulnerability is found in Jersey 1, it will leave us
> in
> >> a
> >> > bad situation since we can't simply update Jersey version and be done.
> >> >
> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> >> > should advocate downstream applications to use the public artifacts to
> >> > avoid breakage.
> >> >
> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
> with
> >> > the latest dependency updates, or would rather stay behind to ensure
> >> > compatibility?
> >> >
> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
> >> for
> >> > the same purpose. IMHO that'll require better infrastructure to assist
> >> the
> >> > release work and some rethinking our current Hadoop code structure,
> like
> >> > separate each subproject into its own repository and release cadence.
> >> This
> >> > can be controversial but I think it'll be good for the project in the
> >> long
> >> > run.
> >> >
> >> > Thanks,
> >> > Wei-Chiu
> >> >
> >>
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
Apache version are they going to land, but we'll upstream them for sure.

The guava change is debatable. It's not as critical as others. There are
critical vulnerabilities in other dependencies that we have no way but to
update to a new major/minor version because we are so far behind. And given
the critical nature, I think it is worth the risk and backport to lower
maintenance releases is warranted. Moreover, our minor releases are at best
1 per year. That is too slow to respond to a critical vulnerability.

On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> Generally I'm for updating dependencies, but I think that Hadoop should
> stick with semantic versioning and do not make major and
> minor dependency updates in subminor releases.
>
> For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> support Guava 27.0-jre.
>
> It would be better to make dependency upgrades when releasing new
> major/minor versions, for example Guava 27.0-jre upgrade was more
> appropriate for Hadoop 3.3.0 release than 3.2.1.
>
> On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> I'm not hearing any feedback so far, but I want to suggest:
>>
>> use hadoop-thirdparty repository to host any dependencies that are known
>> to
>> break compatibility.
>>
>> Candidate #1 guava
>> Candidate #2 Netty
>> Candidate #3 Jetty
>>
>> in fact, HBase shades these dependencies for the exact same reason.
>>
>> As an example of the cost of compatibility breakage: we spent the last 6
>> months to backport the guava update change (guava 11 --> 27) throughout
>> Cloudera's stack, and after 6 months we are not done yet because we have
>> to
>> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> guava
>> is in the classpath of every application.
>>
>> Thoughts?
>>
>> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> wrote:
>>
>> > Hi Hadoop devs,
>> >
>> > I the past, Hadoop tends to be pretty far behind the latest versions of
>> > dependencies. Part of that is due to the fear of the breaking changes
>> > brought in by the dependency updates.
>> >
>> > However, things have changed dramatically over the past few years. With
>> > more focus on security vulnerabilities, more vulnerabilities are
>> discovered
>> > in our dependencies, and users put more pressure on patching Hadoop (and
>> > its ecosystem) to use the latest dependency versions.
>> >
>> > As an example, Jackson-databind had 20 CVEs published in the last year
>> > alone.
>> >
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >
>> > Jetty: 4 CVEs in 2019:
>> >
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >
>> > We can no longer keep Hadoop stay behind. The more we stay behind, the
>> > harder it is to update. A good example is Jersey migration 1 -> 2
>> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> contributed
>> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> hard.
>> > If any critical vulnerability is found in Jersey 1, it will leave us in
>> a
>> > bad situation since we can't simply update Jersey version and be done.
>> >
>> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> > should advocate downstream applications to use the public artifacts to
>> > avoid breakage.
>> >
>> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
>> > the latest dependency updates, or would rather stay behind to ensure
>> > compatibility?
>> >
>> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> for
>> > the same purpose. IMHO that'll require better infrastructure to assist
>> the
>> > release work and some rethinking our current Hadoop code structure, like
>> > separate each subproject into its own repository and release cadence.
>> This
>> > can be controversial but I think it'll be good for the project in the
>> long
>> > run.
>> >
>> > Thanks,
>> > Wei-Chiu
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
Apache version are they going to land, but we'll upstream them for sure.

The guava change is debatable. It's not as critical as others. There are
critical vulnerabilities in other dependencies that we have no way but to
update to a new major/minor version because we are so far behind. And given
the critical nature, I think it is worth the risk and backport to lower
maintenance releases is warranted. Moreover, our minor releases are at best
1 per year. That is too slow to respond to a critical vulnerability.

On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> Generally I'm for updating dependencies, but I think that Hadoop should
> stick with semantic versioning and do not make major and
> minor dependency updates in subminor releases.
>
> For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> support Guava 27.0-jre.
>
> It would be better to make dependency upgrades when releasing new
> major/minor versions, for example Guava 27.0-jre upgrade was more
> appropriate for Hadoop 3.3.0 release than 3.2.1.
>
> On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> I'm not hearing any feedback so far, but I want to suggest:
>>
>> use hadoop-thirdparty repository to host any dependencies that are known
>> to
>> break compatibility.
>>
>> Candidate #1 guava
>> Candidate #2 Netty
>> Candidate #3 Jetty
>>
>> in fact, HBase shades these dependencies for the exact same reason.
>>
>> As an example of the cost of compatibility breakage: we spent the last 6
>> months to backport the guava update change (guava 11 --> 27) throughout
>> Cloudera's stack, and after 6 months we are not done yet because we have
>> to
>> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> guava
>> is in the classpath of every application.
>>
>> Thoughts?
>>
>> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> wrote:
>>
>> > Hi Hadoop devs,
>> >
>> > I the past, Hadoop tends to be pretty far behind the latest versions of
>> > dependencies. Part of that is due to the fear of the breaking changes
>> > brought in by the dependency updates.
>> >
>> > However, things have changed dramatically over the past few years. With
>> > more focus on security vulnerabilities, more vulnerabilities are
>> discovered
>> > in our dependencies, and users put more pressure on patching Hadoop (and
>> > its ecosystem) to use the latest dependency versions.
>> >
>> > As an example, Jackson-databind had 20 CVEs published in the last year
>> > alone.
>> >
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >
>> > Jetty: 4 CVEs in 2019:
>> >
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >
>> > We can no longer keep Hadoop stay behind. The more we stay behind, the
>> > harder it is to update. A good example is Jersey migration 1 -> 2
>> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> contributed
>> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> hard.
>> > If any critical vulnerability is found in Jersey 1, it will leave us in
>> a
>> > bad situation since we can't simply update Jersey version and be done.
>> >
>> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> > should advocate downstream applications to use the public artifacts to
>> > avoid breakage.
>> >
>> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
>> > the latest dependency updates, or would rather stay behind to ensure
>> > compatibility?
>> >
>> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> for
>> > the same purpose. IMHO that'll require better infrastructure to assist
>> the
>> > release work and some rethinking our current Hadoop code structure, like
>> > separate each subproject into its own repository and release cadence.
>> This
>> > can be controversial but I think it'll be good for the project in the
>> long
>> > run.
>> >
>> > Thanks,
>> > Wei-Chiu
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
Apache version are they going to land, but we'll upstream them for sure.

The guava change is debatable. It's not as critical as others. There are
critical vulnerabilities in other dependencies that we have no way but to
update to a new major/minor version because we are so far behind. And given
the critical nature, I think it is worth the risk and backport to lower
maintenance releases is warranted. Moreover, our minor releases are at best
1 per year. That is too slow to respond to a critical vulnerability.

On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> Generally I'm for updating dependencies, but I think that Hadoop should
> stick with semantic versioning and do not make major and
> minor dependency updates in subminor releases.
>
> For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> support Guava 27.0-jre.
>
> It would be better to make dependency upgrades when releasing new
> major/minor versions, for example Guava 27.0-jre upgrade was more
> appropriate for Hadoop 3.3.0 release than 3.2.1.
>
> On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> I'm not hearing any feedback so far, but I want to suggest:
>>
>> use hadoop-thirdparty repository to host any dependencies that are known
>> to
>> break compatibility.
>>
>> Candidate #1 guava
>> Candidate #2 Netty
>> Candidate #3 Jetty
>>
>> in fact, HBase shades these dependencies for the exact same reason.
>>
>> As an example of the cost of compatibility breakage: we spent the last 6
>> months to backport the guava update change (guava 11 --> 27) throughout
>> Cloudera's stack, and after 6 months we are not done yet because we have
>> to
>> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> guava
>> is in the classpath of every application.
>>
>> Thoughts?
>>
>> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> wrote:
>>
>> > Hi Hadoop devs,
>> >
>> > I the past, Hadoop tends to be pretty far behind the latest versions of
>> > dependencies. Part of that is due to the fear of the breaking changes
>> > brought in by the dependency updates.
>> >
>> > However, things have changed dramatically over the past few years. With
>> > more focus on security vulnerabilities, more vulnerabilities are
>> discovered
>> > in our dependencies, and users put more pressure on patching Hadoop (and
>> > its ecosystem) to use the latest dependency versions.
>> >
>> > As an example, Jackson-databind had 20 CVEs published in the last year
>> > alone.
>> >
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >
>> > Jetty: 4 CVEs in 2019:
>> >
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >
>> > We can no longer keep Hadoop stay behind. The more we stay behind, the
>> > harder it is to update. A good example is Jersey migration 1 -> 2
>> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> contributed
>> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> hard.
>> > If any critical vulnerability is found in Jersey 1, it will leave us in
>> a
>> > bad situation since we can't simply update Jersey version and be done.
>> >
>> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> > should advocate downstream applications to use the public artifacts to
>> > avoid breakage.
>> >
>> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
>> > the latest dependency updates, or would rather stay behind to ensure
>> > compatibility?
>> >
>> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> for
>> > the same purpose. IMHO that'll require better infrastructure to assist
>> the
>> > release work and some rethinking our current Hadoop code structure, like
>> > separate each subproject into its own repository and release cadence.
>> This
>> > can be controversial but I think it'll be good for the project in the
>> long
>> > run.
>> >
>> > Thanks,
>> > Wei-Chiu
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
Apache version are they going to land, but we'll upstream them for sure.

The guava change is debatable. It's not as critical as others. There are
critical vulnerabilities in other dependencies that we have no way but to
update to a new major/minor version because we are so far behind. And given
the critical nature, I think it is worth the risk and backport to lower
maintenance releases is warranted. Moreover, our minor releases are at best
1 per year. That is too slow to respond to a critical vulnerability.

On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> Generally I'm for updating dependencies, but I think that Hadoop should
> stick with semantic versioning and do not make major and
> minor dependency updates in subminor releases.
>
> For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> support Guava 27.0-jre.
>
> It would be better to make dependency upgrades when releasing new
> major/minor versions, for example Guava 27.0-jre upgrade was more
> appropriate for Hadoop 3.3.0 release than 3.2.1.
>
> On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> I'm not hearing any feedback so far, but I want to suggest:
>>
>> use hadoop-thirdparty repository to host any dependencies that are known
>> to
>> break compatibility.
>>
>> Candidate #1 guava
>> Candidate #2 Netty
>> Candidate #3 Jetty
>>
>> in fact, HBase shades these dependencies for the exact same reason.
>>
>> As an example of the cost of compatibility breakage: we spent the last 6
>> months to backport the guava update change (guava 11 --> 27) throughout
>> Cloudera's stack, and after 6 months we are not done yet because we have
>> to
>> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> guava
>> is in the classpath of every application.
>>
>> Thoughts?
>>
>> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> wrote:
>>
>> > Hi Hadoop devs,
>> >
>> > I the past, Hadoop tends to be pretty far behind the latest versions of
>> > dependencies. Part of that is due to the fear of the breaking changes
>> > brought in by the dependency updates.
>> >
>> > However, things have changed dramatically over the past few years. With
>> > more focus on security vulnerabilities, more vulnerabilities are
>> discovered
>> > in our dependencies, and users put more pressure on patching Hadoop (and
>> > its ecosystem) to use the latest dependency versions.
>> >
>> > As an example, Jackson-databind had 20 CVEs published in the last year
>> > alone.
>> >
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >
>> > Jetty: 4 CVEs in 2019:
>> >
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >
>> > We can no longer keep Hadoop stay behind. The more we stay behind, the
>> > harder it is to update. A good example is Jersey migration 1 -> 2
>> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> contributed
>> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> hard.
>> > If any critical vulnerability is found in Jersey 1, it will leave us in
>> a
>> > bad situation since we can't simply update Jersey version and be done.
>> >
>> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> > should advocate downstream applications to use the public artifacts to
>> > avoid breakage.
>> >
>> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
>> > the latest dependency updates, or would rather stay behind to ensure
>> > compatibility?
>> >
>> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> for
>> > the same purpose. IMHO that'll require better infrastructure to assist
>> the
>> > release work and some rethinking our current Hadoop code structure, like
>> > separate each subproject into its own repository and release cadence.
>> This
>> > can be controversial but I think it'll be good for the project in the
>> long
>> > run.
>> >
>> > Thanks,
>> > Wei-Chiu
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
Apache version are they going to land, but we'll upstream them for sure.

The guava change is debatable. It's not as critical as others. There are
critical vulnerabilities in other dependencies that we have no way but to
update to a new major/minor version because we are so far behind. And given
the critical nature, I think it is worth the risk and backport to lower
maintenance releases is warranted. Moreover, our minor releases are at best
1 per year. That is too slow to respond to a critical vulnerability.

On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <id...@google.com.invalid>
wrote:

> Generally I'm for updating dependencies, but I think that Hadoop should
> stick with semantic versioning and do not make major and
> minor dependency updates in subminor releases.
>
> For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> support Guava 27.0-jre.
>
> It would be better to make dependency upgrades when releasing new
> major/minor versions, for example Guava 27.0-jre upgrade was more
> appropriate for Hadoop 3.3.0 release than 3.2.1.
>
> On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> <we...@cloudera.com.invalid> wrote:
>
>> I'm not hearing any feedback so far, but I want to suggest:
>>
>> use hadoop-thirdparty repository to host any dependencies that are known
>> to
>> break compatibility.
>>
>> Candidate #1 guava
>> Candidate #2 Netty
>> Candidate #3 Jetty
>>
>> in fact, HBase shades these dependencies for the exact same reason.
>>
>> As an example of the cost of compatibility breakage: we spent the last 6
>> months to backport the guava update change (guava 11 --> 27) throughout
>> Cloudera's stack, and after 6 months we are not done yet because we have
>> to
>> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> guava
>> is in the classpath of every application.
>>
>> Thoughts?
>>
>> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org>
>> wrote:
>>
>> > Hi Hadoop devs,
>> >
>> > I the past, Hadoop tends to be pretty far behind the latest versions of
>> > dependencies. Part of that is due to the fear of the breaking changes
>> > brought in by the dependency updates.
>> >
>> > However, things have changed dramatically over the past few years. With
>> > more focus on security vulnerabilities, more vulnerabilities are
>> discovered
>> > in our dependencies, and users put more pressure on patching Hadoop (and
>> > its ecosystem) to use the latest dependency versions.
>> >
>> > As an example, Jackson-databind had 20 CVEs published in the last year
>> > alone.
>> >
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >
>> > Jetty: 4 CVEs in 2019:
>> >
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >
>> > We can no longer keep Hadoop stay behind. The more we stay behind, the
>> > harder it is to update. A good example is Jersey migration 1 -> 2
>> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> contributed
>> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> hard.
>> > If any critical vulnerability is found in Jersey 1, it will leave us in
>> a
>> > bad situation since we can't simply update Jersey version and be done.
>> >
>> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> > should advocate downstream applications to use the public artifacts to
>> > avoid breakage.
>> >
>> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
>> > the latest dependency updates, or would rather stay behind to ensure
>> > compatibility?
>> >
>> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> for
>> > the same purpose. IMHO that'll require better infrastructure to assist
>> the
>> > release work and some rethinking our current Hadoop code structure, like
>> > separate each subproject into its own repository and release cadence.
>> This
>> > can be controversial but I think it'll be good for the project in the
>> long
>> > run.
>> >
>> > Thanks,
>> > Wei-Chiu
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

Generally I'm for updating dependencies, but I think that Hadoop should
stick with semantic versioning and do not make major and
minor dependency updates in subminor releases.

For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
support Guava 27.0-jre.

It would be better to make dependency upgrades when releasing new
major/minor versions, for example Guava 27.0-jre upgrade was more
appropriate for Hadoop 3.3.0 release than 3.2.1.

On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> I'm not hearing any feedback so far, but I want to suggest:
>
> use hadoop-thirdparty repository to host any dependencies that are known to
> break compatibility.
>
> Candidate #1 guava
> Candidate #2 Netty
> Candidate #3 Jetty
>
> in fact, HBase shades these dependencies for the exact same reason.
>
> As an example of the cost of compatibility breakage: we spent the last 6
> months to backport the guava update change (guava 11 --> 27) throughout
> Cloudera's stack, and after 6 months we are not done yet because we have to
> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
> is in the classpath of every application.
>
> Thoughts?
>
> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:
>
> > Hi Hadoop devs,
> >
> > I the past, Hadoop tends to be pretty far behind the latest versions of
> > dependencies. Part of that is due to the fear of the breaking changes
> > brought in by the dependency updates.
> >
> > However, things have changed dramatically over the past few years. With
> > more focus on security vulnerabilities, more vulnerabilities are
> discovered
> > in our dependencies, and users put more pressure on patching Hadoop (and
> > its ecosystem) to use the latest dependency versions.
> >
> > As an example, Jackson-databind had 20 CVEs published in the last year
> > alone.
> >
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >
> > Jetty: 4 CVEs in 2019:
> >
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >
> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> > harder it is to update. A good example is Jersey migration 1 -> 2
> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> contributed
> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> hard.
> > If any critical vulnerability is found in Jersey 1, it will leave us in a
> > bad situation since we can't simply update Jersey version and be done.
> >
> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> > should advocate downstream applications to use the public artifacts to
> > avoid breakage.
> >
> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> > the latest dependency updates, or would rather stay behind to ensure
> > compatibility?
> >
> > Coupled with that, I'd like to call for more frequent Hadoop releases for
> > the same purpose. IMHO that'll require better infrastructure to assist
> the
> > release work and some rethinking our current Hadoop code structure, like
> > separate each subproject into its own repository and release cadence.
> This
> > can be controversial but I think it'll be good for the project in the
> long
> > run.
> >
> > Thanks,
> > Wei-Chiu
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

Generally I'm for updating dependencies, but I think that Hadoop should
stick with semantic versioning and do not make major and
minor dependency updates in subminor releases.

For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
support Guava 27.0-jre.

It would be better to make dependency upgrades when releasing new
major/minor versions, for example Guava 27.0-jre upgrade was more
appropriate for Hadoop 3.3.0 release than 3.2.1.

On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> I'm not hearing any feedback so far, but I want to suggest:
>
> use hadoop-thirdparty repository to host any dependencies that are known to
> break compatibility.
>
> Candidate #1 guava
> Candidate #2 Netty
> Candidate #3 Jetty
>
> in fact, HBase shades these dependencies for the exact same reason.
>
> As an example of the cost of compatibility breakage: we spent the last 6
> months to backport the guava update change (guava 11 --> 27) throughout
> Cloudera's stack, and after 6 months we are not done yet because we have to
> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
> is in the classpath of every application.
>
> Thoughts?
>
> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:
>
> > Hi Hadoop devs,
> >
> > I the past, Hadoop tends to be pretty far behind the latest versions of
> > dependencies. Part of that is due to the fear of the breaking changes
> > brought in by the dependency updates.
> >
> > However, things have changed dramatically over the past few years. With
> > more focus on security vulnerabilities, more vulnerabilities are
> discovered
> > in our dependencies, and users put more pressure on patching Hadoop (and
> > its ecosystem) to use the latest dependency versions.
> >
> > As an example, Jackson-databind had 20 CVEs published in the last year
> > alone.
> >
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >
> > Jetty: 4 CVEs in 2019:
> >
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >
> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> > harder it is to update. A good example is Jersey migration 1 -> 2
> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> contributed
> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> hard.
> > If any critical vulnerability is found in Jersey 1, it will leave us in a
> > bad situation since we can't simply update Jersey version and be done.
> >
> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> > should advocate downstream applications to use the public artifacts to
> > avoid breakage.
> >
> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> > the latest dependency updates, or would rather stay behind to ensure
> > compatibility?
> >
> > Coupled with that, I'd like to call for more frequent Hadoop releases for
> > the same purpose. IMHO that'll require better infrastructure to assist
> the
> > release work and some rethinking our current Hadoop code structure, like
> > separate each subproject into its own repository and release cadence.
> This
> > can be controversial but I think it'll be good for the project in the
> long
> > run.
> >
> > Thanks,
> > Wei-Chiu
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

Generally I'm for updating dependencies, but I think that Hadoop should
stick with semantic versioning and do not make major and
minor dependency updates in subminor releases.

For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
support Guava 27.0-jre.

It would be better to make dependency upgrades when releasing new
major/minor versions, for example Guava 27.0-jre upgrade was more
appropriate for Hadoop 3.3.0 release than 3.2.1.

On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> I'm not hearing any feedback so far, but I want to suggest:
>
> use hadoop-thirdparty repository to host any dependencies that are known to
> break compatibility.
>
> Candidate #1 guava
> Candidate #2 Netty
> Candidate #3 Jetty
>
> in fact, HBase shades these dependencies for the exact same reason.
>
> As an example of the cost of compatibility breakage: we spent the last 6
> months to backport the guava update change (guava 11 --> 27) throughout
> Cloudera's stack, and after 6 months we are not done yet because we have to
> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
> is in the classpath of every application.
>
> Thoughts?
>
> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:
>
> > Hi Hadoop devs,
> >
> > I the past, Hadoop tends to be pretty far behind the latest versions of
> > dependencies. Part of that is due to the fear of the breaking changes
> > brought in by the dependency updates.
> >
> > However, things have changed dramatically over the past few years. With
> > more focus on security vulnerabilities, more vulnerabilities are
> discovered
> > in our dependencies, and users put more pressure on patching Hadoop (and
> > its ecosystem) to use the latest dependency versions.
> >
> > As an example, Jackson-databind had 20 CVEs published in the last year
> > alone.
> >
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >
> > Jetty: 4 CVEs in 2019:
> >
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >
> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> > harder it is to update. A good example is Jersey migration 1 -> 2
> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> contributed
> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> hard.
> > If any critical vulnerability is found in Jersey 1, it will leave us in a
> > bad situation since we can't simply update Jersey version and be done.
> >
> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> > should advocate downstream applications to use the public artifacts to
> > avoid breakage.
> >
> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> > the latest dependency updates, or would rather stay behind to ensure
> > compatibility?
> >
> > Coupled with that, I'd like to call for more frequent Hadoop releases for
> > the same purpose. IMHO that'll require better infrastructure to assist
> the
> > release work and some rethinking our current Hadoop code structure, like
> > separate each subproject into its own repository and release cadence.
> This
> > can be controversial but I think it'll be good for the project in the
> long
> > run.
> >
> > Thanks,
> > Wei-Chiu
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

Generally I'm for updating dependencies, but I think that Hadoop should
stick with semantic versioning and do not make major and
minor dependency updates in subminor releases.

For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
support Guava 27.0-jre.

It would be better to make dependency upgrades when releasing new
major/minor versions, for example Guava 27.0-jre upgrade was more
appropriate for Hadoop 3.3.0 release than 3.2.1.

On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> I'm not hearing any feedback so far, but I want to suggest:
>
> use hadoop-thirdparty repository to host any dependencies that are known to
> break compatibility.
>
> Candidate #1 guava
> Candidate #2 Netty
> Candidate #3 Jetty
>
> in fact, HBase shades these dependencies for the exact same reason.
>
> As an example of the cost of compatibility breakage: we spent the last 6
> months to backport the guava update change (guava 11 --> 27) throughout
> Cloudera's stack, and after 6 months we are not done yet because we have to
> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
> is in the classpath of every application.
>
> Thoughts?
>
> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:
>
> > Hi Hadoop devs,
> >
> > I the past, Hadoop tends to be pretty far behind the latest versions of
> > dependencies. Part of that is due to the fear of the breaking changes
> > brought in by the dependency updates.
> >
> > However, things have changed dramatically over the past few years. With
> > more focus on security vulnerabilities, more vulnerabilities are
> discovered
> > in our dependencies, and users put more pressure on patching Hadoop (and
> > its ecosystem) to use the latest dependency versions.
> >
> > As an example, Jackson-databind had 20 CVEs published in the last year
> > alone.
> >
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >
> > Jetty: 4 CVEs in 2019:
> >
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >
> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> > harder it is to update. A good example is Jersey migration 1 -> 2
> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> contributed
> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> hard.
> > If any critical vulnerability is found in Jersey 1, it will leave us in a
> > bad situation since we can't simply update Jersey version and be done.
> >
> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> > should advocate downstream applications to use the public artifacts to
> > avoid breakage.
> >
> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> > the latest dependency updates, or would rather stay behind to ensure
> > compatibility?
> >
> > Coupled with that, I'd like to call for more frequent Hadoop releases for
> > the same purpose. IMHO that'll require better infrastructure to assist
> the
> > release work and some rethinking our current Hadoop code structure, like
> > separate each subproject into its own repository and release cadence.
> This
> > can be controversial but I think it'll be good for the project in the
> long
> > run.
> >
> > Thanks,
> > Wei-Chiu
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Igor Dvorzhak <id...@google.com.INVALID>.

Generally I'm for updating dependencies, but I think that Hadoop should
stick with semantic versioning and do not make major and
minor dependency updates in subminor releases.

For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
support Guava 27.0-jre.

It would be better to make dependency upgrades when releasing new
major/minor versions, for example Guava 27.0-jre upgrade was more
appropriate for Hadoop 3.3.0 release than 3.2.1.

On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
<we...@cloudera.com.invalid> wrote:

> I'm not hearing any feedback so far, but I want to suggest:
>
> use hadoop-thirdparty repository to host any dependencies that are known to
> break compatibility.
>
> Candidate #1 guava
> Candidate #2 Netty
> Candidate #3 Jetty
>
> in fact, HBase shades these dependencies for the exact same reason.
>
> As an example of the cost of compatibility breakage: we spent the last 6
> months to backport the guava update change (guava 11 --> 27) throughout
> Cloudera's stack, and after 6 months we are not done yet because we have to
> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
> is in the classpath of every application.
>
> Thoughts?
>
> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:
>
> > Hi Hadoop devs,
> >
> > I the past, Hadoop tends to be pretty far behind the latest versions of
> > dependencies. Part of that is due to the fear of the breaking changes
> > brought in by the dependency updates.
> >
> > However, things have changed dramatically over the past few years. With
> > more focus on security vulnerabilities, more vulnerabilities are
> discovered
> > in our dependencies, and users put more pressure on patching Hadoop (and
> > its ecosystem) to use the latest dependency versions.
> >
> > As an example, Jackson-databind had 20 CVEs published in the last year
> > alone.
> >
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >
> > Jetty: 4 CVEs in 2019:
> >
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >
> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> > harder it is to update. A good example is Jersey migration 1 -> 2
> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> contributed
> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> hard.
> > If any critical vulnerability is found in Jersey 1, it will leave us in a
> > bad situation since we can't simply update Jersey version and be done.
> >
> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> > should advocate downstream applications to use the public artifacts to
> > avoid breakage.
> >
> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> > the latest dependency updates, or would rather stay behind to ensure
> > compatibility?
> >
> > Coupled with that, I'd like to call for more frequent Hadoop releases for
> > the same purpose. IMHO that'll require better infrastructure to assist
> the
> > release work and some rethinking our current Hadoop code structure, like
> > separate each subproject into its own repository and release cadence.
> This
> > can be controversial but I think it'll be good for the project in the
> long
> > run.
> >
> > Thanks,
> > Wei-Chiu
> >
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

I'm not hearing any feedback so far, but I want to suggest:

use hadoop-thirdparty repository to host any dependencies that are known to
break compatibility.

Candidate #1 guava
Candidate #2 Netty
Candidate #3 Jetty

in fact, HBase shades these dependencies for the exact same reason.

As an example of the cost of compatibility breakage: we spent the last 6
months to backport the guava update change (guava 11 --> 27) throughout
Cloudera's stack, and after 6 months we are not done yet because we have to
update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
is in the classpath of every application.

Thoughts?

On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:

> Hi Hadoop devs,
>
> I the past, Hadoop tends to be pretty far behind the latest versions of
> dependencies. Part of that is due to the fear of the breaking changes
> brought in by the dependency updates.
>
> However, things have changed dramatically over the past few years. With
> more focus on security vulnerabilities, more vulnerabilities are discovered
> in our dependencies, and users put more pressure on patching Hadoop (and
> its ecosystem) to use the latest dependency versions.
>
> As an example, Jackson-databind had 20 CVEs published in the last year
> alone.
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>
> Jetty: 4 CVEs in 2019:
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>
> We can no longer keep Hadoop stay behind. The more we stay behind, the
> harder it is to update. A good example is Jersey migration 1 -> 2
> HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984> contributed
> by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is hard.
> If any critical vulnerability is found in Jersey 1, it will leave us in a
> bad situation since we can't simply update Jersey version and be done.
>
> Hadoop 3 adds new public artifacts that shade these dependencies. We
> should advocate downstream applications to use the public artifacts to
> avoid breakage.
>
> I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> the latest dependency updates, or would rather stay behind to ensure
> compatibility?
>
> Coupled with that, I'd like to call for more frequent Hadoop releases for
> the same purpose. IMHO that'll require better infrastructure to assist the
> release work and some rethinking our current Hadoop code structure, like
> separate each subproject into its own repository and release cadence. This
> can be controversial but I think it'll be good for the project in the long
> run.
>
> Thanks,
> Wei-Chiu
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

I'm not hearing any feedback so far, but I want to suggest:

use hadoop-thirdparty repository to host any dependencies that are known to
break compatibility.

Candidate #1 guava
Candidate #2 Netty
Candidate #3 Jetty

in fact, HBase shades these dependencies for the exact same reason.

As an example of the cost of compatibility breakage: we spent the last 6
months to backport the guava update change (guava 11 --> 27) throughout
Cloudera's stack, and after 6 months we are not done yet because we have to
update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
is in the classpath of every application.

Thoughts?

On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:

> Hi Hadoop devs,
>
> I the past, Hadoop tends to be pretty far behind the latest versions of
> dependencies. Part of that is due to the fear of the breaking changes
> brought in by the dependency updates.
>
> However, things have changed dramatically over the past few years. With
> more focus on security vulnerabilities, more vulnerabilities are discovered
> in our dependencies, and users put more pressure on patching Hadoop (and
> its ecosystem) to use the latest dependency versions.
>
> As an example, Jackson-databind had 20 CVEs published in the last year
> alone.
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>
> Jetty: 4 CVEs in 2019:
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>
> We can no longer keep Hadoop stay behind. The more we stay behind, the
> harder it is to update. A good example is Jersey migration 1 -> 2
> HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984> contributed
> by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is hard.
> If any critical vulnerability is found in Jersey 1, it will leave us in a
> bad situation since we can't simply update Jersey version and be done.
>
> Hadoop 3 adds new public artifacts that shade these dependencies. We
> should advocate downstream applications to use the public artifacts to
> avoid breakage.
>
> I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> the latest dependency updates, or would rather stay behind to ensure
> compatibility?
>
> Coupled with that, I'd like to call for more frequent Hadoop releases for
> the same purpose. IMHO that'll require better infrastructure to assist the
> release work and some rethinking our current Hadoop code structure, like
> separate each subproject into its own repository and release cadence. This
> can be controversial but I think it'll be good for the project in the long
> run.
>
> Thanks,
> Wei-Chiu
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

I'm not hearing any feedback so far, but I want to suggest:

use hadoop-thirdparty repository to host any dependencies that are known to
break compatibility.

Candidate #1 guava
Candidate #2 Netty
Candidate #3 Jetty

in fact, HBase shades these dependencies for the exact same reason.

As an example of the cost of compatibility breakage: we spent the last 6
months to backport the guava update change (guava 11 --> 27) throughout
Cloudera's stack, and after 6 months we are not done yet because we have to
update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
is in the classpath of every application.

Thoughts?

On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:

> Hi Hadoop devs,
>
> I the past, Hadoop tends to be pretty far behind the latest versions of
> dependencies. Part of that is due to the fear of the breaking changes
> brought in by the dependency updates.
>
> However, things have changed dramatically over the past few years. With
> more focus on security vulnerabilities, more vulnerabilities are discovered
> in our dependencies, and users put more pressure on patching Hadoop (and
> its ecosystem) to use the latest dependency versions.
>
> As an example, Jackson-databind had 20 CVEs published in the last year
> alone.
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>
> Jetty: 4 CVEs in 2019:
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>
> We can no longer keep Hadoop stay behind. The more we stay behind, the
> harder it is to update. A good example is Jersey migration 1 -> 2
> HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984> contributed
> by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is hard.
> If any critical vulnerability is found in Jersey 1, it will leave us in a
> bad situation since we can't simply update Jersey version and be done.
>
> Hadoop 3 adds new public artifacts that shade these dependencies. We
> should advocate downstream applications to use the public artifacts to
> avoid breakage.
>
> I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> the latest dependency updates, or would rather stay behind to ensure
> compatibility?
>
> Coupled with that, I'd like to call for more frequent Hadoop releases for
> the same purpose. IMHO that'll require better infrastructure to assist the
> release work and some rethinking our current Hadoop code structure, like
> separate each subproject into its own repository and release cadence. This
> can be controversial but I think it'll be good for the project in the long
> run.
>
> Thanks,
> Wei-Chiu
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

I'm not hearing any feedback so far, but I want to suggest:

use hadoop-thirdparty repository to host any dependencies that are known to
break compatibility.

Candidate #1 guava
Candidate #2 Netty
Candidate #3 Jetty

in fact, HBase shades these dependencies for the exact same reason.

As an example of the cost of compatibility breakage: we spent the last 6
months to backport the guava update change (guava 11 --> 27) throughout
Cloudera's stack, and after 6 months we are not done yet because we have to
update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
is in the classpath of every application.

Thoughts?

On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:

> Hi Hadoop devs,
>
> I the past, Hadoop tends to be pretty far behind the latest versions of
> dependencies. Part of that is due to the fear of the breaking changes
> brought in by the dependency updates.
>
> However, things have changed dramatically over the past few years. With
> more focus on security vulnerabilities, more vulnerabilities are discovered
> in our dependencies, and users put more pressure on patching Hadoop (and
> its ecosystem) to use the latest dependency versions.
>
> As an example, Jackson-databind had 20 CVEs published in the last year
> alone.
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>
> Jetty: 4 CVEs in 2019:
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>
> We can no longer keep Hadoop stay behind. The more we stay behind, the
> harder it is to update. A good example is Jersey migration 1 -> 2
> HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984> contributed
> by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is hard.
> If any critical vulnerability is found in Jersey 1, it will leave us in a
> bad situation since we can't simply update Jersey version and be done.
>
> Hadoop 3 adds new public artifacts that shade these dependencies. We
> should advocate downstream applications to use the public artifacts to
> avoid breakage.
>
> I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> the latest dependency updates, or would rather stay behind to ensure
> compatibility?
>
> Coupled with that, I'd like to call for more frequent Hadoop releases for
> the same purpose. IMHO that'll require better infrastructure to assist the
> release work and some rethinking our current Hadoop code structure, like
> separate each subproject into its own repository and release cadence. This
> can be controversial but I think it'll be good for the project in the long
> run.
>
> Thanks,
> Wei-Chiu
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

I'm not hearing any feedback so far, but I want to suggest:

use hadoop-thirdparty repository to host any dependencies that are known to
break compatibility.

Candidate #1 guava
Candidate #2 Netty
Candidate #3 Jetty

in fact, HBase shades these dependencies for the exact same reason.

As an example of the cost of compatibility breakage: we spent the last 6
months to backport the guava update change (guava 11 --> 27) throughout
Cloudera's stack, and after 6 months we are not done yet because we have to
update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
is in the classpath of every application.

Thoughts?

On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <we...@apache.org> wrote:

> Hi Hadoop devs,
>
> I the past, Hadoop tends to be pretty far behind the latest versions of
> dependencies. Part of that is due to the fear of the breaking changes
> brought in by the dependency updates.
>
> However, things have changed dramatically over the past few years. With
> more focus on security vulnerabilities, more vulnerabilities are discovered
> in our dependencies, and users put more pressure on patching Hadoop (and
> its ecosystem) to use the latest dependency versions.
>
> As an example, Jackson-databind had 20 CVEs published in the last year
> alone.
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>
> Jetty: 4 CVEs in 2019:
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>
> We can no longer keep Hadoop stay behind. The more we stay behind, the
> harder it is to update. A good example is Jersey migration 1 -> 2
> HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984> contributed
> by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is hard.
> If any critical vulnerability is found in Jersey 1, it will leave us in a
> bad situation since we can't simply update Jersey version and be done.
>
> Hadoop 3 adds new public artifacts that shade these dependencies. We
> should advocate downstream applications to use the public artifacts to
> avoid breakage.
>
> I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> the latest dependency updates, or would rather stay behind to ensure
> compatibility?
>
> Coupled with that, I'd like to call for more frequent Hadoop releases for
> the same purpose. IMHO that'll require better infrastructure to assist the
> release work and some rethinking our current Hadoop code structure, like
> separate each subproject into its own repository and release cadence. This
> can be controversial but I think it'll be good for the project in the long
> run.
>
> Thanks,
> Wei-Chiu
>