You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2019/10/28 19:33:29 UTC

Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Hi, All.

There was a discussion on publishing artifacts built with Hadoop 3 .
But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be
the same because we didn't change anything yet.

Technically, we need to change two places for publishing.

1. Jenkins Snapshot Publishing

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/

2. Release Snapshot/Release Publishing

https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh

To minimize the change, we need to switch our default Hadoop profile.

Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
(3.2.0)` is optional.
We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
optionally.

Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.

Bests,
Dongjoon.

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Cheng Lian <li...@gmail.com>.

Hey Nicholas,

Thanks for pointing this out. I just realized that I misread the
spark-hadoop-cloud POM. Previously, in Spark 2.4, two profiles,
"hadoop-2.7" and "hadoop-3.1", were referenced in the spark-hadoop-cloud
POM (here
<https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L174> and
here <https://github.com/apache/spark/blob/v2.4.4/hadoop-cloud/pom.xml#L213>).
But in the current master (3.0.0-SNAPSHOT), only the "hadoop-3.2" profile
is mentioned. And I came to the wrong conclusion that spark-hadoop-cloud in
Spark 3.0.0 is only available with the "hadoop-3.2" profile. Apologies for
the misleading information.

Cheng



On Tue, Nov 19, 2019 at 8:57 PM Nicholas Chammas <ni...@gmail.com>
wrote:

> > I don't think the default Hadoop version matters except for the
> spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2
> profile.
>
> What do you mean by "only meaningful under the hadoop-3.2 profile"?
>
> On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian <li...@gmail.com> wrote:
>
>> Hey Steve,
>>
>> In terms of Maven artifact, I don't think the default Hadoop version
>> matters except for the spark-hadoop-cloud module, which is only meaningful
>> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
>> Maven central are Hadoop-version-neutral.
>>
>> Another issue about switching the default Hadoop version to 3.2 is
>> PySpark distribution. Right now, we only publish PySpark artifacts prebuilt
>> with Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency
>> to 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
>> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>>
>> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via
>> the proposed hive-2.3 profile, I personally don't have a preference over
>> having Hadoop 2.7 or 3.2 as the default Hadoop version. But just for
>> minimizing the release management work, in case we decided to publish other
>> spark-* Maven artifacts from a Hadoop 2.7 build, we can still special case
>> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>>
>> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> I also agree with Steve and Felix.
>>>
>>> Let's have another thread to discuss Hive issue
>>>
>>> because this thread was originally for `hadoop` version.
>>>
>>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>>> `hadoop-3.0` versions.
>>>
>>> We don't need to mix both.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <fe...@hotmail.com>
>>> wrote:
>>>
>>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>>> It is old and rather buggy; and It’s been *years*
>>>>
>>>> I think we should decouple hive change from everything else if people
>>>> are concerned?
>>>>
>>>> ------------------------------
>>>> *From:* Steve Loughran <st...@cloudera.com.INVALID>
>>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>>> *To:* Cheng Lian <li...@gmail.com>
>>>> *Cc:* Sean Owen <sr...@gmail.com>; Wenchen Fan <cl...@gmail.com>;
>>>> Dongjoon Hyun <do...@gmail.com>; dev <de...@spark.apache.org>;
>>>> Yuming Wang <wg...@gmail.com>
>>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>>
>>>> Can I take this moment to remind everyone that the version of hive
>>>> which spark has historically bundled (the org.spark-project one) is an
>>>> orphan project put together to deal with Hive's shading issues and a source
>>>> of unhappiness in the Hive project. What ever get shipped should do its
>>>> best to avoid including that file.
>>>>
>>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the
>>>> safest move from a risk minimisation perspective. If something has broken
>>>> then it is you can start with the assumption that it is in the o.a.s
>>>> packages without having to debug o.a.hadoop and o.a.hive first. There is a
>>>> cost: if there are problems with the hadoop / hive dependencies those teams
>>>> will inevitably ignore filed bug reports for the same reason spark team
>>>> will probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for
>>>> the Hadoop 2.x line include any compatibility issues with Java 9+. Do bear
>>>> that in mind. It's not been tested, it has dependencies on artifacts we
>>>> know are incompatible, and as far as the Hadoop project is concerned:
>>>> people should move to branch 3 if they want to run on a modern version of
>>>> Java
>>>>
>>>> It would be really really good if the published spark maven artefacts
>>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
>>>> 3.x. That way people doing things with their own projects will get
>>>> up-to-date dependencies and don't get WONTFIX responses themselves.
>>>>
>>>> -Steve
>>>>
>>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>>>> the transition release.
>>>>
>>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <li...@gmail.com>
>>>> wrote:
>>>>
>>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>>>> seemed risky, and therefore we only introduced Hive 2.3 under the
>>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>>>> here...
>>>>
>>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed
>>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>>>> upgrade together looks too risky.
>>>>
>>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>>> work and is there demand for it?
>>>>
>>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Do we have a limitation on the number of pre-built distributions?
>>>> Seems this time we need
>>>> > 1. hadoop 2.7 + hive 1.2
>>>> > 2. hadoop 2.7 + hive 2.3
>>>> > 3. hadoop 3 + hive 2.3
>>>> >
>>>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>>>> don't need to add JDK version to the combination.
>>>> >
>>>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <
>>>> dongjoon.hyun@gmail.com> wrote:
>>>> >>
>>>> >> Thank you for suggestion.
>>>> >>
>>>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal
>>>> to Hadoop 3.
>>>> >> IIRC, originally, it was proposed in that way, but we put it under
>>>> `hadoop-3.2` to avoid adding new profiles at that time.
>>>> >>
>>>> >> And, I'm wondering if you are considering additional pre-built
>>>> distribution and Jenkins jobs.
>>>> >>
>>>> >> Bests,
>>>> >> Dongjoon.
>>>> >>
>>>>
>>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Nicholas Chammas <ni...@gmail.com>.

> I don't think the default Hadoop version matters except for the
spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2
profile.

What do you mean by "only meaningful under the hadoop-3.2 profile"?

On Tue, Nov 19, 2019 at 5:40 PM Cheng Lian <li...@gmail.com> wrote:

> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
> Maven central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark
> distribution. Right now, we only publish PySpark artifacts prebuilt with
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
> 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
> proposed hive-2.3 profile, I personally don't have a preference over having
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
> the release management work, in case we decided to publish other spark-*
> Maven artifacts from a Hadoop 2.7 build, we can still special case
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>> It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people
>>> are concerned?
>>>
>>> ------------------------------
>>> *From:* Steve Loughran <st...@cloudera.com.INVALID>
>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>> *To:* Cheng Lian <li...@gmail.com>
>>> *Cc:* Sean Owen <sr...@gmail.com>; Wenchen Fan <cl...@gmail.com>;
>>> Dongjoon Hyun <do...@gmail.com>; dev <de...@spark.apache.org>;
>>> Yuming Wang <wg...@gmail.com>
>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which
>>> spark has historically bundled (the org.spark-project one) is an orphan
>>> project put together to deal with Hive's shading issues and a source of
>>> unhappiness in the Hive project. What ever get shipped should do its best
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>>> move from a risk minimisation perspective. If something has broken then it
>>> is you can start with the assumption that it is in the o.a.s packages
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>>> there are problems with the hadoop / hive dependencies those teams will
>>> inevitably ignore filed bug reports for the same reason spark team will
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>>> in mind. It's not been tested, it has dependencies on artifacts we know are
>>> incompatible, and as far as the Hadoop project is concerned: people should
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts
>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
>>> 3.x. That way people doing things with their own projects will get
>>> up-to-date dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>>> the transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <li...@gmail.com>
>>> wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>>> seemed risky, and therefore we only introduced Hive 2.3 under the
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed
>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>> work and is there demand for it?
>>>
>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com> wrote:
>>> >
>>> > Do we have a limitation on the number of pre-built distributions?
>>> Seems this time we need
>>> > 1. hadoop 2.7 + hive 1.2
>>> > 2. hadoop 2.7 + hive 2.3
>>> > 3. hadoop 3 + hive 2.3
>>> >
>>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>>> don't need to add JDK version to the combination.
>>> >
>>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>> >>
>>> >> Thank you for suggestion.
>>> >>
>>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal
>>> to Hadoop 3.
>>> >> IIRC, originally, it was proposed in that way, but we put it under
>>> `hadoop-3.2` to avoid adding new profiles at that time.
>>> >>
>>> >> And, I'm wondering if you are considering additional pre-built
>>> distribution and Jenkins jobs.
>>> >>
>>> >> Bests,
>>> >> Dongjoon.
>>> >>
>>>
>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Mridul Muralidharan <mr...@gmail.com>.

Just for completeness sake, spark is not version neutral to hadoop;
particularly in yarn mode, there is a minimum version requirement
(though fairly generous I believe).

I agree with Steve, it is a long standing pain that we are bundling a
positively ancient version of hive.
Having said that, we should decouple the hive artifact question from
the hadoop version question - though they might be related currently.

Regards,
Mridul

On Tue, Nov 19, 2019 at 2:40 PM Cheng Lian <li...@gmail.com> wrote:
>
> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version matters except for the spark-hadoop-cloud module, which is only meaningful under the hadoop-3.2 profile. All  the other spark-* artifacts published to Maven central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark distribution. Right now, we only publish PySpark artifacts prebuilt with Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to 3.2 is feasible for PySpark users. Or maybe we should publish PySpark prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the proposed hive-2.3 profile, I personally don't have a preference over having Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing the release management work, in case we decided to publish other spark-* Maven artifacts from a Hadoop 2.7 build, we can still special case spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <do...@gmail.com> wrote:
>>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <fe...@hotmail.com> wrote:
>>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people are concerned?
>>>
>>> ________________________________
>>> From: Steve Loughran <st...@cloudera.com.INVALID>
>>> Sent: Sunday, November 17, 2019 9:22:09 AM
>>> To: Cheng Lian <li...@gmail.com>
>>> Cc: Sean Owen <sr...@gmail.com>; Wenchen Fan <cl...@gmail.com>; Dongjoon Hyun <do...@gmail.com>; dev <de...@spark.apache.org>; Yuming Wang <wg...@gmail.com>
>>> Subject: Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which spark has historically bundled (the org.spark-project one) is an orphan project put together to deal with Hive's shading issues and a source of unhappiness in the Hive project. What ever get shipped should do its best to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest move from a risk minimisation perspective. If something has broken then it is you can start with the assumption that it is in the o.a.s packages without having to debug o.a.hadoop and o.a.hive first. There is a cost: if there are problems with the hadoop / hive dependencies those teams will inevitably ignore filed bug reports for the same reason spark team will probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that in mind. It's not been tested, it has dependencies on artifacts we know are incompatible, and as far as the Hadoop project is concerned: people should move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x. That way people doing things with their own projects will get up-to-date dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever" branch-2 release and then declare its predecessors EOL; 2.10 will be the transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <li...@gmail.com> wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I thought the original proposal was to replace Hive 1.2 with Hive 2.3, which seemed risky, and therefore we only introduced Hive 2.3 under the hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11 upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>> work and is there demand for it?
>>>
>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com> wrote:
>>> >
>>> > Do we have a limitation on the number of pre-built distributions? Seems this time we need
>>> > 1. hadoop 2.7 + hive 1.2
>>> > 2. hadoop 2.7 + hive 2.3
>>> > 3. hadoop 3 + hive 2.3
>>> >
>>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination.
>>> >
>>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com> wrote:
>>> >>
>>> >> Thank you for suggestion.
>>> >>
>>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to Hadoop 3.
>>> >> IIRC, originally, it was proposed in that way, but we put it under `hadoop-3.2` to avoid adding new profiles at that time.
>>> >>
>>> >> And, I'm wondering if you are considering additional pre-built distribution and Jenkins jobs.
>>> >>
>>> >> Bests,
>>> >> Dongjoon.
>>> >>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

On Tue, Nov 19, 2019 at 10:40 PM Cheng Lian <li...@gmail.com> wrote:

> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
> Maven central are Hadoop-version-neutral.
>

It's more that everyone using it has to do the game of excluding all the
old artifacts and requesting the new dependencies -including working out
what the spark poms excluded from their imports of later versions of
things.


>
> Another issue about switching the default Hadoop version to 3.2 is PySpark
> distribution. Right now, we only publish PySpark artifacts prebuilt with
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
> 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
> proposed hive-2.3 profile, I personally don't have a preference over having
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
> the release management work, in case we decided to publish other spark-*
> Maven artifacts from a Hadoop 2.7 build, we can still special case
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>

that would really complicate life on maven. sticking a version on mvn
central with the 3.2 dependencies consistently would be better



>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Hyukjin Kwon <gu...@gmail.com>.

We don't have an official Spark with Hadoop 3 yet (except the preview) if I
am not mistaken.
I think it's more natural to one minor release term before switching this
...
How about we target Hadoop 3 as default in Spark 3.1?


2019년 11월 20일 (수) 오전 7:40, Cheng Lian <li...@gmail.com>님이 작성:

> Hey Steve,
>
> In terms of Maven artifact, I don't think the default Hadoop version
> matters except for the spark-hadoop-cloud module, which is only meaningful
> under the hadoop-3.2 profile. All  the other spark-* artifacts published to
> Maven central are Hadoop-version-neutral.
>
> Another issue about switching the default Hadoop version to 3.2 is PySpark
> distribution. Right now, we only publish PySpark artifacts prebuilt with
> Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
> 3.2 is feasible for PySpark users. Or maybe we should publish PySpark
> prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.
>
> Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
> proposed hive-2.3 profile, I personally don't have a preference over having
> Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
> the release management work, in case we decided to publish other spark-*
> Maven artifacts from a Hadoop 2.7 build, we can still special case
> spark-hadoop-cloud and publish it using a hadoop-3.2 build.
>
> On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> I also agree with Steve and Felix.
>>
>> Let's have another thread to discuss Hive issue
>>
>> because this thread was originally for `hadoop` version.
>>
>> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
>> `hadoop-3.0` versions.
>>
>> We don't need to mix both.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution.
>>> It is old and rather buggy; and It’s been *years*
>>>
>>> I think we should decouple hive change from everything else if people
>>> are concerned?
>>>
>>> ------------------------------
>>> *From:* Steve Loughran <st...@cloudera.com.INVALID>
>>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>>> *To:* Cheng Lian <li...@gmail.com>
>>> *Cc:* Sean Owen <sr...@gmail.com>; Wenchen Fan <cl...@gmail.com>;
>>> Dongjoon Hyun <do...@gmail.com>; dev <de...@spark.apache.org>;
>>> Yuming Wang <wg...@gmail.com>
>>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>>
>>> Can I take this moment to remind everyone that the version of hive which
>>> spark has historically bundled (the org.spark-project one) is an orphan
>>> project put together to deal with Hive's shading issues and a source of
>>> unhappiness in the Hive project. What ever get shipped should do its best
>>> to avoid including that file.
>>>
>>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>>> move from a risk minimisation perspective. If something has broken then it
>>> is you can start with the assumption that it is in the o.a.s packages
>>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>>> there are problems with the hadoop / hive dependencies those teams will
>>> inevitably ignore filed bug reports for the same reason spark team will
>>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>>> in mind. It's not been tested, it has dependencies on artifacts we know are
>>> incompatible, and as far as the Hadoop project is concerned: people should
>>> move to branch 3 if they want to run on a modern version of Java
>>>
>>> It would be really really good if the published spark maven artefacts
>>> (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop
>>> 3.x. That way people doing things with their own projects will get
>>> up-to-date dependencies and don't get WONTFIX responses themselves.
>>>
>>> -Steve
>>>
>>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>>> the transition release.
>>>
>>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <li...@gmail.com>
>>> wrote:
>>>
>>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>>> seemed risky, and therefore we only introduced Hive 2.3 under the
>>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>>> here...
>>>
>>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed
>>> that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>>> upgrade together looks too risky.
>>>
>>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>>> work and is there demand for it?
>>>
>>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com> wrote:
>>> >
>>> > Do we have a limitation on the number of pre-built distributions?
>>> Seems this time we need
>>> > 1. hadoop 2.7 + hive 1.2
>>> > 2. hadoop 2.7 + hive 2.3
>>> > 3. hadoop 3 + hive 2.3
>>> >
>>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>>> don't need to add JDK version to the combination.
>>> >
>>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>> >>
>>> >> Thank you for suggestion.
>>> >>
>>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal
>>> to Hadoop 3.
>>> >> IIRC, originally, it was proposed in that way, but we put it under
>>> `hadoop-3.2` to avoid adding new profiles at that time.
>>> >>
>>> >> And, I'm wondering if you are considering additional pre-built
>>> distribution and Jenkins jobs.
>>> >>
>>> >> Bests,
>>> >> Dongjoon.
>>> >>
>>>
>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Cheng Lian <li...@gmail.com>.

Hey Steve,

In terms of Maven artifact, I don't think the default Hadoop version
matters except for the spark-hadoop-cloud module, which is only meaningful
under the hadoop-3.2 profile. All  the other spark-* artifacts published to
Maven central are Hadoop-version-neutral.

Another issue about switching the default Hadoop version to 3.2 is PySpark
distribution. Right now, we only publish PySpark artifacts prebuilt with
Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
3.2 is feasible for PySpark users. Or maybe we should publish PySpark
prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.

Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
proposed hive-2.3 profile, I personally don't have a preference over having
Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
the release management work, in case we decided to publish other spark-*
Maven artifacts from a Hadoop 2.7 build, we can still special case
spark-hadoop-cloud and publish it using a hadoop-3.2 build.

On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> I also agree with Steve and Felix.
>
> Let's have another thread to discuss Hive issue
>
> because this thread was originally for `hadoop` version.
>
> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
> `hadoop-3.0` versions.
>
> We don't need to mix both.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It
>> is old and rather buggy; and It’s been *years*
>>
>> I think we should decouple hive change from everything else if people are
>> concerned?
>>
>> ------------------------------
>> *From:* Steve Loughran <st...@cloudera.com.INVALID>
>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>> *To:* Cheng Lian <li...@gmail.com>
>> *Cc:* Sean Owen <sr...@gmail.com>; Wenchen Fan <cl...@gmail.com>;
>> Dongjoon Hyun <do...@gmail.com>; dev <de...@spark.apache.org>;
>> Yuming Wang <wg...@gmail.com>
>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>
>> Can I take this moment to remind everyone that the version of hive which
>> spark has historically bundled (the org.spark-project one) is an orphan
>> project put together to deal with Hive's shading issues and a source of
>> unhappiness in the Hive project. What ever get shipped should do its best
>> to avoid including that file.
>>
>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>> move from a risk minimisation perspective. If something has broken then it
>> is you can start with the assumption that it is in the o.a.s packages
>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>> there are problems with the hadoop / hive dependencies those teams will
>> inevitably ignore filed bug reports for the same reason spark team will
>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>> in mind. It's not been tested, it has dependencies on artifacts we know are
>> incompatible, and as far as the Hadoop project is concerned: people should
>> move to branch 3 if they want to run on a modern version of Java
>>
>> It would be really really good if the published spark maven artefacts (a)
>> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
>> That way people doing things with their own projects will get up-to-date
>> dependencies and don't get WONTFIX responses themselves.
>>
>> -Steve
>>
>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>> the transition release.
>>
>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <li...@gmail.com> wrote:
>>
>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>> seemed risky, and therefore we only introduced Hive 2.3 under the
>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>> here...
>>
>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
>> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>> upgrade together looks too risky.
>>
>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sr...@gmail.com> wrote:
>>
>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>> work and is there demand for it?
>>
>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com> wrote:
>> >
>> > Do we have a limitation on the number of pre-built distributions? Seems
>> this time we need
>> > 1. hadoop 2.7 + hive 1.2
>> > 2. hadoop 2.7 + hive 2.3
>> > 3. hadoop 3 + hive 2.3
>> >
>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>> don't need to add JDK version to the combination.
>> >
>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>> >>
>> >> Thank you for suggestion.
>> >>
>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
>> Hadoop 3.
>> >> IIRC, originally, it was proposed in that way, but we put it under
>> `hadoop-3.2` to avoid adding new profiles at that time.
>> >>
>> >> And, I'm wondering if you are considering additional pre-built
>> distribution and Jenkins jobs.
>> >>
>> >> Bests,
>> >> Dongjoon.
>> >>
>>
>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

I also agree with Steve and Felix.

Let's have another thread to discuss Hive issue

because this thread was originally for `hadoop` version.

And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
`hadoop-3.0` versions.

We don't need to mix both.

Bests,
Dongjoon.


On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <fe...@hotmail.com>
wrote:

> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It
> is old and rather buggy; and It’s been *years*
>
> I think we should decouple hive change from everything else if people are
> concerned?
>
> ------------------------------
> *From:* Steve Loughran <st...@cloudera.com.INVALID>
> *Sent:* Sunday, November 17, 2019 9:22:09 AM
> *To:* Cheng Lian <li...@gmail.com>
> *Cc:* Sean Owen <sr...@gmail.com>; Wenchen Fan <cl...@gmail.com>;
> Dongjoon Hyun <do...@gmail.com>; dev <de...@spark.apache.org>;
> Yuming Wang <wg...@gmail.com>
> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>
> Can I take this moment to remind everyone that the version of hive which
> spark has historically bundled (the org.spark-project one) is an orphan
> project put together to deal with Hive's shading issues and a source of
> unhappiness in the Hive project. What ever get shipped should do its best
> to avoid including that file.
>
> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
> move from a risk minimisation perspective. If something has broken then it
> is you can start with the assumption that it is in the o.a.s packages
> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
> there are problems with the hadoop / hive dependencies those teams will
> inevitably ignore filed bug reports for the same reason spark team will
> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
> in mind. It's not been tested, it has dependencies on artifacts we know are
> incompatible, and as far as the Hadoop project is concerned: people should
> move to branch 3 if they want to run on a modern version of Java
>
> It would be really really good if the published spark maven artefacts (a)
> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
> That way people doing things with their own projects will get up-to-date
> dependencies and don't get WONTFIX responses themselves.
>
> -Steve
>
> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever"
> branch-2 release and then declare its predecessors EOL; 2.10 will be the
> transition release.
>
> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <li...@gmail.com> wrote:
>
> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
> seemed risky, and therefore we only introduced Hive 2.3 under the
> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
> here...
>
> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
> upgrade together looks too risky.
>
> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sr...@gmail.com> wrote:
>
> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
> than introduce yet another build combination. Does Hadoop 2 + Hive 2
> work and is there demand for it?
>
> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com> wrote:
> >
> > Do we have a limitation on the number of pre-built distributions? Seems
> this time we need
> > 1. hadoop 2.7 + hive 1.2
> > 2. hadoop 2.7 + hive 2.3
> > 3. hadoop 3 + hive 2.3
> >
> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
> don't need to add JDK version to the combination.
> >
> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >>
> >> Thank you for suggestion.
> >>
> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> >> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
> >>
> >> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
> >>
> >> Bests,
> >> Dongjoon.
> >>
>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Felix Cheung <fe...@hotmail.com>.

1000% with Steve, the org.spark-project hive 1.2 will need a solution. It is old and rather buggy; and It’s been *years*

I think we should decouple hive change from everything else if people are concerned?

________________________________
From: Steve Loughran <st...@cloudera.com.INVALID>
Sent: Sunday, November 17, 2019 9:22:09 AM
To: Cheng Lian <li...@gmail.com>
Cc: Sean Owen <sr...@gmail.com>; Wenchen Fan <cl...@gmail.com>; Dongjoon Hyun <do...@gmail.com>; dev <de...@spark.apache.org>; Yuming Wang <wg...@gmail.com>
Subject: Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Can I take this moment to remind everyone that the version of hive which spark has historically bundled (the org.spark-project one) is an orphan project put together to deal with Hive's shading issues and a source of unhappiness in the Hive project. What ever get shipped should do its best to avoid including that file.

Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest move from a risk minimisation perspective. If something has broken then it is you can start with the assumption that it is in the o.a.s packages without having to debug o.a.hadoop and o.a.hive first. There is a cost: if there are problems with the hadoop / hive dependencies those teams will inevitably ignore filed bug reports for the same reason spark team will probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that in mind. It's not been tested, it has dependencies on artifacts we know are incompatible, and as far as the Hadoop project is concerned: people should move to branch 3 if they want to run on a modern version of Java

It would be really really good if the published spark maven artefacts (a) included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x. That way people doing things with their own projects will get up-to-date dependencies and don't get WONTFIX responses themselves.

-Steve

PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever" branch-2 release and then declare its predecessors EOL; 2.10 will be the transition release.

On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <li...@gmail.com>> wrote:
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I thought the original proposal was to replace Hive 1.2 with Hive 2.3, which seemed risky, and therefore we only introduced Hive 2.3 under the hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong here...

Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11 upgrade together looks too risky.

On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sr...@gmail.com>> wrote:
I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
than introduce yet another build combination. Does Hadoop 2 + Hive 2
work and is there demand for it?

On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com>> wrote:
>
> Do we have a limitation on the number of pre-built distributions? Seems this time we need
> 1. hadoop 2.7 + hive 1.2
> 2. hadoop 2.7 + hive 2.3
> 3. hadoop 3 + hive 2.3
>
> AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination.
>
> On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com>> wrote:
>>
>> Thank you for suggestion.
>>
>> Having `hive-2.3` profile sounds good to me because it's orthogonal to Hadoop 3.
>> IIRC, originally, it was proposed in that way, but we put it under `hadoop-3.2` to avoid adding new profiles at that time.
>>
>> And, I'm wondering if you are considering additional pre-built distribution and Jenkins jobs.
>>
>> Bests,
>> Dongjoon.
>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

Can I take this moment to remind everyone that the version of hive which
spark has historically bundled (the org.spark-project one) is an orphan
project put together to deal with Hive's shading issues and a source of
unhappiness in the Hive project. What ever get shipped should do its best
to avoid including that file.

Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
move from a risk minimisation perspective. If something has broken then it
is you can start with the assumption that it is in the o.a.s packages
without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
there are problems with the hadoop / hive dependencies those teams will
inevitably ignore filed bug reports for the same reason spark team will
probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
in mind. It's not been tested, it has dependencies on artifacts we know are
incompatible, and as far as the Hadoop project is concerned: people should
move to branch 3 if they want to run on a modern version of Java

It would be really really good if the published spark maven artefacts (a)
included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
That way people doing things with their own projects will get up-to-date
dependencies and don't get WONTFIX responses themselves.

-Steve

PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last ever"
branch-2 release and then declare its predecessors EOL; 2.10 will be the
transition release.

On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <li...@gmail.com> wrote:

> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
> seemed risky, and therefore we only introduced Hive 2.3 under the
> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
> here...
>
> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
> upgrade together looks too risky.
>
> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sr...@gmail.com> wrote:
>
>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>> work and is there demand for it?
>>
>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com> wrote:
>> >
>> > Do we have a limitation on the number of pre-built distributions? Seems
>> this time we need
>> > 1. hadoop 2.7 + hive 1.2
>> > 2. hadoop 2.7 + hive 2.3
>> > 3. hadoop 3 + hive 2.3
>> >
>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>> don't need to add JDK version to the combination.
>> >
>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>> >>
>> >> Thank you for suggestion.
>> >>
>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
>> Hadoop 3.
>> >> IIRC, originally, it was proposed in that way, but we put it under
>> `hadoop-3.2` to avoid adding new profiles at that time.
>> >>
>> >> And, I'm wondering if you are considering additional pre-built
>> distribution and Jenkins jobs.
>> >>
>> >> Bests,
>> >> Dongjoon.
>> >>
>>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Cheng Lian <li...@gmail.com>.

Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
seemed risky, and therefore we only introduced Hive 2.3 under the
hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
here...

Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
upgrade together looks too risky.

On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <sr...@gmail.com> wrote:

> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
> than introduce yet another build combination. Does Hadoop 2 + Hive 2
> work and is there demand for it?
>
> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com> wrote:
> >
> > Do we have a limitation on the number of pre-built distributions? Seems
> this time we need
> > 1. hadoop 2.7 + hive 1.2
> > 2. hadoop 2.7 + hive 2.3
> > 3. hadoop 3 + hive 2.3
> >
> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
> don't need to add JDK version to the combination.
> >
> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >>
> >> Thank you for suggestion.
> >>
> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> >> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
> >>
> >> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
> >>
> >> Bests,
> >> Dongjoon.
> >>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Sean Owen <sr...@gmail.com>.

I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
than introduce yet another build combination. Does Hadoop 2 + Hive 2
work and is there demand for it?

On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <cl...@gmail.com> wrote:
>
> Do we have a limitation on the number of pre-built distributions? Seems this time we need
> 1. hadoop 2.7 + hive 1.2
> 2. hadoop 2.7 + hive 2.3
> 3. hadoop 3 + hive 2.3
>
> AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination.
>
> On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com> wrote:
>>
>> Thank you for suggestion.
>>
>> Having `hive-2.3` profile sounds good to me because it's orthogonal to Hadoop 3.
>> IIRC, originally, it was proposed in that way, but we put it under `hadoop-3.2` to avoid adding new profiles at that time.
>>
>> And, I'm wondering if you are considering additional pre-built distribution and Jenkins jobs.
>>
>> Bests,
>> Dongjoon.
>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Wenchen Fan <cl...@gmail.com>.

Do we have a limitation on the number of pre-built distributions? Seems
this time we need
1. hadoop 2.7 + hive 1.2
2. hadoop 2.7 + hive 2.3
3. hadoop 3 + hive 2.3

AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't
need to add JDK version to the combination.

On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Thank you for suggestion.
>
> Having `hive-2.3` profile sounds good to me because it's orthogonal to
> Hadoop 3.
> IIRC, originally, it was proposed in that way, but we put it under
> `hadoop-3.2` to avoid adding new profiles at that time.
>
> And, I'm wondering if you are considering additional pre-built
> distribution and Jenkins jobs.
>
> Bests,
> Dongjoon.
>
>
>
> On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian <li...@gmail.com> wrote:
>
>> Cc Yuming, Steve, and Dongjoon
>>
>> On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <li...@gmail.com>
>> wrote:
>>
>>> Similar to Xiao, my major concern about making Hadoop 3.2 the default
>>> Hadoop version is quality control. The current hadoop-3.2 profile
>>> covers too many major component upgrades, i.e.:
>>>
>>>    - Hadoop 3.2
>>>    - Hive 2.3
>>>    - JDK 11
>>>
>>> We have already found and fixed some feature and performance regressions
>>> related to these upgrades. Empirically, I’m not surprised at all if more
>>> regressions are lurking somewhere. On the other hand, we do want help from
>>> the community to help us to evaluate and stabilize these new changes.
>>> Following that, I’d like to propose:
>>>
>>>    1.
>>>
>>>    Introduce a new profile hive-2.3 to enable (hopefully) less risky
>>>    Hadoop/Hive/JDK version combinations.
>>>
>>>    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>>>    profile, so that users may try out some less risky Hadoop/Hive/JDK
>>>    combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>>>    face potential regressions introduced by the Hadoop 3.2 upgrade.
>>>
>>>    Yuming Wang has already sent out PR #26533
>>>    <https://github.com/apache/spark/pull/26533> to exercise the Hadoop
>>>    2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the
>>>    hive-2.3 profile yet), and the result looks promising: the Kafka
>>>    streaming and Arrow related test failures should be irrelevant to the topic
>>>    discussed here.
>>>
>>>    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a
>>>    lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default
>>>    Hadoop version. For users who are still using Hadoop 2.x in production,
>>>    they will have to use a hadoop-provided prebuilt package or build
>>>    Spark 3.0 against their own 2.x version anyway. It does make a difference
>>>    for cloud users who don’t use Hadoop at all, though. And this probably also
>>>    helps to stabilize the Hadoop 3.2 code path faster since our PR builder
>>>    will exercise it regularly.
>>>    2.
>>>
>>>    Defer Hadoop 2.x upgrade to Spark 3.1+
>>>
>>>    I personally do want to bump our Hadoop 2.x version to 2.9 or even
>>>    2.10. Steve has already stated the benefits very well. My worry here is
>>>    still quality control: Spark 3.0 has already had tons of changes and major
>>>    component version upgrades that are subject to all kinds of known and
>>>    hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
>>>    it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
>>>    to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>>>    next 1 or 2 Spark 3.x releases.
>>>
>>> Cheng
>>>
>>> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>>>> but they kept the public apis stable at the 2.7 level, because thats kind
>>>> of the point. arent those the hadoop apis spark uses?
>>>>
>>>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>>>> <st...@cloudera.com.invalid> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>
>>>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>>>> <st...@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> It would be really good if the spark distributions shipped with
>>>>>>> later versions of the hadoop artifacts.
>>>>>>>
>>>>>>
>>>>>> I second this. If we need to keep a Hadoop 2.x profile around, why
>>>>>> not make it Hadoop 2.8 or something newer?
>>>>>>
>>>>>
>>>>> go for 2.9
>>>>>
>>>>>>
>>>>>> Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>
>>>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2
>>>>>>> profile to latest would probably be an issue for us.
>>>>>>
>>>>>>
>>>>>> When was the last time HDP 2.x bumped their minor version of Hadoop?
>>>>>> Do we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>>>>
>>>>>
>>>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A
>>>>> really large proportion of the later branch-2 patches are backported. 2,7
>>>>> was left behind a long time ago
>>>>>
>>>>>
>>>>>
>>>>>
>>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

Thank you for suggestion.

Having `hive-2.3` profile sounds good to me because it's orthogonal to
Hadoop 3.
IIRC, originally, it was proposed in that way, but we put it under
`hadoop-3.2` to avoid adding new profiles at that time.

And, I'm wondering if you are considering additional pre-built distribution
and Jenkins jobs.

Bests,
Dongjoon.



On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian <li...@gmail.com> wrote:

> Cc Yuming, Steve, and Dongjoon
>
> On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <li...@gmail.com> wrote:
>
>> Similar to Xiao, my major concern about making Hadoop 3.2 the default
>> Hadoop version is quality control. The current hadoop-3.2 profile covers
>> too many major component upgrades, i.e.:
>>
>>    - Hadoop 3.2
>>    - Hive 2.3
>>    - JDK 11
>>
>> We have already found and fixed some feature and performance regressions
>> related to these upgrades. Empirically, I’m not surprised at all if more
>> regressions are lurking somewhere. On the other hand, we do want help from
>> the community to help us to evaluate and stabilize these new changes.
>> Following that, I’d like to propose:
>>
>>    1.
>>
>>    Introduce a new profile hive-2.3 to enable (hopefully) less risky
>>    Hadoop/Hive/JDK version combinations.
>>
>>    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>>    profile, so that users may try out some less risky Hadoop/Hive/JDK
>>    combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>>    face potential regressions introduced by the Hadoop 3.2 upgrade.
>>
>>    Yuming Wang has already sent out PR #26533
>>    <https://github.com/apache/spark/pull/26533> to exercise the Hadoop
>>    2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the
>>    hive-2.3 profile yet), and the result looks promising: the Kafka
>>    streaming and Arrow related test failures should be irrelevant to the topic
>>    discussed here.
>>
>>    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a
>>    lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default
>>    Hadoop version. For users who are still using Hadoop 2.x in production,
>>    they will have to use a hadoop-provided prebuilt package or build
>>    Spark 3.0 against their own 2.x version anyway. It does make a difference
>>    for cloud users who don’t use Hadoop at all, though. And this probably also
>>    helps to stabilize the Hadoop 3.2 code path faster since our PR builder
>>    will exercise it regularly.
>>    2.
>>
>>    Defer Hadoop 2.x upgrade to Spark 3.1+
>>
>>    I personally do want to bump our Hadoop 2.x version to 2.9 or even
>>    2.10. Steve has already stated the benefits very well. My worry here is
>>    still quality control: Spark 3.0 has already had tons of changes and major
>>    component version upgrades that are subject to all kinds of known and
>>    hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
>>    it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
>>    to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>>    next 1 or 2 Spark 3.x releases.
>>
>> Cheng
>>
>> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>>> but they kept the public apis stable at the 2.7 level, because thats kind
>>> of the point. arent those the hadoop apis spark uses?
>>>
>>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>>> <st...@cloudera.com.invalid> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>>>> nicholas.chammas@gmail.com> wrote:
>>>>
>>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>>> <st...@cloudera.com.invalid> wrote:
>>>>>
>>>>>> It would be really good if the spark distributions shipped with later
>>>>>> versions of the hadoop artifacts.
>>>>>>
>>>>>
>>>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>>>> make it Hadoop 2.8 or something newer?
>>>>>
>>>>
>>>> go for 2.9
>>>>
>>>>>
>>>>> Koert Kuipers <ko...@tresata.com> wrote:
>>>>>
>>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2
>>>>>> profile to latest would probably be an issue for us.
>>>>>
>>>>>
>>>>> When was the last time HDP 2.x bumped their minor version of Hadoop?
>>>>> Do we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>>>
>>>>
>>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>>>> large proportion of the later branch-2 patches are backported. 2,7 was left
>>>> behind a long time ago
>>>>
>>>>
>>>>
>>>>
>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Cheng Lian <li...@gmail.com>.

Cc Yuming, Steve, and Dongjoon

On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <li...@gmail.com> wrote:

> Similar to Xiao, my major concern about making Hadoop 3.2 the default
> Hadoop version is quality control. The current hadoop-3.2 profile covers
> too many major component upgrades, i.e.:
>
>    - Hadoop 3.2
>    - Hive 2.3
>    - JDK 11
>
> We have already found and fixed some feature and performance regressions
> related to these upgrades. Empirically, I’m not surprised at all if more
> regressions are lurking somewhere. On the other hand, we do want help from
> the community to help us to evaluate and stabilize these new changes.
> Following that, I’d like to propose:
>
>    1.
>
>    Introduce a new profile hive-2.3 to enable (hopefully) less risky
>    Hadoop/Hive/JDK version combinations.
>
>    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>    profile, so that users may try out some less risky Hadoop/Hive/JDK
>    combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>    face potential regressions introduced by the Hadoop 3.2 upgrade.
>
>    Yuming Wang has already sent out PR #26533
>    <https://github.com/apache/spark/pull/26533> to exercise the Hadoop
>    2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3
>    profile yet), and the result looks promising: the Kafka streaming and Arrow
>    related test failures should be irrelevant to the topic discussed here.
>
>    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot
>    of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop
>    version. For users who are still using Hadoop 2.x in production, they will
>    have to use a hadoop-provided prebuilt package or build Spark 3.0
>    against their own 2.x version anyway. It does make a difference for cloud
>    users who don’t use Hadoop at all, though. And this probably also helps to
>    stabilize the Hadoop 3.2 code path faster since our PR builder will
>    exercise it regularly.
>    2.
>
>    Defer Hadoop 2.x upgrade to Spark 3.1+
>
>    I personally do want to bump our Hadoop 2.x version to 2.9 or even
>    2.10. Steve has already stated the benefits very well. My worry here is
>    still quality control: Spark 3.0 has already had tons of changes and major
>    component version upgrades that are subject to all kinds of known and
>    hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
>    it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
>    to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>    next 1 or 2 Spark 3.x releases.
>
> Cheng
>
> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <ko...@tresata.com> wrote:
>
>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>> but they kept the public apis stable at the 2.7 level, because thats kind
>> of the point. arent those the hadoop apis spark uses?
>>
>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>> <st...@cloudera.com.invalid> wrote:
>>
>>>
>>>
>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>> <st...@cloudera.com.invalid> wrote:
>>>>
>>>>> It would be really good if the spark distributions shipped with later
>>>>> versions of the hadoop artifacts.
>>>>>
>>>>
>>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>>> make it Hadoop 2.8 or something newer?
>>>>
>>>
>>> go for 2.9
>>>
>>>>
>>>> Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile
>>>>> to latest would probably be an issue for us.
>>>>
>>>>
>>>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do
>>>> we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>>
>>>
>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>>> large proportion of the later branch-2 patches are backported. 2,7 was left
>>> behind a long time ago
>>>
>>>
>>>
>>>
>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Cheng Lian <li...@gmail.com>.

Similar to Xiao, my major concern about making Hadoop 3.2 the default
Hadoop version is quality control. The current hadoop-3.2 profile covers
too many major component upgrades, i.e.:

   - Hadoop 3.2
   - Hive 2.3
   - JDK 11

We have already found and fixed some feature and performance regressions
related to these upgrades. Empirically, I’m not surprised at all if more
regressions are lurking somewhere. On the other hand, we do want help from
the community to help us to evaluate and stabilize these new changes.
Following that, I’d like to propose:

   1.

   Introduce a new profile hive-2.3 to enable (hopefully) less risky
   Hadoop/Hive/JDK version combinations.

   This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
   profile, so that users may try out some less risky Hadoop/Hive/JDK
   combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
   face potential regressions introduced by the Hadoop 3.2 upgrade.

   Yuming Wang has already sent out PR #26533
   <https://github.com/apache/spark/pull/26533> to exercise the Hadoop 2.7
   + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3
   profile yet), and the result looks promising: the Kafka streaming and Arrow
   related test failures should be irrelevant to the topic discussed here.

   After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot
   of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop
   version. For users who are still using Hadoop 2.x in production, they will
   have to use a hadoop-provided prebuilt package or build Spark 3.0
   against their own 2.x version anyway. It does make a difference for cloud
   users who don’t use Hadoop at all, though. And this probably also helps to
   stabilize the Hadoop 3.2 code path faster since our PR builder will
   exercise it regularly.
   2.

   Defer Hadoop 2.x upgrade to Spark 3.1+

   I personally do want to bump our Hadoop 2.x version to 2.9 or even 2.10.
   Steve has already stated the benefits very well. My worry here is still
   quality control: Spark 3.0 has already had tons of changes and major
   component version upgrades that are subject to all kinds of known and
   hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
   it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
   to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
   next 1 or 2 Spark 3.x releases.

Cheng

On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <ko...@tresata.com> wrote:

> i get that cdh and hdp backport a lot and in that way left 2.7 behind. but
> they kept the public apis stable at the 2.7 level, because thats kind of
> the point. arent those the hadoop apis spark uses?
>
> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran <st...@cloudera.com.invalid>
> wrote:
>
>>
>>
>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>> <st...@cloudera.com.invalid> wrote:
>>>
>>>> It would be really good if the spark distributions shipped with later
>>>> versions of the hadoop artifacts.
>>>>
>>>
>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>> make it Hadoop 2.8 or something newer?
>>>
>>
>> go for 2.9
>>
>>>
>>> Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile
>>>> to latest would probably be an issue for us.
>>>
>>>
>>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do
>>> we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>
>>
>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>> large proportion of the later branch-2 patches are backported. 2,7 was left
>> behind a long time ago
>>
>>
>>
>>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Koert Kuipers <ko...@tresata.com>.

i get that cdh and hdp backport a lot and in that way left 2.7 behind. but
they kept the public apis stable at the 2.7 level, because thats kind of
the point. arent those the hadoop apis spark uses?

On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran <st...@cloudera.com.invalid>
wrote:

>
>
> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <st...@cloudera.com.invalid>
>> wrote:
>>
>>> It would be really good if the spark distributions shipped with later
>>> versions of the hadoop artifacts.
>>>
>>
>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>> make it Hadoop 2.8 or something newer?
>>
>
> go for 2.9
>
>>
>> Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile
>>> to latest would probably be an issue for us.
>>
>>
>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do
>> we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>
>
> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
> large proportion of the later branch-2 patches are backported. 2,7 was left
> behind a long time ago
>
>
>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <ni...@gmail.com>
wrote:

> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <st...@cloudera.com.invalid>
> wrote:
>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>
> I second this. If we need to keep a Hadoop 2.x profile around, why not
> make it Hadoop 2.8 or something newer?
>

go for 2.9

>
> Koert Kuipers <ko...@tresata.com> wrote:
>
>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to
>> latest would probably be an issue for us.
>
>
> When was the last time HDP 2.x bumped their minor version of Hadoop? Do we
> want to wait for them to bump to Hadoop 2.8 before we do the same?
>

The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
large proportion of the later branch-2 patches are backported. 2,7 was left
behind a long time ago

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Nicholas Chammas <ni...@gmail.com>.

On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran <st...@cloudera.com.invalid>
wrote:

> It would be really good if the spark distributions shipped with later
> versions of the hadoop artifacts.
>

I second this. If we need to keep a Hadoop 2.x profile around, why not make
it Hadoop 2.8 or something newer?

Koert Kuipers <ko...@tresata.com> wrote:

> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to
> latest would probably be an issue for us.

When was the last time HDP 2.x bumped their minor version of Hadoop? Do we
want to wait for them to bump to Hadoop 2.8 before we do the same?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

On Sun, 12 Jul 2020 at 01:45, gpongracz <gp...@gmail.com> wrote:

> As someone who mainly operates in AWS it would be very welcome to have the
> option to use an updated version of hadoop using pyspark sourced from pypi.
>
> Acknowledging the issues of backwards compatability...
>
> The most vexing issue is the lack of ability to use s3a STS, ie
> org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider.
>
> This prevents the use of AWS temporary credentials, hampering local
> development against s3.
>
> I'd personally worry about other issue related to performance, security,
Joda Time and java 8, etc. Hadoop 2.7.x is EOL and doesn't get security
fixes any more.

If you do want that temporary credentials provider -you can stick a copy of
the class on your classpath and just list it on the
option fs.s3a.aws.credentials.provider


> Whilst this would be solved by bumping the hadoop version to anything >=
> 2.8.x, the 3.x option would also allow for the writing of data using KMS.
>
> Regards,
>
> George Pongracz
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by gpongracz <gp...@gmail.com>.

As someone who mainly operates in AWS it would be very welcome to have the
option to use an updated version of hadoop using pyspark sourced from pypi.

Acknowledging the issues of backwards compatability...

The most vexing issue is the lack of ability to use s3a STS, ie
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider. 

This prevents the use of AWS temporary credentials, hampering local
development against s3.

Whilst this would be solved by bumping the hadoop version to anything >=
2.8.x, the 3.x option would also allow for the writing of data using KMS.

Regards,

George Pongracz



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Hyukjin Kwon <gu...@gmail.com>.

I dont have a strong opinion on changing default too but I also a little
bit more prefer to have the option to switch Hadoop version first just to
stay safer.

To be clear, we're more now discussing about the timing about when to set
Hadoop 3.0.0 by default, and which change has to be first, right?

I can work on the option to switch soon. I guess that will consolidate all
opinions here (?).


On Thu, 25 Jun 2020, 04:25 Andrew Melo, <an...@gmail.com> wrote:

> Hello,
>
> On Wed, Jun 24, 2020 at 2:13 PM Holden Karau <ho...@pigscanfly.ca> wrote:
> >
> > So I thought our theory for the pypi packages was it was for local
> developers, they really shouldn't care about the Hadoop version. If you're
> running on a production cluster you ideally pip install from the same
> release artifacts as your production cluster to match.
>
> That's certainly one use of pypi packages, but not the only one. In
> our case, we provide clusters for our users, with SPARK_CONF pre
> configured with (e.g.) the master connection URL. But the analyses
> they're doing are their own and unique, so they work in their own
> personal python virtual environments. There are no "release artifacts"
> to publish, per-se, since each user is working independently and can
> install whatever they'd like into their virtual environments.
>
> Cheers
> Andrew
>
> >
> > On Wed, Jun 24, 2020 at 12:11 PM Wenchen Fan <cl...@gmail.com>
> wrote:
> >>
> >> Shall we start a new thread to discuss the bundled Hadoop version in
> PySpark? I don't have a strong opinion on changing the default, as users
> can still download the Hadoop 2.7 version.
> >>
> >> On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >>>
> >>> To Xiao.
> >>> Why Apache project releases should be blocked by PyPi / CRAN? It's
> completely optional, isn't it?
> >>>
> >>>     > let me repeat my opinion:  the top priority is to provide two
> options for PyPi distribution
> >>>
> >>> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the
> first incident. Apache Spark already has a history of missing SparkR
> uploading. We don't say Spark 3.0.0 fails due to CRAN uploading or other
> non-Apache distribution channels. In short, non-Apache distribution
> channels cannot be a `blocker` for Apache project releases. We only do our
> best for the community.
> >>>
> >>> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is
> really irrelevant to this PR. If someone wants to do that and the PR is
> ready, why don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait
> for December? Is there a reason why we need to wait?
> >>>
> >>> To Sean
> >>> Thanks!
> >>>
> >>> To Nicholas.
> >>> Do you think `pip install pyspark` is version-agnostic? In the Python
> world, `pip install somepackage` fails frequently. In production, you
> should use `pip install somepackage==specificversion`. I don't think the
> production pipeline has non-versinoned Python package installation.
> >>>
> >>> The bottom line is that the PR doesn't change PyPi uploading, the
> AS-IS PR keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think
> there is a blocker for that PR.
> >>>
> >>> Bests,
> >>> Dongjoon.
> >>>
> >>> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
> >>>>
> >>>> To rephrase my earlier email, PyPI users would care about the bundled
> Hadoop version if they have a workflow that, in effect, looks something
> like this:
> >>>>
> >>>> ```
> >>>> pip install pyspark
> >>>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
> >>>> spark.read.parquet('s3a://...')
> >>>> ```
> >>>>
> >>>> I agree that Hadoop 3 would be a better default (again, the s3a
> support is just much better). But to Xiao's point, if you are expecting
> Spark to work with some package like hadoop-aws that assumes an older
> version of Hadoop bundled with Spark, then changing the default may break
> your workflow.
> >>>>
> >>>> In the case of hadoop-aws the fix is simple--just bump
> hadoop-aws:2.7.7 to hadoop-aws:3.2.1. But perhaps there are other
> PyPI-based workflows that would be more difficult to repair. 🤷‍♂️
> >>>>
> >>>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sr...@gmail.com> wrote:
> >>>>>
> >>>>> I'm also genuinely curious when PyPI users would care about the
> >>>>> bundled Hadoop jars - do we even need two versions? that itself is
> >>>>> extra complexity for end users.
> >>>>> I do think Hadoop 3 is the better choice for the user who doesn't
> >>>>> care, and better long term.
> >>>>> OK but let's at least move ahead with changing defaults.
> >>>>>
> >>>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <li...@databricks.com>
> wrote:
> >>>>> >
> >>>>> > Hi, Dongjoon,
> >>>>> >
> >>>>> > Please do not misinterpret my point. I already clearly said "I do
> not know how to track the popularity of Hadoop 2 vs Hadoop 3."
> >>>>> >
> >>>>> > Also, let me repeat my opinion:  the top priority is to provide
> two options for PyPi distribution and let the end users choose the ones
> they need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
> breaking change, let us follow our protocol documented in
> https://spark.apache.org/versioning-policy.html.
> >>>>> >
> >>>>> > If you just want to change the Jenkins setup, I am OK about it. If
> you want to change the default distribution, we need more discussions in
> the community for getting an agreement.
> >>>>> >
> >>>>> >  Thanks,
> >>>>> >
> >>>>> > Xiao
> >>>>> >
> >
> >
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Andrew Melo <an...@gmail.com>.

Hello,

On Wed, Jun 24, 2020 at 2:13 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>
> So I thought our theory for the pypi packages was it was for local developers, they really shouldn't care about the Hadoop version. If you're running on a production cluster you ideally pip install from the same release artifacts as your production cluster to match.

That's certainly one use of pypi packages, but not the only one. In
our case, we provide clusters for our users, with SPARK_CONF pre
configured with (e.g.) the master connection URL. But the analyses
they're doing are their own and unique, so they work in their own
personal python virtual environments. There are no "release artifacts"
to publish, per-se, since each user is working independently and can
install whatever they'd like into their virtual environments.

Cheers
Andrew

>
> On Wed, Jun 24, 2020 at 12:11 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>> Shall we start a new thread to discuss the bundled Hadoop version in PySpark? I don't have a strong opinion on changing the default, as users can still download the Hadoop 2.7 version.
>>
>> On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <do...@gmail.com> wrote:
>>>
>>> To Xiao.
>>> Why Apache project releases should be blocked by PyPi / CRAN? It's completely optional, isn't it?
>>>
>>>     > let me repeat my opinion:  the top priority is to provide two options for PyPi distribution
>>>
>>> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first incident. Apache Spark already has a history of missing SparkR uploading. We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache distribution channels. In short, non-Apache distribution channels cannot be a `blocker` for Apache project releases. We only do our best for the community.
>>>
>>> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really irrelevant to this PR. If someone wants to do that and the PR is ready, why don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for December? Is there a reason why we need to wait?
>>>
>>> To Sean
>>> Thanks!
>>>
>>> To Nicholas.
>>> Do you think `pip install pyspark` is version-agnostic? In the Python world, `pip install somepackage` fails frequently. In production, you should use `pip install somepackage==specificversion`. I don't think the production pipeline has non-versinoned Python package installation.
>>>
>>> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is a blocker for that PR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <ni...@gmail.com> wrote:
>>>>
>>>> To rephrase my earlier email, PyPI users would care about the bundled Hadoop version if they have a workflow that, in effect, looks something like this:
>>>>
>>>> ```
>>>> pip install pyspark
>>>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
>>>> spark.read.parquet('s3a://...')
>>>> ```
>>>>
>>>> I agree that Hadoop 3 would be a better default (again, the s3a support is just much better). But to Xiao's point, if you are expecting Spark to work with some package like hadoop-aws that assumes an older version of Hadoop bundled with Spark, then changing the default may break your workflow.
>>>>
>>>> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that would be more difficult to repair. 🤷‍♂️
>>>>
>>>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>> I'm also genuinely curious when PyPI users would care about the
>>>>> bundled Hadoop jars - do we even need two versions? that itself is
>>>>> extra complexity for end users.
>>>>> I do think Hadoop 3 is the better choice for the user who doesn't
>>>>> care, and better long term.
>>>>> OK but let's at least move ahead with changing defaults.
>>>>>
>>>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <li...@databricks.com> wrote:
>>>>> >
>>>>> > Hi, Dongjoon,
>>>>> >
>>>>> > Please do not misinterpret my point. I already clearly said "I do not know how to track the popularity of Hadoop 2 vs Hadoop 3."
>>>>> >
>>>>> > Also, let me repeat my opinion:  the top priority is to provide two options for PyPi distribution and let the end users choose the ones they need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any breaking change, let us follow our protocol documented in https://spark.apache.org/versioning-policy.html.
>>>>> >
>>>>> > If you just want to change the Jenkins setup, I am OK about it. If you want to change the default distribution, we need more discussions in the community for getting an agreement.
>>>>> >
>>>>> >  Thanks,
>>>>> >
>>>>> > Xiao
>>>>> >
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Holden Karau <ho...@pigscanfly.ca>.

So I thought our theory for the pypi packages was it was for local
developers, they really shouldn't care about the Hadoop version. If you're
running on a production cluster you ideally pip install from the same
release artifacts as your production cluster to match.

On Wed, Jun 24, 2020 at 12:11 PM Wenchen Fan <cl...@gmail.com> wrote:

> Shall we start a new thread to discuss the bundled Hadoop version in
> PySpark? I don't have a strong opinion on changing the default, as users
> can still download the Hadoop 2.7 version.
>
> On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> To Xiao.
>> Why Apache project releases should be blocked by PyPi / CRAN? It's
>> completely optional, isn't it?
>>
>>     > let me repeat my opinion:  the top priority is to provide two
>> options for PyPi distribution
>>
>> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the
>> first incident. Apache Spark already has a history of missing SparkR
>> uploading. We don't say Spark 3.0.0 fails due to CRAN uploading or other
>> non-Apache distribution channels. In short, non-Apache distribution
>> channels cannot be a `blocker` for Apache project releases. We only do our
>> best for the community.
>>
>> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is
>> really irrelevant to this PR. If someone wants to do that and the PR is
>> ready, why don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait
>> for December? Is there a reason why we need to wait?
>>
>> To Sean
>> Thanks!
>>
>> To Nicholas.
>> Do you think `pip install pyspark` is version-agnostic? In the Python
>> world, `pip install somepackage` fails frequently. In production, you
>> should use `pip install somepackage==specificversion`. I don't think the
>> production pipeline has non-versinoned Python package installation.
>>
>> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS
>> PR keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there
>> is a blocker for that PR.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> To rephrase my earlier email, PyPI users would care about the bundled
>>> Hadoop version if they have a workflow that, in effect, looks something
>>> like this:
>>>
>>> ```
>>> pip install pyspark
>>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
>>> spark.read.parquet('s3a://...')
>>> ```
>>>
>>> I agree that Hadoop 3 would be a better default (again, the s3a support
>>> is just much better). But to Xiao's point, if you are expecting Spark to
>>> work with some package like hadoop-aws that assumes an older version of
>>> Hadoop bundled with Spark, then changing the default may break your
>>> workflow.
>>>
>>> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7
>>> to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
>>> would be more difficult to repair. 🤷‍♂️
>>>
>>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> I'm also genuinely curious when PyPI users would care about the
>>>> bundled Hadoop jars - do we even need two versions? that itself is
>>>> extra complexity for end users.
>>>> I do think Hadoop 3 is the better choice for the user who doesn't
>>>> care, and better long term.
>>>> OK but let's at least move ahead with changing defaults.
>>>>
>>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <li...@databricks.com> wrote:
>>>> >
>>>> > Hi, Dongjoon,
>>>> >
>>>> > Please do not misinterpret my point. I already clearly said "I do not
>>>> know how to track the popularity of Hadoop 2 vs Hadoop 3."
>>>> >
>>>> > Also, let me repeat my opinion:  the top priority is to provide two
>>>> options for PyPi distribution and let the end users choose the ones they
>>>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
>>>> breaking change, let us follow our protocol documented in
>>>> https://spark.apache.org/versioning-policy.html.
>>>> >
>>>> > If you just want to change the Jenkins setup, I am OK about it. If
>>>> you want to change the default distribution, we need more discussions in
>>>> the community for getting an agreement.
>>>> >
>>>> >  Thanks,
>>>> >
>>>> > Xiao
>>>> >
>>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Wenchen Fan <cl...@gmail.com>.

Shall we start a new thread to discuss the bundled Hadoop version in
PySpark? I don't have a strong opinion on changing the default, as users
can still download the Hadoop 2.7 version.

On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> To Xiao.
> Why Apache project releases should be blocked by PyPi / CRAN? It's
> completely optional, isn't it?
>
>     > let me repeat my opinion:  the top priority is to provide two
> options for PyPi distribution
>
> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first
> incident. Apache Spark already has a history of missing SparkR uploading.
> We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache
> distribution channels. In short, non-Apache distribution channels cannot be
> a `blocker` for Apache project releases. We only do our best for the
> community.
>
> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really
> irrelevant to this PR. If someone wants to do that and the PR is ready, why
> don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for
> December? Is there a reason why we need to wait?
>
> To Sean
> Thanks!
>
> To Nicholas.
> Do you think `pip install pyspark` is version-agnostic? In the Python
> world, `pip install somepackage` fails frequently. In production, you
> should use `pip install somepackage==specificversion`. I don't think the
> production pipeline has non-versinoned Python package installation.
>
> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR
> keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is
> a blocker for that PR.
>
> Bests,
> Dongjoon.
>
> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> To rephrase my earlier email, PyPI users would care about the bundled
>> Hadoop version if they have a workflow that, in effect, looks something
>> like this:
>>
>> ```
>> pip install pyspark
>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
>> spark.read.parquet('s3a://...')
>> ```
>>
>> I agree that Hadoop 3 would be a better default (again, the s3a support
>> is just much better). But to Xiao's point, if you are expecting Spark to
>> work with some package like hadoop-aws that assumes an older version of
>> Hadoop bundled with Spark, then changing the default may break your
>> workflow.
>>
>> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7
>> to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
>> would be more difficult to repair. 🤷‍♂️
>>
>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> I'm also genuinely curious when PyPI users would care about the
>>> bundled Hadoop jars - do we even need two versions? that itself is
>>> extra complexity for end users.
>>> I do think Hadoop 3 is the better choice for the user who doesn't
>>> care, and better long term.
>>> OK but let's at least move ahead with changing defaults.
>>>
>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <li...@databricks.com> wrote:
>>> >
>>> > Hi, Dongjoon,
>>> >
>>> > Please do not misinterpret my point. I already clearly said "I do not
>>> know how to track the popularity of Hadoop 2 vs Hadoop 3."
>>> >
>>> > Also, let me repeat my opinion:  the top priority is to provide two
>>> options for PyPi distribution and let the end users choose the ones they
>>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
>>> breaking change, let us follow our protocol documented in
>>> https://spark.apache.org/versioning-policy.html.
>>> >
>>> > If you just want to change the Jenkins setup, I am OK about it. If you
>>> want to change the default distribution, we need more discussions in the
>>> community for getting an agreement.
>>> >
>>> >  Thanks,
>>> >
>>> > Xiao
>>> >
>>>
>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

To Xiao.
Why Apache project releases should be blocked by PyPi / CRAN? It's
completely optional, isn't it?

    > let me repeat my opinion:  the top priority is to provide two options
for PyPi distribution

IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first
incident. Apache Spark already has a history of missing SparkR uploading.
We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache
distribution channels. In short, non-Apache distribution channels cannot be
a `blocker` for Apache project releases. We only do our best for the
community.

SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really
irrelevant to this PR. If someone wants to do that and the PR is ready, why
don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for
December? Is there a reason why we need to wait?

To Sean
Thanks!

To Nicholas.
Do you think `pip install pyspark` is version-agnostic? In the Python
world, `pip install somepackage` fails frequently. In production, you
should use `pip install somepackage==specificversion`. I don't think the
production pipeline has non-versinoned Python package installation.

The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR
keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is
a blocker for that PR.

Bests,
Dongjoon.

On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> To rephrase my earlier email, PyPI users would care about the bundled
> Hadoop version if they have a workflow that, in effect, looks something
> like this:
>
> ```
> pip install pyspark
> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
> spark.read.parquet('s3a://...')
> ```
>
> I agree that Hadoop 3 would be a better default (again, the s3a support is
> just much better). But to Xiao's point, if you are expecting Spark to work
> with some package like hadoop-aws that assumes an older version of Hadoop
> bundled with Spark, then changing the default may break your workflow.
>
> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to
> hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
> would be more difficult to repair. 🤷‍♂️
>
> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sr...@gmail.com> wrote:
>
>> I'm also genuinely curious when PyPI users would care about the
>> bundled Hadoop jars - do we even need two versions? that itself is
>> extra complexity for end users.
>> I do think Hadoop 3 is the better choice for the user who doesn't
>> care, and better long term.
>> OK but let's at least move ahead with changing defaults.
>>
>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <li...@databricks.com> wrote:
>> >
>> > Hi, Dongjoon,
>> >
>> > Please do not misinterpret my point. I already clearly said "I do not
>> know how to track the popularity of Hadoop 2 vs Hadoop 3."
>> >
>> > Also, let me repeat my opinion:  the top priority is to provide two
>> options for PyPi distribution and let the end users choose the ones they
>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
>> breaking change, let us follow our protocol documented in
>> https://spark.apache.org/versioning-policy.html.
>> >
>> > If you just want to change the Jenkins setup, I am OK about it. If you
>> want to change the default distribution, we need more discussions in the
>> community for getting an agreement.
>> >
>> >  Thanks,
>> >
>> > Xiao
>> >
>>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Nicholas Chammas <ni...@gmail.com>.

To rephrase my earlier email, PyPI users would care about the bundled
Hadoop version if they have a workflow that, in effect, looks something
like this:

```
pip install pyspark
pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
spark.read.parquet('s3a://...')
```

I agree that Hadoop 3 would be a better default (again, the s3a support is
just much better). But to Xiao's point, if you are expecting Spark to work
with some package like hadoop-aws that assumes an older version of Hadoop
bundled with Spark, then changing the default may break your workflow.

In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to
hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
would be more difficult to repair. 🤷‍♂️

On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sr...@gmail.com> wrote:

> I'm also genuinely curious when PyPI users would care about the
> bundled Hadoop jars - do we even need two versions? that itself is
> extra complexity for end users.
> I do think Hadoop 3 is the better choice for the user who doesn't
> care, and better long term.
> OK but let's at least move ahead with changing defaults.
>
> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <li...@databricks.com> wrote:
> >
> > Hi, Dongjoon,
> >
> > Please do not misinterpret my point. I already clearly said "I do not
> know how to track the popularity of Hadoop 2 vs Hadoop 3."
> >
> > Also, let me repeat my opinion:  the top priority is to provide two
> options for PyPi distribution and let the end users choose the ones they
> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
> breaking change, let us follow our protocol documented in
> https://spark.apache.org/versioning-policy.html.
> >
> > If you just want to change the Jenkins setup, I am OK about it. If you
> want to change the default distribution, we need more discussions in the
> community for getting an agreement.
> >
> >  Thanks,
> >
> > Xiao
> >
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Sean Owen <sr...@gmail.com>.

I'm also genuinely curious when PyPI users would care about the
bundled Hadoop jars - do we even need two versions? that itself is
extra complexity for end users.
I do think Hadoop 3 is the better choice for the user who doesn't
care, and better long term.
OK but let's at least move ahead with changing defaults.

On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <li...@databricks.com> wrote:
>
> Hi, Dongjoon,
>
> Please do not misinterpret my point. I already clearly said "I do not know how to track the popularity of Hadoop 2 vs Hadoop 3."
>
> Also, let me repeat my opinion:  the top priority is to provide two options for PyPi distribution and let the end users choose the ones they need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any breaking change, let us follow our protocol documented in https://spark.apache.org/versioning-policy.html.
>
> If you just want to change the Jenkins setup, I am OK about it. If you want to change the default distribution, we need more discussions in the community for getting an agreement.
>
>  Thanks,
>
> Xiao
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Xiao Li <li...@databricks.com>.

Hi, Dongjoon,

Please do not misinterpret my point. I already clearly said "I do not know
how to track the popularity of Hadoop 2 vs Hadoop 3."

Also, let me repeat my opinion:  the top priority is to provide two options
for PyPi distribution and let the end users choose the ones they need.
Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any breaking
change, let us follow our protocol documented in
https://spark.apache.org/versioning-policy.html.

If you just want to change the Jenkins setup, I am OK about it. If you want
to change the default distribution, we need more discussions in the
community for getting an agreement.

 Thanks,

Xiao


On Wed, Jun 24, 2020 at 10:07 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Thanks, Xiao, Sean, Nicholas.
>
> To Xiao,
>
> >  it sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>
> If you say so,
> - Apache Hadoop 2.6.0 is the most popular one with 156 dependencies.
> - Apache Spark 2.2.0 is the most popular one with 264 dependencies.
>
> As we know, it doesn't make sense. Are we recommending Apache Spark 2.2.0
> over Apache Spark 3.0.0?
>
> There is a reason why Apache Spark dropped Hadoop 2.6 profile. Hadoop
> 2.7.4 has many limitations in the cloud environment. Apache Hadoop 3.2 will
> unleash Apache Spark 3.1 in the cloud environment.  (Nicholas also pointed
> it).
>
> For Sean's comment, yes. We can focus on that later in a different thread.
>
> > The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
> eventually, not now.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jun 24, 2020 at 7:26 AM Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> The team I'm on currently uses pip-installed PySpark for local
>> development, and we regularly access S3 directly from our
>> laptops/workstations.
>>
>> One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
>> being able to use a recent version of hadoop-aws that has mature support
>> for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
>> there are incompatibilities that prevent you from using Spark built against
>> Hadoop 2.7 with hadoop-aws version 2.8 or newer.
>>
>> On Wed, Jun 24, 2020 at 10:15 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> Will pyspark users care much about Hadoop version? they won't if running
>>> locally. They will if connecting to a Hadoop cluster. Then again in that
>>> context, they're probably using a distro anyway that harmonizes it.
>>> Hadoop 3's installed based can't be that large yet; it's been around far
>>> less time.
>>>
>>> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
>>> eventually, not now.
>>> But if the question now is build defaults, is it a big deal either way?
>>>
>>> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li <li...@databricks.com> wrote:
>>>
>>>> I think we just need to provide two options and let end users choose
>>>> the ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make
>>>> Pyspark Hadoop 3.2+ Variant available in PyPI) is a high priority task for
>>>> Spark 3.1 release to me.
>>>>
>>>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3.
>>>> Based on this link
>>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>>>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>>>
>>>>
>>>>

-- 
<https://databricks.com/sparkaisummit/north-america>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

Thanks, Xiao, Sean, Nicholas.

To Xiao,

>  it sounds like Hadoop 3.x is not as popular as Hadoop 2.7.

If you say so,
- Apache Hadoop 2.6.0 is the most popular one with 156 dependencies.
- Apache Spark 2.2.0 is the most popular one with 264 dependencies.

As we know, it doesn't make sense. Are we recommending Apache Spark 2.2.0
over Apache Spark 3.0.0?

There is a reason why Apache Spark dropped Hadoop 2.6 profile. Hadoop 2.7.4
has many limitations in the cloud environment. Apache Hadoop 3.2 will
unleash Apache Spark 3.1 in the cloud environment.  (Nicholas also pointed
it).

For Sean's comment, yes. We can focus on that later in a different thread.

> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
eventually, not now.

Bests,
Dongjoon.


On Wed, Jun 24, 2020 at 7:26 AM Nicholas Chammas <ni...@gmail.com>
wrote:

> The team I'm on currently uses pip-installed PySpark for local
> development, and we regularly access S3 directly from our
> laptops/workstations.
>
> One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
> being able to use a recent version of hadoop-aws that has mature support
> for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
> there are incompatibilities that prevent you from using Spark built against
> Hadoop 2.7 with hadoop-aws version 2.8 or newer.
>
> On Wed, Jun 24, 2020 at 10:15 AM Sean Owen <sr...@gmail.com> wrote:
>
>> Will pyspark users care much about Hadoop version? they won't if running
>> locally. They will if connecting to a Hadoop cluster. Then again in that
>> context, they're probably using a distro anyway that harmonizes it.
>> Hadoop 3's installed based can't be that large yet; it's been around far
>> less time.
>>
>> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
>> eventually, not now.
>> But if the question now is build defaults, is it a big deal either way?
>>
>> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li <li...@databricks.com> wrote:
>>
>>> I think we just need to provide two options and let end users choose the
>>> ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
>>> Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
>>> 3.1 release to me.
>>>
>>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based
>>> on this link
>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>>
>>>
>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Nicholas Chammas <ni...@gmail.com>.

The team I'm on currently uses pip-installed PySpark for local development,
and we regularly access S3 directly from our laptops/workstations.

One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
being able to use a recent version of hadoop-aws that has mature support
for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
there are incompatibilities that prevent you from using Spark built against
Hadoop 2.7 with hadoop-aws version 2.8 or newer.

On Wed, Jun 24, 2020 at 10:15 AM Sean Owen <sr...@gmail.com> wrote:

> Will pyspark users care much about Hadoop version? they won't if running
> locally. They will if connecting to a Hadoop cluster. Then again in that
> context, they're probably using a distro anyway that harmonizes it.
> Hadoop 3's installed based can't be that large yet; it's been around far
> less time.
>
> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
> eventually, not now.
> But if the question now is build defaults, is it a big deal either way?
>
> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li <li...@databricks.com> wrote:
>
>> I think we just need to provide two options and let end users choose the
>> ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
>> Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
>> 3.1 release to me.
>>
>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based
>> on this link
>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>
>>
>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Sean Owen <sr...@gmail.com>.

Will pyspark users care much about Hadoop version? they won't if running
locally. They will if connecting to a Hadoop cluster. Then again in that
context, they're probably using a distro anyway that harmonizes it.
Hadoop 3's installed based can't be that large yet; it's been around far
less time.

The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
eventually, not now.
But if the question now is build defaults, is it a big deal either way?

On Tue, Jun 23, 2020 at 11:03 PM Xiao Li <li...@databricks.com> wrote:

> I think we just need to provide two options and let end users choose the
> ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
> Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
> 3.1 release to me.
>
> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based
> on this link
> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>
>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Xiao Li <li...@databricks.com>.

I think we just need to provide two options and let end users choose the
ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
3.1 release to me.

I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based on
this link
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
sounds like Hadoop 3.x is not as popular as Hadoop 2.7.


On Tue, Jun 23, 2020 at 8:08 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> I fully understand your concern, but we cannot live with Hadoop 2.7.4
> forever, Xiao. Like Hadoop 2.6, we should let it go.
>
> So, are you saying that CRAN/PyPy should have all combination of Apache
> Spark including Hive 1.2 distribution?
>
> What is your suggestion as a PMC on Hadoop 3.2 migration path? I'd love to
> remove the road blocks for that.
>
> As a side note, Homebrew is not Apache Spark official channel, but it's
> also popular distribution channel in the community. And, it's using Hadoop
> 3.2 distribution already. Hadoop 2.7 is too old for Year 2021 (Apache Spark
> 3.1), isn't it?
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Jun 23, 2020 at 7:55 PM Xiao Li <li...@databricks.com> wrote:
>
>> Then, it will be a little complex after this PR. It might make the
>> community more confused.
>>
>> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile;
>> however, in the other distributions, we are using Hadoop 3.2 as the
>> default?
>>
>> How to explain this to the community? I would not change the default for
>> consistency.
>>
>> Xiao
>>
>>
>>
>> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Thanks. Uploading PySpark to PyPI is a simple manual step and
>>> our release script is able to build PySpark with Hadoop 2.7 still if we
>>> want.
>>> So, `No` for the following question. I updated my PR according to your
>>> comment.
>>>
>>> > If we change the default, will it impact them? If YES,...
>>>
>>> From the comment on the PR, the following become irrelevant to the
>>> current PR.
>>>
>>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li <li...@databricks.com> wrote:
>>>
>>>>
>>>> Our monthly pypi downloads of PySpark have reached 5.4 million. We
>>>> should avoid forcing the current PySpark users to upgrade their Hadoop
>>>> versions. If we change the default, will it impact them? If YES, I think we
>>>> should not do it until it is ready and they have a workaround. So far, our
>>>> pypi downloads are still relying on our default version.
>>>>
>>>> Please correct me if my concern is not valid.
>>>>
>>>> Xiao
>>>>
>>>>
>>>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> I bump up this thread again with the title "Use Hadoop-3.2 as a
>>>>> default Hadoop profile in 3.1.0?"
>>>>> There exists some recent discussion on the following PR. Please let us
>>>>> know your thoughts.
>>>>>
>>>>> https://github.com/apache/spark/pull/28897
>>>>>
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <li...@databricks.com> wrote:
>>>>>
>>>>>> Hi, Steve,
>>>>>>
>>>>>> Thanks for your comments! My major quality concern is not against
>>>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>>>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>>>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
>>>>>> is more risky due to these changes.
>>>>>>
>>>>>> To speed up the adoption of Spark 3.0, which has many other highly
>>>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>>>>> default.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Xiao.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> What is the current default value? as the 2.x releases are becoming
>>>>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>>>>> there will inevitably be surprises.
>>>>>>>
>>>>>>> One issue about using a older versions is that any problem reported
>>>>>>> -especially at stack traces you can blame me for- Will generally be met by
>>>>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>>>>> much test coverage are things getting?
>>>>>>>
>>>>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The
>>>>>>> ABFS client is there, and I the big guava update (HADOOP-16213) went in.
>>>>>>> People will either love or hate that.
>>>>>>>
>>>>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>>>>> backport planned though, including changes to better handle AWS caching of
>>>>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>>>>
>>>>>>> It would be really good if the spark distributions shipped with
>>>>>>> later versions of the hadoop artifacts.
>>>>>>>
>>>>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>>>>> thriftserver.
>>>>>>>>
>>>>>>>> To reduce the risk, I would like to keep the current default
>>>>>>>> version unchanged. When it becomes stable, we can change the default
>>>>>>>> profile to Hadoop-3.2.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Xiao
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm OK with that, but don't have a strong opinion nor info about
>>>>>>>>> the
>>>>>>>>> implications.
>>>>>>>>> That said my guess is we're close to the point where we don't need
>>>>>>>>> to
>>>>>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > Hi, All.
>>>>>>>>> >
>>>>>>>>> > There was a discussion on publishing artifacts built with Hadoop
>>>>>>>>> 3 .
>>>>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>>>>>> will be the same because we didn't change anything yet.
>>>>>>>>> >
>>>>>>>>> > Technically, we need to change two places for publishing.
>>>>>>>>> >
>>>>>>>>> > 1. Jenkins Snapshot Publishing
>>>>>>>>> >
>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>>>>>> >
>>>>>>>>> > 2. Release Snapshot/Release Publishing
>>>>>>>>> >
>>>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>>>>>> >
>>>>>>>>> > To minimize the change, we need to switch our default Hadoop
>>>>>>>>> profile.
>>>>>>>>> >
>>>>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>>>>>> > We had better use `hadoop-3.2` profile by default and
>>>>>>>>> `hadoop-2.7` optionally.
>>>>>>>>> >
>>>>>>>>> > Note that this means we use Hive 2.3.6 by default. Only
>>>>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>>>>>> >
>>>>>>>>> > Bests,
>>>>>>>>> > Dongjoon.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> [image: Databricks Summit - Watch the talks]
>>>>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> [image: Databricks Summit - Watch the talks]
>>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
<https://databricks.com/sparkaisummit/north-america>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

I fully understand your concern, but we cannot live with Hadoop 2.7.4
forever, Xiao. Like Hadoop 2.6, we should let it go.

So, are you saying that CRAN/PyPy should have all combination of Apache
Spark including Hive 1.2 distribution?

What is your suggestion as a PMC on Hadoop 3.2 migration path? I'd love to
remove the road blocks for that.

As a side note, Homebrew is not Apache Spark official channel, but it's
also popular distribution channel in the community. And, it's using Hadoop
3.2 distribution already. Hadoop 2.7 is too old for Year 2021 (Apache Spark
3.1), isn't it?

Bests,
Dongjoon.



On Tue, Jun 23, 2020 at 7:55 PM Xiao Li <li...@databricks.com> wrote:

> Then, it will be a little complex after this PR. It might make the
> community more confused.
>
> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
> in the other distributions, we are using Hadoop 3.2 as the default?
>
> How to explain this to the community? I would not change the default for
> consistency.
>
> Xiao
>
>
>
> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Thanks. Uploading PySpark to PyPI is a simple manual step and our release
>> script is able to build PySpark with Hadoop 2.7 still if we want.
>> So, `No` for the following question. I updated my PR according to your
>> comment.
>>
>> > If we change the default, will it impact them? If YES,...
>>
>> From the comment on the PR, the following become irrelevant to the
>> current PR.
>>
>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li <li...@databricks.com> wrote:
>>
>>>
>>> Our monthly pypi downloads of PySpark have reached 5.4 million. We
>>> should avoid forcing the current PySpark users to upgrade their Hadoop
>>> versions. If we change the default, will it impact them? If YES, I think we
>>> should not do it until it is ready and they have a workaround. So far, our
>>> pypi downloads are still relying on our default version.
>>>
>>> Please correct me if my concern is not valid.
>>>
>>> Xiao
>>>
>>>
>>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> I bump up this thread again with the title "Use Hadoop-3.2 as a default
>>>> Hadoop profile in 3.1.0?"
>>>> There exists some recent discussion on the following PR. Please let us
>>>> know your thoughts.
>>>>
>>>> https://github.com/apache/spark/pull/28897
>>>>
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <li...@databricks.com> wrote:
>>>>
>>>>> Hi, Steve,
>>>>>
>>>>> Thanks for your comments! My major quality concern is not against
>>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
>>>>> is more risky due to these changes.
>>>>>
>>>>> To speed up the adoption of Spark 3.0, which has many other highly
>>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>>>> default.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> What is the current default value? as the 2.x releases are becoming
>>>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>>>> there will inevitably be surprises.
>>>>>>
>>>>>> One issue about using a older versions is that any problem reported
>>>>>> -especially at stack traces you can blame me for- Will generally be met by
>>>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>>>> much test coverage are things getting?
>>>>>>
>>>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>>>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>>>>> will either love or hate that.
>>>>>>
>>>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>>>> backport planned though, including changes to better handle AWS caching of
>>>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>>>
>>>>>> It would be really good if the spark distributions shipped with later
>>>>>> versions of the hadoop artifacts.
>>>>>>
>>>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>>>> thriftserver.
>>>>>>>
>>>>>>> To reduce the risk, I would like to keep the current default version
>>>>>>> unchanged. When it becomes stable, we can change the default profile to
>>>>>>> Hadoop-3.2.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>>>>> implications.
>>>>>>>> That said my guess is we're close to the point where we don't need
>>>>>>>> to
>>>>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Hi, All.
>>>>>>>> >
>>>>>>>> > There was a discussion on publishing artifacts built with Hadoop
>>>>>>>> 3 .
>>>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>>>>> will be the same because we didn't change anything yet.
>>>>>>>> >
>>>>>>>> > Technically, we need to change two places for publishing.
>>>>>>>> >
>>>>>>>> > 1. Jenkins Snapshot Publishing
>>>>>>>> >
>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>>>>> >
>>>>>>>> > 2. Release Snapshot/Release Publishing
>>>>>>>> >
>>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>>>>> >
>>>>>>>> > To minimize the change, we need to switch our default Hadoop
>>>>>>>> profile.
>>>>>>>> >
>>>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>>>>> > We had better use `hadoop-3.2` profile by default and
>>>>>>>> `hadoop-2.7` optionally.
>>>>>>>> >
>>>>>>>> > Note that this means we use Hive 2.3.6 by default. Only
>>>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>>>>> >
>>>>>>>> > Bests,
>>>>>>>> > Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> [image: Databricks Summit - Watch the talks]
>>>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> [image: Databricks Summit - Watch the talks]
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>
>>> --
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> <https://databricks.com/sparkaisummit/north-america>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Sean Owen <sr...@gmail.com>.

So, we also release Spark binary distros with Hadoop 2.7, 3.2, and no
Hadoop -- all of the options. Picking one profile or the other to release
with pypi etc isn't more or less consistent with those releases, as all
exist.

Is this change only about the source code default, with no effect on
planned releases for 3.1.x, etc? I get that this affects what you get if
you build from source, but, the concern wasn't about that audience, but
what pypi users get, which does not change, right?

Although you could also say, why bother -- who cares what the default is --
I do think we need to be moving away from multiple Hadoop and Hive
profiles, and for the audience who this would impact at all, developers,
probably OK to start lightly pushing by changing defaults?

I can't feel strongly about it at this point; we're not debating changing
any mass-consumption artifacts. So, I'd not object to it either.



On Tue, Jun 23, 2020 at 9:55 PM Xiao Li <li...@databricks.com> wrote:

> Then, it will be a little complex after this PR. It might make the
> community more confused.
>
> In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
> in the other distributions, we are using Hadoop 3.2 as the default?
>
> How to explain this to the community? I would not change the default for
> consistency.
>
> Xiao
>
>
>
> On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Thanks. Uploading PySpark to PyPI is a simple manual step and our release
>> script is able to build PySpark with Hadoop 2.7 still if we want.
>> So, `No` for the following question. I updated my PR according to your
>> comment.
>>
>> > If we change the default, will it impact them? If YES,...
>>
>> From the comment on the PR, the following become irrelevant to the
>> current PR.
>>
>> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li <li...@databricks.com> wrote:
>>
>>>
>>> Our monthly pypi downloads of PySpark have reached 5.4 million. We
>>> should avoid forcing the current PySpark users to upgrade their Hadoop
>>> versions. If we change the default, will it impact them? If YES, I think we
>>> should not do it until it is ready and they have a workaround. So far, our
>>> pypi downloads are still relying on our default version.
>>>
>>> Please correct me if my concern is not valid.
>>>
>>> Xiao
>>>
>>>
>>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> I bump up this thread again with the title "Use Hadoop-3.2 as a default
>>>> Hadoop profile in 3.1.0?"
>>>> There exists some recent discussion on the following PR. Please let us
>>>> know your thoughts.
>>>>
>>>> https://github.com/apache/spark/pull/28897
>>>>
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <li...@databricks.com> wrote:
>>>>
>>>>> Hi, Steve,
>>>>>
>>>>> Thanks for your comments! My major quality concern is not against
>>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
>>>>> is more risky due to these changes.
>>>>>
>>>>> To speed up the adoption of Spark 3.0, which has many other highly
>>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>>>> default.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> What is the current default value? as the 2.x releases are becoming
>>>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>>>> there will inevitably be surprises.
>>>>>>
>>>>>> One issue about using a older versions is that any problem reported
>>>>>> -especially at stack traces you can blame me for- Will generally be met by
>>>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>>>> much test coverage are things getting?
>>>>>>
>>>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>>>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>>>>> will either love or hate that.
>>>>>>
>>>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>>>> backport planned though, including changes to better handle AWS caching of
>>>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>>>
>>>>>> It would be really good if the spark distributions shipped with later
>>>>>> versions of the hadoop artifacts.
>>>>>>
>>>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>>>> thriftserver.
>>>>>>>
>>>>>>> To reduce the risk, I would like to keep the current default version
>>>>>>> unchanged. When it becomes stable, we can change the default profile to
>>>>>>> Hadoop-3.2.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>>>>> implications.
>>>>>>>> That said my guess is we're close to the point where we don't need
>>>>>>>> to
>>>>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>>>>
>>>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Hi, All.
>>>>>>>> >
>>>>>>>> > There was a discussion on publishing artifacts built with Hadoop
>>>>>>>> 3 .
>>>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>>>>> will be the same because we didn't change anything yet.
>>>>>>>> >
>>>>>>>> > Technically, we need to change two places for publishing.
>>>>>>>> >
>>>>>>>> > 1. Jenkins Snapshot Publishing
>>>>>>>> >
>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>>>>> >
>>>>>>>> > 2. Release Snapshot/Release Publishing
>>>>>>>> >
>>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>>>>> >
>>>>>>>> > To minimize the change, we need to switch our default Hadoop
>>>>>>>> profile.
>>>>>>>> >
>>>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>>>>> > We had better use `hadoop-3.2` profile by default and
>>>>>>>> `hadoop-2.7` optionally.
>>>>>>>> >
>>>>>>>> > Note that this means we use Hive 2.3.6 by default. Only
>>>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>>>>> >
>>>>>>>> > Bests,
>>>>>>>> > Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> [image: Databricks Summit - Watch the talks]
>>>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> [image: Databricks Summit - Watch the talks]
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>
>>> --
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> <https://databricks.com/sparkaisummit/north-america>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Xiao Li <li...@databricks.com>.

Then, it will be a little complex after this PR. It might make the
community more confused.

In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however,
in the other distributions, we are using Hadoop 3.2 as the default?

How to explain this to the community? I would not change the default for
consistency.

Xiao



On Tue, Jun 23, 2020 at 7:18 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Thanks. Uploading PySpark to PyPI is a simple manual step and our release
> script is able to build PySpark with Hadoop 2.7 still if we want.
> So, `No` for the following question. I updated my PR according to your
> comment.
>
> > If we change the default, will it impact them? If YES,...
>
> From the comment on the PR, the following become irrelevant to the current
> PR.
>
> > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)
>
> Bests,
> Dongjoon.
>
>
>
>
> On Tue, Jun 23, 2020 at 12:09 AM Xiao Li <li...@databricks.com> wrote:
>
>>
>> Our monthly pypi downloads of PySpark have reached 5.4 million. We should
>> avoid forcing the current PySpark users to upgrade their Hadoop versions.
>> If we change the default, will it impact them? If YES, I think we should
>> not do it until it is ready and they have a workaround. So far, our pypi
>> downloads are still relying on our default version.
>>
>> Please correct me if my concern is not valid.
>>
>> Xiao
>>
>>
>> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> I bump up this thread again with the title "Use Hadoop-3.2 as a default
>>> Hadoop profile in 3.1.0?"
>>> There exists some recent discussion on the following PR. Please let us
>>> know your thoughts.
>>>
>>> https://github.com/apache/spark/pull/28897
>>>
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <li...@databricks.com> wrote:
>>>
>>>> Hi, Steve,
>>>>
>>>> Thanks for your comments! My major quality concern is not against
>>>> Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to
>>>> 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop
>>>> 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile
>>>> is more risky due to these changes.
>>>>
>>>> To speed up the adoption of Spark 3.0, which has many other highly
>>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>>> default.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao.
>>>>
>>>>
>>>>
>>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com>
>>>> wrote:
>>>>
>>>>> What is the current default value? as the 2.x releases are becoming
>>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>>> there will inevitably be surprises.
>>>>>
>>>>> One issue about using a older versions is that any problem reported
>>>>> -especially at stack traces you can blame me for- Will generally be met by
>>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>>> much test coverage are things getting?
>>>>>
>>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>>>> will either love or hate that.
>>>>>
>>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>>> backport planned though, including changes to better handle AWS caching of
>>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>>
>>>>> It would be really good if the spark distributions shipped with later
>>>>> versions of the hadoop artifacts.
>>>>>
>>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com> wrote:
>>>>>
>>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>>> thriftserver.
>>>>>>
>>>>>> To reduce the risk, I would like to keep the current default version
>>>>>> unchanged. When it becomes stable, we can change the default profile to
>>>>>> Hadoop-3.2.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>>>> implications.
>>>>>>> That said my guess is we're close to the point where we don't need to
>>>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>>>
>>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Hi, All.
>>>>>>> >
>>>>>>> > There was a discussion on publishing artifacts built with Hadoop 3
>>>>>>> .
>>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>>>> will be the same because we didn't change anything yet.
>>>>>>> >
>>>>>>> > Technically, we need to change two places for publishing.
>>>>>>> >
>>>>>>> > 1. Jenkins Snapshot Publishing
>>>>>>> >
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>>>> >
>>>>>>> > 2. Release Snapshot/Release Publishing
>>>>>>> >
>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>>>> >
>>>>>>> > To minimize the change, we need to switch our default Hadoop
>>>>>>> profile.
>>>>>>> >
>>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>>>>> optionally.
>>>>>>> >
>>>>>>> > Note that this means we use Hive 2.3.6 by default. Only
>>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>>>> >
>>>>>>> > Bests,
>>>>>>> > Dongjoon.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> [image: Databricks Summit - Watch the talks]
>>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> [image: Databricks Summit - Watch the talks]
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
<https://databricks.com/sparkaisummit/north-america>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

Thanks. Uploading PySpark to PyPI is a simple manual step and our release
script is able to build PySpark with Hadoop 2.7 still if we want.
So, `No` for the following question. I updated my PR according to your
comment.

> If we change the default, will it impact them? If YES,...

From the comment on the PR, the following become irrelevant to the current
PR.

> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI)

Bests,
Dongjoon.




On Tue, Jun 23, 2020 at 12:09 AM Xiao Li <li...@databricks.com> wrote:

>
> Our monthly pypi downloads of PySpark have reached 5.4 million. We should
> avoid forcing the current PySpark users to upgrade their Hadoop versions.
> If we change the default, will it impact them? If YES, I think we should
> not do it until it is ready and they have a workaround. So far, our pypi
> downloads are still relying on our default version.
>
> Please correct me if my concern is not valid.
>
> Xiao
>
>
> On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> I bump up this thread again with the title "Use Hadoop-3.2 as a default
>> Hadoop profile in 3.1.0?"
>> There exists some recent discussion on the following PR. Please let us
>> know your thoughts.
>>
>> https://github.com/apache/spark/pull/28897
>>
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <li...@databricks.com> wrote:
>>
>>> Hi, Steve,
>>>
>>> Thanks for your comments! My major quality concern is not against Hadoop
>>> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
>>> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
>>> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
>>> risky due to these changes.
>>>
>>> To speed up the adoption of Spark 3.0, which has many other highly
>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>> default.
>>>
>>> Cheers,
>>>
>>> Xiao.
>>>
>>>
>>>
>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com>
>>> wrote:
>>>
>>>> What is the current default value? as the 2.x releases are becoming
>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>> there will inevitably be surprises.
>>>>
>>>> One issue about using a older versions is that any problem reported
>>>> -especially at stack traces you can blame me for- Will generally be met by
>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>> much test coverage are things getting?
>>>>
>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>>> will either love or hate that.
>>>>
>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>> backport planned though, including changes to better handle AWS caching of
>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>
>>>> It would be really good if the spark distributions shipped with later
>>>> versions of the hadoop artifacts.
>>>>
>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com> wrote:
>>>>
>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>> thriftserver.
>>>>>
>>>>> To reduce the risk, I would like to keep the current default version
>>>>> unchanged. When it becomes stable, we can change the default profile to
>>>>> Hadoop-3.2.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao
>>>>>
>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>>> implications.
>>>>>> That said my guess is we're close to the point where we don't need to
>>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>>
>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>> >
>>>>>> > Hi, All.
>>>>>> >
>>>>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>>> will be the same because we didn't change anything yet.
>>>>>> >
>>>>>> > Technically, we need to change two places for publishing.
>>>>>> >
>>>>>> > 1. Jenkins Snapshot Publishing
>>>>>> >
>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>>> >
>>>>>> > 2. Release Snapshot/Release Publishing
>>>>>> >
>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>>> >
>>>>>> > To minimize the change, we need to switch our default Hadoop
>>>>>> profile.
>>>>>> >
>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>>>> optionally.
>>>>>> >
>>>>>> > Note that this means we use Hive 2.3.6 by default. Only
>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>>> >
>>>>>> > Bests,
>>>>>> > Dongjoon.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> [image: Databricks Summit - Watch the talks]
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>
>>> --
>>> [image: Databricks Summit - Watch the talks]
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> <https://databricks.com/sparkaisummit/north-america>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Xiao Li <li...@databricks.com>.

Our monthly pypi downloads of PySpark have reached 5.4 million. We should
avoid forcing the current PySpark users to upgrade their Hadoop versions.
If we change the default, will it impact them? If YES, I think we should
not do it until it is ready and they have a workaround. So far, our pypi
downloads are still relying on our default version.

Please correct me if my concern is not valid.

Xiao


On Tue, Jun 23, 2020 at 12:04 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> I bump up this thread again with the title "Use Hadoop-3.2 as a default
> Hadoop profile in 3.1.0?"
> There exists some recent discussion on the following PR. Please let us
> know your thoughts.
>
> https://github.com/apache/spark/pull/28897
>
>
> Bests,
> Dongjoon.
>
>
> On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <li...@databricks.com> wrote:
>
>> Hi, Steve,
>>
>> Thanks for your comments! My major quality concern is not against Hadoop
>> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
>> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
>> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
>> risky due to these changes.
>>
>> To speed up the adoption of Spark 3.0, which has many other highly
>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>> default.
>>
>> Cheers,
>>
>> Xiao.
>>
>>
>>
>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com>
>> wrote:
>>
>>> What is the current default value? as the 2.x releases are becoming EOL;
>>> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
>>> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
>>> inevitably be surprises.
>>>
>>> One issue about using a older versions is that any problem reported
>>> -especially at stack traces you can blame me for- Will generally be met by
>>> a response of "does it go away when you upgrade?" The other issue is how
>>> much test coverage are things getting?
>>>
>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>> will either love or hate that.
>>>
>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>> backport planned though, including changes to better handle AWS caching of
>>> 404s generatd from HEAD requests before an object was actually created.
>>>
>>> It would be really good if the spark distributions shipped with later
>>> versions of the hadoop artifacts.
>>>
>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com> wrote:
>>>
>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>> changes are massive, including Hive execution and a new version of Hive
>>>> thriftserver.
>>>>
>>>> To reduce the risk, I would like to keep the current default version
>>>> unchanged. When it becomes stable, we can change the default profile to
>>>> Hadoop-3.2.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>> implications.
>>>>> That said my guess is we're close to the point where we don't need to
>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>
>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi, All.
>>>>> >
>>>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>> will be the same because we didn't change anything yet.
>>>>> >
>>>>> > Technically, we need to change two places for publishing.
>>>>> >
>>>>> > 1. Jenkins Snapshot Publishing
>>>>> >
>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>> >
>>>>> > 2. Release Snapshot/Release Publishing
>>>>> >
>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>> >
>>>>> > To minimize the change, we need to switch our default Hadoop profile.
>>>>> >
>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>>> optionally.
>>>>> >
>>>>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>>>>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> [image: Databricks Summit - Watch the talks]
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
<https://databricks.com/sparkaisummit/north-america>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, All.

I bump up this thread again with the title "Use Hadoop-3.2 as a default
Hadoop profile in 3.1.0?"
There exists some recent discussion on the following PR. Please let us know
your thoughts.

https://github.com/apache/spark/pull/28897


Bests,
Dongjoon.


On Fri, Nov 1, 2019 at 9:41 AM Xiao Li <li...@databricks.com> wrote:

> Hi, Steve,
>
> Thanks for your comments! My major quality concern is not against Hadoop
> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
> risky due to these changes.
>
> To speed up the adoption of Spark 3.0, which has many other highly
> desirable features, I am proposing to keep Hadoop 2.x profile as the
> default.
>
> Cheers,
>
> Xiao.
>
>
>
> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com> wrote:
>
>> What is the current default value? as the 2.x releases are becoming EOL;
>> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
>> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
>> inevitably be surprises.
>>
>> One issue about using a older versions is that any problem reported
>> -especially at stack traces you can blame me for- Will generally be met by
>> a response of "does it go away when you upgrade?" The other issue is how
>> much test coverage are things getting?
>>
>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>> client is there, and I the big guava update (HADOOP-16213) went in. People
>> will either love or hate that.
>>
>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>> backport planned though, including changes to better handle AWS caching of
>> 404s generatd from HEAD requests before an object was actually created.
>>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com> wrote:
>>
>>> The stability and quality of Hadoop 3.2 profile are unknown. The changes
>>> are massive, including Hive execution and a new version of Hive
>>> thriftserver.
>>>
>>> To reduce the risk, I would like to keep the current default version
>>> unchanged. When it becomes stable, we can change the default profile to
>>> Hadoop-3.2.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>> implications.
>>>> That said my guess is we're close to the point where we don't need to
>>>> support Hadoop 2.x anyway, so, yeah.
>>>>
>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi, All.
>>>> >
>>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>>>> be the same because we didn't change anything yet.
>>>> >
>>>> > Technically, we need to change two places for publishing.
>>>> >
>>>> > 1. Jenkins Snapshot Publishing
>>>> >
>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>> >
>>>> > 2. Release Snapshot/Release Publishing
>>>> >
>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>> >
>>>> > To minimize the change, we need to switch our default Hadoop profile.
>>>> >
>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>> optionally.
>>>> >
>>>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>>>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>> >
>>>> > Bests,
>>>> > Dongjoon.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>> [image: Databricks Summit - Watch the talks]
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Xiao Li <li...@databricks.com>.

The changes for JDK 11 supports are not increasing the risk of Hadoop 3.2
profile.

Hive 1.2.1 execution JARs are much more stable than Hive 2.3.6 execution
JARs. The changes of thrift-servers are massive. We need more evidence to
prove the quality and stability before we switching the default to Hadoop
3.2 profile. Adoption of Spark 3.0 is more important in the current moment.
I think we can switch the default profile in Spark 3.1 or 3.2 releases,
instead of Spark 3.0.


On Fri, Nov 1, 2019 at 6:21 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, Xiao.
>
> How JDK11-support can make `Hadoop-3.2 profile` risky? We build and
> publish with JDK8.
>
> > In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
> only.
>
> Since we build and publish with JDK8 and the default runtime is still
> JDK8, I don't think `hadoop-3.2 profile` is risky in that context.
>
> For JDK11, Hive execution module 2.3.6 doesn't support JDK11 still in
> terms of remote HiveMetastore.
>
> So, among the above reasons, we can say that Hive execution module (with
> Hive 2.3.6) can be the root cause of potential unknown issues.
>
> In other words, `Hive 1.2.1` is the one you think stable, isn't it?
>
> Although Hive 2.3.6 might be not proven in Apache Spark officially, we
> resolved several SPARK issues by upgrading Hive from 1.2.1 to 2.3.6 also.
>
> Bests,
> Dongjoon.
>
>
>
> On Fri, Nov 1, 2019 at 5:37 PM Jiaxin Shan <se...@gmail.com> wrote:
>
>> +1 for Hadoop 3.2.  Seems lots of cloud integration efforts Steve made is
>> only available in 3.2. We see lots of users asking for better S3A support
>> in Spark.
>>
>> On Fri, Nov 1, 2019 at 9:46 AM Xiao Li <li...@databricks.com> wrote:
>>
>>> Hi, Steve,
>>>
>>> Thanks for your comments! My major quality concern is not against Hadoop
>>> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
>>> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
>>> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
>>> risky due to these changes.
>>>
>>> To speed up the adoption of Spark 3.0, which has many other highly
>>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>>> default.
>>>
>>> Cheers,
>>>
>>> Xiao.
>>>
>>>
>>>
>>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com>
>>> wrote:
>>>
>>>> What is the current default value? as the 2.x releases are becoming
>>>> EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2
>>>> release getting attention. 2.10.0 shipped yesterday, but the ".0" means
>>>> there will inevitably be surprises.
>>>>
>>>> One issue about using a older versions is that any problem reported
>>>> -especially at stack traces you can blame me for- Will generally be met by
>>>> a response of "does it go away when you upgrade?" The other issue is how
>>>> much test coverage are things getting?
>>>>
>>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>>> will either love or hate that.
>>>>
>>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>>> backport planned though, including changes to better handle AWS caching of
>>>> 404s generatd from HEAD requests before an object was actually created.
>>>>
>>>> It would be really good if the spark distributions shipped with later
>>>> versions of the hadoop artifacts.
>>>>
>>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com> wrote:
>>>>
>>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>>> changes are massive, including Hive execution and a new version of Hive
>>>>> thriftserver.
>>>>>
>>>>> To reduce the risk, I would like to keep the current default version
>>>>> unchanged. When it becomes stable, we can change the default profile to
>>>>> Hadoop-3.2.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao
>>>>>
>>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>>> implications.
>>>>>> That said my guess is we're close to the point where we don't need to
>>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>>
>>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>> >
>>>>>> > Hi, All.
>>>>>> >
>>>>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>>> will be the same because we didn't change anything yet.
>>>>>> >
>>>>>> > Technically, we need to change two places for publishing.
>>>>>> >
>>>>>> > 1. Jenkins Snapshot Publishing
>>>>>> >
>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>>> >
>>>>>> > 2. Release Snapshot/Release Publishing
>>>>>> >
>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>>> >
>>>>>> > To minimize the change, we need to switch our default Hadoop
>>>>>> profile.
>>>>>> >
>>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>>>> optionally.
>>>>>> >
>>>>>> > Note that this means we use Hive 2.3.6 by default. Only
>>>>>> `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>>> >
>>>>>> > Bests,
>>>>>> > Dongjoon.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> [image: Databricks Summit - Watch the talks]
>>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>>
>>>>
>>>
>>> --
>>> [image: Databricks Summit - Watch the talks]
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>>
>> --
>> Best Regards!
>> Jiaxin Shan
>> Tel:  412-230-7670
>> Address: 470 2nd Ave S, Kirkland, WA
>>
>>

-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, Xiao.

How JDK11-support can make `Hadoop-3.2 profile` risky? We build and publish
with JDK8.

> In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
only.

Since we build and publish with JDK8 and the default runtime is still JDK8,
I don't think `hadoop-3.2 profile` is risky in that context.

For JDK11, Hive execution module 2.3.6 doesn't support JDK11 still in terms
of remote HiveMetastore.

So, among the above reasons, we can say that Hive execution module (with
Hive 2.3.6) can be the root cause of potential unknown issues.

In other words, `Hive 1.2.1` is the one you think stable, isn't it?

Although Hive 2.3.6 might be not proven in Apache Spark officially, we
resolved several SPARK issues by upgrading Hive from 1.2.1 to 2.3.6 also.

Bests,
Dongjoon.



On Fri, Nov 1, 2019 at 5:37 PM Jiaxin Shan <se...@gmail.com> wrote:

> +1 for Hadoop 3.2.  Seems lots of cloud integration efforts Steve made is
> only available in 3.2. We see lots of users asking for better S3A support
> in Spark.
>
> On Fri, Nov 1, 2019 at 9:46 AM Xiao Li <li...@databricks.com> wrote:
>
>> Hi, Steve,
>>
>> Thanks for your comments! My major quality concern is not against Hadoop
>> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
>> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
>> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
>> risky due to these changes.
>>
>> To speed up the adoption of Spark 3.0, which has many other highly
>> desirable features, I am proposing to keep Hadoop 2.x profile as the
>> default.
>>
>> Cheers,
>>
>> Xiao.
>>
>>
>>
>> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com>
>> wrote:
>>
>>> What is the current default value? as the 2.x releases are becoming EOL;
>>> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
>>> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
>>> inevitably be surprises.
>>>
>>> One issue about using a older versions is that any problem reported
>>> -especially at stack traces you can blame me for- Will generally be met by
>>> a response of "does it go away when you upgrade?" The other issue is how
>>> much test coverage are things getting?
>>>
>>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>>> client is there, and I the big guava update (HADOOP-16213) went in. People
>>> will either love or hate that.
>>>
>>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>>> backport planned though, including changes to better handle AWS caching of
>>> 404s generatd from HEAD requests before an object was actually created.
>>>
>>> It would be really good if the spark distributions shipped with later
>>> versions of the hadoop artifacts.
>>>
>>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com> wrote:
>>>
>>>> The stability and quality of Hadoop 3.2 profile are unknown. The
>>>> changes are massive, including Hive execution and a new version of Hive
>>>> thriftserver.
>>>>
>>>> To reduce the risk, I would like to keep the current default version
>>>> unchanged. When it becomes stable, we can change the default profile to
>>>> Hadoop-3.2.
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>>> implications.
>>>>> That said my guess is we're close to the point where we don't need to
>>>>> support Hadoop 2.x anyway, so, yeah.
>>>>>
>>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi, All.
>>>>> >
>>>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview`
>>>>> will be the same because we didn't change anything yet.
>>>>> >
>>>>> > Technically, we need to change two places for publishing.
>>>>> >
>>>>> > 1. Jenkins Snapshot Publishing
>>>>> >
>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>>> >
>>>>> > 2. Release Snapshot/Release Publishing
>>>>> >
>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>>> >
>>>>> > To minimize the change, we need to switch our default Hadoop profile.
>>>>> >
>>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>>> optionally.
>>>>> >
>>>>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>>>>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> [image: Databricks Summit - Watch the talks]
>>>> <https://databricks.com/sparkaisummit/north-america>
>>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>
>
> --
> Best Regards!
> Jiaxin Shan
> Tel:  412-230-7670
> Address: 470 2nd Ave S, Kirkland, WA
>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Jiaxin Shan <se...@gmail.com>.

+1 for Hadoop 3.2.  Seems lots of cloud integration efforts Steve made is
only available in 3.2. We see lots of users asking for better S3A support
in Spark.

On Fri, Nov 1, 2019 at 9:46 AM Xiao Li <li...@databricks.com> wrote:

> Hi, Steve,
>
> Thanks for your comments! My major quality concern is not against Hadoop
> 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
> thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
> only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
> risky due to these changes.
>
> To speed up the adoption of Spark 3.0, which has many other highly
> desirable features, I am proposing to keep Hadoop 2.x profile as the
> default.
>
> Cheers,
>
> Xiao.
>
>
>
> On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com> wrote:
>
>> What is the current default value? as the 2.x releases are becoming EOL;
>> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
>> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
>> inevitably be surprises.
>>
>> One issue about using a older versions is that any problem reported
>> -especially at stack traces you can blame me for- Will generally be met by
>> a response of "does it go away when you upgrade?" The other issue is how
>> much test coverage are things getting?
>>
>> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
>> client is there, and I the big guava update (HADOOP-16213) went in. People
>> will either love or hate that.
>>
>> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
>> backport planned though, including changes to better handle AWS caching of
>> 404s generatd from HEAD requests before an object was actually created.
>>
>> It would be really good if the spark distributions shipped with later
>> versions of the hadoop artifacts.
>>
>> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com> wrote:
>>
>>> The stability and quality of Hadoop 3.2 profile are unknown. The changes
>>> are massive, including Hive execution and a new version of Hive
>>> thriftserver.
>>>
>>> To reduce the risk, I would like to keep the current default version
>>> unchanged. When it becomes stable, we can change the default profile to
>>> Hadoop-3.2.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> I'm OK with that, but don't have a strong opinion nor info about the
>>>> implications.
>>>> That said my guess is we're close to the point where we don't need to
>>>> support Hadoop 2.x anyway, so, yeah.
>>>>
>>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi, All.
>>>> >
>>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>>>> be the same because we didn't change anything yet.
>>>> >
>>>> > Technically, we need to change two places for publishing.
>>>> >
>>>> > 1. Jenkins Snapshot Publishing
>>>> >
>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>>> >
>>>> > 2. Release Snapshot/Release Publishing
>>>> >
>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>>> >
>>>> > To minimize the change, we need to switch our default Hadoop profile.
>>>> >
>>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and
>>>> `hadoop-3.2 (3.2.0)` is optional.
>>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>>> optionally.
>>>> >
>>>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>>>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>>> >
>>>> > Bests,
>>>> > Dongjoon.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>> [image: Databricks Summit - Watch the talks]
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Xiao Li <li...@databricks.com>.

Hi, Steve,

Thanks for your comments! My major quality concern is not against Hadoop
3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive
thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile
only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile is more
risky due to these changes.

To speed up the adoption of Spark 3.0, which has many other highly
desirable features, I am proposing to keep Hadoop 2.x profile as the
default.

Cheers,

Xiao.



On Fri, Nov 1, 2019 at 5:33 AM Steve Loughran <st...@cloudera.com> wrote:

> What is the current default value? as the 2.x releases are becoming EOL;
> 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
> getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
> inevitably be surprises.
>
> One issue about using a older versions is that any problem reported
> -especially at stack traces you can blame me for- Will generally be met by
> a response of "does it go away when you upgrade?" The other issue is how
> much test coverage are things getting?
>
> w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
> client is there, and I the big guava update (HADOOP-16213) went in. People
> will either love or hate that.
>
> No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
> backport planned though, including changes to better handle AWS caching of
> 404s generatd from HEAD requests before an object was actually created.
>
> It would be really good if the spark distributions shipped with later
> versions of the hadoop artifacts.
>
> On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com> wrote:
>
>> The stability and quality of Hadoop 3.2 profile are unknown. The changes
>> are massive, including Hive execution and a new version of Hive
>> thriftserver.
>>
>> To reduce the risk, I would like to keep the current default version
>> unchanged. When it becomes stable, we can change the default profile to
>> Hadoop-3.2.
>>
>> Cheers,
>>
>> Xiao
>>
>> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> I'm OK with that, but don't have a strong opinion nor info about the
>>> implications.
>>> That said my guess is we're close to the point where we don't need to
>>> support Hadoop 2.x anyway, so, yeah.
>>>
>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>>> be the same because we didn't change anything yet.
>>> >
>>> > Technically, we need to change two places for publishing.
>>> >
>>> > 1. Jenkins Snapshot Publishing
>>> >
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>> >
>>> > 2. Release Snapshot/Release Publishing
>>> >
>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>> >
>>> > To minimize the change, we need to switch our default Hadoop profile.
>>> >
>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
>>> (3.2.0)` is optional.
>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>> optionally.
>>> >
>>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>> >
>>> > Bests,
>>> > Dongjoon.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

What is the current default value? as the 2.x releases are becoming EOL;
2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release
getting attention. 2.10.0 shipped yesterday, but the ".0" means there will
inevitably be surprises.

One issue about using a older versions is that any problem reported
-especially at stack traces you can blame me for- Will generally be met by
a response of "does it go away when you upgrade?" The other issue is how
much test coverage are things getting?

w.r.t Hadoop 3.2 stability, nothing major has been reported. The ABFS
client is there, and I the big guava update (HADOOP-16213) went in. People
will either love or hate that.

No major changes in s3a code between 3.2.0 and 3.2.1; I have a large
backport planned though, including changes to better handle AWS caching of
404s generatd from HEAD requests before an object was actually created.

It would be really good if the spark distributions shipped with later
versions of the hadoop artifacts.

On Mon, Oct 28, 2019 at 7:53 PM Xiao Li <li...@databricks.com> wrote:

> The stability and quality of Hadoop 3.2 profile are unknown. The changes
> are massive, including Hive execution and a new version of Hive
> thriftserver.
>
> To reduce the risk, I would like to keep the current default version
> unchanged. When it becomes stable, we can change the default profile to
> Hadoop-3.2.
>
> Cheers,
>
> Xiao
>
> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>
>> I'm OK with that, but don't have a strong opinion nor info about the
>> implications.
>> That said my guess is we're close to the point where we don't need to
>> support Hadoop 2.x anyway, so, yeah.
>>
>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>> >
>> > Hi, All.
>> >
>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>> be the same because we didn't change anything yet.
>> >
>> > Technically, we need to change two places for publishing.
>> >
>> > 1. Jenkins Snapshot Publishing
>> >
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>> >
>> > 2. Release Snapshot/Release Publishing
>> >
>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>> >
>> > To minimize the change, we need to switch our default Hadoop profile.
>> >
>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
>> (3.2.0)` is optional.
>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>> optionally.
>> >
>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>> >
>> > Bests,
>> > Dongjoon.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

Thank you for the feedback, Sean and Xiao.

Bests,
Dongjoon.

On Mon, Oct 28, 2019 at 12:52 PM Xiao Li <li...@databricks.com> wrote:

> The stability and quality of Hadoop 3.2 profile are unknown. The changes
> are massive, including Hive execution and a new version of Hive
> thriftserver.
>
> To reduce the risk, I would like to keep the current default version
> unchanged. When it becomes stable, we can change the default profile to
> Hadoop-3.2.
>
> Cheers,
>
> Xiao
>
> On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:
>
>> I'm OK with that, but don't have a strong opinion nor info about the
>> implications.
>> That said my guess is we're close to the point where we don't need to
>> support Hadoop 2.x anyway, so, yeah.
>>
>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>> >
>> > Hi, All.
>> >
>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>> be the same because we didn't change anything yet.
>> >
>> > Technically, we need to change two places for publishing.
>> >
>> > 1. Jenkins Snapshot Publishing
>> >
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>> >
>> > 2. Release Snapshot/Release Publishing
>> >
>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>> >
>> > To minimize the change, we need to switch our default Hadoop profile.
>> >
>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
>> (3.2.0)` is optional.
>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>> optionally.
>> >
>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>> >
>> > Bests,
>> > Dongjoon.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Xiao Li <li...@databricks.com>.

The stability and quality of Hadoop 3.2 profile are unknown. The changes
are massive, including Hive execution and a new version of Hive
thriftserver.

To reduce the risk, I would like to keep the current default version
unchanged. When it becomes stable, we can change the default profile to
Hadoop-3.2.

Cheers,

Xiao

On Mon, Oct 28, 2019 at 12:51 PM Sean Owen <sr...@gmail.com> wrote:

> I'm OK with that, but don't have a strong opinion nor info about the
> implications.
> That said my guess is we're close to the point where we don't need to
> support Hadoop 2.x anyway, so, yeah.
>
> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >
> > Hi, All.
> >
> > There was a discussion on publishing artifacts built with Hadoop 3 .
> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be
> the same because we didn't change anything yet.
> >
> > Technically, we need to change two places for publishing.
> >
> > 1. Jenkins Snapshot Publishing
> >
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> >
> > 2. Release Snapshot/Release Publishing
> >
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
> >
> > To minimize the change, we need to switch our default Hadoop profile.
> >
> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
> (3.2.0)` is optional.
> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
> optionally.
> >
> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
> >
> > Bests,
> > Dongjoon.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Koert Kuipers <ko...@tresata.com>.

yes i am not against hadoop 3 becoming the default. i was just questioning
the statement that we are close to dropping support for hadoop 2.

we build our own spark releases that we deploy on the clusters of our
clients. these clusters are hdp 2.x, cdh 5, emr, dataproc, etc.

i am aware that hadoop 2.6 profile was dropped and we are handling this
in-house.

given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile to
latest would probably be an issue for us.

On Sat, Nov 2, 2019, 15:47 Dongjoon Hyun <do...@gmail.com> wrote:

> Hi, Koert.
>
> Could you be more specific to your Hadoop version requirement?
>
> Although we will have Hadoop 2.7 profile, Hadoop 2.6 and older support is
> officially already dropped in Apache Spark 3.0.0. We can not give you the
> answer for Hadoop 2.6 and older version clusters because we are not testing
> at all.
>
> Also, Steve already pointed out that Hadoop 2.7 is also EOL. According to
> his advice, we might need to upgrade our Hadoop 2.7 profile to the latest
> 2.x. I'm wondering you are against on that because of Hadoop 2.6 or older
> version support.
>
> BTW, I'm the one of the users of Hadoop 3.x clusters. It's used already
> and we are migrating more. Apache Spark 3.0 will arrive 2020 (not today).
> We need to consider that, too. Do you have any migration plan in 2020?
>
> In short, for the clusters using Hadoop 2.6 and older versions, Apache
> Spark 2.4 is supported as a LTS version. You can get the bug fixes. For
> Hadoop 2.7, Apache Spark 3.0 will have the profile and the binary release.
> Making Hadoop 3.2 profile as a default is irrelevant to that.
>
> Bests,
> Dongjoon.
>
>
> On Sat, Nov 2, 2019 at 09:35 Koert Kuipers <ko...@tresata.com> wrote:
>
>> i dont see how we can be close to the point where we dont need to support
>> hadoop 2.x. this does not agree with the reality from my perspective, which
>> is that all our clients are on hadoop 2.x. not a single one is on hadoop
>> 3.x currently. this includes deployments of cloudera distros, hortonworks
>> distros, and cloud distros like emr and dataproc.
>>
>> forcing us to be on older spark versions would be unfortunate for us, and
>> also bad for the community (as deployments like ours help find bugs in
>> spark).
>>
>> On Mon, Oct 28, 2019 at 3:51 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> I'm OK with that, but don't have a strong opinion nor info about the
>>> implications.
>>> That said my guess is we're close to the point where we don't need to
>>> support Hadoop 2.x anyway, so, yeah.
>>>
>>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>>> be the same because we didn't change anything yet.
>>> >
>>> > Technically, we need to change two places for publishing.
>>> >
>>> > 1. Jenkins Snapshot Publishing
>>> >
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>>> >
>>> > 2. Release Snapshot/Release Publishing
>>> >
>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>>> >
>>> > To minimize the change, we need to switch our default Hadoop profile.
>>> >
>>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
>>> (3.2.0)` is optional.
>>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>>> optionally.
>>> >
>>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>>> >
>>> > Bests,
>>> > Dongjoon.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, Koert.

Could you be more specific to your Hadoop version requirement?

Although we will have Hadoop 2.7 profile, Hadoop 2.6 and older support is
officially already dropped in Apache Spark 3.0.0. We can not give you the
answer for Hadoop 2.6 and older version clusters because we are not testing
at all.

Also, Steve already pointed out that Hadoop 2.7 is also EOL. According to
his advice, we might need to upgrade our Hadoop 2.7 profile to the latest
2.x. I'm wondering you are against on that because of Hadoop 2.6 or older
version support.

BTW, I'm the one of the users of Hadoop 3.x clusters. It's used already and
we are migrating more. Apache Spark 3.0 will arrive 2020 (not today). We
need to consider that, too. Do you have any migration plan in 2020?

In short, for the clusters using Hadoop 2.6 and older versions, Apache
Spark 2.4 is supported as a LTS version. You can get the bug fixes. For
Hadoop 2.7, Apache Spark 3.0 will have the profile and the binary release.
Making Hadoop 3.2 profile as a default is irrelevant to that.

Bests,
Dongjoon.

On Sat, Nov 2, 2019 at 09:35 Koert Kuipers <ko...@tresata.com> wrote:

> i dont see how we can be close to the point where we dont need to support
> hadoop 2.x. this does not agree with the reality from my perspective, which
> is that all our clients are on hadoop 2.x. not a single one is on hadoop
> 3.x currently. this includes deployments of cloudera distros, hortonworks
> distros, and cloud distros like emr and dataproc.
>
> forcing us to be on older spark versions would be unfortunate for us, and
> also bad for the community (as deployments like ours help find bugs in
> spark).
>
> On Mon, Oct 28, 2019 at 3:51 PM Sean Owen <sr...@gmail.com> wrote:
>
>> I'm OK with that, but don't have a strong opinion nor info about the
>> implications.
>> That said my guess is we're close to the point where we don't need to
>> support Hadoop 2.x anyway, so, yeah.
>>
>> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>> >
>> > Hi, All.
>> >
>> > There was a discussion on publishing artifacts built with Hadoop 3 .
>> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will
>> be the same because we didn't change anything yet.
>> >
>> > Technically, we need to change two places for publishing.
>> >
>> > 1. Jenkins Snapshot Publishing
>> >
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>> >
>> > 2. Release Snapshot/Release Publishing
>> >
>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>> >
>> > To minimize the change, we need to switch our default Hadoop profile.
>> >
>> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
>> (3.2.0)` is optional.
>> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
>> optionally.
>> >
>> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
>> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>> >
>> > Bests,
>> > Dongjoon.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

I'd move spark's branch-2 line to 2.9.x as

(a) spark's version of httpclient hits a bug in the AWS SDK used in
hadoop-2.8 unless you revert that patch
https://issues.apache.org/jira/browse/SPARK-22919
(b) there's only one future version of 2.8x planned, which is expected once
myself or someone else sits down to do it. After that, all CVEs will be
dealt with by "upgrade".
(c) it's actually tested against java 8, whereas versions <= 2.8 are
nominally java 7 only.
(d) Microsoft contributed a lot for Azure integration

To be fair, the fact that the 2.7 release has lasted so long is actually
pretty impressive. Core APIs stable; Kerberos under control, HDFS client
and server happy (no erasure coding, other things tho'); the lack of
support/performance for object store integration shows how things have
changed since it's release in April 2015. But that was over five years ago.

On Sat, Nov 2, 2019 at 4:36 PM Koert Kuipers <ko...@tresata.com> wrote:

> i dont see how we can be close to the point where we dont need to support
> hadoop 2.x. this does not agree with the reality from my perspective, which
> is that all our clients are on hadoop 2.x. not a single one is on hadoop
> 3.x currently.
>

Maybe, but unlikely to be on a "vanilla" 2.7.x release except for some very
special cases where teams have taken on that task of maintaining their own
installation.

> this includes deployments of cloudera distros, hortonworks distros,
>

In front of me I have a git source tree whose repositories let me see the
version histories of all of these and ~HD/I too. This is a power (I can
make changes to all) and a responsibility (I could accidentally break the
nightly builds of all if I'm not careful (1)). The one thing it doesn't do
is have write access to asf gitbox, but that's only to stop me accidentally
pushing me up an internal HDP or CDH branch to the ASF/github repos (2).

CDH5.x: hadoop branch-2 with some S3A features backported from hadoop
branch-3 (i.e S3Guard). I'd call it 2.8+ though I don't know it in detail
there.

HDP2.6.x: again, 2.8+ with abfs and gcs support.

Either way: when Spark 3.x ships it'd be up to Cloudera to deal with that
release.

I have no idea what is going to happen there. If other people want to test
spark 3.0.0 on those platforms -go for it, but do consider that by the
commercial on-premises clusters have had a hadoop-3 option for 2+ years and
that every month the age of those 2.x-based clusters increases. In cloud,
things are transient so it doesn't matter *at all*.

> and cloud distros like emr and dataproc.
>
>
EMR is a closed-source fork of (hadoop, hbase, spark, ...) with their own
S3 connector which has never had its source seen other than in stack traces
on stack overflow. Their problem (3).

HD/I: Current with azure connectivity, doesn't worry about the rest.

dataproc: no idea. Their gcs connector has been pretty stable. They do both
branch-2 and branch-3.1 artifacts & do run the fs contract tests to help
catch regressions in our code and theirs.

For all those in-cloud deployments, if you say "min version is Hadoop 3.x
artifacts" then when they offer spark-3 they'll just do it with their build
of the hadoop-3 JARs. It's not like they have 1000+ node HDFS clusters to
upgrade.

> forcing us to be on older spark versions would be unfortunate for us, and
> also bad for the community (as deployments like ours help find bugs in
> spark).
>
>
Bear also in mind: because all the work with hadoop, hive, HBase etc goes
on in branch-3 code, the compatibility with those things ages too. If you
are worried about Hive, well, you need to be working with their latest
releases to get any issues you find fixed,

It's a really hard choice here: Stable dependencies versus newer ones.
Certainly hadoop stayed with an old version of guava because the upgrade
was so traumatic (it's changed now), and as for protobuf, that was so
traumatic that everyone left it stayed frozen until last month (3.3, not
3.2.x, and protoc is done in java/maven). At the same time CVEs force
Jackson updates on a fortnightly basis and the move to java 11 breaks so
much that it's a big upgrade festival on us all.

You're going to have to consider "how much suffering with Hadoop 2.7
support is justified?" and "what should be the version which is actually
shipped for people to play with". I think my stance is clear: time to move
on. You cut your test matrix in half, be confident all users reporting bugs
will be on hadoop 3.x and when you do file bugs with your peer ASF projects
they don't get closed as WONTFIX.

BTW: out of curiosity, what versions of things does Databricks build off.
ASF 2.7.x or something later?

-Steve

(1) Narrator: He has accidentally broken the nightly builds of most of
these. And IBM websphere once. Breaking google cloud is still an unrealised
ambition.
(2) Narrator: He has accidentally pushed up a release of an internal branch
to the ASF/github repos. Colleagues were unhappy.
(3) Pro: they don't have to worry about me directly breaking their S3
integration. Con: I could still indirectly do it elsewhere in the source
tree,  wouldn't notice, and probably wouldn't care much if they complained.

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Koert Kuipers <ko...@tresata.com>.

i dont see how we can be close to the point where we dont need to support
hadoop 2.x. this does not agree with the reality from my perspective, which
is that all our clients are on hadoop 2.x. not a single one is on hadoop
3.x currently. this includes deployments of cloudera distros, hortonworks
distros, and cloud distros like emr and dataproc.

forcing us to be on older spark versions would be unfortunate for us, and
also bad for the community (as deployments like ours help find bugs in
spark).

On Mon, Oct 28, 2019 at 3:51 PM Sean Owen <sr...@gmail.com> wrote:

> I'm OK with that, but don't have a strong opinion nor info about the
> implications.
> That said my guess is we're close to the point where we don't need to
> support Hadoop 2.x anyway, so, yeah.
>
> On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >
> > Hi, All.
> >
> > There was a discussion on publishing artifacts built with Hadoop 3 .
> > But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be
> the same because we didn't change anything yet.
> >
> > Technically, we need to change two places for publishing.
> >
> > 1. Jenkins Snapshot Publishing
> >
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> >
> > 2. Release Snapshot/Release Publishing
> >
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
> >
> > To minimize the change, we need to switch our default Hadoop profile.
> >
> > Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2
> (3.2.0)` is optional.
> > We had better use `hadoop-3.2` profile by default and `hadoop-2.7`
> optionally.
> >
> > Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7`
> distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
> >
> > Bests,
> > Dongjoon.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Posted by Sean Owen <sr...@gmail.com>.

I'm OK with that, but don't have a strong opinion nor info about the
implications.
That said my guess is we're close to the point where we don't need to
support Hadoop 2.x anyway, so, yeah.

On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun <do...@gmail.com> wrote:
>
> Hi, All.
>
> There was a discussion on publishing artifacts built with Hadoop 3 .
> But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet.
>
> Technically, we need to change two places for publishing.
>
> 1. Jenkins Snapshot Publishing
>     https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
>
> 2. Release Snapshot/Release Publishing
>     https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh
>
> To minimize the change, we need to switch our default Hadoop profile.
>
> Currently, the default is `hadoop-2.7 (2.7.4)` profile and `hadoop-3.2 (3.2.0)` is optional.
> We had better use `hadoop-3.2` profile by default and `hadoop-2.7` optionally.
>
> Note that this means we use Hive 2.3.6 by default. Only `hadoop-2.7` distribution will use `Hive 1.2.1` like Apache Spark 2.4.x.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org