You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2019/11/19 05:11:12 UTC

Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Hi, All.

First of all, I want to put this as a policy issue instead of a technical
issue.
Also, this is orthogonal from `hadoop` version discussion.

Apache Spark community kept (not maintained) the forked Apache Hive
1.2.1 because there has been no other options before. As we see at
SPARK-20202, it's not a desirable situation among the Apache projects.

    https://issues.apache.org/jira/browse/SPARK-20202

Also, please note that we `kept`, not `maintained`, because we know it's
not good.
There are several attempt to update that forked repository
for several reasons (Hadoop 3 support is one of the example),
but those attempts are also turned down.

From Apache Spark 3.0, it seems that we have a new feasible option
`hive-2.3` profile. What about moving forward in this direction further?

For example, can we remove the usage of forked `hive` in Apache Spark 3.0
completely officially? If someone still needs to use the forked `hive`, we
can
have a profile `hive-1.2`. Of course, it should not be a default profile in
the community.

I want to say this is a goal we should achieve someday.
If we don't do anything, nothing happen. At least we need to prepare this.
Without any preparation, Spark 3.1+ will be the same.

Shall we focus on what are our problems with Hive 2.3.6?
If the only reason is that we didn't use it before, we can release another
`3.0.0-preview` for that.

Bests,
Dongjoon.

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Dongjoon Hyun <do...@gmail.com>.

Thank you all.

I'll try to make JIRA and PR for that.

Bests,
Dongjoon.

On Wed, Nov 20, 2019 at 4:08 PM Cheng Lian <li...@gmail.com> wrote:

> Sean, thanks for the corner cases you listed. They make a lot of sense.
> Now I do incline to have Hive 2.3 as the default version.
>
> Dongjoon, apologize if I didn't make it clear before. What made me
> concerned initially was only the following part:
>
> > can we remove the usage of forked `hive` in Apache Spark 3.0 completely
> officially?
>
> So having Hive 2.3 as the default Hive version and adding a `hive-1.2`
> profile to keep the Hive 1.2.1 fork looks like a feasible approach to me.
> Thanks for starting the discussion!
>
> On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Yes. Right. That's the situation we are hitting and the result I expected.
>> We need to change our default with Hive 2 in the POM.
>>
>> Dongjoon.
>>
>>
>> On Wed, Nov 20, 2019 at 5:20 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> Yes, good point. A user would get whatever the POM says without
>>> profiles enabled so it matters.
>>>
>>> Playing it out, an app _should_ compile with the Spark dependency
>>> marked 'provided'. In that case the app that is spark-submit-ted is
>>> agnostic to the Hive dependency as the only one that matters is what's
>>> on the cluster. Right? we don't leak through the Hive API in the Spark
>>> API. And yes it's then up to the cluster to provide whatever version
>>> it wants. Vendors will have made a specific version choice when
>>> building their distro one way or the other.
>>>
>>> If you run a Spark cluster yourself, you're using the binary distro,
>>> and we're already talking about also publishing a binary distro with
>>> this variation, so that's not the issue.
>>>
>>> The corner cases where it might matter are:
>>>
>>> - I unintentionally package Spark in the app and by default pull in
>>> Hive 2 when I will deploy against Hive 1. But that's user error, and
>>> causes other problems
>>> - I run tests locally in my project, which will pull in a default
>>> version of Hive defined by the POM
>>>
>>> Double-checking, is that right? if so it kind of implies it doesn't
>>> matter. Which is an argument either way about what's the default. I
>>> too would then prefer defaulting to Hive 2 in the POM. Am I missing
>>> something about the implication?
>>>
>>> (That fork will stay published forever anyway, that's not an issue per
>>> se.)
>>>
>>> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>> > Sean, our published POM is pointing and advertising the illegitimate
>>> Hive 1.2 fork as a compile dependency.
>>> > Yes. It can be overridden. So, why does Apache Spark need to publish
>>> like that?
>>> > If someone want to use that illegitimate Hive 1.2 fork, let them
>>> override it. We are unable to delete those illegitimate Hive 1.2 fork.
>>> > Those artifacts will be orphans.
>>> >
>>>
>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Cheng Lian <li...@gmail.com>.

Sean, thanks for the corner cases you listed. They make a lot of sense. Now
I do incline to have Hive 2.3 as the default version.

Dongjoon, apologize if I didn't make it clear before. What made me
concerned initially was only the following part:

> can we remove the usage of forked `hive` in Apache Spark 3.0 completely
officially?

So having Hive 2.3 as the default Hive version and adding a `hive-1.2`
profile to keep the Hive 1.2.1 fork looks like a feasible approach to me.
Thanks for starting the discussion!

On Wed, Nov 20, 2019 at 9:46 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Yes. Right. That's the situation we are hitting and the result I expected.
> We need to change our default with Hive 2 in the POM.
>
> Dongjoon.
>
>
> On Wed, Nov 20, 2019 at 5:20 AM Sean Owen <sr...@gmail.com> wrote:
>
>> Yes, good point. A user would get whatever the POM says without
>> profiles enabled so it matters.
>>
>> Playing it out, an app _should_ compile with the Spark dependency
>> marked 'provided'. In that case the app that is spark-submit-ted is
>> agnostic to the Hive dependency as the only one that matters is what's
>> on the cluster. Right? we don't leak through the Hive API in the Spark
>> API. And yes it's then up to the cluster to provide whatever version
>> it wants. Vendors will have made a specific version choice when
>> building their distro one way or the other.
>>
>> If you run a Spark cluster yourself, you're using the binary distro,
>> and we're already talking about also publishing a binary distro with
>> this variation, so that's not the issue.
>>
>> The corner cases where it might matter are:
>>
>> - I unintentionally package Spark in the app and by default pull in
>> Hive 2 when I will deploy against Hive 1. But that's user error, and
>> causes other problems
>> - I run tests locally in my project, which will pull in a default
>> version of Hive defined by the POM
>>
>> Double-checking, is that right? if so it kind of implies it doesn't
>> matter. Which is an argument either way about what's the default. I
>> too would then prefer defaulting to Hive 2 in the POM. Am I missing
>> something about the implication?
>>
>> (That fork will stay published forever anyway, that's not an issue per
>> se.)
>>
>> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>> > Sean, our published POM is pointing and advertising the illegitimate
>> Hive 1.2 fork as a compile dependency.
>> > Yes. It can be overridden. So, why does Apache Spark need to publish
>> like that?
>> > If someone want to use that illegitimate Hive 1.2 fork, let them
>> override it. We are unable to delete those illegitimate Hive 1.2 fork.
>> > Those artifacts will be orphans.
>> >
>>
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Dongjoon Hyun <do...@gmail.com>.

Yes. Right. That's the situation we are hitting and the result I expected.
We need to change our default with Hive 2 in the POM.

Dongjoon.


On Wed, Nov 20, 2019 at 5:20 AM Sean Owen <sr...@gmail.com> wrote:

> Yes, good point. A user would get whatever the POM says without
> profiles enabled so it matters.
>
> Playing it out, an app _should_ compile with the Spark dependency
> marked 'provided'. In that case the app that is spark-submit-ted is
> agnostic to the Hive dependency as the only one that matters is what's
> on the cluster. Right? we don't leak through the Hive API in the Spark
> API. And yes it's then up to the cluster to provide whatever version
> it wants. Vendors will have made a specific version choice when
> building their distro one way or the other.
>
> If you run a Spark cluster yourself, you're using the binary distro,
> and we're already talking about also publishing a binary distro with
> this variation, so that's not the issue.
>
> The corner cases where it might matter are:
>
> - I unintentionally package Spark in the app and by default pull in
> Hive 2 when I will deploy against Hive 1. But that's user error, and
> causes other problems
> - I run tests locally in my project, which will pull in a default
> version of Hive defined by the POM
>
> Double-checking, is that right? if so it kind of implies it doesn't
> matter. Which is an argument either way about what's the default. I
> too would then prefer defaulting to Hive 2 in the POM. Am I missing
> something about the implication?
>
> (That fork will stay published forever anyway, that's not an issue per se.)
>
> On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
> > Sean, our published POM is pointing and advertising the illegitimate
> Hive 1.2 fork as a compile dependency.
> > Yes. It can be overridden. So, why does Apache Spark need to publish
> like that?
> > If someone want to use that illegitimate Hive 1.2 fork, let them
> override it. We are unable to delete those illegitimate Hive 1.2 fork.
> > Those artifacts will be orphans.
> >
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Sean Owen <sr...@gmail.com>.

Yes, good point. A user would get whatever the POM says without
profiles enabled so it matters.

Playing it out, an app _should_ compile with the Spark dependency
marked 'provided'. In that case the app that is spark-submit-ted is
agnostic to the Hive dependency as the only one that matters is what's
on the cluster. Right? we don't leak through the Hive API in the Spark
API. And yes it's then up to the cluster to provide whatever version
it wants. Vendors will have made a specific version choice when
building their distro one way or the other.

If you run a Spark cluster yourself, you're using the binary distro,
and we're already talking about also publishing a binary distro with
this variation, so that's not the issue.

The corner cases where it might matter are:

- I unintentionally package Spark in the app and by default pull in
Hive 2 when I will deploy against Hive 1. But that's user error, and
causes other problems
- I run tests locally in my project, which will pull in a default
version of Hive defined by the POM

Double-checking, is that right? if so it kind of implies it doesn't
matter. Which is an argument either way about what's the default. I
too would then prefer defaulting to Hive 2 in the POM. Am I missing
something about the implication?

(That fork will stay published forever anyway, that's not an issue per se.)

On Wed, Nov 20, 2019 at 1:40 AM Dongjoon Hyun <do...@gmail.com> wrote:
> Sean, our published POM is pointing and advertising the illegitimate Hive 1.2 fork as a compile dependency.
> Yes. It can be overridden. So, why does Apache Spark need to publish like that?
> If someone want to use that illegitimate Hive 1.2 fork, let them override it. We are unable to delete those illegitimate Hive 1.2 fork.
> Those artifacts will be orphans.
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Dongjoon Hyun <do...@gmail.com>.

Cheng, could you elaborate on your criteria, `Hive 2.3 code paths are
proven to be stable`?
For me, it's difficult to image that we can reach any stable situation when
we don't use it at all by ourselves.

    > The Hive 1.2 code paths can only be removed once the Hive 2.3 code
paths are proven to be stable.

Sean, our published POM is pointing and advertising the illegitimate Hive
1.2 fork as a compile dependency.
Yes. It can be overridden. So, why does Apache Spark need to publish like
that?
If someone want to use that illegitimate Hive 1.2 fork, let them override
it. We are unable to delete those illegitimate Hive 1.2 fork.
Those artifacts will be orphans.

    > The published POM will be agnostic to Hadoop / Hive; well,
    > it will link against a particular version but can be overridden.

    -
https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.12/3.0.0-preview
       ->
https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/1.2.1.spark2
       ->
https://mvnrepository.com/artifact/org.spark-project.hive/hive-metastore/1.2.1.spark2

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 5:26 PM Hyukjin Kwon <gu...@gmail.com> wrote:

> > Should Hadoop 2 + Hive 2 be considered to work on JDK 11?
> This seems being investigated by Yuming's PR (
> https://github.com/apache/spark/pull/26533) if I am not mistaken.
>
> Oh, yes, what I meant by (default) was the default profiles we will use in
> Spark.
>
>
> 2019년 11월 20일 (수) 오전 10:14, Sean Owen <sr...@gmail.com>님이 작성:
>
>> Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
>> sure if 2.7 did, but honestly I've lost track.
>> Anyway, it doesn't matter much as the JDK doesn't cause another build
>> permutation. All are built targeting Java 8.
>>
>> I also don't know if we have to declare a binary release a default.
>> The published POM will be agnostic to Hadoop / Hive; well, it will
>> link against a particular version but can be overridden. That's what
>> you're getting at?
>>
>>
>> On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>> >
>> > So, are we able to conclude our plans as below?
>> >
>> > 1. In Spark 3,  we release as below:
>> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
>> >   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
>> >
>> > 2. In Spark 3.1, we target:
>> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
>> >
>> > 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)"
>> combo right away after cutting branch-3 to see if Hive 2.3 is considered as
>> stable in general.
>> >     I roughly suspect it would be a couple of months after Spark 3.0
>> release (?).
>> >
>> > BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1
>> (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
>> >
>>
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Hyukjin Kwon <gu...@gmail.com>.

> Should Hadoop 2 + Hive 2 be considered to work on JDK 11?
This seems being investigated by Yuming's PR (
https://github.com/apache/spark/pull/26533) if I am not mistaken.

Oh, yes, what I meant by (default) was the default profiles we will use in
Spark.


2019년 11월 20일 (수) 오전 10:14, Sean Owen <sr...@gmail.com>님이 작성:

> Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
> sure if 2.7 did, but honestly I've lost track.
> Anyway, it doesn't matter much as the JDK doesn't cause another build
> permutation. All are built targeting Java 8.
>
> I also don't know if we have to declare a binary release a default.
> The published POM will be agnostic to Hadoop / Hive; well, it will
> link against a particular version but can be overridden. That's what
> you're getting at?
>
>
> On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon <gu...@gmail.com> wrote:
> >
> > So, are we able to conclude our plans as below?
> >
> > 1. In Spark 3,  we release as below:
> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
> >   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
> >
> > 2. In Spark 3.1, we target:
> >   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
> >   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
> >
> > 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)"
> combo right away after cutting branch-3 to see if Hive 2.3 is considered as
> stable in general.
> >     I roughly suspect it would be a couple of months after Spark 3.0
> release (?).
> >
> > BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1
> (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
> >
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Sean Owen <sr...@gmail.com>.

Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't
sure if 2.7 did, but honestly I've lost track.
Anyway, it doesn't matter much as the JDK doesn't cause another build
permutation. All are built targeting Java 8.

I also don't know if we have to declare a binary release a default.
The published POM will be agnostic to Hadoop / Hive; well, it will
link against a particular version but can be overridden. That's what
you're getting at?

On Tue, Nov 19, 2019 at 7:11 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>
> So, are we able to conclude our plans as below?
>
> 1. In Spark 3,  we release as below:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)
>
> 2. In Spark 3.1, we target:
>   - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
>   - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)
>
> 3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo right away after cutting branch-3 to see if Hive 2.3 is considered as stable in general.
>     I roughly suspect it would be a couple of months after Spark 3.0 release (?).
>
> BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combination is deprecated anyway in Spark 3.
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Hyukjin Kwon <gu...@gmail.com>.

So, are we able to conclude our plans as below?

1. In Spark 3,  we release as below:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)

2. In Spark 3.1, we target:
  - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11
  - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 (default)

3. Avoid to remove "Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default)" combo
right away after cutting branch-3 to see if Hive 2.3 is considered as
stable in general.
    I roughly suspect it would be a couple of months after Spark 3.0
release (?).

BTW, maybe we should officially note that "Hadoop 2.7 + Hive 1.2.1 (fork) +
JDK8 (default)" combination is deprecated anyway in Spark 3.



2019년 11월 20일 (수) 오전 9:52, Cheng Lian <li...@gmail.com>님이 작성:

> Thanks for taking care of this, Dongjoon!
>
> We can target SPARK-20202 to 3.1.0, but I don't think we should do it
> immediately after cutting the branch-3.0. The Hive 1.2 code paths can only
> be removed once the Hive 2.3 code paths are proven to be stable. If it
> turned out to be buggy in Spark 3.1, we may want to further postpone
> SPARK-20202 to 3.2.0 by then.
>
> On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Yes. It does. I meant SPARK-20202.
>>
>> Thanks. I understand that it can be considered like Scala version issue.
>> So, that's the reason why I put this as a `policy` issue from the
>> beginning.
>>
>> > First of all, I want to put this as a policy issue instead of a
>> technical issue.
>>
>> In the policy perspective, we should remove this immediately if we have a
>> solution to fix this.
>> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to
>> the current discussion status.
>>
>>     https://issues.apache.org/jira/browse/SPARK-20202
>>
>> And, if there is no other issues, I'll create a PR to remove it from
>> `master` branch when we cut `branch-3.0`.
>>
>> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
>> you think about this, Sean?
>> The preparation is already started in another email thread and I believe
>> that is a keystone to prove `Hive 2.3` version stability
>> (which Cheng/Hyukjin/you asked).
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <li...@gmail.com> wrote:
>>
>>> It's kinda like Scala version upgrade. Historically, we only remove the
>>> support of an older Scala version when the newer version is proven to be
>>> stable after one or more Spark minor versions.
>>>
>>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <li...@gmail.com>
>>> wrote:
>>>
>>>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>>>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>>>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>>>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>>>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>>>> version. After all, for end-users and providers who need a particular
>>>> version combination, they can always build Spark with proper profiles
>>>> themselves.
>>>>
>>>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that
>>>> it's due to the folder name.
>>>>
>>>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>>>
>>>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>>>
>>>>> We can replace it immediately if we want right now.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <
>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>
>>>>>> Hi, Cheng.
>>>>>>
>>>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8
>>>>>> world.
>>>>>> If we consider them, it could be the followings.
>>>>>>
>>>>>> +----------+-----------------+--------------------+
>>>>>> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>>>>> +-------------------------------------------------+
>>>>>> |Legitimate|        X        |         O          |
>>>>>> |JDK11     |        X        |         O          |
>>>>>> |Hadoop3   |        X        |         O          |
>>>>>> |Hadoop2   |        O        |         O          |
>>>>>> |Functions |     Baseline    |       More         |
>>>>>> |Bug fixes |     Baseline    |       More         |
>>>>>> +-------------------------------------------------+
>>>>>>
>>>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>>>>> (including Jenkins/GitHubAction/AppVeyor).
>>>>>>
>>>>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>>>>> to give more visibility to the whole community,
>>>>>>
>>>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>>>>> distribution
>>>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for
>>>>>> 3.1 after `branch-3.0` branch cut.
>>>>>>
>>>>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>>>>> But, it's time to prepare. Without them, we are going to be
>>>>>> insufficient again and again.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <li...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>>>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>>>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>>>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>>>>>> and here
>>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>>>>>> .)
>>>>>>>
>>>>>>> Again, I'm happy to get rid of ancient legacy dependencies like
>>>>>>> Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety
>>>>>>> net for Spark 3.0. For preview releases, I'm afraid that their visibility
>>>>>>> is not good enough for covering such major upgrades.
>>>>>>>
>>>>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <
>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>>>>>
>>>>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that
>>>>>>>> at 3.1
>>>>>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>>>>>> reference
>>>>>>>> immediately after `branch-3.0` cut.
>>>>>>>>
>>>>>>>> Sean, I'm referencing Cheng Lian's email for the status of
>>>>>>>> `hadoop-2.7`.
>>>>>>>>
>>>>>>>> -
>>>>>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>>>>>
>>>>>>>> The way I see this is that it's not a user problem. Apache Spark
>>>>>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>>>>>> We need to drop it by ourselves because we created it and it's our
>>>>>>>> bad.
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Just to clarify, as even I have lost the details over time:
>>>>>>>>> hadoop-2.7
>>>>>>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>>>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>>>>>>> 2.x, for end users using Hive via Spark?
>>>>>>>>> I don't have a strong opinion, other than sharing the view that we
>>>>>>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>>>>>>> Question is simply how much risk that entails. Keeping in mind that
>>>>>>>>> Spark 3.0 is already something that people understand works
>>>>>>>>> differently. We can accept some behavior changes.
>>>>>>>>>
>>>>>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>> >
>>>>>>>>> > Hi, All.
>>>>>>>>> >
>>>>>>>>> > First of all, I want to put this as a policy issue instead of a
>>>>>>>>> technical issue.
>>>>>>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>>>>>>> >
>>>>>>>>> > Apache Spark community kept (not maintained) the forked Apache
>>>>>>>>> Hive
>>>>>>>>> > 1.2.1 because there has been no other options before. As we see
>>>>>>>>> at
>>>>>>>>> > SPARK-20202, it's not a desirable situation among the Apache
>>>>>>>>> projects.
>>>>>>>>> >
>>>>>>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>>>>>>> >
>>>>>>>>> > Also, please note that we `kept`, not `maintained`, because we
>>>>>>>>> know it's not good.
>>>>>>>>> > There are several attempt to update that forked repository
>>>>>>>>> > for several reasons (Hadoop 3 support is one of the example),
>>>>>>>>> > but those attempts are also turned down.
>>>>>>>>> >
>>>>>>>>> > From Apache Spark 3.0, it seems that we have a new feasible
>>>>>>>>> option
>>>>>>>>> > `hive-2.3` profile. What about moving forward in this direction
>>>>>>>>> further?
>>>>>>>>> >
>>>>>>>>> > For example, can we remove the usage of forked `hive` in Apache
>>>>>>>>> Spark 3.0
>>>>>>>>> > completely officially? If someone still needs to use the forked
>>>>>>>>> `hive`, we can
>>>>>>>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>>>>>>>> profile in the community.
>>>>>>>>> >
>>>>>>>>> > I want to say this is a goal we should achieve someday.
>>>>>>>>> > If we don't do anything, nothing happen. At least we need to
>>>>>>>>> prepare this.
>>>>>>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>>>>>>> >
>>>>>>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>>>>>>> > If the only reason is that we didn't use it before, we can
>>>>>>>>> release another
>>>>>>>>> > `3.0.0-preview` for that.
>>>>>>>>> >
>>>>>>>>> > Bests,
>>>>>>>>> > Dongjoon.
>>>>>>>>>
>>>>>>>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Cheng Lian <li...@gmail.com>.

Thanks for taking care of this, Dongjoon!

We can target SPARK-20202 to 3.1.0, but I don't think we should do it
immediately after cutting the branch-3.0. The Hive 1.2 code paths can only
be removed once the Hive 2.3 code paths are proven to be stable. If it
turned out to be buggy in Spark 3.1, we may want to further postpone
SPARK-20202 to 3.2.0 by then.

On Tue, Nov 19, 2019 at 2:53 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Yes. It does. I meant SPARK-20202.
>
> Thanks. I understand that it can be considered like Scala version issue.
> So, that's the reason why I put this as a `policy` issue from the
> beginning.
>
> > First of all, I want to put this as a policy issue instead of a
> technical issue.
>
> In the policy perspective, we should remove this immediately if we have a
> solution to fix this.
> For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to
> the current discussion status.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> And, if there is no other issues, I'll create a PR to remove it from
> `master` branch when we cut `branch-3.0`.
>
> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
> you think about this, Sean?
> The preparation is already started in another email thread and I believe
> that is a keystone to prove `Hive 2.3` version stability
> (which Cheng/Hyukjin/you asked).
>
> Bests,
> Dongjoon.
>
>
> On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <li...@gmail.com> wrote:
>
>> It's kinda like Scala version upgrade. Historically, we only remove the
>> support of an older Scala version when the newer version is proven to be
>> stable after one or more Spark minor versions.
>>
>> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <li...@gmail.com> wrote:
>>
>>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>>> version. After all, for end-users and providers who need a particular
>>> version combination, they can always build Spark with proper profiles
>>> themselves.
>>>
>>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that
>>> it's due to the folder name.
>>>
>>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>>
>>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>>
>>>> We can replace it immediately if we want right now.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, Cheng.
>>>>>
>>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>>>> If we consider them, it could be the followings.
>>>>>
>>>>> +----------+-----------------+--------------------+
>>>>> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>>>> +-------------------------------------------------+
>>>>> |Legitimate|        X        |         O          |
>>>>> |JDK11     |        X        |         O          |
>>>>> |Hadoop3   |        X        |         O          |
>>>>> |Hadoop2   |        O        |         O          |
>>>>> |Functions |     Baseline    |       More         |
>>>>> |Bug fixes |     Baseline    |       More         |
>>>>> +-------------------------------------------------+
>>>>>
>>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>>>> (including Jenkins/GitHubAction/AppVeyor).
>>>>>
>>>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>>>> to give more visibility to the whole community,
>>>>>
>>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>>>> distribution
>>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>>>> after `branch-3.0` branch cut.
>>>>>
>>>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>>>> But, it's time to prepare. Without them, we are going to be
>>>>> insufficient again and again.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>>>>> and here
>>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>>>>> .)
>>>>>>
>>>>>> Again, I'm happy to get rid of ancient legacy dependencies like
>>>>>> Hadoop 2.7 and the Hive 1.2 fork, but I do believe that we need a safety
>>>>>> net for Spark 3.0. For preview releases, I'm afraid that their visibility
>>>>>> is not good enough for covering such major upgrades.
>>>>>>
>>>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>>>>
>>>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that
>>>>>>> at 3.1
>>>>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>>>>> reference
>>>>>>> immediately after `branch-3.0` cut.
>>>>>>>
>>>>>>> Sean, I'm referencing Cheng Lian's email for the status of
>>>>>>> `hadoop-2.7`.
>>>>>>>
>>>>>>> -
>>>>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>>>>
>>>>>>> The way I see this is that it's not a user problem. Apache Spark
>>>>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>>>>> We need to drop it by ourselves because we created it and it's our
>>>>>>> bad.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Just to clarify, as even I have lost the details over time:
>>>>>>>> hadoop-2.7
>>>>>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>>>>>> 2.x, for end users using Hive via Spark?
>>>>>>>> I don't have a strong opinion, other than sharing the view that we
>>>>>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>>>>>> Question is simply how much risk that entails. Keeping in mind that
>>>>>>>> Spark 3.0 is already something that people understand works
>>>>>>>> differently. We can accept some behavior changes.
>>>>>>>>
>>>>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > Hi, All.
>>>>>>>> >
>>>>>>>> > First of all, I want to put this as a policy issue instead of a
>>>>>>>> technical issue.
>>>>>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>>>>>> >
>>>>>>>> > Apache Spark community kept (not maintained) the forked Apache
>>>>>>>> Hive
>>>>>>>> > 1.2.1 because there has been no other options before. As we see at
>>>>>>>> > SPARK-20202, it's not a desirable situation among the Apache
>>>>>>>> projects.
>>>>>>>> >
>>>>>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>>>>>> >
>>>>>>>> > Also, please note that we `kept`, not `maintained`, because we
>>>>>>>> know it's not good.
>>>>>>>> > There are several attempt to update that forked repository
>>>>>>>> > for several reasons (Hadoop 3 support is one of the example),
>>>>>>>> > but those attempts are also turned down.
>>>>>>>> >
>>>>>>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>>>>>>> > `hive-2.3` profile. What about moving forward in this direction
>>>>>>>> further?
>>>>>>>> >
>>>>>>>> > For example, can we remove the usage of forked `hive` in Apache
>>>>>>>> Spark 3.0
>>>>>>>> > completely officially? If someone still needs to use the forked
>>>>>>>> `hive`, we can
>>>>>>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>>>>>>> profile in the community.
>>>>>>>> >
>>>>>>>> > I want to say this is a goal we should achieve someday.
>>>>>>>> > If we don't do anything, nothing happen. At least we need to
>>>>>>>> prepare this.
>>>>>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>>>>>> >
>>>>>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>>>>>> > If the only reason is that we didn't use it before, we can
>>>>>>>> release another
>>>>>>>> > `3.0.0-preview` for that.
>>>>>>>> >
>>>>>>>> > Bests,
>>>>>>>> > Dongjoon.
>>>>>>>>
>>>>>>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Sean Owen <sr...@gmail.com>.

Same idea? support this combo in 3.0 and then remove Hadoop 2 support
in 3.1 or something? or at least make them non-default, not
necessarily publish special builds?

On Tue, Nov 19, 2019 at 4:53 PM Dongjoon Hyun <do...@gmail.com> wrote:
> For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do you think about this, Sean?
> The preparation is already started in another email thread and I believe that is a keystone to prove `Hive 2.3` version stability
> (which Cheng/Hyukjin/you asked).
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Dongjoon Hyun <do...@gmail.com>.

Yes. It does. I meant SPARK-20202.

Thanks. I understand that it can be considered like Scala version issue.
So, that's the reason why I put this as a `policy` issue from the beginning.

> First of all, I want to put this as a policy issue instead of a technical
issue.

In the policy perspective, we should remove this immediately if we have a
solution to fix this.
For now, I set `Target Versions` of SPARK-20202 to `3.1.0` according to the
current discussion status.

    https://issues.apache.org/jira/browse/SPARK-20202

And, if there is no other issues, I'll create a PR to remove it from
`master` branch when we cut `branch-3.0`.

For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do
you think about this, Sean?
The preparation is already started in another email thread and I believe
that is a keystone to prove `Hive 2.3` version stability
(which Cheng/Hyukjin/you asked).

Bests,
Dongjoon.


On Tue, Nov 19, 2019 at 2:09 PM Cheng Lian <li...@gmail.com> wrote:

> It's kinda like Scala version upgrade. Historically, we only remove the
> support of an older Scala version when the newer version is proven to be
> stable after one or more Spark minor versions.
>
> On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <li...@gmail.com> wrote:
>
>> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
>> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
>> forked Hive 1.2 dependencies completely, no? As long as we still keep the
>> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
>> particular preference between using Hive 1.2 or 2.3 as the default Hive
>> version. After all, for end-users and providers who need a particular
>> version combination, they can always build Spark with proper profiles
>> themselves.
>>
>> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
>> due to the folder name.
>>
>> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>>
>>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>>
>>> We can replace it immediately if we want right now.
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Cheng.
>>>>
>>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>>> If we consider them, it could be the followings.
>>>>
>>>> +----------+-----------------+--------------------+
>>>> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>>> +-------------------------------------------------+
>>>> |Legitimate|        X        |         O          |
>>>> |JDK11     |        X        |         O          |
>>>> |Hadoop3   |        X        |         O          |
>>>> |Hadoop2   |        O        |         O          |
>>>> |Functions |     Baseline    |       More         |
>>>> |Bug fixes |     Baseline    |       More         |
>>>> +-------------------------------------------------+
>>>>
>>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>>> (including Jenkins/GitHubAction/AppVeyor).
>>>>
>>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>>> to give more visibility to the whole community,
>>>>
>>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>>> distribution
>>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>>> after `branch-3.0` branch cut.
>>>>
>>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>>> But, it's time to prepare. Without them, we are going to be
>>>> insufficient again and again.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <li...@gmail.com>
>>>> wrote:
>>>>
>>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>>>> and here
>>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>>>> .)
>>>>>
>>>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>>>> good enough for covering such major upgrades.
>>>>>
>>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>>>
>>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that
>>>>>> at 3.1
>>>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>>>> reference
>>>>>> immediately after `branch-3.0` cut.
>>>>>>
>>>>>> Sean, I'm referencing Cheng Lian's email for the status of
>>>>>> `hadoop-2.7`.
>>>>>>
>>>>>> -
>>>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>>>
>>>>>> The way I see this is that it's not a user problem. Apache Spark
>>>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>>>> We need to drop it by ourselves because we created it and it's our
>>>>>> bad.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>> Just to clarify, as even I have lost the details over time:
>>>>>>> hadoop-2.7
>>>>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>>>>> 2.x, for end users using Hive via Spark?
>>>>>>> I don't have a strong opinion, other than sharing the view that we
>>>>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>>>>> Question is simply how much risk that entails. Keeping in mind that
>>>>>>> Spark 3.0 is already something that people understand works
>>>>>>> differently. We can accept some behavior changes.
>>>>>>>
>>>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Hi, All.
>>>>>>> >
>>>>>>> > First of all, I want to put this as a policy issue instead of a
>>>>>>> technical issue.
>>>>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>>>>> >
>>>>>>> > Apache Spark community kept (not maintained) the forked Apache Hive
>>>>>>> > 1.2.1 because there has been no other options before. As we see at
>>>>>>> > SPARK-20202, it's not a desirable situation among the Apache
>>>>>>> projects.
>>>>>>> >
>>>>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>>>>> >
>>>>>>> > Also, please note that we `kept`, not `maintained`, because we
>>>>>>> know it's not good.
>>>>>>> > There are several attempt to update that forked repository
>>>>>>> > for several reasons (Hadoop 3 support is one of the example),
>>>>>>> > but those attempts are also turned down.
>>>>>>> >
>>>>>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>>>>>> > `hive-2.3` profile. What about moving forward in this direction
>>>>>>> further?
>>>>>>> >
>>>>>>> > For example, can we remove the usage of forked `hive` in Apache
>>>>>>> Spark 3.0
>>>>>>> > completely officially? If someone still needs to use the forked
>>>>>>> `hive`, we can
>>>>>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>>>>>> profile in the community.
>>>>>>> >
>>>>>>> > I want to say this is a goal we should achieve someday.
>>>>>>> > If we don't do anything, nothing happen. At least we need to
>>>>>>> prepare this.
>>>>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>>>>> >
>>>>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>>>>> > If the only reason is that we didn't use it before, we can release
>>>>>>> another
>>>>>>> > `3.0.0-preview` for that.
>>>>>>> >
>>>>>>> > Bests,
>>>>>>> > Dongjoon.
>>>>>>>
>>>>>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Cheng Lian <li...@gmail.com>.

It's kinda like Scala version upgrade. Historically, we only remove the
support of an older Scala version when the newer version is proven to be
stable after one or more Spark minor versions.

On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian <li...@gmail.com> wrote:

> Hmm, what exactly did you mean by "remove the usage of forked `hive` in
> Apache Spark 3.0 completely officially"? I thought you wanted to remove the
> forked Hive 1.2 dependencies completely, no? As long as we still keep the
> Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
> particular preference between using Hive 1.2 or 2.3 as the default Hive
> version. After all, for end-users and providers who need a particular
> version combination, they can always build Spark with proper profiles
> themselves.
>
> And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
> due to the folder name.
>
> On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>>
>> For directory name, we use '1.2.1' and '2.3.5' because we just delayed
>> the renaming the directories until 3.0.0 deadline to minimize the diff.
>>
>> We can replace it immediately if we want right now.
>>
>>
>>
>> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, Cheng.
>>>
>>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>>> If we consider them, it could be the followings.
>>>
>>> +----------+-----------------+--------------------+
>>> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>>> +-------------------------------------------------+
>>> |Legitimate|        X        |         O          |
>>> |JDK11     |        X        |         O          |
>>> |Hadoop3   |        X        |         O          |
>>> |Hadoop2   |        O        |         O          |
>>> |Functions |     Baseline    |       More         |
>>> |Bug fixes |     Baseline    |       More         |
>>> +-------------------------------------------------+
>>>
>>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>>> (including Jenkins/GitHubAction/AppVeyor).
>>>
>>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>>> to give more visibility to the whole community,
>>>
>>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>>> distribution
>>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>>> after `branch-3.0` branch cut.
>>>
>>> I know that we have been reluctant to (1) and (2) due to its burden.
>>> But, it's time to prepare. Without them, we are going to be insufficient
>>> again and again.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <li...@gmail.com>
>>> wrote:
>>>
>>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x
>>>> minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>>> and here
>>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>>> .)
>>>>
>>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>>> good enough for covering such major upgrades.
>>>>
>>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>>
>>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>>>> 3.1
>>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>>> reference
>>>>> immediately after `branch-3.0` cut.
>>>>>
>>>>> Sean, I'm referencing Cheng Lian's email for the status of
>>>>> `hadoop-2.7`.
>>>>>
>>>>> -
>>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>>
>>>>> The way I see this is that it's not a user problem. Apache Spark
>>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>>> We need to drop it by ourselves because we created it and it's our bad.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>>>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>>>> 2.x, for end users using Hive via Spark?
>>>>>> I don't have a strong opinion, other than sharing the view that we
>>>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>>>> Question is simply how much risk that entails. Keeping in mind that
>>>>>> Spark 3.0 is already something that people understand works
>>>>>> differently. We can accept some behavior changes.
>>>>>>
>>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>> >
>>>>>> > Hi, All.
>>>>>> >
>>>>>> > First of all, I want to put this as a policy issue instead of a
>>>>>> technical issue.
>>>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>>>> >
>>>>>> > Apache Spark community kept (not maintained) the forked Apache Hive
>>>>>> > 1.2.1 because there has been no other options before. As we see at
>>>>>> > SPARK-20202, it's not a desirable situation among the Apache
>>>>>> projects.
>>>>>> >
>>>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>>>> >
>>>>>> > Also, please note that we `kept`, not `maintained`, because we know
>>>>>> it's not good.
>>>>>> > There are several attempt to update that forked repository
>>>>>> > for several reasons (Hadoop 3 support is one of the example),
>>>>>> > but those attempts are also turned down.
>>>>>> >
>>>>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>>>>> > `hive-2.3` profile. What about moving forward in this direction
>>>>>> further?
>>>>>> >
>>>>>> > For example, can we remove the usage of forked `hive` in Apache
>>>>>> Spark 3.0
>>>>>> > completely officially? If someone still needs to use the forked
>>>>>> `hive`, we can
>>>>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>>>>> profile in the community.
>>>>>> >
>>>>>> > I want to say this is a goal we should achieve someday.
>>>>>> > If we don't do anything, nothing happen. At least we need to
>>>>>> prepare this.
>>>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>>>> >
>>>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>>>> > If the only reason is that we didn't use it before, we can release
>>>>>> another
>>>>>> > `3.0.0-preview` for that.
>>>>>> >
>>>>>> > Bests,
>>>>>> > Dongjoon.
>>>>>>
>>>>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Cheng Lian <li...@gmail.com>.

Hmm, what exactly did you mean by "remove the usage of forked `hive` in
Apache Spark 3.0 completely officially"? I thought you wanted to remove the
forked Hive 1.2 dependencies completely, no? As long as we still keep the
Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a
particular preference between using Hive 1.2 or 2.3 as the default Hive
version. After all, for end-users and providers who need a particular
version combination, they can always build Spark with proper profiles
themselves.

And thanks for clarifying the Hive 2.3.5 issue. I didn't notice that it's
due to the folder name.

On Tue, Nov 19, 2019 at 11:15 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.
>
> For directory name, we use '1.2.1' and '2.3.5' because we just delayed the
> renaming the directories until 3.0.0 deadline to minimize the diff.
>
> We can replace it immediately if we want right now.
>
>
>
> On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, Cheng.
>>
>> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
>> If we consider them, it could be the followings.
>>
>> +----------+-----------------+--------------------+
>> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
>> +-------------------------------------------------+
>> |Legitimate|        X        |         O          |
>> |JDK11     |        X        |         O          |
>> |Hadoop3   |        X        |         O          |
>> |Hadoop2   |        O        |         O          |
>> |Functions |     Baseline    |       More         |
>> |Bug fixes |     Baseline    |       More         |
>> +-------------------------------------------------+
>>
>> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
>> (including Jenkins/GitHubAction/AppVeyor).
>>
>> For me, AS-IS 3.0 is not enough for that. According to your advices,
>> to give more visibility to the whole community,
>>
>> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
>> distribution
>> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
>> after `branch-3.0` branch cut.
>>
>> I know that we have been reluctant to (1) and (2) due to its burden.
>> But, it's time to prepare. Without them, we are going to be insufficient
>> again and again.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>>
>> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <li...@gmail.com> wrote:
>>
>>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
>>> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>>> and here
>>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>>> .)
>>>
>>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>>> good enough for covering such major upgrades.
>>>
>>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Thank you for feedback, Hyujkjin and Sean.
>>>>
>>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>>> 3.1
>>>> if we can make a decision to eliminate the illegitimate Hive fork
>>>> reference
>>>> immediately after `branch-3.0` cut.
>>>>
>>>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>>>
>>>> -
>>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>>
>>>> The way I see this is that it's not a user problem. Apache Spark
>>>> community didn't try to drop the illegitimate Hive fork yet.
>>>> We need to drop it by ourselves because we created it and it's our bad.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>>> 2.x, for end users using Hive via Spark?
>>>>> I don't have a strong opinion, other than sharing the view that we
>>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>>> Question is simply how much risk that entails. Keeping in mind that
>>>>> Spark 3.0 is already something that people understand works
>>>>> differently. We can accept some behavior changes.
>>>>>
>>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <
>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>> >
>>>>> > Hi, All.
>>>>> >
>>>>> > First of all, I want to put this as a policy issue instead of a
>>>>> technical issue.
>>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>>> >
>>>>> > Apache Spark community kept (not maintained) the forked Apache Hive
>>>>> > 1.2.1 because there has been no other options before. As we see at
>>>>> > SPARK-20202, it's not a desirable situation among the Apache
>>>>> projects.
>>>>> >
>>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>>> >
>>>>> > Also, please note that we `kept`, not `maintained`, because we know
>>>>> it's not good.
>>>>> > There are several attempt to update that forked repository
>>>>> > for several reasons (Hadoop 3 support is one of the example),
>>>>> > but those attempts are also turned down.
>>>>> >
>>>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>>>> > `hive-2.3` profile. What about moving forward in this direction
>>>>> further?
>>>>> >
>>>>> > For example, can we remove the usage of forked `hive` in Apache
>>>>> Spark 3.0
>>>>> > completely officially? If someone still needs to use the forked
>>>>> `hive`, we can
>>>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>>>> profile in the community.
>>>>> >
>>>>> > I want to say this is a goal we should achieve someday.
>>>>> > If we don't do anything, nothing happen. At least we need to prepare
>>>>> this.
>>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>>> >
>>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>>> > If the only reason is that we didn't use it before, we can release
>>>>> another
>>>>> > `3.0.0-preview` for that.
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>>
>>>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Dongjoon Hyun <do...@gmail.com>.

BTW, `hive.version.short` is a directory name. We are using 2.3.6 only.

For directory name, we use '1.2.1' and '2.3.5' because we just delayed the
renaming the directories until 3.0.0 deadline to minimize the diff.

We can replace it immediately if we want right now.



On Tue, Nov 19, 2019 at 11:11 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, Cheng.
>
> This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
> If we consider them, it could be the followings.
>
> +----------+-----------------+--------------------+
> |          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
> +-------------------------------------------------+
> |Legitimate|        X        |         O          |
> |JDK11     |        X        |         O          |
> |Hadoop3   |        X        |         O          |
> |Hadoop2   |        O        |         O          |
> |Functions |     Baseline    |       More         |
> |Bug fixes |     Baseline    |       More         |
> +-------------------------------------------------+
>
> To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
> (including Jenkins/GitHubAction/AppVeyor).
>
> For me, AS-IS 3.0 is not enough for that. According to your advices,
> to give more visibility to the whole community,
>
> 1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
> distribution
> 2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
> after `branch-3.0` branch cut.
>
> I know that we have been reluctant to (1) and (2) due to its burden.
> But, it's time to prepare. Without them, we are going to be insufficient
> again and again.
>
> Bests,
> Dongjoon.
>
>
>
>
> On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <li...@gmail.com> wrote:
>
>> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
>> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
>> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
>> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
>> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
>> and here
>> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
>> .)
>>
>> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop
>> 2.7 and the Hive 1.2 fork, but I do believe that we need a safety net for
>> Spark 3.0. For preview releases, I'm afraid that their visibility is not
>> good enough for covering such major upgrades.
>>
>> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Thank you for feedback, Hyujkjin and Sean.
>>>
>>> I proposed `preview-2` for that purpose but I'm also +1 for do that at
>>> 3.1
>>> if we can make a decision to eliminate the illegitimate Hive fork
>>> reference
>>> immediately after `branch-3.0` cut.
>>>
>>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>>
>>> -
>>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>>
>>> The way I see this is that it's not a user problem. Apache Spark
>>> community didn't try to drop the illegitimate Hive fork yet.
>>> We need to drop it by ourselves because we created it and it's our bad.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>>> 2.x, for end users using Hive via Spark?
>>>> I don't have a strong opinion, other than sharing the view that we
>>>> have to dump the Hive 1.x fork at the first opportunity.
>>>> Question is simply how much risk that entails. Keeping in mind that
>>>> Spark 3.0 is already something that people understand works
>>>> differently. We can accept some behavior changes.
>>>>
>>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi, All.
>>>> >
>>>> > First of all, I want to put this as a policy issue instead of a
>>>> technical issue.
>>>> > Also, this is orthogonal from `hadoop` version discussion.
>>>> >
>>>> > Apache Spark community kept (not maintained) the forked Apache Hive
>>>> > 1.2.1 because there has been no other options before. As we see at
>>>> > SPARK-20202, it's not a desirable situation among the Apache projects.
>>>> >
>>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>>> >
>>>> > Also, please note that we `kept`, not `maintained`, because we know
>>>> it's not good.
>>>> > There are several attempt to update that forked repository
>>>> > for several reasons (Hadoop 3 support is one of the example),
>>>> > but those attempts are also turned down.
>>>> >
>>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>>> > `hive-2.3` profile. What about moving forward in this direction
>>>> further?
>>>> >
>>>> > For example, can we remove the usage of forked `hive` in Apache Spark
>>>> 3.0
>>>> > completely officially? If someone still needs to use the forked
>>>> `hive`, we can
>>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>>> profile in the community.
>>>> >
>>>> > I want to say this is a goal we should achieve someday.
>>>> > If we don't do anything, nothing happen. At least we need to prepare
>>>> this.
>>>> > Without any preparation, Spark 3.1+ will be the same.
>>>> >
>>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>>> > If the only reason is that we didn't use it before, we can release
>>>> another
>>>> > `3.0.0-preview` for that.
>>>> >
>>>> > Bests,
>>>> > Dongjoon.
>>>>
>>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, Cheng.

This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world.
If we consider them, it could be the followings.

+----------+-----------------+--------------------+
|          | Hive 1.2.1 fork |  Apache Hive 2.3.6 |
+-------------------------------------------------+
|Legitimate|        X        |         O          |
|JDK11     |        X        |         O          |
|Hadoop3   |        X        |         O          |
|Hadoop2   |        O        |         O          |
|Functions |     Baseline    |       More         |
|Bug fixes |     Baseline    |       More         |
+-------------------------------------------------+

To stabilize Spark's Hive 2.3 usage, we should use it by ourselves
(including Jenkins/GitHubAction/AppVeyor).

For me, AS-IS 3.0 is not enough for that. According to your advices,
to give more visibility to the whole community,

1. We need to give additional `hadoop-2.7 with Hive 2.3` pre-built
distribution
2. We need to switch our default Hive usage to 2.3 in `master` for 3.1
after `branch-3.0` branch cut.

I know that we have been reluctant to (1) and (2) due to its burden.
But, it's time to prepare. Without them, we are going to be insufficient
again and again.

Bests,
Dongjoon.




On Tue, Nov 19, 2019 at 9:26 AM Cheng Lian <li...@gmail.com> wrote:

> Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
> release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
> fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
> buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
> referring both Hive 2.3.6 and 2.3.5 at the moment, see here
> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
> and here
> <https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
> .)
>
> Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7
> and the Hive 1.2 fork, but I do believe that we need a safety net for Spark
> 3.0. For preview releases, I'm afraid that their visibility is not good
> enough for covering such major upgrades.
>
> On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Thank you for feedback, Hyujkjin and Sean.
>>
>> I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
>> if we can make a decision to eliminate the illegitimate Hive fork
>> reference
>> immediately after `branch-3.0` cut.
>>
>> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>>
>> -
>> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>>
>> The way I see this is that it's not a user problem. Apache Spark
>> community didn't try to drop the illegitimate Hive fork yet.
>> We need to drop it by ourselves because we created it and it's our bad.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>>> works with hive-2.3? it isn't tied to hadoop-3.2?
>>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>>> 2.x, for end users using Hive via Spark?
>>> I don't have a strong opinion, other than sharing the view that we
>>> have to dump the Hive 1.x fork at the first opportunity.
>>> Question is simply how much risk that entails. Keeping in mind that
>>> Spark 3.0 is already something that people understand works
>>> differently. We can accept some behavior changes.
>>>
>>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > First of all, I want to put this as a policy issue instead of a
>>> technical issue.
>>> > Also, this is orthogonal from `hadoop` version discussion.
>>> >
>>> > Apache Spark community kept (not maintained) the forked Apache Hive
>>> > 1.2.1 because there has been no other options before. As we see at
>>> > SPARK-20202, it's not a desirable situation among the Apache projects.
>>> >
>>> >     https://issues.apache.org/jira/browse/SPARK-20202
>>> >
>>> > Also, please note that we `kept`, not `maintained`, because we know
>>> it's not good.
>>> > There are several attempt to update that forked repository
>>> > for several reasons (Hadoop 3 support is one of the example),
>>> > but those attempts are also turned down.
>>> >
>>> > From Apache Spark 3.0, it seems that we have a new feasible option
>>> > `hive-2.3` profile. What about moving forward in this direction
>>> further?
>>> >
>>> > For example, can we remove the usage of forked `hive` in Apache Spark
>>> 3.0
>>> > completely officially? If someone still needs to use the forked
>>> `hive`, we can
>>> > have a profile `hive-1.2`. Of course, it should not be a default
>>> profile in the community.
>>> >
>>> > I want to say this is a goal we should achieve someday.
>>> > If we don't do anything, nothing happen. At least we need to prepare
>>> this.
>>> > Without any preparation, Spark 3.1+ will be the same.
>>> >
>>> > Shall we focus on what are our problems with Hive 2.3.6?
>>> > If the only reason is that we didn't use it before, we can release
>>> another
>>> > `3.0.0-preview` for that.
>>> >
>>> > Bests,
>>> > Dongjoon.
>>>
>>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Cheng Lian <li...@gmail.com>.

Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor
release to stabilize Hive 2.3 code paths before retiring the Hive 1.2
fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still
buggy in terms of JDK 11 support. (BTW, I just found that our root POM is
referring both Hive 2.3.6 and 2.3.5 at the moment, see here
<https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L135>
and here
<https://github.com/apache/spark/blob/6fb8b8606544f26dc2d9719a2d009eb5aea65ba2/pom.xml#L2927>
.)

Again, I'm happy to get rid of ancient legacy dependencies like Hadoop 2.7
and the Hive 1.2 fork, but I do believe that we need a safety net for Spark
3.0. For preview releases, I'm afraid that their visibility is not good
enough for covering such major upgrades.

On Tue, Nov 19, 2019 at 8:39 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Thank you for feedback, Hyujkjin and Sean.
>
> I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
> if we can make a decision to eliminate the illegitimate Hive fork reference
> immediately after `branch-3.0` cut.
>
> Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.
>
> -
> https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E
>
> The way I see this is that it's not a user problem. Apache Spark community
> didn't try to drop the illegitimate Hive fork yet.
> We need to drop it by ourselves because we created it and it's our bad.
>
> Bests,
> Dongjoon.
>
>
>
> On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sr...@gmail.com> wrote:
>
>> Just to clarify, as even I have lost the details over time: hadoop-2.7
>> works with hive-2.3? it isn't tied to hadoop-3.2?
>> Roughly how much risk is there in using the Hive 1.x fork over Hive
>> 2.x, for end users using Hive via Spark?
>> I don't have a strong opinion, other than sharing the view that we
>> have to dump the Hive 1.x fork at the first opportunity.
>> Question is simply how much risk that entails. Keeping in mind that
>> Spark 3.0 is already something that people understand works
>> differently. We can accept some behavior changes.
>>
>> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>> >
>> > Hi, All.
>> >
>> > First of all, I want to put this as a policy issue instead of a
>> technical issue.
>> > Also, this is orthogonal from `hadoop` version discussion.
>> >
>> > Apache Spark community kept (not maintained) the forked Apache Hive
>> > 1.2.1 because there has been no other options before. As we see at
>> > SPARK-20202, it's not a desirable situation among the Apache projects.
>> >
>> >     https://issues.apache.org/jira/browse/SPARK-20202
>> >
>> > Also, please note that we `kept`, not `maintained`, because we know
>> it's not good.
>> > There are several attempt to update that forked repository
>> > for several reasons (Hadoop 3 support is one of the example),
>> > but those attempts are also turned down.
>> >
>> > From Apache Spark 3.0, it seems that we have a new feasible option
>> > `hive-2.3` profile. What about moving forward in this direction further?
>> >
>> > For example, can we remove the usage of forked `hive` in Apache Spark
>> 3.0
>> > completely officially? If someone still needs to use the forked `hive`,
>> we can
>> > have a profile `hive-1.2`. Of course, it should not be a default
>> profile in the community.
>> >
>> > I want to say this is a goal we should achieve someday.
>> > If we don't do anything, nothing happen. At least we need to prepare
>> this.
>> > Without any preparation, Spark 3.1+ will be the same.
>> >
>> > Shall we focus on what are our problems with Hive 2.3.6?
>> > If the only reason is that we didn't use it before, we can release
>> another
>> > `3.0.0-preview` for that.
>> >
>> > Bests,
>> > Dongjoon.
>>
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Dongjoon Hyun <do...@gmail.com>.

Thank you for feedback, Hyujkjin and Sean.

I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1
if we can make a decision to eliminate the illegitimate Hive fork reference
immediately after `branch-3.0` cut.

Sean, I'm referencing Cheng Lian's email for the status of `hadoop-2.7`.

-
https://lists.apache.org/thread.html/623dd9a6d4e951daeec985feffede12c7b419e03c2965018de7a72f1@%3Cdev.spark.apache.org%3E

The way I see this is that it's not a user problem. Apache Spark community
didn't try to drop the illegitimate Hive fork yet.
We need to drop it by ourselves because we created it and it's our bad.

Bests,
Dongjoon.



On Tue, Nov 19, 2019 at 5:06 AM Sean Owen <sr...@gmail.com> wrote:

> Just to clarify, as even I have lost the details over time: hadoop-2.7
> works with hive-2.3? it isn't tied to hadoop-3.2?
> Roughly how much risk is there in using the Hive 1.x fork over Hive
> 2.x, for end users using Hive via Spark?
> I don't have a strong opinion, other than sharing the view that we
> have to dump the Hive 1.x fork at the first opportunity.
> Question is simply how much risk that entails. Keeping in mind that
> Spark 3.0 is already something that people understand works
> differently. We can accept some behavior changes.
>
> On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >
> > Hi, All.
> >
> > First of all, I want to put this as a policy issue instead of a
> technical issue.
> > Also, this is orthogonal from `hadoop` version discussion.
> >
> > Apache Spark community kept (not maintained) the forked Apache Hive
> > 1.2.1 because there has been no other options before. As we see at
> > SPARK-20202, it's not a desirable situation among the Apache projects.
> >
> >     https://issues.apache.org/jira/browse/SPARK-20202
> >
> > Also, please note that we `kept`, not `maintained`, because we know it's
> not good.
> > There are several attempt to update that forked repository
> > for several reasons (Hadoop 3 support is one of the example),
> > but those attempts are also turned down.
> >
> > From Apache Spark 3.0, it seems that we have a new feasible option
> > `hive-2.3` profile. What about moving forward in this direction further?
> >
> > For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> > completely officially? If someone still needs to use the forked `hive`,
> we can
> > have a profile `hive-1.2`. Of course, it should not be a default profile
> in the community.
> >
> > I want to say this is a goal we should achieve someday.
> > If we don't do anything, nothing happen. At least we need to prepare
> this.
> > Without any preparation, Spark 3.1+ will be the same.
> >
> > Shall we focus on what are our problems with Hive 2.3.6?
> > If the only reason is that we didn't use it before, we can release
> another
> > `3.0.0-preview` for that.
> >
> > Bests,
> > Dongjoon.
>

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Sean Owen <sr...@gmail.com>.

Just to clarify, as even I have lost the details over time: hadoop-2.7
works with hive-2.3? it isn't tied to hadoop-3.2?
Roughly how much risk is there in using the Hive 1.x fork over Hive
2.x, for end users using Hive via Spark?
I don't have a strong opinion, other than sharing the view that we
have to dump the Hive 1.x fork at the first opportunity.
Question is simply how much risk that entails. Keeping in mind that
Spark 3.0 is already something that people understand works
differently. We can accept some behavior changes.

On Mon, Nov 18, 2019 at 11:11 PM Dongjoon Hyun <do...@gmail.com> wrote:
>
> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we can
> have a profile `hive-1.2`. Of course, it should not be a default profile in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

Posted by Hyukjin Kwon <gu...@gmail.com>.

I struggled hard to deal with this issue multiple times over a year and
thankfully we finally
decided to use the official version of Hive 2.3.x too (thank you, Yuming,
Alan, and guys)
I think this is already a huge progress that we started to use the
official version of Hive.

I think we should at least have one minor release term to let users test
out Spark with Hive 2.3.x. before switching this
as a default. My impression was it's the decision made before at:
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Upgrade-built-in-Hive-to-2-3-4-td26153.html

How about we try to make it default in Spark 3.1 by using this thread as a
reference? I think it's too a radical change.


2019년 11월 19일 (화) 오후 2:11, Dongjoon Hyun <do...@gmail.com>님이 작성:

> Hi, All.
>
> First of all, I want to put this as a policy issue instead of a technical
> issue.
> Also, this is orthogonal from `hadoop` version discussion.
>
> Apache Spark community kept (not maintained) the forked Apache Hive
> 1.2.1 because there has been no other options before. As we see at
> SPARK-20202, it's not a desirable situation among the Apache projects.
>
>     https://issues.apache.org/jira/browse/SPARK-20202
>
> Also, please note that we `kept`, not `maintained`, because we know it's
> not good.
> There are several attempt to update that forked repository
> for several reasons (Hadoop 3 support is one of the example),
> but those attempts are also turned down.
>
> From Apache Spark 3.0, it seems that we have a new feasible option
> `hive-2.3` profile. What about moving forward in this direction further?
>
> For example, can we remove the usage of forked `hive` in Apache Spark 3.0
> completely officially? If someone still needs to use the forked `hive`, we
> can
> have a profile `hive-1.2`. Of course, it should not be a default profile
> in the community.
>
> I want to say this is a goal we should achieve someday.
> If we don't do anything, nothing happen. At least we need to prepare this.
> Without any preparation, Spark 3.1+ will be the same.
>
> Shall we focus on what are our problems with Hive 2.3.6?
> If the only reason is that we didn't use it before, we can release
> another
> `3.0.0-preview` for that.
>
> Bests,
> Dongjoon.
>