You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Zoltan Haindrich <ki...@rxd.hu> on 2022/02/10 16:42:40 UTC

Re: Time to Remove Hive-on-Spark

Hey,

I think there is no real interest in this feature; we don't have users/contributors backing it - last development was around 2018 October; there were ~2 bugfix commits ever 
since that...we should stop carrying dead weight...another 2 weeks went by since Stamatis have reminded us that after 1.5 years(!) nothing have changed.

+1 on removing it

cheers,
Zoltan

you may inspect some of the recent changes with:
git log -c `find . -type f -path '**/spark/**'|grep -v xml|grep -v properties|grep -v q.out`


On 1/28/22 2:32 PM, Stamatis Zampetakis wrote:
> Hi team,
> 
> Almost one year has passed since the last exchange in this discussion and
> if I am not wrong there has been no effort to revive Hive-on-Spark. To be
> more precise, I don't think I have seen any Spark related JIRA for quite
> some time now and although I don't want to rush into conclusions, there
> does not seem to be any community member involved in maintaining or adding
> new features in this part of the code.
> 
> Keeping dead code in the repository does not do any good to the project and
> puts a non-negligible burden to future maintainers.
> 
> Clearly, we cannot make a new Hive release where a major feature is
> completely untested so either someone commits to re-enable/fix the
> respective tests soon or we move forward the work started by David and drop
> support for Hive-on-Spark.
> 
> I would like to ask the community if there is anyone who can take up this
> maintenance task and enable/fix Spark related tests in the next month or so?
> 
> Best,
> Stamatis
> 
> On Sat, Feb 27, 2021 at 4:17 AM Edward Capriolo <ed...@gmail.com>
> wrote:
> 
>> I do not know how it works for most of the world. But in cloudera where the
>> TEZ options were never popular hive-on-spark represents a solid way to get
>> things done for small datasets lower latency.
>>
>> As for the spark adoption. You know a while ago I came up with some ways to
>> make hive more  spark like. One of them was a found a way to make "compile"
>> a hive keyword so folks could build UDFs on the fly. It was such an
>> uphil climb. Folks found a way to make it disabled by default for security.
>> Then later when things moved from CLI to beeline it was like the ONLY thing
>> that I found not ported. Like it was extremely frustrating.
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 27, 2020 at 3:19 PM David <da...@gmail.com> wrote:
>>
>>> Hello  Xuefu,
>>>
>>> I am not part of the Cloudera Hive product team,  though I volunteer to
>>> work on small projects from time to time.  Perhaps someone from that team
>>> can chime in with some of their thoughts, but personally, I think that in
>>> the long run, there will be more of a merge between Hive-on-Spark and
>> other
>>> Spark-native offerings.  I'm not sure what the differentiation will be
>>> going forward.  With that said, are there any developers on this mailing
>>> list who are willing to take on the maintenance effort of keeping HoS
>>> moving forward?
>>>
>>> http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/
>>>
>>>
>> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts.html
>>>
>>>
>>> Thanks.
>>>
>>> On Thu, Jul 23, 2020 at 12:35 PM Xuefu Zhang <xu...@apache.org> wrote:
>>>
>>>> Previous reasoning seemed to suggest a lack of user adoption. Now we
>> are
>>>> concerned about ongoing maintenance effort. Both are valid
>>> considerations.
>>>> However, I think we should have ways to find out the answers.
>> Therefore,
>>> I
>>>> suggest the following be carried out:
>>>>
>>>> 1. Send out the proposal (removing Hive on Spark) to users including
>>>> user@hive.apache.org and get their feedback.
>>>> 2. Ask if any developers on this mailing list are willing to take on
>> the
>>>> maintenance effort.
>>>>
>>>> I'm concerned about user impact because I can still see issues being
>>>> reported on HoS from time to time. I'm more concerned about the future
>> of
>>>> Hive if we narrow Hive neutrality on execution engines, which will
>>> possibly
>>>> force more Hive users to migrate to other alternatives such as Spark
>> SQL,
>>>> which is already eroding Hive's user base.
>>>>
>>>> Being open and neutral used to be Hive's most admired strengths.
>>>>
>>>> Thanks,
>>>> Xuefu
>>>>
>>>>
>>>> On Wed, Jul 22, 2020 at 8:46 AM Alan Gates <al...@gmail.com>
>> wrote:
>>>>
>>>>> An important point here is I don't believe David is proposing to
>> remove
>>>>> Hive on Spark from the 2 or 3 lines, but only from trunk.  Continuing
>>> to
>>>>> support it in existing 2 and 3 lines makes sense, but since no one
>> has
>>>>> maintained it on trunk for some time and it does not work with many
>> of
>>>> the
>>>>> newer features it should be removed from trunk.
>>>>>
>>>>> Alan.
>>>>>
>>>>> On Tue, Jul 21, 2020 at 4:10 PM Chao Sun <su...@apache.org> wrote:
>>>>>
>>>>>> Thanks David. FWIW Uber is still running Hive on Spark (2.3.4) on a
>>>> very
>>>>>> large scale in production right now and I don't think we have any
>>> plan
>>>> to
>>>>>> change it soon.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 21, 2020 at 11:28 AM David <da...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Thanks for the feedback.
>>>>>>>
>>>>>>> Just a quick recap: I did propose this @dev and I received
>>> unanimous
>>>>> +1's
>>>>>>> from the community.  After a couple months, I created the PR.
>>>>>>>
>>>>>>> Certainly open to discussion, but there hasn't been any
>> discussion
>>>> thus
>>>>>> far
>>>>>>> because there have been no objections until this point.
>>>>>>>
>>>>>>> HoS has low adoption, heavy technical debt, and the manner in
>> which
>>>> its
>>>>>>> build process is setup is impeding some other work that is not
>> even
>>>>>> related
>>>>>>> to HoS.
>>>>>>>
>>>>>>> We can deprecate in Hive 3.x and remove in Hive 4.x.  The plan
>>> would
>>>> be
>>>>>> to
>>>>>>> use Tez moving forward.
>>>>>>>
>>>>>>> My point about the vendor's move to Tez is that HoS adoption is
>>> very
>>>>> low,
>>>>>>> it's only going lower, and while I don't know the specifics of
>> it,
>>>>> there
>>>>>>> must be some migration plan in place there (i.e., it must be
>>> possible
>>>>> to
>>>>>> do
>>>>>>> it already).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>> On Tue, Jul 21, 2020 at 12:23 PM Xuefu Zhang <xu...@apache.org>
>>>> wrote:
>>>>>>>
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> While a vendor may not support a component in an open source
>>>> project,
>>>>>>>> removing it or not is a decision by and for the community. I
>>>>> certainly
>>>>>>>> understand that the vendor you mentioned has contributed a
>> great
>>>> deal
>>>>>>>> (including my personal effort while working there), it's not up
>>> to
>>>>> the
>>>>>>>> vendor to make a call like what is proposed here.
>>>>>>>>
>>>>>>>> As a community, we should have gone through a thorough
>> discussion
>>>> and
>>>>>>>> reached a consensus before actually making such a big change,
>> in
>>> my
>>>>>>>> opinion.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Xuefu
>>>>>>>>
>>>>>>>> On Tue, Jul 21, 2020 at 8:49 AM David <da...@gmail.com>
>> wrote:
>>>>>>>>
>>>>>>>>> Hey,
>>>>>>>>>
>>>>>>>>> Thanks for the input.
>>>>>>>>>
>>>>>>>>> FYI. Cloudera (Cloudera + Hortonworks) have removed HoS from
>>>> their
>>>>>>> latest
>>>>>>>>> offering.
>>>>>>>>>
>>>>>>>>> "Tez is now the only supported execution engine, existing
>>> queries
>>>>>> that
>>>>>>>>> change execution mode to Spark or MapReduce within a session,
>>> for
>>>>>>>> example,
>>>>>>>>> fail."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> https://docs.cloudera.com/cdp/latest/upgrade-post/topics/ug_hive_configuration_changes.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So I don't know who will be supporting this feature moving
>>>> forward,
>>>>>> but
>>>>>>>>> there has been a lot of work done to make this change as
>>> painless
>>>>> as
>>>>>>>>> possible.  Simply set the engine to 'tez' and remove the
>>>>> HoS-related
>>>>>>>>> settings should address many use cases.
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> On Tue, Jul 21, 2020 at 11:36 AM Xuefu Z <us...@gmail.com>
>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for chiming in late. However, I don't think we should
>>>>> remove
>>>>>>> Hive
>>>>>>>>> on
>>>>>>>>>> Spark just because of a technical problem. This is rather a
>>> big
>>>>>>>> decision
>>>>>>>>>> that we need to be careful about. There are users that will
>>> be
>>>>> left
>>>>>>>> high
>>>>>>>>>> and dry by this move.
>>>>>>>>>>
>>>>>>>>>> If the community decides to desupport and eventually remove
>>>> it, I
>>>>>>> think
>>>>>>>>> we
>>>>>>>>>> need to have a due process. We also need a deprecation plan
>>> if
>>>>>> that's
>>>>>>>> we
>>>>>>>>>> decide to do. Before that, I'm -1 on this proposal.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Xuefu
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 21, 2020 at 7:57 AM David <da...@gmail.com>
>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Team,
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/hive/pull/1285
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 3, 2020 at 11:49 PM Gopal V <
>> gopalv@apache.org
>>>>
>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> +1
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Gopal
>>>>>>>>>>>>
>>>>>>>>>>>> On 6/3/20 7:48 PM, Jesus Camacho Rodriguez wrote:
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jesús
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jun 3, 2020 at 1:58 PM Alan Gates <
>>>>>>> alanfgates@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Alan.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jun 3, 2020 at 1:40 PM Prasanth Jayachandran
>>>>>>>>>>>>>> <pj...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Jun 3, 2020, at 1:38 PM, Ashutosh Chauhan <
>>>>>>>>>> hashutosh@apache.org>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Jun 3, 2020 at 1:23 PM David Mollitor <
>>>>>>>>> dam6923@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello Gang,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have spent some time working on upgrading Avro
>>> (far
>>>>>> less
>>>>>>>> than
>>>>>>>>>>>>>> others):
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/HIVE-21737
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This should be a relatively easy thing to do, but
>>> is
>>>>>>> blocked
>>>>>>>> by
>>>>>>>>>>>>>>>>> Hive-on-Spark.  HoS has a weird thing where it
>>>>> downloads
>>>>>>> some
>>>>>>>>>>>>>>>>> cloud-storage-hosted file of Spark-Hadoop as part
>>> of
>>>>> its
>>>>>>>> maven
>>>>>>>>>> run.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since HoS is not going to receive updates from
>> the
>>>>> major
>>>>>>>>> vendors,
>>>>>>>>>>> is
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> time to simply remove it?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Tests are currently disabled:
>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/HIVE-23137
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Xuefu Zhang
>>>>>>>>>>
>>>>>>>>>> "In Honey We Trust!"
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Time to Remove Hive-on-Spark

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

+1 from my side too.

I have created PR against the current branch.
Still needs some work, and as many reviews as possible, because it is quite
big, and I might made some mistakes
https://issues.apache.org/jira/browse/HIVE-26134
https://github.com/apache/hive/pull/3201

Thanks,
Peter

On Thu, 10 Feb 2022 at 17:43, Zoltan Haindrich <ki...@rxd.hu> wrote:

> Hey,
>
> I think there is no real interest in this feature; we don't have
> users/contributors backing it - last development was around 2018 October;
> there were ~2 bugfix commits ever
> since that...we should stop carrying dead weight...another 2 weeks went by
> since Stamatis have reminded us that after 1.5 years(!) nothing have
> changed.
>
> +1 on removing it
>
> cheers,
> Zoltan
>
> you may inspect some of the recent changes with:
> git log -c `find . -type f -path '**/spark/**'|grep -v xml|grep -v
> properties|grep -v q.out`
>
>
> On 1/28/22 2:32 PM, Stamatis Zampetakis wrote:
> > Hi team,
> >
> > Almost one year has passed since the last exchange in this discussion and
> > if I am not wrong there has been no effort to revive Hive-on-Spark. To be
> > more precise, I don't think I have seen any Spark related JIRA for quite
> > some time now and although I don't want to rush into conclusions, there
> > does not seem to be any community member involved in maintaining or
> adding
> > new features in this part of the code.
> >
> > Keeping dead code in the repository does not do any good to the project
> and
> > puts a non-negligible burden to future maintainers.
> >
> > Clearly, we cannot make a new Hive release where a major feature is
> > completely untested so either someone commits to re-enable/fix the
> > respective tests soon or we move forward the work started by David and
> drop
> > support for Hive-on-Spark.
> >
> > I would like to ask the community if there is anyone who can take up this
> > maintenance task and enable/fix Spark related tests in the next month or
> so?
> >
> > Best,
> > Stamatis
> >
> > On Sat, Feb 27, 2021 at 4:17 AM Edward Capriolo <ed...@gmail.com>
> > wrote:
> >
> >> I do not know how it works for most of the world. But in cloudera where
> the
> >> TEZ options were never popular hive-on-spark represents a solid way to
> get
> >> things done for small datasets lower latency.
> >>
> >> As for the spark adoption. You know a while ago I came up with some
> ways to
> >> make hive more  spark like. One of them was a found a way to make
> "compile"
> >> a hive keyword so folks could build UDFs on the fly. It was such an
> >> uphil climb. Folks found a way to make it disabled by default for
> security.
> >> Then later when things moved from CLI to beeline it was like the ONLY
> thing
> >> that I found not ported. Like it was extremely frustrating.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Jul 27, 2020 at 3:19 PM David <da...@gmail.com> wrote:
> >>
> >>> Hello  Xuefu,
> >>>
> >>> I am not part of the Cloudera Hive product team,  though I volunteer to
> >>> work on small projects from time to time.  Perhaps someone from that
> team
> >>> can chime in with some of their thoughts, but personally, I think that
> in
> >>> the long run, there will be more of a merge between Hive-on-Spark and
> >> other
> >>> Spark-native offerings.  I'm not sure what the differentiation will be
> >>> going forward.  With that said, are there any developers on this
> mailing
> >>> list who are willing to take on the maintenance effort of keeping HoS
> >>> moving forward?
> >>>
> >>> http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/
> >>>
> >>>
> >>
> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts.html
> >>>
> >>>
> >>> Thanks.
> >>>
> >>> On Thu, Jul 23, 2020 at 12:35 PM Xuefu Zhang <xu...@apache.org> wrote:
> >>>
> >>>> Previous reasoning seemed to suggest a lack of user adoption. Now we
> >> are
> >>>> concerned about ongoing maintenance effort. Both are valid
> >>> considerations.
> >>>> However, I think we should have ways to find out the answers.
> >> Therefore,
> >>> I
> >>>> suggest the following be carried out:
> >>>>
> >>>> 1. Send out the proposal (removing Hive on Spark) to users including
> >>>> user@hive.apache.org and get their feedback.
> >>>> 2. Ask if any developers on this mailing list are willing to take on
> >> the
> >>>> maintenance effort.
> >>>>
> >>>> I'm concerned about user impact because I can still see issues being
> >>>> reported on HoS from time to time. I'm more concerned about the future
> >> of
> >>>> Hive if we narrow Hive neutrality on execution engines, which will
> >>> possibly
> >>>> force more Hive users to migrate to other alternatives such as Spark
> >> SQL,
> >>>> which is already eroding Hive's user base.
> >>>>
> >>>> Being open and neutral used to be Hive's most admired strengths.
> >>>>
> >>>> Thanks,
> >>>> Xuefu
> >>>>
> >>>>
> >>>> On Wed, Jul 22, 2020 at 8:46 AM Alan Gates <al...@gmail.com>
> >> wrote:
> >>>>
> >>>>> An important point here is I don't believe David is proposing to
> >> remove
> >>>>> Hive on Spark from the 2 or 3 lines, but only from trunk.  Continuing
> >>> to
> >>>>> support it in existing 2 and 3 lines makes sense, but since no one
> >> has
> >>>>> maintained it on trunk for some time and it does not work with many
> >> of
> >>>> the
> >>>>> newer features it should be removed from trunk.
> >>>>>
> >>>>> Alan.
> >>>>>
> >>>>> On Tue, Jul 21, 2020 at 4:10 PM Chao Sun <su...@apache.org> wrote:
> >>>>>
> >>>>>> Thanks David. FWIW Uber is still running Hive on Spark (2.3.4) on a
> >>>> very
> >>>>>> large scale in production right now and I don't think we have any
> >>> plan
> >>>> to
> >>>>>> change it soon.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jul 21, 2020 at 11:28 AM David <da...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> Thanks for the feedback.
> >>>>>>>
> >>>>>>> Just a quick recap: I did propose this @dev and I received
> >>> unanimous
> >>>>> +1's
> >>>>>>> from the community.  After a couple months, I created the PR.
> >>>>>>>
> >>>>>>> Certainly open to discussion, but there hasn't been any
> >> discussion
> >>>> thus
> >>>>>> far
> >>>>>>> because there have been no objections until this point.
> >>>>>>>
> >>>>>>> HoS has low adoption, heavy technical debt, and the manner in
> >> which
> >>>> its
> >>>>>>> build process is setup is impeding some other work that is not
> >> even
> >>>>>> related
> >>>>>>> to HoS.
> >>>>>>>
> >>>>>>> We can deprecate in Hive 3.x and remove in Hive 4.x.  The plan
> >>> would
> >>>> be
> >>>>>> to
> >>>>>>> use Tez moving forward.
> >>>>>>>
> >>>>>>> My point about the vendor's move to Tez is that HoS adoption is
> >>> very
> >>>>> low,
> >>>>>>> it's only going lower, and while I don't know the specifics of
> >> it,
> >>>>> there
> >>>>>>> must be some migration plan in place there (i.e., it must be
> >>> possible
> >>>>> to
> >>>>>> do
> >>>>>>> it already).
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> David
> >>>>>>>
> >>>>>>> On Tue, Jul 21, 2020 at 12:23 PM Xuefu Zhang <xu...@apache.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Hi David,
> >>>>>>>>
> >>>>>>>> While a vendor may not support a component in an open source
> >>>> project,
> >>>>>>>> removing it or not is a decision by and for the community. I
> >>>>> certainly
> >>>>>>>> understand that the vendor you mentioned has contributed a
> >> great
> >>>> deal
> >>>>>>>> (including my personal effort while working there), it's not up
> >>> to
> >>>>> the
> >>>>>>>> vendor to make a call like what is proposed here.
> >>>>>>>>
> >>>>>>>> As a community, we should have gone through a thorough
> >> discussion
> >>>> and
> >>>>>>>> reached a consensus before actually making such a big change,
> >> in
> >>> my
> >>>>>>>> opinion.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Xuefu
> >>>>>>>>
> >>>>>>>> On Tue, Jul 21, 2020 at 8:49 AM David <da...@gmail.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>>> Hey,
> >>>>>>>>>
> >>>>>>>>> Thanks for the input.
> >>>>>>>>>
> >>>>>>>>> FYI. Cloudera (Cloudera + Hortonworks) have removed HoS from
> >>>> their
> >>>>>>> latest
> >>>>>>>>> offering.
> >>>>>>>>>
> >>>>>>>>> "Tez is now the only supported execution engine, existing
> >>> queries
> >>>>>> that
> >>>>>>>>> change execution mode to Spark or MapReduce within a session,
> >>> for
> >>>>>>>> example,
> >>>>>>>>> fail."
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.cloudera.com/cdp/latest/upgrade-post/topics/ug_hive_configuration_changes.html
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> So I don't know who will be supporting this feature moving
> >>>> forward,
> >>>>>> but
> >>>>>>>>> there has been a lot of work done to make this change as
> >>> painless
> >>>>> as
> >>>>>>>>> possible.  Simply set the engine to 'tez' and remove the
> >>>>> HoS-related
> >>>>>>>>> settings should address many use cases.
> >>>>>>>>>
> >>>>>>>>> Thanks.
> >>>>>>>>>
> >>>>>>>>> On Tue, Jul 21, 2020 at 11:36 AM Xuefu Z <us...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Sorry for chiming in late. However, I don't think we should
> >>>>> remove
> >>>>>>> Hive
> >>>>>>>>> on
> >>>>>>>>>> Spark just because of a technical problem. This is rather a
> >>> big
> >>>>>>>> decision
> >>>>>>>>>> that we need to be careful about. There are users that will
> >>> be
> >>>>> left
> >>>>>>>> high
> >>>>>>>>>> and dry by this move.
> >>>>>>>>>>
> >>>>>>>>>> If the community decides to desupport and eventually remove
> >>>> it, I
> >>>>>>> think
> >>>>>>>>> we
> >>>>>>>>>> need to have a due process. We also need a deprecation plan
> >>> if
> >>>>>> that's
> >>>>>>>> we
> >>>>>>>>>> decide to do. Before that, I'm -1 on this proposal.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Xuefu
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jul 21, 2020 at 7:57 AM David <da...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hello Team,
> >>>>>>>>>>>
> >>>>>>>>>>> https://github.com/apache/hive/pull/1285
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks.
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 3, 2020 at 11:49 PM Gopal V <
> >> gopalv@apache.org
> >>>>
> >>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Gopal
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 6/3/20 7:48 PM, Jesus Camacho Rodriguez wrote:
> >>>>>>>>>>>>> +1
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Jesús
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jun 3, 2020 at 1:58 PM Alan Gates <
> >>>>>>> alanfgates@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> +1.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Alan.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Jun 3, 2020 at 1:40 PM Prasanth Jayachandran
> >>>>>>>>>>>>>> <pj...@cloudera.com.invalid> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> +1
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Jun 3, 2020, at 1:38 PM, Ashutosh Chauhan <
> >>>>>>>>>> hashutosh@apache.org>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +1
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Jun 3, 2020 at 1:23 PM David Mollitor <
> >>>>>>>>> dam6923@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hello Gang,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I have spent some time working on upgrading Avro
> >>> (far
> >>>>>> less
> >>>>>>>> than
> >>>>>>>>>>>>>> others):
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/HIVE-21737
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This should be a relatively easy thing to do, but
> >>> is
> >>>>>>> blocked
> >>>>>>>> by
> >>>>>>>>>>>>>>>>> Hive-on-Spark.  HoS has a weird thing where it
> >>>>> downloads
> >>>>>>> some
> >>>>>>>>>>>>>>>>> cloud-storage-hosted file of Spark-Hadoop as part
> >>> of
> >>>>> its
> >>>>>>>> maven
> >>>>>>>>>> run.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Since HoS is not going to receive updates from
> >> the
> >>>>> major
> >>>>>>>>> vendors,
> >>>>>>>>>>> is
> >>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> time to simply remove it?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Tests are currently disabled:
> >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/HIVE-23137
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Xuefu Zhang
> >>>>>>>>>>
> >>>>>>>>>> "In Honey We Trust!"
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: Time to Remove Hive-on-Spark

Posted by Peter Vary <pv...@cloudera.com>.

+1 from my side too.

I have created PR against the current branch.
Still needs some work, and as many reviews as possible, because it is quite
big, and I might made some mistakes
https://issues.apache.org/jira/browse/HIVE-26134
https://github.com/apache/hive/pull/3201

Thanks,
Peter

On Thu, 10 Feb 2022 at 17:43, Zoltan Haindrich <ki...@rxd.hu> wrote:

> Hey,
>
> I think there is no real interest in this feature; we don't have
> users/contributors backing it - last development was around 2018 October;
> there were ~2 bugfix commits ever
> since that...we should stop carrying dead weight...another 2 weeks went by
> since Stamatis have reminded us that after 1.5 years(!) nothing have
> changed.
>
> +1 on removing it
>
> cheers,
> Zoltan
>
> you may inspect some of the recent changes with:
> git log -c `find . -type f -path '**/spark/**'|grep -v xml|grep -v
> properties|grep -v q.out`
>
>
> On 1/28/22 2:32 PM, Stamatis Zampetakis wrote:
> > Hi team,
> >
> > Almost one year has passed since the last exchange in this discussion and
> > if I am not wrong there has been no effort to revive Hive-on-Spark. To be
> > more precise, I don't think I have seen any Spark related JIRA for quite
> > some time now and although I don't want to rush into conclusions, there
> > does not seem to be any community member involved in maintaining or
> adding
> > new features in this part of the code.
> >
> > Keeping dead code in the repository does not do any good to the project
> and
> > puts a non-negligible burden to future maintainers.
> >
> > Clearly, we cannot make a new Hive release where a major feature is
> > completely untested so either someone commits to re-enable/fix the
> > respective tests soon or we move forward the work started by David and
> drop
> > support for Hive-on-Spark.
> >
> > I would like to ask the community if there is anyone who can take up this
> > maintenance task and enable/fix Spark related tests in the next month or
> so?
> >
> > Best,
> > Stamatis
> >
> > On Sat, Feb 27, 2021 at 4:17 AM Edward Capriolo <ed...@gmail.com>
> > wrote:
> >
> >> I do not know how it works for most of the world. But in cloudera where
> the
> >> TEZ options were never popular hive-on-spark represents a solid way to
> get
> >> things done for small datasets lower latency.
> >>
> >> As for the spark adoption. You know a while ago I came up with some
> ways to
> >> make hive more  spark like. One of them was a found a way to make
> "compile"
> >> a hive keyword so folks could build UDFs on the fly. It was such an
> >> uphil climb. Folks found a way to make it disabled by default for
> security.
> >> Then later when things moved from CLI to beeline it was like the ONLY
> thing
> >> that I found not ported. Like it was extremely frustrating.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Jul 27, 2020 at 3:19 PM David <da...@gmail.com> wrote:
> >>
> >>> Hello  Xuefu,
> >>>
> >>> I am not part of the Cloudera Hive product team,  though I volunteer to
> >>> work on small projects from time to time.  Perhaps someone from that
> team
> >>> can chime in with some of their thoughts, but personally, I think that
> in
> >>> the long run, there will be more of a merge between Hive-on-Spark and
> >> other
> >>> Spark-native offerings.  I'm not sure what the differentiation will be
> >>> going forward.  With that said, are there any developers on this
> mailing
> >>> list who are willing to take on the maintenance effort of keeping HoS
> >>> moving forward?
> >>>
> >>> http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/
> >>>
> >>>
> >>
> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts.html
> >>>
> >>>
> >>> Thanks.
> >>>
> >>> On Thu, Jul 23, 2020 at 12:35 PM Xuefu Zhang <xu...@apache.org> wrote:
> >>>
> >>>> Previous reasoning seemed to suggest a lack of user adoption. Now we
> >> are
> >>>> concerned about ongoing maintenance effort. Both are valid
> >>> considerations.
> >>>> However, I think we should have ways to find out the answers.
> >> Therefore,
> >>> I
> >>>> suggest the following be carried out:
> >>>>
> >>>> 1. Send out the proposal (removing Hive on Spark) to users including
> >>>> user@hive.apache.org and get their feedback.
> >>>> 2. Ask if any developers on this mailing list are willing to take on
> >> the
> >>>> maintenance effort.
> >>>>
> >>>> I'm concerned about user impact because I can still see issues being
> >>>> reported on HoS from time to time. I'm more concerned about the future
> >> of
> >>>> Hive if we narrow Hive neutrality on execution engines, which will
> >>> possibly
> >>>> force more Hive users to migrate to other alternatives such as Spark
> >> SQL,
> >>>> which is already eroding Hive's user base.
> >>>>
> >>>> Being open and neutral used to be Hive's most admired strengths.
> >>>>
> >>>> Thanks,
> >>>> Xuefu
> >>>>
> >>>>
> >>>> On Wed, Jul 22, 2020 at 8:46 AM Alan Gates <al...@gmail.com>
> >> wrote:
> >>>>
> >>>>> An important point here is I don't believe David is proposing to
> >> remove
> >>>>> Hive on Spark from the 2 or 3 lines, but only from trunk.  Continuing
> >>> to
> >>>>> support it in existing 2 and 3 lines makes sense, but since no one
> >> has
> >>>>> maintained it on trunk for some time and it does not work with many
> >> of
> >>>> the
> >>>>> newer features it should be removed from trunk.
> >>>>>
> >>>>> Alan.
> >>>>>
> >>>>> On Tue, Jul 21, 2020 at 4:10 PM Chao Sun <su...@apache.org> wrote:
> >>>>>
> >>>>>> Thanks David. FWIW Uber is still running Hive on Spark (2.3.4) on a
> >>>> very
> >>>>>> large scale in production right now and I don't think we have any
> >>> plan
> >>>> to
> >>>>>> change it soon.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jul 21, 2020 at 11:28 AM David <da...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> Thanks for the feedback.
> >>>>>>>
> >>>>>>> Just a quick recap: I did propose this @dev and I received
> >>> unanimous
> >>>>> +1's
> >>>>>>> from the community.  After a couple months, I created the PR.
> >>>>>>>
> >>>>>>> Certainly open to discussion, but there hasn't been any
> >> discussion
> >>>> thus
> >>>>>> far
> >>>>>>> because there have been no objections until this point.
> >>>>>>>
> >>>>>>> HoS has low adoption, heavy technical debt, and the manner in
> >> which
> >>>> its
> >>>>>>> build process is setup is impeding some other work that is not
> >> even
> >>>>>> related
> >>>>>>> to HoS.
> >>>>>>>
> >>>>>>> We can deprecate in Hive 3.x and remove in Hive 4.x.  The plan
> >>> would
> >>>> be
> >>>>>> to
> >>>>>>> use Tez moving forward.
> >>>>>>>
> >>>>>>> My point about the vendor's move to Tez is that HoS adoption is
> >>> very
> >>>>> low,
> >>>>>>> it's only going lower, and while I don't know the specifics of
> >> it,
> >>>>> there
> >>>>>>> must be some migration plan in place there (i.e., it must be
> >>> possible
> >>>>> to
> >>>>>> do
> >>>>>>> it already).
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> David
> >>>>>>>
> >>>>>>> On Tue, Jul 21, 2020 at 12:23 PM Xuefu Zhang <xu...@apache.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Hi David,
> >>>>>>>>
> >>>>>>>> While a vendor may not support a component in an open source
> >>>> project,
> >>>>>>>> removing it or not is a decision by and for the community. I
> >>>>> certainly
> >>>>>>>> understand that the vendor you mentioned has contributed a
> >> great
> >>>> deal
> >>>>>>>> (including my personal effort while working there), it's not up
> >>> to
> >>>>> the
> >>>>>>>> vendor to make a call like what is proposed here.
> >>>>>>>>
> >>>>>>>> As a community, we should have gone through a thorough
> >> discussion
> >>>> and
> >>>>>>>> reached a consensus before actually making such a big change,
> >> in
> >>> my
> >>>>>>>> opinion.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Xuefu
> >>>>>>>>
> >>>>>>>> On Tue, Jul 21, 2020 at 8:49 AM David <da...@gmail.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>>> Hey,
> >>>>>>>>>
> >>>>>>>>> Thanks for the input.
> >>>>>>>>>
> >>>>>>>>> FYI. Cloudera (Cloudera + Hortonworks) have removed HoS from
> >>>> their
> >>>>>>> latest
> >>>>>>>>> offering.
> >>>>>>>>>
> >>>>>>>>> "Tez is now the only supported execution engine, existing
> >>> queries
> >>>>>> that
> >>>>>>>>> change execution mode to Spark or MapReduce within a session,
> >>> for
> >>>>>>>> example,
> >>>>>>>>> fail."
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.cloudera.com/cdp/latest/upgrade-post/topics/ug_hive_configuration_changes.html
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> So I don't know who will be supporting this feature moving
> >>>> forward,
> >>>>>> but
> >>>>>>>>> there has been a lot of work done to make this change as
> >>> painless
> >>>>> as
> >>>>>>>>> possible.  Simply set the engine to 'tez' and remove the
> >>>>> HoS-related
> >>>>>>>>> settings should address many use cases.
> >>>>>>>>>
> >>>>>>>>> Thanks.
> >>>>>>>>>
> >>>>>>>>> On Tue, Jul 21, 2020 at 11:36 AM Xuefu Z <us...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Sorry for chiming in late. However, I don't think we should
> >>>>> remove
> >>>>>>> Hive
> >>>>>>>>> on
> >>>>>>>>>> Spark just because of a technical problem. This is rather a
> >>> big
> >>>>>>>> decision
> >>>>>>>>>> that we need to be careful about. There are users that will
> >>> be
> >>>>> left
> >>>>>>>> high
> >>>>>>>>>> and dry by this move.
> >>>>>>>>>>
> >>>>>>>>>> If the community decides to desupport and eventually remove
> >>>> it, I
> >>>>>>> think
> >>>>>>>>> we
> >>>>>>>>>> need to have a due process. We also need a deprecation plan
> >>> if
> >>>>>> that's
> >>>>>>>> we
> >>>>>>>>>> decide to do. Before that, I'm -1 on this proposal.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Xuefu
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jul 21, 2020 at 7:57 AM David <da...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hello Team,
> >>>>>>>>>>>
> >>>>>>>>>>> https://github.com/apache/hive/pull/1285
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks.
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 3, 2020 at 11:49 PM Gopal V <
> >> gopalv@apache.org
> >>>>
> >>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Gopal
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 6/3/20 7:48 PM, Jesus Camacho Rodriguez wrote:
> >>>>>>>>>>>>> +1
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Jesús
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jun 3, 2020 at 1:58 PM Alan Gates <
> >>>>>>> alanfgates@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> +1.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Alan.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Jun 3, 2020 at 1:40 PM Prasanth Jayachandran
> >>>>>>>>>>>>>> <pj...@cloudera.com.invalid> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> +1
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Jun 3, 2020, at 1:38 PM, Ashutosh Chauhan <
> >>>>>>>>>> hashutosh@apache.org>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +1
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Jun 3, 2020 at 1:23 PM David Mollitor <
> >>>>>>>>> dam6923@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hello Gang,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I have spent some time working on upgrading Avro
> >>> (far
> >>>>>> less
> >>>>>>>> than
> >>>>>>>>>>>>>> others):
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/HIVE-21737
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This should be a relatively easy thing to do, but
> >>> is
> >>>>>>> blocked
> >>>>>>>> by
> >>>>>>>>>>>>>>>>> Hive-on-Spark.  HoS has a weird thing where it
> >>>>> downloads
> >>>>>>> some
> >>>>>>>>>>>>>>>>> cloud-storage-hosted file of Spark-Hadoop as part
> >>> of
> >>>>> its
> >>>>>>>> maven
> >>>>>>>>>> run.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Since HoS is not going to receive updates from
> >> the
> >>>>> major
> >>>>>>>>> vendors,
> >>>>>>>>>>> is
> >>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> time to simply remove it?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Tests are currently disabled:
> >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/HIVE-23137
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Xuefu Zhang
> >>>>>>>>>>
> >>>>>>>>>> "In Honey We Trust!"
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>