You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sean Owen <sr...@gmail.com> on 2018/10/26 14:10:57 UTC

Drop support for old Hive in Spark 3.0?

Here's another thread to start considering, and I know it's been raised
before.
What version(s) of Hive should Spark 3 support?

If at least we know it won't include Hive 0.x, could we go ahead and remove
those tests from master? It might significantly reduce the run time and
flakiness.

It seems that maintaining even the Hive 1.x fork is untenable going
forward, right? does that also imply this support is almost certainly not
maintained in 3.0?

Per below, it seems like it might even be hard to both support Hive 3 and
Hadoop 2 at the same time?

And while we're at it, what's the + and - for simply only supporting Hadoop
3 in Spark 3? Is the difference in client / HDFS API even that big? Or what
about focusing only on Hadoop 2.9.x support + 3.x support?

Lots of questions, just interested now in informal reactions, not a binding
decision.

On Thu, Oct 25, 2018 at 11:49 PM Dagang Wei <no...@github.com>
wrote:

> Do we really want to switch to Hive 2.3? From this page
> https://hive.apache.org/downloads.html, Hive 2.3 works with Hadoop 2.x
> (Hive 3.x works with Hadoop 3.x).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21588#issuecomment-433285287>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAyM-sRygel3il6Ne4FafD5BQ7NDSJ7Mks5uopRlgaJpZM4Usweh>
> .
>

Re: Drop support for old Hive in Spark 3.0?

Posted by Michael Shtelma <ms...@gmail.com>.

Which alternatives to ThriftServer do we really have? If ThriftServer is
not there anymore, there is no other way to connect to Spark SQL using
JDBC.... and this is the primary way for connecting BI tools to Spark SQL.
Do I miss something?

The question is, if Spark would like to be the tool, used for the online
queries... Alternative is to process/enrich data with Spark, push it
somewhere and use other tools for actual queries... Do you think, it is
better to use Spark in this way ?


On Fri, Oct 26, 2018 at 7:48 PM Sean Owen <sr...@gmail.com> wrote:

> Maybe that's what I really mean (you can tell I don't follow the Hive part
> closely)
> In my travels, indeed the thrift server has been viewed as an older
> solution to a problem probably better met by others.
> From my perspective it's worth dropping, but, that's just anecdotal.
> Any other arguments for or against the thrift server?
>
> On Fri, Oct 26, 2018 at 12:30 PM Marco Gaido <ma...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> one big problem about getting rid of the Hive fork is the thriftserver,
>> which relies on the HiveServer from the Hive fork.
>> We might migrate to an apache/hive dependency, but not sure this would
>> help that much.
>> I think a broader topic would be the actual opportunity of having a
>> thriftserver directly into Spark. It has many well-known limitations (not
>> fault tolerant, no security/impersonation, etc.etc.) and there are other
>> project which target to provide a thrift/JDBC interface to Spark. Just to
>> be clear I am not proposing to remove the thriftserver in 3.0, but maybe it
>> is something we could evaluate in the long term.
>>
>> Thanks,
>> Marco
>>
>>
>> Il giorno ven 26 ott 2018 alle ore 19:07 Sean Owen <sr...@gmail.com> ha
>> scritto:
>>
>>> OK let's keep this about Hive.
>>>
>>> Right, good point, this is really about supporting metastore versions,
>>> and there is a good argument for retaining backwards-compatibility with
>>> older metastores. I don't know how far, but I guess, as far as is practical?
>>>
>>> Isn't there still a lot of Hive 0.x test code? is that something that's
>>> safe to drop for 3.0?
>>>
>>> And, basically, what must we do to get rid of the Hive fork? that seems
>>> like a must-do.
>>>
>>>
>>>
>>> On Fri, Oct 26, 2018 at 11:51 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Sean and All.
>>>>
>>>> For the first question, we support only Hive Metastore from 1.x ~ 2.x.
>>>> And, we can support Hive Metastore 3.0 simultaneously. Spark is designed
>>>> like that.
>>>>
>>>> I don't think we need to drop old Hive Metastore Support. Is it
>>>> for avoiding Hive Metastore sharing between Spark2 and Spark3 clusters?
>>>>
>>>> I think we should allow that use cases, especially for new Spark 3
>>>> clusters. How do you think so?
>>>>
>>>>
>>>> For the second question, Apache Spark 2.x doesn't support Hive
>>>> officially. It's only a best-effort approach in a boundary of Spark.
>>>>
>>>>
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality
>>>>
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#incompatible-hive-udf
>>>>
>>>>
>>>> Not only the documented one, decimal literal(HIVE-17186) makes a query
>>>> result difference even in the well-known benchmark like TPC-H.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> PS. For Hadoop, let's have another thread if needed. I expect another
>>>> long story. :)
>>>>
>>>>
>>>> On Fri, Oct 26, 2018 at 7:11 AM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> Here's another thread to start considering, and I know it's been
>>>>> raised before.
>>>>> What version(s) of Hive should Spark 3 support?
>>>>>
>>>>> If at least we know it won't include Hive 0.x, could we go ahead and
>>>>> remove those tests from master? It might significantly reduce the run time
>>>>> and flakiness.
>>>>>
>>>>> It seems that maintaining even the Hive 1.x fork is untenable going
>>>>> forward, right? does that also imply this support is almost certainly not
>>>>> maintained in 3.0?
>>>>>
>>>>> Per below, it seems like it might even be hard to both support Hive 3
>>>>> and Hadoop 2 at the same time?
>>>>>
>>>>> And while we're at it, what's the + and - for simply only supporting
>>>>> Hadoop 3 in Spark 3? Is the difference in client / HDFS API even that big?
>>>>> Or what about focusing only on Hadoop 2.9.x support + 3.x support?
>>>>>
>>>>> Lots of questions, just interested now in informal reactions, not a
>>>>> binding decision.
>>>>>
>>>>> On Thu, Oct 25, 2018 at 11:49 PM Dagang Wei <no...@github.com>
>>>>> wrote:
>>>>>
>>>>>> Do we really want to switch to Hive 2.3? From this page
>>>>>> https://hive.apache.org/downloads.html, Hive 2.3 works with Hadoop
>>>>>> 2.x (Hive 3.x works with Hadoop 3.x).
>>>>>>
>>>>>> —
>>>>>> You are receiving this because you were mentioned.
>>>>>> Reply to this email directly, view it on GitHub
>>>>>> <https://github.com/apache/spark/pull/21588#issuecomment-433285287>,
>>>>>> or mute the thread
>>>>>> <https://github.com/notifications/unsubscribe-auth/AAyM-sRygel3il6Ne4FafD5BQ7NDSJ7Mks5uopRlgaJpZM4Usweh>
>>>>>> .
>>>>>>
>>>>>

Re: Drop support for old Hive in Spark 3.0?

Posted by Reynold Xin <rx...@databricks.com>.

People do use it, and the maintenance cost is pretty low so I don't think
we should just drop it. We can be explicit about there are not a lot of
developments going on and we are unlikely to add a lot of new features to
it, and users are also welcome to use other JDBC/ODBC endpoint
implementations built by the ecosystem, so the Spark project itself is not
pressured to continue adding a lot of features.


On Fri, Oct 26, 2018 at 10:48 AM Sean Owen <sr...@gmail.com> wrote:

> Maybe that's what I really mean (you can tell I don't follow the Hive part
> closely)
> In my travels, indeed the thrift server has been viewed as an older
> solution to a problem probably better met by others.
> From my perspective it's worth dropping, but, that's just anecdotal.
> Any other arguments for or against the thrift server?
>
> On Fri, Oct 26, 2018 at 12:30 PM Marco Gaido <ma...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> one big problem about getting rid of the Hive fork is the thriftserver,
>> which relies on the HiveServer from the Hive fork.
>> We might migrate to an apache/hive dependency, but not sure this would
>> help that much.
>> I think a broader topic would be the actual opportunity of having a
>> thriftserver directly into Spark. It has many well-known limitations (not
>> fault tolerant, no security/impersonation, etc.etc.) and there are other
>> project which target to provide a thrift/JDBC interface to Spark. Just to
>> be clear I am not proposing to remove the thriftserver in 3.0, but maybe it
>> is something we could evaluate in the long term.
>>
>> Thanks,
>> Marco
>>
>>
>> Il giorno ven 26 ott 2018 alle ore 19:07 Sean Owen <sr...@gmail.com> ha
>> scritto:
>>
>>> OK let's keep this about Hive.
>>>
>>> Right, good point, this is really about supporting metastore versions,
>>> and there is a good argument for retaining backwards-compatibility with
>>> older metastores. I don't know how far, but I guess, as far as is practical?
>>>
>>> Isn't there still a lot of Hive 0.x test code? is that something that's
>>> safe to drop for 3.0?
>>>
>>> And, basically, what must we do to get rid of the Hive fork? that seems
>>> like a must-do.
>>>
>>>
>>>
>>> On Fri, Oct 26, 2018 at 11:51 AM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Sean and All.
>>>>
>>>> For the first question, we support only Hive Metastore from 1.x ~ 2.x.
>>>> And, we can support Hive Metastore 3.0 simultaneously. Spark is designed
>>>> like that.
>>>>
>>>> I don't think we need to drop old Hive Metastore Support. Is it
>>>> for avoiding Hive Metastore sharing between Spark2 and Spark3 clusters?
>>>>
>>>> I think we should allow that use cases, especially for new Spark 3
>>>> clusters. How do you think so?
>>>>
>>>>
>>>> For the second question, Apache Spark 2.x doesn't support Hive
>>>> officially. It's only a best-effort approach in a boundary of Spark.
>>>>
>>>>
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality
>>>>
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#incompatible-hive-udf
>>>>
>>>>
>>>> Not only the documented one, decimal literal(HIVE-17186) makes a query
>>>> result difference even in the well-known benchmark like TPC-H.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> PS. For Hadoop, let's have another thread if needed. I expect another
>>>> long story. :)
>>>>
>>>>
>>>> On Fri, Oct 26, 2018 at 7:11 AM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> Here's another thread to start considering, and I know it's been
>>>>> raised before.
>>>>> What version(s) of Hive should Spark 3 support?
>>>>>
>>>>> If at least we know it won't include Hive 0.x, could we go ahead and
>>>>> remove those tests from master? It might significantly reduce the run time
>>>>> and flakiness.
>>>>>
>>>>> It seems that maintaining even the Hive 1.x fork is untenable going
>>>>> forward, right? does that also imply this support is almost certainly not
>>>>> maintained in 3.0?
>>>>>
>>>>> Per below, it seems like it might even be hard to both support Hive 3
>>>>> and Hadoop 2 at the same time?
>>>>>
>>>>> And while we're at it, what's the + and - for simply only supporting
>>>>> Hadoop 3 in Spark 3? Is the difference in client / HDFS API even that big?
>>>>> Or what about focusing only on Hadoop 2.9.x support + 3.x support?
>>>>>
>>>>> Lots of questions, just interested now in informal reactions, not a
>>>>> binding decision.
>>>>>
>>>>> On Thu, Oct 25, 2018 at 11:49 PM Dagang Wei <no...@github.com>
>>>>> wrote:
>>>>>
>>>>>> Do we really want to switch to Hive 2.3? From this page
>>>>>> https://hive.apache.org/downloads.html, Hive 2.3 works with Hadoop
>>>>>> 2.x (Hive 3.x works with Hadoop 3.x).
>>>>>>
>>>>>> —
>>>>>> You are receiving this because you were mentioned.
>>>>>> Reply to this email directly, view it on GitHub
>>>>>> <https://github.com/apache/spark/pull/21588#issuecomment-433285287>,
>>>>>> or mute the thread
>>>>>> <https://github.com/notifications/unsubscribe-auth/AAyM-sRygel3il6Ne4FafD5BQ7NDSJ7Mks5uopRlgaJpZM4Usweh>
>>>>>> .
>>>>>>
>>>>>

Re: Drop support for old Hive in Spark 3.0?

Posted by Sean Owen <sr...@gmail.com>.

Maybe that's what I really mean (you can tell I don't follow the Hive part
closely)
In my travels, indeed the thrift server has been viewed as an older
solution to a problem probably better met by others.
From my perspective it's worth dropping, but, that's just anecdotal.
Any other arguments for or against the thrift server?

On Fri, Oct 26, 2018 at 12:30 PM Marco Gaido <ma...@gmail.com> wrote:

> Hi all,
>
> one big problem about getting rid of the Hive fork is the thriftserver,
> which relies on the HiveServer from the Hive fork.
> We might migrate to an apache/hive dependency, but not sure this would
> help that much.
> I think a broader topic would be the actual opportunity of having a
> thriftserver directly into Spark. It has many well-known limitations (not
> fault tolerant, no security/impersonation, etc.etc.) and there are other
> project which target to provide a thrift/JDBC interface to Spark. Just to
> be clear I am not proposing to remove the thriftserver in 3.0, but maybe it
> is something we could evaluate in the long term.
>
> Thanks,
> Marco
>
>
> Il giorno ven 26 ott 2018 alle ore 19:07 Sean Owen <sr...@gmail.com> ha
> scritto:
>
>> OK let's keep this about Hive.
>>
>> Right, good point, this is really about supporting metastore versions,
>> and there is a good argument for retaining backwards-compatibility with
>> older metastores. I don't know how far, but I guess, as far as is practical?
>>
>> Isn't there still a lot of Hive 0.x test code? is that something that's
>> safe to drop for 3.0?
>>
>> And, basically, what must we do to get rid of the Hive fork? that seems
>> like a must-do.
>>
>>
>>
>> On Fri, Oct 26, 2018 at 11:51 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, Sean and All.
>>>
>>> For the first question, we support only Hive Metastore from 1.x ~ 2.x.
>>> And, we can support Hive Metastore 3.0 simultaneously. Spark is designed
>>> like that.
>>>
>>> I don't think we need to drop old Hive Metastore Support. Is it
>>> for avoiding Hive Metastore sharing between Spark2 and Spark3 clusters?
>>>
>>> I think we should allow that use cases, especially for new Spark 3
>>> clusters. How do you think so?
>>>
>>>
>>> For the second question, Apache Spark 2.x doesn't support Hive
>>> officially. It's only a best-effort approach in a boundary of Spark.
>>>
>>>
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality
>>>
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#incompatible-hive-udf
>>>
>>>
>>> Not only the documented one, decimal literal(HIVE-17186) makes a query
>>> result difference even in the well-known benchmark like TPC-H.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> PS. For Hadoop, let's have another thread if needed. I expect another
>>> long story. :)
>>>
>>>
>>> On Fri, Oct 26, 2018 at 7:11 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> Here's another thread to start considering, and I know it's been raised
>>>> before.
>>>> What version(s) of Hive should Spark 3 support?
>>>>
>>>> If at least we know it won't include Hive 0.x, could we go ahead and
>>>> remove those tests from master? It might significantly reduce the run time
>>>> and flakiness.
>>>>
>>>> It seems that maintaining even the Hive 1.x fork is untenable going
>>>> forward, right? does that also imply this support is almost certainly not
>>>> maintained in 3.0?
>>>>
>>>> Per below, it seems like it might even be hard to both support Hive 3
>>>> and Hadoop 2 at the same time?
>>>>
>>>> And while we're at it, what's the + and - for simply only supporting
>>>> Hadoop 3 in Spark 3? Is the difference in client / HDFS API even that big?
>>>> Or what about focusing only on Hadoop 2.9.x support + 3.x support?
>>>>
>>>> Lots of questions, just interested now in informal reactions, not a
>>>> binding decision.
>>>>
>>>> On Thu, Oct 25, 2018 at 11:49 PM Dagang Wei <no...@github.com>
>>>> wrote:
>>>>
>>>>> Do we really want to switch to Hive 2.3? From this page
>>>>> https://hive.apache.org/downloads.html, Hive 2.3 works with Hadoop
>>>>> 2.x (Hive 3.x works with Hadoop 3.x).
>>>>>
>>>>> —
>>>>> You are receiving this because you were mentioned.
>>>>> Reply to this email directly, view it on GitHub
>>>>> <https://github.com/apache/spark/pull/21588#issuecomment-433285287>,
>>>>> or mute the thread
>>>>> <https://github.com/notifications/unsubscribe-auth/AAyM-sRygel3il6Ne4FafD5BQ7NDSJ7Mks5uopRlgaJpZM4Usweh>
>>>>> .
>>>>>
>>>>

Re: Drop support for old Hive in Spark 3.0?

Posted by Marco Gaido <ma...@gmail.com>.

Hi all,

one big problem about getting rid of the Hive fork is the thriftserver,
which relies on the HiveServer from the Hive fork.
We might migrate to an apache/hive dependency, but not sure this would help
that much.
I think a broader topic would be the actual opportunity of having a
thriftserver directly into Spark. It has many well-known limitations (not
fault tolerant, no security/impersonation, etc.etc.) and there are other
project which target to provide a thrift/JDBC interface to Spark. Just to
be clear I am not proposing to remove the thriftserver in 3.0, but maybe it
is something we could evaluate in the long term.

Thanks,
Marco


Il giorno ven 26 ott 2018 alle ore 19:07 Sean Owen <sr...@gmail.com> ha
scritto:

> OK let's keep this about Hive.
>
> Right, good point, this is really about supporting metastore versions, and
> there is a good argument for retaining backwards-compatibility with older
> metastores. I don't know how far, but I guess, as far as is practical?
>
> Isn't there still a lot of Hive 0.x test code? is that something that's
> safe to drop for 3.0?
>
> And, basically, what must we do to get rid of the Hive fork? that seems
> like a must-do.
>
>
>
> On Fri, Oct 26, 2018 at 11:51 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, Sean and All.
>>
>> For the first question, we support only Hive Metastore from 1.x ~ 2.x.
>> And, we can support Hive Metastore 3.0 simultaneously. Spark is designed
>> like that.
>>
>> I don't think we need to drop old Hive Metastore Support. Is it
>> for avoiding Hive Metastore sharing between Spark2 and Spark3 clusters?
>>
>> I think we should allow that use cases, especially for new Spark 3
>> clusters. How do you think so?
>>
>>
>> For the second question, Apache Spark 2.x doesn't support Hive
>> officially. It's only a best-effort approach in a boundary of Spark.
>>
>>
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality
>>
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#incompatible-hive-udf
>>
>>
>> Not only the documented one, decimal literal(HIVE-17186) makes a query
>> result difference even in the well-known benchmark like TPC-H.
>>
>> Bests,
>> Dongjoon.
>>
>> PS. For Hadoop, let's have another thread if needed. I expect another
>> long story. :)
>>
>>
>> On Fri, Oct 26, 2018 at 7:11 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> Here's another thread to start considering, and I know it's been raised
>>> before.
>>> What version(s) of Hive should Spark 3 support?
>>>
>>> If at least we know it won't include Hive 0.x, could we go ahead and
>>> remove those tests from master? It might significantly reduce the run time
>>> and flakiness.
>>>
>>> It seems that maintaining even the Hive 1.x fork is untenable going
>>> forward, right? does that also imply this support is almost certainly not
>>> maintained in 3.0?
>>>
>>> Per below, it seems like it might even be hard to both support Hive 3
>>> and Hadoop 2 at the same time?
>>>
>>> And while we're at it, what's the + and - for simply only supporting
>>> Hadoop 3 in Spark 3? Is the difference in client / HDFS API even that big?
>>> Or what about focusing only on Hadoop 2.9.x support + 3.x support?
>>>
>>> Lots of questions, just interested now in informal reactions, not a
>>> binding decision.
>>>
>>> On Thu, Oct 25, 2018 at 11:49 PM Dagang Wei <no...@github.com>
>>> wrote:
>>>
>>>> Do we really want to switch to Hive 2.3? From this page
>>>> https://hive.apache.org/downloads.html, Hive 2.3 works with Hadoop 2.x
>>>> (Hive 3.x works with Hadoop 3.x).
>>>>
>>>> —
>>>> You are receiving this because you were mentioned.
>>>> Reply to this email directly, view it on GitHub
>>>> <https://github.com/apache/spark/pull/21588#issuecomment-433285287>,
>>>> or mute the thread
>>>> <https://github.com/notifications/unsubscribe-auth/AAyM-sRygel3il6Ne4FafD5BQ7NDSJ7Mks5uopRlgaJpZM4Usweh>
>>>> .
>>>>
>>>

Re: Drop support for old Hive in Spark 3.0?

Posted by Sean Owen <sr...@gmail.com>.

OK let's keep this about Hive.

Right, good point, this is really about supporting metastore versions, and
there is a good argument for retaining backwards-compatibility with older
metastores. I don't know how far, but I guess, as far as is practical?

Isn't there still a lot of Hive 0.x test code? is that something that's
safe to drop for 3.0?

And, basically, what must we do to get rid of the Hive fork? that seems
like a must-do.



On Fri, Oct 26, 2018 at 11:51 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, Sean and All.
>
> For the first question, we support only Hive Metastore from 1.x ~ 2.x.
> And, we can support Hive Metastore 3.0 simultaneously. Spark is designed
> like that.
>
> I don't think we need to drop old Hive Metastore Support. Is it
> for avoiding Hive Metastore sharing between Spark2 and Spark3 clusters?
>
> I think we should allow that use cases, especially for new Spark 3
> clusters. How do you think so?
>
>
> For the second question, Apache Spark 2.x doesn't support Hive officially.
> It's only a best-effort approach in a boundary of Spark.
>
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#incompatible-hive-udf
>
>
> Not only the documented one, decimal literal(HIVE-17186) makes a query
> result difference even in the well-known benchmark like TPC-H.
>
> Bests,
> Dongjoon.
>
> PS. For Hadoop, let's have another thread if needed. I expect another long
> story. :)
>
>
> On Fri, Oct 26, 2018 at 7:11 AM Sean Owen <sr...@gmail.com> wrote:
>
>> Here's another thread to start considering, and I know it's been raised
>> before.
>> What version(s) of Hive should Spark 3 support?
>>
>> If at least we know it won't include Hive 0.x, could we go ahead and
>> remove those tests from master? It might significantly reduce the run time
>> and flakiness.
>>
>> It seems that maintaining even the Hive 1.x fork is untenable going
>> forward, right? does that also imply this support is almost certainly not
>> maintained in 3.0?
>>
>> Per below, it seems like it might even be hard to both support Hive 3 and
>> Hadoop 2 at the same time?
>>
>> And while we're at it, what's the + and - for simply only supporting
>> Hadoop 3 in Spark 3? Is the difference in client / HDFS API even that big?
>> Or what about focusing only on Hadoop 2.9.x support + 3.x support?
>>
>> Lots of questions, just interested now in informal reactions, not a
>> binding decision.
>>
>> On Thu, Oct 25, 2018 at 11:49 PM Dagang Wei <no...@github.com>
>> wrote:
>>
>>> Do we really want to switch to Hive 2.3? From this page
>>> https://hive.apache.org/downloads.html, Hive 2.3 works with Hadoop 2.x
>>> (Hive 3.x works with Hadoop 3.x).
>>>
>>> —
>>> You are receiving this because you were mentioned.
>>> Reply to this email directly, view it on GitHub
>>> <https://github.com/apache/spark/pull/21588#issuecomment-433285287>, or mute
>>> the thread
>>> <https://github.com/notifications/unsubscribe-auth/AAyM-sRygel3il6Ne4FafD5BQ7NDSJ7Mks5uopRlgaJpZM4Usweh>
>>> .
>>>
>>

Re: Drop support for old Hive in Spark 3.0?

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, Sean and All.

For the first question, we support only Hive Metastore from 1.x ~ 2.x. And,
we can support Hive Metastore 3.0 simultaneously. Spark is designed like
that.

I don't think we need to drop old Hive Metastore Support. Is it
for avoiding Hive Metastore sharing between Spark2 and Spark3 clusters?

I think we should allow that use cases, especially for new Spark 3
clusters. How do you think so?

For the second question, Apache Spark 2.x doesn't support Hive officially.
It's only a best-effort approach in a boundary of Spark.

http://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality
http://spark.apache.org/docs/latest/sql-programming-guide.html#incompatible-hive-udf

Not only the documented one, decimal literal(HIVE-17186) makes a query
result difference even in the well-known benchmark like TPC-H.

Bests,
Dongjoon.

PS. For Hadoop, let's have another thread if needed. I expect another long
story. :)

On Fri, Oct 26, 2018 at 7:11 AM Sean Owen <sr...@gmail.com> wrote:

> Here's another thread to start considering, and I know it's been raised
> before.
> What version(s) of Hive should Spark 3 support?
>
> If at least we know it won't include Hive 0.x, could we go ahead and
> remove those tests from master? It might significantly reduce the run time
> and flakiness.
>
> It seems that maintaining even the Hive 1.x fork is untenable going
> forward, right? does that also imply this support is almost certainly not
> maintained in 3.0?
>
> Per below, it seems like it might even be hard to both support Hive 3 and
> Hadoop 2 at the same time?
>
> And while we're at it, what's the + and - for simply only supporting
> Hadoop 3 in Spark 3? Is the difference in client / HDFS API even that big?
> Or what about focusing only on Hadoop 2.9.x support + 3.x support?
>
> Lots of questions, just interested now in informal reactions, not a
> binding decision.
>
> On Thu, Oct 25, 2018 at 11:49 PM Dagang Wei <no...@github.com>
> wrote:
>
>> Do we really want to switch to Hive 2.3? From this page
>> https://hive.apache.org/downloads.html, Hive 2.3 works with Hadoop 2.x
>> (Hive 3.x works with Hadoop 3.x).
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <https://github.com/apache/spark/pull/21588#issuecomment-433285287>, or mute
>> the thread
>> <https://github.com/notifications/unsubscribe-auth/AAyM-sRygel3il6Ne4FafD5BQ7NDSJ7Mks5uopRlgaJpZM4Usweh>
>> .
>>
>