You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Artur Sukhenko <ar...@gmail.com> on 2019/04/15 11:35:00 UTC

Hive on Tez vs Impala

Hi,
We are using CDH 5, with Impala  2.7.0-cdh5.9.1  and Hive 1.1 (MapReduce)
I can't find the info regarding Hive on Tez performance compared to Impala.
Does someone know or compared it?

Thanks

Artur Sukhenko

Re: Hive on Tez vs Impala

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> I wish the Hive team to keep things more backward-compatible as well. Hive is such an enormous system with a wide-spread impact so any backward-incompatible change could cause an uproar in the community.

The incompatibilities were not avoidable in a set of situations - a lot of those were in Hive2, but hidden away or deliberately disabled to make Hive 3 into what it is.

Here's a quick run-down of how the incompatibilities at the table level allow a final user to run more SQL

https://www.slideshare.net/dbist/hive-3-a-new-horizon/10

The incompatibilities form the foundation for something like "How do I have Kafka streams offloaded to S3 cold data stores, but still query down to the last second without the small file problem?".

Cheers,
Gopal



Re: Hive on Tez vs Impala

Posted by Thai Bui <bl...@gmail.com>.
I'm using Hive 3.1 on Tez/LLAP and I must say the experience was not good
but it was worth it. We built Hive from HDP's hive-release and add Tez UI
back, combined that with Hue 4.3 (also built from Cloudera Hue). Now that
the two companies have merged I think things are going to get better (I'm
not an enterprise user of either CDH or HDP and we build our own distro
based off their open-source version). Hue is now trying to integrate with
Atlas and Ranger which is a really good step.

We like Tez because it has been stable enough for batch processing jobs.
The LLAP and vectorized side of things is a different story and that's
where the new Hive is going to be. However, historically it hasn't been
that stable as much as pure Tez containers in our opinion. LLAP +
vectorized execution can bring the speed to sub-seconds if you have the
hardware for it (at least 128G of mem instance with a good 10Gbit network,
i3.4xlarge on AWS for example). It's actually faster than Presto (in our
case AWS Athena as well) in a few cases however I would say they are very
comparable.

I like the fact that we can use a single SQL dialect (for both batch and
interactive queries) using a combination of Hive 3.x on Tez and Hive 3.1 on
LLAP. There's no context switching between different dialect wasting our
time in LATERAL VIEW explode(..) vs. CROSS JOIN unnest(...).

One thing I must say though, Hive 3 has a few backwards-incompatible
changes so be careful. For example, the transition of the managed table to
a default transactional table has broken many of our assumptions. I wish
the Hive team to keep things more backward-compatible as well. Hive is such
an enormous system with a wide-spread impact so any backward-incompatible
change could cause an uproar in the community.

On Tue, Apr 16, 2019 at 8:08 AM Edward Capriolo <ed...@gmail.com>
wrote:

> I have changes jobs 3 times since tez was introduced. It is a true waste
> of compute resources and time that it was never patched in. So I either
> have to waste my time patching it in, waste my time running a side
> deployment, or not installing it and waste money having queries run longer
> on mr/spark engine.
>
> Imagine how much compute hours have been lost world wide.
> On Tuesday, April 16, 2019, Manoj Murumkar <ma...@gmail.com>
> wrote:
>
>> If we install our own build of Hive, we'll be out of support from CDH.
>>
>> Tez is not supported anyway and we're not touching any CDH bits, so it's
>> not a big issue to have our own build of Tez engine.
>>
>> > On Apr 15, 2019, at 9:20 PM, Gopal Vijayaraghavan <go...@apache.org>
>> wrote:
>> >
>> >
>> > Hi,
>> >
>> >>> However, we have built Tez on CDH and it runs just fine.
>> >
>> > Down that path you'll also need to deploy a slightly newer version of
>> Hive as well, because Hive 1.1 is a bit ancient & has known bugs with the
>> tez planner code.
>> >
>> > You effectively end up building the hortonworks/hive-release builds, by
>> undoing the non-htrace tracing impl & applying the htrace one back etc.
>> >
>> >> Lol. I was hoping that the merger would unblock the "saltyness".
>> >
>> > Historically, I've unofficially supported folks using Tez on CDH in
>> prod (assuming they buy me enough coffee), though I might have discontinue
>> that.
>> >
>> >
>> https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/vendor-repos.xml#L11
>> >
>> > Cheers,
>> > Gopal
>> >
>> >
>>
>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>


-- 
Thai

Re: Hive on Tez vs Impala

Posted by Edward Capriolo <ed...@gmail.com>.
I have changes jobs 3 times since tez was introduced. It is a true waste of
compute resources and time that it was never patched in. So I either have
to waste my time patching it in, waste my time running a side deployment,
or not installing it and waste money having queries run longer on mr/spark
engine.

Imagine how much compute hours have been lost world wide.
On Tuesday, April 16, 2019, Manoj Murumkar <ma...@gmail.com> wrote:

> If we install our own build of Hive, we'll be out of support from CDH.
>
> Tez is not supported anyway and we're not touching any CDH bits, so it's
> not a big issue to have our own build of Tez engine.
>
> > On Apr 15, 2019, at 9:20 PM, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
> >
> >
> > Hi,
> >
> >>> However, we have built Tez on CDH and it runs just fine.
> >
> > Down that path you'll also need to deploy a slightly newer version of
> Hive as well, because Hive 1.1 is a bit ancient & has known bugs with the
> tez planner code.
> >
> > You effectively end up building the hortonworks/hive-release builds, by
> undoing the non-htrace tracing impl & applying the htrace one back etc.
> >
> >> Lol. I was hoping that the merger would unblock the "saltyness".
> >
> > Historically, I've unofficially supported folks using Tez on CDH in prod
> (assuming they buy me enough coffee), though I might have discontinue that.
> >
> > https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/
> vendor-repos.xml#L11
> >
> > Cheers,
> > Gopal
> >
> >
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: Hive on Tez vs Impala

Posted by Manoj Murumkar <ma...@gmail.com>.
If we install our own build of Hive, we'll be out of support from CDH. 

Tez is not supported anyway and we're not touching any CDH bits, so it's not a big issue to have our own build of Tez engine.

> On Apr 15, 2019, at 9:20 PM, Gopal Vijayaraghavan <go...@apache.org> wrote:
> 
> 
> Hi,
> 
>>> However, we have built Tez on CDH and it runs just fine.
> 
> Down that path you'll also need to deploy a slightly newer version of Hive as well, because Hive 1.1 is a bit ancient & has known bugs with the tez planner code.
> 
> You effectively end up building the hortonworks/hive-release builds, by undoing the non-htrace tracing impl & applying the htrace one back etc.
> 
>> Lol. I was hoping that the merger would unblock the "saltyness". 
> 
> Historically, I've unofficially supported folks using Tez on CDH in prod (assuming they buy me enough coffee), though I might have discontinue that.
> 
> https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/vendor-repos.xml#L11
> 
> Cheers,
> Gopal
> 
> 

Re: Hive on Tez vs Impala

Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi,

>> However, we have built Tez on CDH and it runs just fine.

Down that path you'll also need to deploy a slightly newer version of Hive as well, because Hive 1.1 is a bit ancient & has known bugs with the tez planner code.

You effectively end up building the hortonworks/hive-release builds, by undoing the non-htrace tracing impl & applying the htrace one back etc.

> Lol. I was hoping that the merger would unblock the "saltyness". 

Historically, I've unofficially supported folks using Tez on CDH in prod (assuming they buy me enough coffee), though I might have discontinue that.

https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/vendor-repos.xml#L11

Cheers,
Gopal



Re: Hive on Tez vs Impala

Posted by Edward Capriolo <ed...@gmail.com>.
Lol. I was hoping that the merger would unblock the "saltyness". I wonder
what is the official position is now because back in the day there was a
puff piece produced to the effect of hive was not the way forward and
impala is the bees knees.

On Monday, April 15, 2019, Manoj Murumkar <ma...@gmail.com> wrote:

> No, not yet. However, we have built Tez on CDH and it runs just fine.
> Following blog summarizes part of the work (bit old, we currently run Tez
> 0.9.1 on CDH 5.16.1).
>
> https://blog.upala.com/2017/03/04/setting-up-tez-on-cdh-cluster/
>
> Blog says use ATS from open source hadoop, which will not work if you've
> kerberized the cluster. You'll have to build a version of ATS against CDH
> libraries that provides the classes needed to run the engine. We have done
> this work as well and it runs pretty smoothly.
>
>
>
> On Mon, Apr 15, 2019 at 8:33 AM Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> Out of band question. Given:
>> https://hortonworks.com/blog/welcome-brand-new-cloudera/
>>
>> Does cdh finally ship with a tea you dont have to manually patch in?
>> On Monday, April 15, 2019, Sungwoo Park <gl...@gmail.com> wrote:
>>
>>> I tested the performance of Impala 2.12.0+cdh5.15.2+0 in Cloudera CDH
>>> 5.15.2 a while ago. I compared it with Hive 3.1.1 on MR3 (where MR3 is a
>>> new execution engine for Hadoop and Kubernetes). You can find the result at:
>>>
>>> https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/
>>>
>>> On average, Hive on MR3 is about 30% faster than Hive on Tez on
>>> sequential queries. For concurrent queries, the throughput of Hive on MR3
>>> is about three times higher than Hive on Tez (when tested with 16
>>> concurrent queries). You can find the result at:
>>>
>>> https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/
>>>
>>> --- Sungwoo Park
>>>
>>> On Mon, Apr 15, 2019 at 8:44 PM Artur Sukhenko <ar...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> We are using CDH 5, with Impala  2.7.0-cdh5.9.1  and Hive 1.1
>>>> (MapReduce)
>>>> I can't find the info regarding Hive on Tez performance compared to
>>>> Impala.
>>>> Does someone know or compared it?
>>>>
>>>> Thanks
>>>>
>>>> Artur Sukhenko
>>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: Hive on Tez vs Impala

Posted by Manoj Murumkar <ma...@gmail.com>.
No, not yet. However, we have built Tez on CDH and it runs just fine.
Following blog summarizes part of the work (bit old, we currently run Tez
0.9.1 on CDH 5.16.1).

https://blog.upala.com/2017/03/04/setting-up-tez-on-cdh-cluster/

Blog says use ATS from open source hadoop, which will not work if you've
kerberized the cluster. You'll have to build a version of ATS against CDH
libraries that provides the classes needed to run the engine. We have done
this work as well and it runs pretty smoothly.



On Mon, Apr 15, 2019 at 8:33 AM Edward Capriolo <ed...@gmail.com>
wrote:

> Out of band question. Given:
> https://hortonworks.com/blog/welcome-brand-new-cloudera/
>
> Does cdh finally ship with a tea you dont have to manually patch in?
> On Monday, April 15, 2019, Sungwoo Park <gl...@gmail.com> wrote:
>
>> I tested the performance of Impala 2.12.0+cdh5.15.2+0 in Cloudera CDH
>> 5.15.2 a while ago. I compared it with Hive 3.1.1 on MR3 (where MR3 is a
>> new execution engine for Hadoop and Kubernetes). You can find the result at:
>>
>> https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/
>>
>> On average, Hive on MR3 is about 30% faster than Hive on Tez on
>> sequential queries. For concurrent queries, the throughput of Hive on MR3
>> is about three times higher than Hive on Tez (when tested with 16
>> concurrent queries). You can find the result at:
>>
>> https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/
>>
>> --- Sungwoo Park
>>
>> On Mon, Apr 15, 2019 at 8:44 PM Artur Sukhenko <ar...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> We are using CDH 5, with Impala  2.7.0-cdh5.9.1  and Hive 1.1
>>> (MapReduce)
>>> I can't find the info regarding Hive on Tez performance compared to
>>> Impala.
>>> Does someone know or compared it?
>>>
>>> Thanks
>>>
>>> Artur Sukhenko
>>>
>>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>

Re: Hive on Tez vs Impala

Posted by Edward Capriolo <ed...@gmail.com>.
Out of band question. Given:
https://hortonworks.com/blog/welcome-brand-new-cloudera/

Does cdh finally ship with a tea you dont have to manually patch in?
On Monday, April 15, 2019, Sungwoo Park <gl...@gmail.com> wrote:

> I tested the performance of Impala 2.12.0+cdh5.15.2+0 in Cloudera CDH
> 5.15.2 a while ago. I compared it with Hive 3.1.1 on MR3 (where MR3 is a
> new execution engine for Hadoop and Kubernetes). You can find the result at:
>
> https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/
>
> On average, Hive on MR3 is about 30% faster than Hive on Tez on sequential
> queries. For concurrent queries, the throughput of Hive on MR3 is about
> three times higher than Hive on Tez (when tested with 16 concurrent
> queries). You can find the result at:
>
> https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/
>
> --- Sungwoo Park
>
> On Mon, Apr 15, 2019 at 8:44 PM Artur Sukhenko <ar...@gmail.com>
> wrote:
>
>> Hi,
>> We are using CDH 5, with Impala  2.7.0-cdh5.9.1  and Hive 1.1 (MapReduce)
>> I can't find the info regarding Hive on Tez performance compared to
>> Impala.
>> Does someone know or compared it?
>>
>> Thanks
>>
>> Artur Sukhenko
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: Hive on Tez vs Impala

Posted by Artur Sukhenko <ar...@gmail.com>.
Thanks Sungwoo, very nice articles.

On Mon, Apr 15, 2019 at 5:38 PM Sungwoo Park <gl...@gmail.com> wrote:

> I tested the performance of Impala 2.12.0+cdh5.15.2+0 in Cloudera CDH
> 5.15.2 a while ago. I compared it with Hive 3.1.1 on MR3 (where MR3 is a
> new execution engine for Hadoop and Kubernetes). You can find the result at:
>
> https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/
>
> On average, Hive on MR3 is about 30% faster than Hive on Tez on sequential
> queries. For concurrent queries, the throughput of Hive on MR3 is about
> three times higher than Hive on Tez (when tested with 16 concurrent
> queries). You can find the result at:
>
> https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/
>
> --- Sungwoo Park
>
> On Mon, Apr 15, 2019 at 8:44 PM Artur Sukhenko <ar...@gmail.com>
> wrote:
>
>> Hi,
>> We are using CDH 5, with Impala  2.7.0-cdh5.9.1  and Hive 1.1 (MapReduce)
>> I can't find the info regarding Hive on Tez performance compared to
>> Impala.
>> Does someone know or compared it?
>>
>> Thanks
>>
>> Artur Sukhenko
>>
>

Re: Hive on Tez vs Impala

Posted by Sungwoo Park <gl...@gmail.com>.
I tested the performance of Impala 2.12.0+cdh5.15.2+0 in Cloudera CDH
5.15.2 a while ago. I compared it with Hive 3.1.1 on MR3 (where MR3 is a
new execution engine for Hadoop and Kubernetes). You can find the result at:

https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/

On average, Hive on MR3 is about 30% faster than Hive on Tez on sequential
queries. For concurrent queries, the throughput of Hive on MR3 is about
three times higher than Hive on Tez (when tested with 16 concurrent
queries). You can find the result at:

https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/

--- Sungwoo Park

On Mon, Apr 15, 2019 at 8:44 PM Artur Sukhenko <ar...@gmail.com>
wrote:

> Hi,
> We are using CDH 5, with Impala  2.7.0-cdh5.9.1  and Hive 1.1 (MapReduce)
> I can't find the info regarding Hive on Tez performance compared to Impala.
> Does someone know or compared it?
>
> Thanks
>
> Artur Sukhenko
>