You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Wangwenli <wa...@huawei.com> on 2016/07/14 07:05:58 UTC

答复: 答复: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

I using 1.x latest, but , I checked the master branch(2.x), the latest code, no update.

Regards
Wenli

发件人: Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
发送时间: 2016年7月14日 15:02
收件人: user
主题: Re: 答复: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

Wjich version of Hive and Spark please?

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

On 14 July 2016 at 07:35, Wangwenli <wa...@huawei.com>> wrote:
It is specific to HoS

发件人: Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com<ma...@gmail.com>]
发送时间: 2016年7月14日 11:55
收件人: user
主题: Re: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

Hi Wenli,

You mentioned:

Coming to HoS, I think the main problem now is many optimization should be done , but seems no progress. Like conditional task , union sql cann’t convert to mapjoin(hive-9044) etc, so many optimize feature is pending, no one working on them.

Is this issue specific to Hive on Spark or they apply equally to Hive on MapReduce as well. In other words a general issue with Hive optimizer case hive-9044?

Thanks

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 14 July 2016 at 01:56, Wangwenli <wa...@huawei.com>> wrote:
Seems LLAP like tachyon, which purpose is also cache data between applications.

On contrast, sparksql is improve very fast

Regards
wenli
发件人: Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com<ma...@gmail.com>]
发送时间: 2016年7月13日 7:21
收件人: user
主题: Re: Using Spark on Hive with Hive also using Spark as its execution engine

I just read further notes on LLAP.

As Gopal explained LLAP has more to do that just in-memory and I quote Gopal:

"... LLAP is designed to be hammered by multiple user sessions running different queries, designed to automate the cache eviction & selection process. There's no user visible explicit .cache() to remember - it's automatic and concurrent. ..."

Sounds like what Oracle classic or SAP ASE do in terms of buffer management strategy. As I understand Spark does not have this concept of hot area (MRU/LRU chain). It loads data into its memory if needed and gets rid of it. if ten users read the same table those blocks from that table will be loaded 10 times which is not efficient.

LLAP is more intelligent in this respect. So somehow it maintains a Most Recently Used (MRU), Least Recently Used (LRU) chain. It maintains this buffer management strategy throughout the cluster. It must be using some clever algorithm to do so.

Cheers

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 12 July 2016 at 15:59, Mich Talebzadeh <mi...@gmail.com>> wrote:
Thanks Alan. Point taken.

In mitigation, here are members in Spark forum who have shown (interest) in using Hive directly and I quote one:

"Did you have any benchmark for using Spark as backend engine for Hive vs using Spark thrift server (and run spark code for hive queries)? We are using later but it will be very useful to remove thriftserver, if we can. "

Cheers,

Mich

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 12 July 2016 at 15:39, Alan Gates <al...@gmail.com>> wrote:

> On Jul 11, 2016, at 16:22, Mich Talebzadeh <mi...@gmail.com>> wrote:
>
> <snip>
> • If I add LLAP, will that be more efficient in terms of memory usage compared to Hive or not? Will it keep the data in memory for reuse or not.
>
Yes, this is exactly what LLAP does. It keeps a cache of hot data (hot columns of hot partitions) and shares that across queries. Unlike many MPP caches it will cache the same data on multiple nodes if it has more workers that want to access the data than can be run on a single node.

As a side note, it is considered bad form in Apache to send a message to two lists. It causes a lot of background noise for people on the Spark list who probably aren’t interested in Hive performance.

Alan.

答复: 答复: 答复: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

Posted by Wangwenli <wa...@huawei.com>.

I think you know what I said is relate to execution plan, right?
My spark version is 1.5.X

Regards
Wenli

发件人: Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
发送时间: 2016年7月14日 15:17
收件人: user
主题: Re: 答复: 答复: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

fine which version of spark are using for Hive execution/query engine please?

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 14 July 2016 at 08:05, Wangwenli <wa...@huawei.com>> wrote:
I using 1.x latest, but , I checked the master branch(2.x), the latest code, no update.

Regards
Wenli

发件人: Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com<ma...@gmail.com>]
发送时间: 2016年7月14日 15:02
收件人: user
主题: Re: 答复: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

Wjich version of Hive and Spark please?

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 14 July 2016 at 07:35, Wangwenli <wa...@huawei.com>> wrote:
It is specific to HoS

Hi Wenli,

You mentioned:

Is this issue specific to Hive on Spark or they apply equally to Hive on MapReduce as well. In other words a general issue with Hive optimizer case hive-9044?

Thanks

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 14 July 2016 at 01:56, Wangwenli <wa...@huawei.com>> wrote:
Seems LLAP like tachyon, which purpose is also cache data between applications.

On contrast, sparksql is improve very fast

I just read further notes on LLAP.

As Gopal explained LLAP has more to do that just in-memory and I quote Gopal:

Cheers

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 12 July 2016 at 15:59, Mich Talebzadeh <mi...@gmail.com>> wrote:
Thanks Alan. Point taken.

In mitigation, here are members in Spark forum who have shown (interest) in using Hive directly and I quote one:

Cheers,

Mich

Dr Mich Talebzadeh

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 12 July 2016 at 15:39, Alan Gates <al...@gmail.com>> wrote:

Alan.

Re: 答复: 答复: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

Posted by Mich Talebzadeh <mi...@gmail.com>.

fine which version of spark are using for Hive execution/query engine
please?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 July 2016 at 08:05, Wangwenli <wa...@huawei.com> wrote:

> I using 1.x latest,    but ,    I checked  the master branch(2.x),  the
> latest code,  no update.
>
>
>
> Regards
>
> Wenli
>
>
>
> *发件人:* Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
> *发送时间:* 2016年7月14日 15:02
> *收件人:* user
> *主题:* Re: 答复: 答复: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> Wjich version of Hive and Spark please?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 14 July 2016 at 07:35, Wangwenli <wa...@huawei.com> wrote:
>
> It is specific to HoS
>
>
>
> *发件人:* Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
> *发送时间:* 2016年7月14日 11:55
> *收件人:* user
> *主题:* Re: 答复: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> Hi Wenli,
>
>
>
> You mentioned:
>
>
>
> Coming to HoS, I think the main problem now is many optimization should be
> done , but seems no progress.  Like conditional task , union sql cann’t
> convert to mapjoin(hive-9044)   etc, so many optimize feature is pending,
> no one working on them.
>
>
>
> Is this issue specific to Hive on Spark or they apply equally to Hive on
> MapReduce as well. In other words a general issue with Hive optimizer  case
> hive-9044?
>
>
>
> Thanks
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 14 July 2016 at 01:56, Wangwenli <wa...@huawei.com> wrote:
>
> Seems LLAP like tachyon,  which purpose is also cache data between
> applications.
>
>
>
> Coming to HoS, I think the main problem now is many optimization should be
> done , but seems no progress.  Like conditional task , union sql cann’t
> convert to mapjoin(hive-9044)   etc, so many optimize feature is pending,
> no one working on them.
>
>
>
> On contrast, sparksql is improve  very fast
>
>
>
> Regards
>
> wenli
>
> *发件人:* Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
> *发送时间:* 2016年7月13日 7:21
> *收件人:* user
> *主题:* Re: Using Spark on Hive with Hive also using Spark as its execution
> engine
>
>
>
> I just read further notes on LLAP.
>
>
>
> As Gopal explained LLAP has more to do that just in-memory and I quote
> Gopal:
>
>
>
> "...  LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent. ..."
>
>
>
> Sounds like what Oracle classic or SAP ASE do in terms of buffer
> management strategy. As I understand Spark does not have this concept of
> hot area (MRU/LRU chain). It loads data into its memory if needed and gets
> rid of it. if ten users read the same table those blocks from that table
> will be loaded 10 times which is not efficient.
>
>
>
>  LLAP is more intelligent in this respect. So somehow it maintains a Most
> Recently Used (MRU), Least Recently Used (LRU) chain. It maintains this
> buffer management strategy throughout the cluster. It must be using some
> clever algorithm to do so.
>
>
>
> Cheers
>
>
>
> .
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 12 July 2016 at 15:59, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Thanks Alan. Point taken.
>
>
>
> In mitigation, here are members in Spark forum who have shown (interest)
> in using Hive directly and I quote one:
>
>
>
> "Did you have any benchmark for using Spark as backend engine for Hive vs
> using Spark thrift server (and run spark code for hive queries)? We are
> using later but it will be very useful to remove thriftserver, if we can. "
>
>
>
> Cheers,
>
>
>
> Mich
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 12 July 2016 at 15:39, Alan Gates <al...@gmail.com> wrote:
>
>
> > On Jul 11, 2016, at 16:22, Mich Talebzadeh <mi...@gmail.com>
> wrote:
> >
> > <snip>
> >       • If I add LLAP, will that be more efficient in terms of memory
> usage compared to Hive or not? Will it keep the data in memory for reuse or
> not.
> >
> Yes, this is exactly what LLAP does.  It keeps a cache of hot data (hot
> columns of hot partitions) and shares that across queries.  Unlike many MPP
> caches it will cache the same data on multiple nodes if it has more workers
> that want to access the data than can be run on a single node.
>
> As a side note, it is considered bad form in Apache to send a message to
> two lists.  It causes a lot of background noise for people on the Spark
> list who probably aren’t interested in Hive performance.
>
> Alan.
>
>
>
>
>
>
>
>
>