You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ashok Kumar <as...@yahoo.com.INVALID> on 2016/07/11 14:22:58 UTC

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Hi Mich,
Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark"
Have you made any more interesting findings that you like to bring up?
If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP is going to be a better choice from what you mentioned?
thanking you 
 

    On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh <mi...@gmail.com> wrote:
 

 Couple of points if I may and kindly bear with my remarks. 
Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
"Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called  LLAP (Live Long and Process, #llap online).
LLAP is an optional daemon process running on multiple nodes, that provides the following:   
   - Caching and data reuse across queries with compressed columnar data in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and hash joins
   - High throughput IO using Async IO Elevator with dedicated thread and core per disk
   - Granular column level security across applications
   - "
OK so we have added an in-memory capability to TEZ by way of LLAP, In other words what Spark does already and BTW it does not require a daemon running on any host. Don't take me wrong. It is interesting but this sounds to me (without testing myself) adding caching capability to TEZ to bring it on par with SPARK. 
Remember:
Spark -> DAG + in-memory cachingTEZ = MR on DAGTEZ + LLAP => DAG + in-memory caching
OK it is another way getting the same result. However, my concerns:
   
   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know
Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything about Hive. and they promote Impala but that sounds like a sinking ship these days.
Having said that I will try TEZ + LLAP :) No pun intended
Regards
Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com 
On 31 May 2016 at 08:19, Jörn Franke <jo...@gmail.com> wrote:

Thanks very interesting explanation. Looking forward to test it.

> On 31 May 2016, at 07:51, Gopal Vijayaraghavan <go...@apache.org> wrote:
>
>
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
>
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
>
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
>
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
>
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
>
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
>
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
>
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>
>
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
>
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
>
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
>
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
>
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
>
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data
> for availability/performance.
>
> Since data stream tend to be extremely wide table (Omniture) comes to
> mine, so the cache actually does not hold all columns in a table and since
> Zipf distributions are extremely common in these real data sets, the cache
> does not hold all rows either.
>
> select count(clicks) from table where zipcode = 695506;
>
> with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
> the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
> indexes for all files will be loaded into memory, all misses on the bloom
> will not even feature in the cache.
>
> A subsequent query for
>
> select count(clicks) from table where zipcode = 695586;
>
> will run against the collected indexes, before deciding which files need
> to be loaded into cache.
>
>
> Then again,
>
> select count(clicks)/count(impressions) from table where zipcode = 695586;
>
> will load only impressions out of the table into cache, to add it to the
> columnar cache without producing another complete copy (RDDs are not
> mutable, but LLAP cache is additive).
>
> The column split cache & index-cache separation allows for this to be
> cheaper than a full rematerialization - both are evicted as they fill up,
> with different priorities.
>
> Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
> with a bit of input from UX patterns observed from Tableau/Microstrategy
> users to give it the impression of being much faster than the engine
> really can be.
>
> Illusion of performance is likely to be indistinguishable from actual -
> I'm actually looking for subjects for that experiment :)
>
> Cheers,
> Gopal
>
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Posted by Jörn Franke <jo...@gmail.com>.

I think llap should be in the future a general component so llap + spark can make sense. I see tez and spark not as competitors but they have different purposes. Hive+Tez+llap is not the same as hive+spark. I think it goes beyond that for interactive queries .
Tez - you should use a distribution (eg Hortonworks) - generally I would use a distribution for anything related to performance , testing etc. because doing an own installation is more complex and more difficult to maintain. Performance and also features will be less good if you do not use a distribution. Which one is up to your choice.

> On 11 Jul 2016, at 17:09, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> The presentation will go deeper into the topic. Otherwise some thoughts  of mine. Fell free to comment. criticise :) 
> 
> I am a member of Spark Hive and Tez user groups plus one or two others
> Spark is by far the biggest in terms of community interaction
> Tez, typically one thread in a month
> Personally started building Tez for Hive from Tez source and gave up as it was not working. This was my own build as opposed to a distro
> if Hive says you should use Spark or Tez then using Spark is a perfectly valid choice
> If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the bonnet why bother.
> Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc. but they are a bit dated (not being unkind) and cannot be taken as is today. One their concern if I recall was excessive CPU and memory usage of Spark but then with the same token LLAP will add additional need for resources
> Essentially I am more comfortable to use less of technology stack than more.  With Hive and Spark (in this context) we have two. With Hive, Tez and LLAP, we have three stacks to look after that add to skill cost as well.
> Yep. It is still good to keep it simple
> 
> My thoughts on this are that if you have a viable open source product like Spark which is becoming a sort of Vogue in Big Data space and moving very fast, why look for another one. Hive does what it says on the Tin and good reliable Data Warehouse.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
>> On 11 July 2016 at 15:22, Ashok Kumar <as...@yahoo.com> wrote:
>> Hi Mich,
>> 
>> Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark"
>> 
>> Have you made any more interesting findings that you like to bring up?
>> 
>> If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP is going to be a better choice from what you mentioned?
>> 
>> thanking you 
>> 
>> 
>> 
>> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh <mi...@gmail.com> wrote:
>> 
>> 
>> Couple of points if I may and kindly bear with my remarks.
>> 
>> Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
>> 
>> "Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called  LLAP (Live Long and Process, #llap online).
>> 
>> LLAP is an optional daemon process running on multiple nodes, that provides the following:
>> Caching and data reuse across queries with compressed columnar data in-memory (off-heap)
>> Multi-threaded execution including reads with predicate pushdown and hash joins
>> High throughput IO using Async IO Elevator with dedicated thread and core per disk
>> Granular column level security across applications
>> "
>> OK so we have added an in-memory capability to TEZ by way of LLAP, In other words what Spark does already and BTW it does not require a daemon running on any host. Don't take me wrong. It is interesting but this sounds to me (without testing myself) adding caching capability to TEZ to bring it on par with SPARK.
>> 
>> Remember:
>> 
>> Spark -> DAG + in-memory caching
>> TEZ = MR on DAG
>> TEZ + LLAP => DAG + in-memory caching
>> 
>> OK it is another way getting the same result. However, my concerns:
>> 
>> Spark has a wide user base. I judge this from Spark user group traffic
>> TEZ user group has no traffic I am afraid
>> LLAP I don't know
>> Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything about Hive. and they promote Impala but that sounds like a sinking ship these days.
>> 
>> Having said that I will try TEZ + LLAP :) No pun intended
>> 
>> Regards
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>> On 31 May 2016 at 08:19, Jörn Franke <jo...@gmail.com> wrote:
>> Thanks very interesting explanation. Looking forward to test it.
>> 
>> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan <go...@apache.org> wrote:
>> >
>> >
>> >> That being said all systems are evolving. Hive supports tez+llap which
>> >> is basically the in-memory support.
>> >
>> > There is a big difference between where LLAP & SparkSQL, which has to do
>> > with access pattern needs.
>> >
>> > The first one is related to the lifetime of the cache - the Spark RDD
>> > cache is per-user-session which allows for further operation in that
>> > session to be optimized.
>> >
>> > LLAP is designed to be hammered by multiple user sessions running
>> > different queries, designed to automate the cache eviction & selection
>> > process. There's no user visible explicit .cache() to remember - it's
>> > automatic and concurrent.
>> >
>> > My team works with both engines, trying to improve it for ORC, but the
>> > goals of both are different.
>> >
>> > I will probably have to write a proper academic paper & get it
>> > edited/reviewed instead of send my ramblings to the user lists like this.
>> > Still, this needs an example to talk about.
>> >
>> > To give a qualified example, let's leave the world of single use clusters
>> > and take the use-case detailed here
>> >
>> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>> >
>> >
>> > There are two distinct problems there - one is that a single day sees upto
>> > 100k independent user sessions running queries and that most queries cover
>> > the last hour (& possibly join/compare against a similar hour aggregate
>> > from the past).
>> >
>> > The problem with having independent 100k user-sessions from different
>> > connections was that the SparkSQL layer drops the RDD lineage & cache
>> > whenever a user ends a session.
>> >
>> > The scale problem in general for Impala was that even though the data size
>> > was in multiple terabytes, the actual hot data was approx <20Gb, which
>> > resides on <10 machines with locality.
>> >
>> > The same problem applies when you apply RDD caching with something like
>> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
>> > popular that the machines which hold those blocks run extra hot.
>> >
>> > A cache model per-user session is entirely wasteful and a common cache +
>> > MPP model effectively overloads 2-3% of cluster, while leaving the other
>> > machines idle.
>> >
>> > LLAP was designed specifically to prevent that hotspotting, while
>> > maintaining the common cache model - within a few minutes after an hour
>> > ticks over, the whole cluster develops temporal popularity for the hot
>> > data and nearly every rack has at least one cached copy of the same data
>> > for availability/performance.
>> >
>> > Since data stream tend to be extremely wide table (Omniture) comes to
>> > mine, so the cache actually does not hold all columns in a table and since
>> > Zipf distributions are extremely common in these real data sets, the cache
>> > does not hold all rows either.
>> >
>> > select count(clicks) from table where zipcode = 695506;
>> >
>> > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
>> > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
>> > indexes for all files will be loaded into memory, all misses on the bloom
>> > will not even feature in the cache.
>> >
>> > A subsequent query for
>> >
>> > select count(clicks) from table where zipcode = 695586;
>> >
>> > will run against the collected indexes, before deciding which files need
>> > to be loaded into cache.
>> >
>> >
>> > Then again,
>> >
>> > select count(clicks)/count(impressions) from table where zipcode = 695586;
>> >
>> > will load only impressions out of the table into cache, to add it to the
>> > columnar cache without producing another complete copy (RDDs are not
>> > mutable, but LLAP cache is additive).
>> >
>> > The column split cache & index-cache separation allows for this to be
>> > cheaper than a full rematerialization - both are evicted as they fill up,
>> > with different priorities.
>> >
>> > Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
>> > with a bit of input from UX patterns observed from Tableau/Microstrategy
>> > users to give it the impression of being much faster than the engine
>> > really can be.
>> >
>> > Illusion of performance is likely to be indistinguishable from actual -
>> > I'm actually looking for subjects for that experiment :)
>> >
>> > Cheers,
>> > Gopal
>> >
>> >
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Posted by Jörn Franke <jo...@gmail.com>.

I think llap should be in the future a general component so llap + spark can make sense. I see tez and spark not as competitors but they have different purposes. Hive+Tez+llap is not the same as hive+spark. I think it goes beyond that for interactive queries .
Tez - you should use a distribution (eg Hortonworks) - generally I would use a distribution for anything related to performance , testing etc. because doing an own installation is more complex and more difficult to maintain. Performance and also features will be less good if you do not use a distribution. Which one is up to your choice.

> On 11 Jul 2016, at 17:09, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> The presentation will go deeper into the topic. Otherwise some thoughts  of mine. Fell free to comment. criticise :) 
> 
> I am a member of Spark Hive and Tez user groups plus one or two others
> Spark is by far the biggest in terms of community interaction
> Tez, typically one thread in a month
> Personally started building Tez for Hive from Tez source and gave up as it was not working. This was my own build as opposed to a distro
> if Hive says you should use Spark or Tez then using Spark is a perfectly valid choice
> If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the bonnet why bother.
> Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc. but they are a bit dated (not being unkind) and cannot be taken as is today. One their concern if I recall was excessive CPU and memory usage of Spark but then with the same token LLAP will add additional need for resources
> Essentially I am more comfortable to use less of technology stack than more.  With Hive and Spark (in this context) we have two. With Hive, Tez and LLAP, we have three stacks to look after that add to skill cost as well.
> Yep. It is still good to keep it simple
> 
> My thoughts on this are that if you have a viable open source product like Spark which is becoming a sort of Vogue in Big Data space and moving very fast, why look for another one. Hive does what it says on the Tin and good reliable Data Warehouse.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
>> On 11 July 2016 at 15:22, Ashok Kumar <as...@yahoo.com> wrote:
>> Hi Mich,
>> 
>> Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark"
>> 
>> Have you made any more interesting findings that you like to bring up?
>> 
>> If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP is going to be a better choice from what you mentioned?
>> 
>> thanking you 
>> 
>> 
>> 
>> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh <mi...@gmail.com> wrote:
>> 
>> 
>> Couple of points if I may and kindly bear with my remarks.
>> 
>> Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
>> 
>> "Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called  LLAP (Live Long and Process, #llap online).
>> 
>> LLAP is an optional daemon process running on multiple nodes, that provides the following:
>> Caching and data reuse across queries with compressed columnar data in-memory (off-heap)
>> Multi-threaded execution including reads with predicate pushdown and hash joins
>> High throughput IO using Async IO Elevator with dedicated thread and core per disk
>> Granular column level security across applications
>> "
>> OK so we have added an in-memory capability to TEZ by way of LLAP, In other words what Spark does already and BTW it does not require a daemon running on any host. Don't take me wrong. It is interesting but this sounds to me (without testing myself) adding caching capability to TEZ to bring it on par with SPARK.
>> 
>> Remember:
>> 
>> Spark -> DAG + in-memory caching
>> TEZ = MR on DAG
>> TEZ + LLAP => DAG + in-memory caching
>> 
>> OK it is another way getting the same result. However, my concerns:
>> 
>> Spark has a wide user base. I judge this from Spark user group traffic
>> TEZ user group has no traffic I am afraid
>> LLAP I don't know
>> Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything about Hive. and they promote Impala but that sounds like a sinking ship these days.
>> 
>> Having said that I will try TEZ + LLAP :) No pun intended
>> 
>> Regards
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>> On 31 May 2016 at 08:19, Jörn Franke <jo...@gmail.com> wrote:
>> Thanks very interesting explanation. Looking forward to test it.
>> 
>> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan <go...@apache.org> wrote:
>> >
>> >
>> >> That being said all systems are evolving. Hive supports tez+llap which
>> >> is basically the in-memory support.
>> >
>> > There is a big difference between where LLAP & SparkSQL, which has to do
>> > with access pattern needs.
>> >
>> > The first one is related to the lifetime of the cache - the Spark RDD
>> > cache is per-user-session which allows for further operation in that
>> > session to be optimized.
>> >
>> > LLAP is designed to be hammered by multiple user sessions running
>> > different queries, designed to automate the cache eviction & selection
>> > process. There's no user visible explicit .cache() to remember - it's
>> > automatic and concurrent.
>> >
>> > My team works with both engines, trying to improve it for ORC, but the
>> > goals of both are different.
>> >
>> > I will probably have to write a proper academic paper & get it
>> > edited/reviewed instead of send my ramblings to the user lists like this.
>> > Still, this needs an example to talk about.
>> >
>> > To give a qualified example, let's leave the world of single use clusters
>> > and take the use-case detailed here
>> >
>> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>> >
>> >
>> > There are two distinct problems there - one is that a single day sees upto
>> > 100k independent user sessions running queries and that most queries cover
>> > the last hour (& possibly join/compare against a similar hour aggregate
>> > from the past).
>> >
>> > The problem with having independent 100k user-sessions from different
>> > connections was that the SparkSQL layer drops the RDD lineage & cache
>> > whenever a user ends a session.
>> >
>> > The scale problem in general for Impala was that even though the data size
>> > was in multiple terabytes, the actual hot data was approx <20Gb, which
>> > resides on <10 machines with locality.
>> >
>> > The same problem applies when you apply RDD caching with something like
>> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
>> > popular that the machines which hold those blocks run extra hot.
>> >
>> > A cache model per-user session is entirely wasteful and a common cache +
>> > MPP model effectively overloads 2-3% of cluster, while leaving the other
>> > machines idle.
>> >
>> > LLAP was designed specifically to prevent that hotspotting, while
>> > maintaining the common cache model - within a few minutes after an hour
>> > ticks over, the whole cluster develops temporal popularity for the hot
>> > data and nearly every rack has at least one cached copy of the same data
>> > for availability/performance.
>> >
>> > Since data stream tend to be extremely wide table (Omniture) comes to
>> > mine, so the cache actually does not hold all columns in a table and since
>> > Zipf distributions are extremely common in these real data sets, the cache
>> > does not hold all rows either.
>> >
>> > select count(clicks) from table where zipcode = 695506;
>> >
>> > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
>> > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
>> > indexes for all files will be loaded into memory, all misses on the bloom
>> > will not even feature in the cache.
>> >
>> > A subsequent query for
>> >
>> > select count(clicks) from table where zipcode = 695586;
>> >
>> > will run against the collected indexes, before deciding which files need
>> > to be loaded into cache.
>> >
>> >
>> > Then again,
>> >
>> > select count(clicks)/count(impressions) from table where zipcode = 695586;
>> >
>> > will load only impressions out of the table into cache, to add it to the
>> > columnar cache without producing another complete copy (RDDs are not
>> > mutable, but LLAP cache is additive).
>> >
>> > The column split cache & index-cache separation allows for this to be
>> > cheaper than a full rematerialization - both are evicted as they fill up,
>> > with different priorities.
>> >
>> > Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
>> > with a bit of input from UX patterns observed from Tableau/Microstrategy
>> > users to give it the impression of being much faster than the engine
>> > really can be.
>> >
>> > Illusion of performance is likely to be indistinguishable from actual -
>> > I'm actually looking for subjects for that experiment :)
>> >
>> > Cheers,
>> > Gopal
>> >
>> >
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Posted by Mich Talebzadeh <mi...@gmail.com>.

The presentation will go deeper into the topic. Otherwise some thoughts  of
mine. Fell free to comment. criticise :)


   1. I am a member of Spark Hive and Tez user groups plus one or two others
   2. Spark is by far the biggest in terms of community interaction
   3. Tez, typically one thread in a month
   4. Personally started building Tez for Hive from Tez source and gave up
   as it was not working. This was my own build as opposed to a distro
   5. if Hive says you should use Spark or Tez then using Spark is a
   perfectly valid choice
   6. If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the
   bonnet why bother.
   7. Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc.
   but they are a bit dated (not being unkind) and cannot be taken as is
   today. One their concern if I recall was excessive CPU and memory usage of
   Spark but then with the same token LLAP will add additional need for
   resources
   8. Essentially I am more comfortable to use less of technology stack
   than more.  With Hive and Spark (in this context) we have two. With Hive,
   Tez and LLAP, we have three stacks to look after that add to skill cost as
   well.
   9. Yep. It is still good to keep it simple


My thoughts on this are that if you have a viable open source product like
Spark which is becoming a sort of Vogue in Big Data space and moving very
fast, why look for another one. Hive does what it says on the Tin and good
reliable Data Warehouse.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 15:22, Ashok Kumar <as...@yahoo.com> wrote:

> Hi Mich,
>
> Your recent presentation in London on this topic "Running Spark on Hive or
> Hive on Spark"
>
> Have you made any more interesting findings that you like to bring up?
>
> If Hive is offering both Spark and Tez in addition to MR, what stopping
> one not to use Spark? I still don't get why TEZ + LLAP is going to be a
> better choice from what you mentioned?
>
> thanking you
>
>
>
> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>
> Couple of points if I may and kindly bear with my remarks.
>
> Whilst it will be very interesting to try TEZ with LLAP. As I read from
> LLAP
>
> "Sub-second queries require fast query execution and low setup cost. The
> challenge for Hive is to achieve this without giving up on the scale and
> flexibility that users depend on. This requires a new approach using a
> hybrid engine that leverages Tez and something new called  LLAP (Live Long
> and Process, #llap online).
>
> LLAP is an optional daemon process running on multiple nodes, that
> provides the following:
>
>    - Caching and data reuse across queries with compressed columnar data
>    in-memory (off-heap)
>    - Multi-threaded execution including reads with predicate pushdown and
>    hash joins
>    - High throughput IO using Async IO Elevator with dedicated thread and
>    core per disk
>    - Granular column level security across applications
>    - "
>
> OK so we have added an in-memory capability to TEZ by way of LLAP, In
> other words what Spark does already and BTW it does not require a daemon
> running on any host. Don't take me wrong. It is interesting but this sounds
> to me (without testing myself) adding caching capability to TEZ to bring it
> on par with SPARK.
>
> Remember:
>
> Spark -> DAG + in-memory caching
> TEZ = MR on DAG
> TEZ + LLAP => DAG + in-memory caching
>
> OK it is another way getting the same result. However, my concerns:
>
>
>    - Spark has a wide user base. I judge this from Spark user group
>    traffic
>    - TEZ user group has no traffic I am afraid
>    - LLAP I don't know
>
> Sounds like Hortonworks promote TEZ and Cloudera does not want to know
> anything about Hive. and they promote Impala but that sounds like a sinking
> ship these days.
>
> Having said that I will try TEZ + LLAP :) No pun intended
>
> Regards
>
> Dr Mich Talebzadeh
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 31 May 2016 at 08:19, Jörn Franke <jo...@gmail.com> wrote:
>
> Thanks very interesting explanation. Looking forward to test it.
>
> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
> >
> >
> >> That being said all systems are evolving. Hive supports tez+llap which
> >> is basically the in-memory support.
> >
> > There is a big difference between where LLAP & SparkSQL, which has to do
> > with access pattern needs.
> >
> > The first one is related to the lifetime of the cache - the Spark RDD
> > cache is per-user-session which allows for further operation in that
> > session to be optimized.
> >
> > LLAP is designed to be hammered by multiple user sessions running
> > different queries, designed to automate the cache eviction & selection
> > process. There's no user visible explicit .cache() to remember - it's
> > automatic and concurrent.
> >
> > My team works with both engines, trying to improve it for ORC, but the
> > goals of both are different.
> >
> > I will probably have to write a proper academic paper & get it
> > edited/reviewed instead of send my ramblings to the user lists like this.
> > Still, this needs an example to talk about.
> >
> > To give a qualified example, let's leave the world of single use clusters
> > and take the use-case detailed here
> >
> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> >
> >
> > There are two distinct problems there - one is that a single day sees
> upto
> > 100k independent user sessions running queries and that most queries
> cover
> > the last hour (& possibly join/compare against a similar hour aggregate
> > from the past).
> >
> > The problem with having independent 100k user-sessions from different
> > connections was that the SparkSQL layer drops the RDD lineage & cache
> > whenever a user ends a session.
> >
> > The scale problem in general for Impala was that even though the data
> size
> > was in multiple terabytes, the actual hot data was approx <20Gb, which
> > resides on <10 machines with locality.
> >
> > The same problem applies when you apply RDD caching with something like
> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> > popular that the machines which hold those blocks run extra hot.
> >
> > A cache model per-user session is entirely wasteful and a common cache +
> > MPP model effectively overloads 2-3% of cluster, while leaving the other
> > machines idle.
> >
> > LLAP was designed specifically to prevent that hotspotting, while
> > maintaining the common cache model - within a few minutes after an hour
> > ticks over, the whole cluster develops temporal popularity for the hot
> > data and nearly every rack has at least one cached copy of the same data
> > for availability/performance.
> >
> > Since data stream tend to be extremely wide table (Omniture) comes to
> > mine, so the cache actually does not hold all columns in a table and
> since
> > Zipf distributions are extremely common in these real data sets, the
> cache
> > does not hold all rows either.
> >
> > select count(clicks) from table where zipcode = 695506;
> >
> > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
> > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
> > indexes for all files will be loaded into memory, all misses on the bloom
> > will not even feature in the cache.
> >
> > A subsequent query for
> >
> > select count(clicks) from table where zipcode = 695586;
> >
> > will run against the collected indexes, before deciding which files need
> > to be loaded into cache.
> >
> >
> > Then again,
> >
> > select count(clicks)/count(impressions) from table where zipcode =
> 695586;
> >
> > will load only impressions out of the table into cache, to add it to the
> > columnar cache without producing another complete copy (RDDs are not
> > mutable, but LLAP cache is additive).
> >
> > The column split cache & index-cache separation allows for this to be
> > cheaper than a full rematerialization - both are evicted as they fill up,
> > with different priorities.
> >
> > Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
> > with a bit of input from UX patterns observed from Tableau/Microstrategy
> > users to give it the impression of being much faster than the engine
> > really can be.
> >
> > Illusion of performance is likely to be indistinguishable from actual -
> > I'm actually looking for subjects for that experiment :)
> >
> > Cheers,
> > Gopal
> >
> >
>
>
>
>
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Posted by Mich Talebzadeh <mi...@gmail.com>.

The presentation will go deeper into the topic. Otherwise some thoughts  of
mine. Fell free to comment. criticise :)


   1. I am a member of Spark Hive and Tez user groups plus one or two others
   2. Spark is by far the biggest in terms of community interaction
   3. Tez, typically one thread in a month
   4. Personally started building Tez for Hive from Tez source and gave up
   as it was not working. This was my own build as opposed to a distro
   5. if Hive says you should use Spark or Tez then using Spark is a
   perfectly valid choice
   6. If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the
   bonnet why bother.
   7. Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc.
   but they are a bit dated (not being unkind) and cannot be taken as is
   today. One their concern if I recall was excessive CPU and memory usage of
   Spark but then with the same token LLAP will add additional need for
   resources
   8. Essentially I am more comfortable to use less of technology stack
   than more.  With Hive and Spark (in this context) we have two. With Hive,
   Tez and LLAP, we have three stacks to look after that add to skill cost as
   well.
   9. Yep. It is still good to keep it simple


My thoughts on this are that if you have a viable open source product like
Spark which is becoming a sort of Vogue in Big Data space and moving very
fast, why look for another one. Hive does what it says on the Tin and good
reliable Data Warehouse.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 15:22, Ashok Kumar <as...@yahoo.com> wrote:

> Hi Mich,
>
> Your recent presentation in London on this topic "Running Spark on Hive or
> Hive on Spark"
>
> Have you made any more interesting findings that you like to bring up?
>
> If Hive is offering both Spark and Tez in addition to MR, what stopping
> one not to use Spark? I still don't get why TEZ + LLAP is going to be a
> better choice from what you mentioned?
>
> thanking you
>
>
>
> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>
> Couple of points if I may and kindly bear with my remarks.
>
> Whilst it will be very interesting to try TEZ with LLAP. As I read from
> LLAP
>
> "Sub-second queries require fast query execution and low setup cost. The
> challenge for Hive is to achieve this without giving up on the scale and
> flexibility that users depend on. This requires a new approach using a
> hybrid engine that leverages Tez and something new called  LLAP (Live Long
> and Process, #llap online).
>
> LLAP is an optional daemon process running on multiple nodes, that
> provides the following:
>
>    - Caching and data reuse across queries with compressed columnar data
>    in-memory (off-heap)
>    - Multi-threaded execution including reads with predicate pushdown and
>    hash joins
>    - High throughput IO using Async IO Elevator with dedicated thread and
>    core per disk
>    - Granular column level security across applications
>    - "
>
> OK so we have added an in-memory capability to TEZ by way of LLAP, In
> other words what Spark does already and BTW it does not require a daemon
> running on any host. Don't take me wrong. It is interesting but this sounds
> to me (without testing myself) adding caching capability to TEZ to bring it
> on par with SPARK.
>
> Remember:
>
> Spark -> DAG + in-memory caching
> TEZ = MR on DAG
> TEZ + LLAP => DAG + in-memory caching
>
> OK it is another way getting the same result. However, my concerns:
>
>
>    - Spark has a wide user base. I judge this from Spark user group
>    traffic
>    - TEZ user group has no traffic I am afraid
>    - LLAP I don't know
>
> Sounds like Hortonworks promote TEZ and Cloudera does not want to know
> anything about Hive. and they promote Impala but that sounds like a sinking
> ship these days.
>
> Having said that I will try TEZ + LLAP :) No pun intended
>
> Regards
>
> Dr Mich Talebzadeh
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 31 May 2016 at 08:19, Jörn Franke <jo...@gmail.com> wrote:
>
> Thanks very interesting explanation. Looking forward to test it.
>
> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
> >
> >
> >> That being said all systems are evolving. Hive supports tez+llap which
> >> is basically the in-memory support.
> >
> > There is a big difference between where LLAP & SparkSQL, which has to do
> > with access pattern needs.
> >
> > The first one is related to the lifetime of the cache - the Spark RDD
> > cache is per-user-session which allows for further operation in that
> > session to be optimized.
> >
> > LLAP is designed to be hammered by multiple user sessions running
> > different queries, designed to automate the cache eviction & selection
> > process. There's no user visible explicit .cache() to remember - it's
> > automatic and concurrent.
> >
> > My team works with both engines, trying to improve it for ORC, but the
> > goals of both are different.
> >
> > I will probably have to write a proper academic paper & get it
> > edited/reviewed instead of send my ramblings to the user lists like this.
> > Still, this needs an example to talk about.
> >
> > To give a qualified example, let's leave the world of single use clusters
> > and take the use-case detailed here
> >
> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> >
> >
> > There are two distinct problems there - one is that a single day sees
> upto
> > 100k independent user sessions running queries and that most queries
> cover
> > the last hour (& possibly join/compare against a similar hour aggregate
> > from the past).
> >
> > The problem with having independent 100k user-sessions from different
> > connections was that the SparkSQL layer drops the RDD lineage & cache
> > whenever a user ends a session.
> >
> > The scale problem in general for Impala was that even though the data
> size
> > was in multiple terabytes, the actual hot data was approx <20Gb, which
> > resides on <10 machines with locality.
> >
> > The same problem applies when you apply RDD caching with something like
> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> > popular that the machines which hold those blocks run extra hot.
> >
> > A cache model per-user session is entirely wasteful and a common cache +
> > MPP model effectively overloads 2-3% of cluster, while leaving the other
> > machines idle.
> >
> > LLAP was designed specifically to prevent that hotspotting, while
> > maintaining the common cache model - within a few minutes after an hour
> > ticks over, the whole cluster develops temporal popularity for the hot
> > data and nearly every rack has at least one cached copy of the same data
> > for availability/performance.
> >
> > Since data stream tend to be extremely wide table (Omniture) comes to
> > mine, so the cache actually does not hold all columns in a table and
> since
> > Zipf distributions are extremely common in these real data sets, the
> cache
> > does not hold all rows either.
> >
> > select count(clicks) from table where zipcode = 695506;
> >
> > with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
> > the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
> > indexes for all files will be loaded into memory, all misses on the bloom
> > will not even feature in the cache.
> >
> > A subsequent query for
> >
> > select count(clicks) from table where zipcode = 695586;
> >
> > will run against the collected indexes, before deciding which files need
> > to be loaded into cache.
> >
> >
> > Then again,
> >
> > select count(clicks)/count(impressions) from table where zipcode =
> 695586;
> >
> > will load only impressions out of the table into cache, to add it to the
> > columnar cache without producing another complete copy (RDDs are not
> > mutable, but LLAP cache is additive).
> >
> > The column split cache & index-cache separation allows for this to be
> > cheaper than a full rematerialization - both are evicted as they fill up,
> > with different priorities.
> >
> > Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
> > with a bit of input from UX patterns observed from Tableau/Microstrategy
> > users to give it the impression of being much faster than the engine
> > really can be.
> >
> > Illusion of performance is likely to be indistinguishable from actual -
> > I'm actually looking for subjects for that experiment :)
> >
> > Cheers,
> > Gopal
> >
> >
>
>
>
>
>