You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Lars Selsaas <la...@thinkbiganalytics.com> on 2014/06/20 17:49:57 UTC

Tez performance on Hive

Hi,

So when you set Tez as the execution engine for Hive it takes about half
the time to finish a query the second time you run it going from say 24
seconds to 12 seconds. but if I keep re running it it gets down to about 2
seconds on that same query. The speed goes up to 12 seconds if I wait to
long before the next rerun or if I do large enough adjustments to the query.


So I'm working on a blogpost about Tez and need to find out why this is
happening. The first reduced speed seem to mainly just be because of hot
containers that store the information about where to find your data. While
the seconds reduce down to about 2 sec seems to be some in memory storage
of the data. Does it store the results in memory and keep it ready for next
time or?



-- 

Lars Selsaas

Data Engineer

Think Big Analytics <http://thinkbiganalytics.com>

lars.selsaas@thinkbiganalytics.com

650-537-5321

Re: Tez performance on Hive

Posted by Hitesh Shah <hi...@apache.org>.
The main config to control how long containers are kept for is "tez.am.container.session.delay-allocation-millis”. Setting this to a higher value will tell the AM to retain containers for a longer period. Increasing this though will have a negative effect on other users in the cluster as idle resources will be retained by the tez application. 

— Hitesh


On Jun 20, 2014, at 11:27 AM, Lars Selsaas <la...@thinkbiganalytics.com> wrote:

> I'm also wondering which settings I can play around with to affect this? Say I want to make my jobs keep stuff longer.
> 
> Thanks,
> Lars
> 
> 
> On Fri, Jun 20, 2014 at 11:08 AM, Lars Selsaas <la...@thinkbiganalytics.com> wrote:
> Thanks!
> 
> Hopefully I'm getting the correct logs here:
> 
> It seems the same application manager keeps on taking the requests.
> 
> They both get the same application ID: application_1403285786962_0002
> dag_1403285786962_0004_1.dot : Total file length is 2179 bytes.
> dag_1403285786962_0004_2.dot : Total file length is 2179 bytes.
> dag_1403285786962_0004_3.dot : Total file length is 2179 bytes.
> dag_1403285786962_0004_4.dot : Total file length is 2179 bytes.
> stderr : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_1 : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_2 : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_3 : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_4 : Total file length is 0 bytes.
> stderr_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
> stdout : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_1 : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_2 : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_3 : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_4 : Total file length is 0 bytes.
> stdout_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
> syslog : Total file length is 7577 bytes.
> syslog_dag_1403285786962_0004_1 : Total file length is 57034 bytes.
> syslog_dag_1403285786962_0004_1_post : Total file length is 4775 bytes.
> syslog_dag_1403285786962_0004_2 : Total file length is 56104 bytes.
> syslog_dag_1403285786962_0004_2_post : Total file length is 707 bytes.
> syslog_dag_1403285786962_0004_3 : Total file length is 53187 bytes.
> syslog_dag_1403285786962_0004_3_post : Total file length is 5003 bytes.
> syslog_dag_1403285786962_0004_4 : Total file length is 56111 bytes.
> syslog_dag_1403285786962_0004_4_post : Total file length is 4204 bytes.
> 
> fast run
> 
> Map 1	 1	 734 Bytes	 438 Bytes	 639 ms
> Map 2	1	 245 KB	478 Bytes	1.34 secs
> Reducer 3	 1	 446 Bytes	 557 Bytes	 3.63 secs
> 
> 
> 
> slow run
> 
> Map 1	 1	 734 Bytes	 438 Bytes	 12.62 secs
> Map 2	1	 245 KB	478 Bytes	14.37 secs
> Reducer 3	 1	 446 Bytes	 557 Bytes	 15.67 secs
> 
> 
> 
> On Fri, Jun 20, 2014 at 10:31 AM, Hitesh Shah <hi...@apache.org> wrote:
> Hello Lars,
> 
> Just to be very clear - there is no caching of results/data across queries except for some minimal meta-data caching for ORC. If you can send across the logs generated by “yarn logs -applicationId <appId>”, we can try and help you get a better understanding of where the speed difference is stemming from.
> 
> — HItesh
> 
> On Jun 20, 2014, at 10:13 AM, Bikas Saha <bi...@hortonworks.com> wrote:
> 
> > Hi,
> >
> > Thanks for your interest in trying out Hive on Tez. There are multiple reasons for the observations you see below.
> > 1)      Containers are warmed up the longer they get used. So if you repeatedly run queries then the JVM has all classes loaded and ready and may have JIT-ed the frequently run code path. As it learns more about your execution pattern, the JIT can do a better job. This will help you across different queries.
> > 2)      As you frequently access the same data from the OS it will increase the chances of your finding that data in the OS buffer cache. So you get the benefits of in-memory data JThis will help repeated runs of queries on the same data.
> > 3)      Hive is smart about explicitly caching de-serialized (Java objects) within query in order to reduce re-computation of work that has already been done. This will help within a query.
> > 4)      If you are using the ORC file then Hive will try to cache ORC file metadata like locations/sizes etc. and this helps different queries that access the same data.
> > 5)      If your Tez query session has been idle for some time, then the system starts pro-actively releasing resources back to the cluster so that they may be used by other applications (good for multi-tenancy). So if you fire a query after some delay then a slowdown will be observed in case we need to reclaim some of the released resources. This delay is configurable.
> >
> > Hope this helps and you have a positive experience experimenting with Hive on Tez.
> > Please let us know how we can help!
> > Bikas
> >
> > From: Lars Selsaas [mailto:lars.selsaas@thinkbiganalytics.com]
> > Sent: Friday, June 20, 2014 8:50 AM
> > To: user
> > Subject: Tez performance on Hive
> >
> > Hi,
> >
> > So when you set Tez as the execution engine for Hive it takes about half the time to finish a query the second time you run it going from say 24 seconds to 12 seconds. but if I keep re running it it gets down to about 2 seconds on that same query. The speed goes up to 12 seconds if I wait to long before the next rerun or if I do large enough adjustments to the query.
> >
> >
> > So I'm working on a blogpost about Tez and need to find out why this is happening. The first reduced speed seem to mainly just be because of hot containers that store the information about where to find your data. While the seconds reduce down to about 2 sec seems to be some in memory storage of the data. Does it store the results in memory and keep it ready for next time or?
> >
> >
> >
> > --
> > <~WRD018.jpg>
> > Lars Selsaas
> > Data Engineer
> > Think Big Analytics
> > lars.selsaas@thinkbiganalytics.com
> > 650-537-5321
> >
> >
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
> 
> 
> 
> 
> -- 
>  	
> Lars Selsaas
> Data Engineer
> Think Big Analytics
> lars.selsaas@thinkbiganalytics.com
> 650-537-5321
> 
> 
> 
> 
> -- 
> 	
> Lars Selsaas
> Data Engineer
> Think Big Analytics
> lars.selsaas@thinkbiganalytics.com
> 650-537-5321
> 


Re: Tez performance on Hive

Posted by Lars Selsaas <la...@thinkbiganalytics.com>.
I'm also wondering which settings I can play around with to affect this?
Say I want to make my jobs keep stuff longer.

Thanks,
Lars


On Fri, Jun 20, 2014 at 11:08 AM, Lars Selsaas <
lars.selsaas@thinkbiganalytics.com> wrote:

> Thanks!
>
> Hopefully I'm getting the correct logs here:
>
> It seems the same application manager keeps on taking the requests.
>
> They both get the same application ID: application_1403285786962_0002
> <http://127.0.0.1:8088/cluster/app/application_1403285786962_0002>
>
> dag_1403285786962_0004_1.dot : Total file length is 2179 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/dag_1403285786962_0004_1.dot/?start=-4096>
>
> dag_1403285786962_0004_2.dot : Total file length is 2179 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/dag_1403285786962_0004_2.dot/?start=-4096>
>
> dag_1403285786962_0004_3.dot : Total file length is 2179 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/dag_1403285786962_0004_3.dot/?start=-4096>
>
> dag_1403285786962_0004_4.dot : Total file length is 2179 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/dag_1403285786962_0004_4.dot/?start=-4096>
>
> stderr : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr/?start=-4096>
>
> stderr_dag_1403285786962_0004_1 : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_1/?start=-4096>
>
> stderr_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_1_post/?start=-4096>
>
> stderr_dag_1403285786962_0004_2 : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_2/?start=-4096>
>
> stderr_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_2_post/?start=-4096>
>
> stderr_dag_1403285786962_0004_3 : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_3/?start=-4096>
>
> stderr_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_3_post/?start=-4096>
>
> stderr_dag_1403285786962_0004_4 : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_4/?start=-4096>
>
> stderr_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_4_post/?start=-4096>
>
> stdout : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout/?start=-4096>
>
> stdout_dag_1403285786962_0004_1 : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_1/?start=-4096>
>
> stdout_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_1_post/?start=-4096>
>
> stdout_dag_1403285786962_0004_2 : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_2/?start=-4096>
>
> stdout_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_2_post/?start=-4096>
>
> stdout_dag_1403285786962_0004_3 : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_3/?start=-4096>
>
> stdout_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_3_post/?start=-4096>
>
> stdout_dag_1403285786962_0004_4 : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_4/?start=-4096>
>
> stdout_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_4_post/?start=-4096>
>
> syslog : Total file length is 7577 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog/?start=-4096>
>
> syslog_dag_1403285786962_0004_1 : Total file length is 57034 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_1/?start=-4096>
>
> syslog_dag_1403285786962_0004_1_post : Total file length is 4775 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_1_post/?start=-4096>
>
> syslog_dag_1403285786962_0004_2 : Total file length is 56104 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_2/?start=-4096>
>
> syslog_dag_1403285786962_0004_2_post : Total file length is 707 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_2_post/?start=-4096>
>
> syslog_dag_1403285786962_0004_3 : Total file length is 53187 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_3/?start=-4096>
>
> syslog_dag_1403285786962_0004_3_post : Total file length is 5003 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_3_post/?start=-4096>
>
> syslog_dag_1403285786962_0004_4 : Total file length is 56111 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_4/?start=-4096>
>
> syslog_dag_1403285786962_0004_4_post : Total file length is 4204 bytes.
> <http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_4_post/?start=-4096>
>
> fast run
>
>  Map 1 <http://127.0.0.1:8080/#> 1 734 Bytes 438 Bytes 639 ms Map 2
> <http://127.0.0.1:8080/#> 1 245 KB478 Bytes 1.34 secs Reducer 3
> <http://127.0.0.1:8080/#> 1 446 Bytes 557 Bytes 3.63 secs
>
>
> slow run
>
>  Map 1 <http://127.0.0.1:8080/#> 1 734 Bytes 438 Bytes 12.62 secs Map 2
> <http://127.0.0.1:8080/#> 1 245 KB478 Bytes 14.37 secs Reducer 3
> <http://127.0.0.1:8080/#> 1 446 Bytes 557 Bytes 15.67 secs
>
>
>
> On Fri, Jun 20, 2014 at 10:31 AM, Hitesh Shah <hi...@apache.org> wrote:
>
>> Hello Lars,
>>
>> Just to be very clear - there is no caching of results/data across
>> queries except for some minimal meta-data caching for ORC. If you can send
>> across the logs generated by “yarn logs -applicationId <appId>”, we can try
>> and help you get a better understanding of where the speed difference is
>> stemming from.
>>
>> — HItesh
>>
>> On Jun 20, 2014, at 10:13 AM, Bikas Saha <bi...@hortonworks.com> wrote:
>>
>> > Hi,
>> >
>> > Thanks for your interest in trying out Hive on Tez. There are multiple
>> reasons for the observations you see below.
>> > 1)      Containers are warmed up the longer they get used. So if you
>> repeatedly run queries then the JVM has all classes loaded and ready and
>> may have JIT-ed the frequently run code path. As it learns more about your
>> execution pattern, the JIT can do a better job. This will help you across
>> different queries.
>> > 2)      As you frequently access the same data from the OS it will
>> increase the chances of your finding that data in the OS buffer cache. So
>> you get the benefits of in-memory data JThis will help repeated runs of
>> queries on the same data.
>> > 3)      Hive is smart about explicitly caching de-serialized (Java
>> objects) within query in order to reduce re-computation of work that has
>> already been done. This will help within a query.
>> > 4)      If you are using the ORC file then Hive will try to cache ORC
>> file metadata like locations/sizes etc. and this helps different queries
>> that access the same data.
>> > 5)      If your Tez query session has been idle for some time, then the
>> system starts pro-actively releasing resources back to the cluster so that
>> they may be used by other applications (good for multi-tenancy). So if you
>> fire a query after some delay then a slowdown will be observed in case we
>> need to reclaim some of the released resources. This delay is configurable.
>> >
>> > Hope this helps and you have a positive experience experimenting with
>> Hive on Tez.
>> > Please let us know how we can help!
>> > Bikas
>> >
>> > From: Lars Selsaas [mailto:lars.selsaas@thinkbiganalytics.com]
>> > Sent: Friday, June 20, 2014 8:50 AM
>> > To: user
>> > Subject: Tez performance on Hive
>> >
>> > Hi,
>> >
>> > So when you set Tez as the execution engine for Hive it takes about
>> half the time to finish a query the second time you run it going from say
>> 24 seconds to 12 seconds. but if I keep re running it it gets down to about
>> 2 seconds on that same query. The speed goes up to 12 seconds if I wait to
>> long before the next rerun or if I do large enough adjustments to the query.
>> >
>> >
>> > So I'm working on a blogpost about Tez and need to find out why this is
>> happening. The first reduced speed seem to mainly just be because of hot
>> containers that store the information about where to find your data. While
>> the seconds reduce down to about 2 sec seems to be some in memory storage
>> of the data. Does it store the results in memory and keep it ready for next
>> time or?
>> >
>> >
>> >
>> > --
>> > <~WRD018.jpg>
>> > Lars Selsaas
>> > Data Engineer
>> > Think Big Analytics
>> > lars.selsaas@thinkbiganalytics.com
>> > 650-537-5321
>> >
>> >
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or
>> entity to which it is addressed and may contain information that is
>> confidential, privileged and exempt from disclosure under applicable law.
>> If the reader of this message is not the intended recipient, you are hereby
>> notified that any printing, copying, dissemination, distribution,
>> disclosure or forwarding of this communication is strictly prohibited. If
>> you have received this communication in error, please contact the sender
>> immediately and delete it from your system. Thank You.
>>
>>
>
>
> --
>
> Lars Selsaas
>
> Data Engineer
>
> Think Big Analytics <http://thinkbiganalytics.com>
>
> lars.selsaas@thinkbiganalytics.com
>
> 650-537-5321
>
>


-- 

Lars Selsaas

Data Engineer

Think Big Analytics <http://thinkbiganalytics.com>

lars.selsaas@thinkbiganalytics.com

650-537-5321

Re: Tez performance on Hive

Posted by Lars Selsaas <la...@thinkbiganalytics.com>.
Thanks!

Hopefully I'm getting the correct logs here:

It seems the same application manager keeps on taking the requests.

They both get the same application ID: application_1403285786962_0002
<http://127.0.0.1:8088/cluster/app/application_1403285786962_0002>

dag_1403285786962_0004_1.dot : Total file length is 2179 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/dag_1403285786962_0004_1.dot/?start=-4096>

dag_1403285786962_0004_2.dot : Total file length is 2179 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/dag_1403285786962_0004_2.dot/?start=-4096>

dag_1403285786962_0004_3.dot : Total file length is 2179 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/dag_1403285786962_0004_3.dot/?start=-4096>

dag_1403285786962_0004_4.dot : Total file length is 2179 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/dag_1403285786962_0004_4.dot/?start=-4096>

stderr : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr/?start=-4096>

stderr_dag_1403285786962_0004_1 : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_1/?start=-4096>

stderr_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_1_post/?start=-4096>

stderr_dag_1403285786962_0004_2 : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_2/?start=-4096>

stderr_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_2_post/?start=-4096>

stderr_dag_1403285786962_0004_3 : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_3/?start=-4096>

stderr_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_3_post/?start=-4096>

stderr_dag_1403285786962_0004_4 : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_4/?start=-4096>

stderr_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stderr_dag_1403285786962_0004_4_post/?start=-4096>

stdout : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout/?start=-4096>

stdout_dag_1403285786962_0004_1 : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_1/?start=-4096>

stdout_dag_1403285786962_0004_1_post : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_1_post/?start=-4096>

stdout_dag_1403285786962_0004_2 : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_2/?start=-4096>

stdout_dag_1403285786962_0004_2_post : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_2_post/?start=-4096>

stdout_dag_1403285786962_0004_3 : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_3/?start=-4096>

stdout_dag_1403285786962_0004_3_post : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_3_post/?start=-4096>

stdout_dag_1403285786962_0004_4 : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_4/?start=-4096>

stdout_dag_1403285786962_0004_4_post : Total file length is 0 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/stdout_dag_1403285786962_0004_4_post/?start=-4096>

syslog : Total file length is 7577 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog/?start=-4096>

syslog_dag_1403285786962_0004_1 : Total file length is 57034 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_1/?start=-4096>

syslog_dag_1403285786962_0004_1_post : Total file length is 4775 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_1_post/?start=-4096>

syslog_dag_1403285786962_0004_2 : Total file length is 56104 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_2/?start=-4096>

syslog_dag_1403285786962_0004_2_post : Total file length is 707 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_2_post/?start=-4096>

syslog_dag_1403285786962_0004_3 : Total file length is 53187 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_3/?start=-4096>

syslog_dag_1403285786962_0004_3_post : Total file length is 5003 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_3_post/?start=-4096>

syslog_dag_1403285786962_0004_4 : Total file length is 56111 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_4/?start=-4096>

syslog_dag_1403285786962_0004_4_post : Total file length is 4204 bytes.
<http://localhost:8042/node/containerlogs/container_1403285786962_0004_01_000001/root/syslog_dag_1403285786962_0004_4_post/?start=-4096>

fast run

Map 1 <http://127.0.0.1:8080/#>1734 Bytes438 Bytes639 msMap 2
<http://127.0.0.1:8080/#>1245 KB478 Bytes1.34 secsReducer 3
<http://127.0.0.1:8080/#>1446 Bytes557 Bytes3.63 secs


slow run

Map 1 <http://127.0.0.1:8080/#>1734 Bytes438 Bytes12.62 secsMap 2
<http://127.0.0.1:8080/#>1245 KB478 Bytes14.37 secsReducer 3
<http://127.0.0.1:8080/#>1446 Bytes557 Bytes15.67 secs



On Fri, Jun 20, 2014 at 10:31 AM, Hitesh Shah <hi...@apache.org> wrote:

> Hello Lars,
>
> Just to be very clear - there is no caching of results/data across queries
> except for some minimal meta-data caching for ORC. If you can send across
> the logs generated by “yarn logs -applicationId <appId>”, we can try and
> help you get a better understanding of where the speed difference is
> stemming from.
>
> — HItesh
>
> On Jun 20, 2014, at 10:13 AM, Bikas Saha <bi...@hortonworks.com> wrote:
>
> > Hi,
> >
> > Thanks for your interest in trying out Hive on Tez. There are multiple
> reasons for the observations you see below.
> > 1)      Containers are warmed up the longer they get used. So if you
> repeatedly run queries then the JVM has all classes loaded and ready and
> may have JIT-ed the frequently run code path. As it learns more about your
> execution pattern, the JIT can do a better job. This will help you across
> different queries.
> > 2)      As you frequently access the same data from the OS it will
> increase the chances of your finding that data in the OS buffer cache. So
> you get the benefits of in-memory data JThis will help repeated runs of
> queries on the same data.
> > 3)      Hive is smart about explicitly caching de-serialized (Java
> objects) within query in order to reduce re-computation of work that has
> already been done. This will help within a query.
> > 4)      If you are using the ORC file then Hive will try to cache ORC
> file metadata like locations/sizes etc. and this helps different queries
> that access the same data.
> > 5)      If your Tez query session has been idle for some time, then the
> system starts pro-actively releasing resources back to the cluster so that
> they may be used by other applications (good for multi-tenancy). So if you
> fire a query after some delay then a slowdown will be observed in case we
> need to reclaim some of the released resources. This delay is configurable.
> >
> > Hope this helps and you have a positive experience experimenting with
> Hive on Tez.
> > Please let us know how we can help!
> > Bikas
> >
> > From: Lars Selsaas [mailto:lars.selsaas@thinkbiganalytics.com]
> > Sent: Friday, June 20, 2014 8:50 AM
> > To: user
> > Subject: Tez performance on Hive
> >
> > Hi,
> >
> > So when you set Tez as the execution engine for Hive it takes about half
> the time to finish a query the second time you run it going from say 24
> seconds to 12 seconds. but if I keep re running it it gets down to about 2
> seconds on that same query. The speed goes up to 12 seconds if I wait to
> long before the next rerun or if I do large enough adjustments to the query.
> >
> >
> > So I'm working on a blogpost about Tez and need to find out why this is
> happening. The first reduced speed seem to mainly just be because of hot
> containers that store the information about where to find your data. While
> the seconds reduce down to about 2 sec seems to be some in memory storage
> of the data. Does it store the results in memory and keep it ready for next
> time or?
> >
> >
> >
> > --
> > <~WRD018.jpg>
> > Lars Selsaas
> > Data Engineer
> > Think Big Analytics
> > lars.selsaas@thinkbiganalytics.com
> > 650-537-5321
> >
> >
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>


-- 

Lars Selsaas

Data Engineer

Think Big Analytics <http://thinkbiganalytics.com>

lars.selsaas@thinkbiganalytics.com

650-537-5321

Re: Tez performance on Hive

Posted by Hitesh Shah <hi...@apache.org>.
Hello Lars, 

Just to be very clear - there is no caching of results/data across queries except for some minimal meta-data caching for ORC. If you can send across the logs generated by “yarn logs -applicationId <appId>”, we can try and help you get a better understanding of where the speed difference is stemming from. 

— HItesh

On Jun 20, 2014, at 10:13 AM, Bikas Saha <bi...@hortonworks.com> wrote:

> Hi,
>  
> Thanks for your interest in trying out Hive on Tez. There are multiple reasons for the observations you see below.
> 1)      Containers are warmed up the longer they get used. So if you repeatedly run queries then the JVM has all classes loaded and ready and may have JIT-ed the frequently run code path. As it learns more about your execution pattern, the JIT can do a better job. This will help you across different queries.
> 2)      As you frequently access the same data from the OS it will increase the chances of your finding that data in the OS buffer cache. So you get the benefits of in-memory data JThis will help repeated runs of queries on the same data.
> 3)      Hive is smart about explicitly caching de-serialized (Java objects) within query in order to reduce re-computation of work that has already been done. This will help within a query.
> 4)      If you are using the ORC file then Hive will try to cache ORC file metadata like locations/sizes etc. and this helps different queries that access the same data.
> 5)      If your Tez query session has been idle for some time, then the system starts pro-actively releasing resources back to the cluster so that they may be used by other applications (good for multi-tenancy). So if you fire a query after some delay then a slowdown will be observed in case we need to reclaim some of the released resources. This delay is configurable.
>  
> Hope this helps and you have a positive experience experimenting with Hive on Tez.
> Please let us know how we can help!
> Bikas
>  
> From: Lars Selsaas [mailto:lars.selsaas@thinkbiganalytics.com] 
> Sent: Friday, June 20, 2014 8:50 AM
> To: user
> Subject: Tez performance on Hive
>  
> Hi,
>  
> So when you set Tez as the execution engine for Hive it takes about half the time to finish a query the second time you run it going from say 24 seconds to 12 seconds. but if I keep re running it it gets down to about 2 seconds on that same query. The speed goes up to 12 seconds if I wait to long before the next rerun or if I do large enough adjustments to the query.
>  
>  
> So I'm working on a blogpost about Tez and need to find out why this is happening. The first reduced speed seem to mainly just be because of hot containers that store the information about where to find your data. While the seconds reduce down to about 2 sec seems to be some in memory storage of the data. Does it store the results in memory and keep it ready for next time or?
>  
>  
>  
> --
> <~WRD018.jpg>
> Lars Selsaas
> Data Engineer
> Think Big Analytics
> lars.selsaas@thinkbiganalytics.com
> 650-537-5321
>  
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.


RE: Tez performance on Hive

Posted by Bikas Saha <bi...@hortonworks.com>.
Hi,



Thanks for your interest in trying out Hive on Tez. There are multiple
reasons for the observations you see below.

1)      Containers are warmed up the longer they get used. So if you
repeatedly run queries then the JVM has all classes loaded and ready and
may have JIT-ed the frequently run code path. As it learns more about your
execution pattern, the JIT can do a better job. This will help you across
different queries.

2)      As you frequently access the same data from the OS it will increase
the chances of your finding that data in the OS buffer cache. So you get
the benefits of in-memory data J This will help repeated runs of queries on
the same data.

3)      Hive is smart about explicitly caching de-serialized (Java objects)
within query in order to reduce re-computation of work that has already
been done. This will help within a query.

4)      If you are using the ORC file then Hive will try to cache ORC file
metadata like locations/sizes etc. and this helps different queries that
access the same data.

5)      If your Tez query session has been idle for some time, then the
system starts pro-actively releasing resources back to the cluster so that
they may be used by other applications (good for multi-tenancy). So if you
fire a query after some delay then a slowdown will be observed in case we
need to reclaim some of the released resources. This delay is configurable.



Hope this helps and you have a positive experience experimenting with Hive
on Tez.

Please let us know how we can help!

Bikas



*From:* Lars Selsaas [mailto:lars.selsaas@thinkbiganalytics.com]
*Sent:* Friday, June 20, 2014 8:50 AM
*To:* user
*Subject:* Tez performance on Hive



Hi,



So when you set Tez as the execution engine for Hive it takes about half
the time to finish a query the second time you run it going from say 24
seconds to 12 seconds. but if I keep re running it it gets down to about 2
seconds on that same query. The speed goes up to 12 seconds if I wait to
long before the next rerun or if I do large enough adjustments to the query.





So I'm working on a blogpost about Tez and need to find out why this is
happening. The first reduced speed seem to mainly just be because of hot
containers that store the information about where to find your data. While
the seconds reduce down to about 2 sec seems to be some in memory storage
of the data. Does it store the results in memory and keep it ready for next
time or?







-- 

[image: Image removed by sender.]

Lars Selsaas

Data Engineer

Think Big Analytics <http://thinkbiganalytics.com>

lars.selsaas@thinkbiganalytics.com

650-537-5321

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.