You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Nikolaos Tsipas <ni...@gmail.com> on 2018/06/28 17:50:43 UTC
Hive not always triggering a TEZ/Yarn job
Hi,
I'm using Tez with Hive to query data on S3 and I notice the following two
cases.
*Case A*
When the query is covering a smaller amount of data a TEZ job (yarn
application) is not created
select dt from my_db_schema.my_table where dt in
('2018-03-10','2018-03-09') and header ='xxx';
The output in the above case is:
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
2018-03-10
2018-03-10
2018-03-09
2018-03-09
Time taken: 7.043 seconds, Fetched: 4 row(s)
*Case B*
When the query is scanning more data
select dt from my_db_schema.my_table where header ='xxx';
then the output is as follows and I can see a TEZ job logged in the TEZ ui
and in yarn.
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING
PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 22 22 0
0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 38.12 s
----------------------------------------------------------------------------------------------
OK
2018-03-05
2018-03-05
2018-03-06
2018-03-06
2018-03-07
2018-03-07
2018-03-08
2018-03-08
2018-03-09
2018-03-09
2018-03-10
2018-03-10
2018-03-25
2018-03-25
2018-03-26
2018-03-26
2018-03-28
2018-03-28
2018-05-09
2018-05-09
2018-05-10
2018-05-10
Time taken: 47.197 seconds, Fetched: 22 row(s)
The problem in case A is that sometimes Hive decides not to trigger a TEZ
job and the query is taking a long time to complete. In this case the
worker nodes are not utilised at all, it's only the master node executing
the query.
Is there a way to force Hive to always trigger a TEZ job?
Re: Hive not always triggering a TEZ/Yarn job
Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
Hi Nikolaos
+user@hive list
Hive not running a tez job is because of fetch task optimization which directly fetches data and run it through operator pipeline for specific set of queries.
If you want to fully disable it try “set hive.fetch.task.conversion=none”.
If you want to trigger it for much smaller data sizes lower the value for hive.fetch.task.conversion.threshold.
Thanks
Prasanth
On Jun 28, 2018, at 10:50 AM, Nikolaos Tsipas <ni...@gmail.com>> wrote:
Hi,
I'm using Tez with Hive to query data on S3 and I notice the following two cases.
Case A
When the query is covering a smaller amount of data a TEZ job (yarn application) is not created
select dt from my_db_schema.my_table where dt in ('2018-03-10','2018-03-09') and header ='xxx';
The output in the above case is:
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2018-03-10
2018-03-10
2018-03-09
2018-03-09
Time taken: 7.043 seconds, Fetched: 4 row(s)
Case B
When the query is scanning more data
select dt from my_db_schema.my_table where header ='xxx';
then the output is as follows and I can see a TEZ job logged in the TEZ ui and in yarn.
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 22 22 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 38.12 s
----------------------------------------------------------------------------------------------
OK
2018-03-05
2018-03-05
2018-03-06
2018-03-06
2018-03-07
2018-03-07
2018-03-08
2018-03-08
2018-03-09
2018-03-09
2018-03-10
2018-03-10
2018-03-25
2018-03-25
2018-03-26
2018-03-26
2018-03-28
2018-03-28
2018-05-09
2018-05-09
2018-05-10
2018-05-10
Time taken: 47.197 seconds, Fetched: 22 row(s)
The problem in case A is that sometimes Hive decides not to trigger a TEZ job and the query is taking a long time to complete. In this case the worker nodes are not utilised at all, it's only the master node executing the query.
Is there a way to force Hive to always trigger a TEZ job?
Re: Hive not always triggering a TEZ/Yarn job
Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
Hi Nikolaos
+user@hive list
Hive not running a tez job is because of fetch task optimization which directly fetches data and run it through operator pipeline for specific set of queries.
If you want to fully disable it try “set hive.fetch.task.conversion=none”.
If you want to trigger it for much smaller data sizes lower the value for hive.fetch.task.conversion.threshold.
Thanks
Prasanth
On Jun 28, 2018, at 10:50 AM, Nikolaos Tsipas <ni...@gmail.com>> wrote:
Hi,
I'm using Tez with Hive to query data on S3 and I notice the following two cases.
Case A
When the query is covering a smaller amount of data a TEZ job (yarn application) is not created
select dt from my_db_schema.my_table where dt in ('2018-03-10','2018-03-09') and header ='xxx';
The output in the above case is:
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2018-03-10
2018-03-10
2018-03-09
2018-03-09
Time taken: 7.043 seconds, Fetched: 4 row(s)
Case B
When the query is scanning more data
select dt from my_db_schema.my_table where header ='xxx';
then the output is as follows and I can see a TEZ job logged in the TEZ ui and in yarn.
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 22 22 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 38.12 s
----------------------------------------------------------------------------------------------
OK
2018-03-05
2018-03-05
2018-03-06
2018-03-06
2018-03-07
2018-03-07
2018-03-08
2018-03-08
2018-03-09
2018-03-09
2018-03-10
2018-03-10
2018-03-25
2018-03-25
2018-03-26
2018-03-26
2018-03-28
2018-03-28
2018-05-09
2018-05-09
2018-05-10
2018-05-10
Time taken: 47.197 seconds, Fetched: 22 row(s)
The problem in case A is that sometimes Hive decides not to trigger a TEZ job and the query is taking a long time to complete. In this case the worker nodes are not utilised at all, it's only the master node executing the query.
Is there a way to force Hive to always trigger a TEZ job?