You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Nikolaos Tsipas <ni...@gmail.com> on 2018/06/28 17:50:43 UTC

Hive not always triggering a TEZ/Yarn job

Hi,

I'm using Tez with Hive to query data on S3 and I notice the following two
cases.

*Case A*

When the query is covering a smaller amount of data a TEZ job (yarn
application) is not created

select dt from my_db_schema.my_table where dt in
('2018-03-10','2018-03-09') and header ='xxx';

The output in the above case is:

OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
2018-03-10
2018-03-10
2018-03-09
2018-03-09
Time taken: 7.043 seconds, Fetched: 4 row(s)


*Case B*

When the query is scanning more data

select dt from my_db_schema.my_table where  header ='xxx';

then the output is as follows and I can see a TEZ job logged in the TEZ ui
and in yarn.

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     22         22        0
0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 38.12 s
----------------------------------------------------------------------------------------------
OK
2018-03-05
2018-03-05
2018-03-06
2018-03-06
2018-03-07
2018-03-07
2018-03-08
2018-03-08
2018-03-09
2018-03-09
2018-03-10
2018-03-10
2018-03-25
2018-03-25
2018-03-26
2018-03-26
2018-03-28
2018-03-28
2018-05-09
2018-05-09
2018-05-10
2018-05-10
Time taken: 47.197 seconds, Fetched: 22 row(s)

The problem in case A is that sometimes Hive decides not to trigger a TEZ
job and the query is taking a long time to complete. In this case the
worker nodes are not utilised at all, it's only the master node executing
the query.

Is there a way to force Hive to always trigger a TEZ job?

Re: Hive not always triggering a TEZ/Yarn job

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Hi Nikolaos

+user@hive list

Hive not running a tez job is because of fetch task optimization which directly fetches data and run it through operator pipeline for specific set of queries.

If you want to fully disable it try “set hive.fetch.task.conversion=none”.

If you want to trigger it for much smaller data sizes lower the value for hive.fetch.task.conversion.threshold.

Thanks
Prasanth

On Jun 28, 2018, at 10:50 AM, Nikolaos Tsipas <ni...@gmail.com>> wrote:

Hi,

I'm using Tez with Hive to query data on S3 and I notice the following two cases.

Case A

When the query is covering a smaller amount of data a TEZ job (yarn application) is not created

select dt from my_db_schema.my_table where dt in ('2018-03-10','2018-03-09') and header ='xxx';

The output in the above case is:

OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2018-03-10
2018-03-10
2018-03-09
2018-03-09
Time taken: 7.043 seconds, Fetched: 4 row(s)


Case B

When the query is scanning more data

select dt from my_db_schema.my_table where  header ='xxx';

then the output is as follows and I can see a TEZ job logged in the TEZ ui and in yarn.

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     22         22        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 38.12 s
----------------------------------------------------------------------------------------------
OK
2018-03-05
2018-03-05
2018-03-06
2018-03-06
2018-03-07
2018-03-07
2018-03-08
2018-03-08
2018-03-09
2018-03-09
2018-03-10
2018-03-10
2018-03-25
2018-03-25
2018-03-26
2018-03-26
2018-03-28
2018-03-28
2018-05-09
2018-05-09
2018-05-10
2018-05-10
Time taken: 47.197 seconds, Fetched: 22 row(s)

The problem in case A is that sometimes Hive decides not to trigger a TEZ job and the query is taking a long time to complete. In this case the worker nodes are not utilised at all, it's only the master node executing the query.

Is there a way to force Hive to always trigger a TEZ job?

Re: Hive not always triggering a TEZ/Yarn job

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Hi Nikolaos

+user@hive list

Hive not running a tez job is because of fetch task optimization which directly fetches data and run it through operator pipeline for specific set of queries.

If you want to fully disable it try “set hive.fetch.task.conversion=none”.

If you want to trigger it for much smaller data sizes lower the value for hive.fetch.task.conversion.threshold.

Thanks
Prasanth

On Jun 28, 2018, at 10:50 AM, Nikolaos Tsipas <ni...@gmail.com>> wrote:

Hi,

I'm using Tez with Hive to query data on S3 and I notice the following two cases.

Case A

When the query is covering a smaller amount of data a TEZ job (yarn application) is not created

select dt from my_db_schema.my_table where dt in ('2018-03-10','2018-03-09') and header ='xxx';

The output in the above case is:

OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2018-03-10
2018-03-10
2018-03-09
2018-03-09
Time taken: 7.043 seconds, Fetched: 4 row(s)


Case B

When the query is scanning more data

select dt from my_db_schema.my_table where  header ='xxx';

then the output is as follows and I can see a TEZ job logged in the TEZ ui and in yarn.

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     22         22        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 38.12 s
----------------------------------------------------------------------------------------------
OK
2018-03-05
2018-03-05
2018-03-06
2018-03-06
2018-03-07
2018-03-07
2018-03-08
2018-03-08
2018-03-09
2018-03-09
2018-03-10
2018-03-10
2018-03-25
2018-03-25
2018-03-26
2018-03-26
2018-03-28
2018-03-28
2018-05-09
2018-05-09
2018-05-10
2018-05-10
Time taken: 47.197 seconds, Fetched: 22 row(s)

The problem in case A is that sometimes Hive decides not to trigger a TEZ job and the query is taking a long time to complete. In this case the worker nodes are not utilised at all, it's only the master node executing the query.

Is there a way to force Hive to always trigger a TEZ job?