You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "r7raul1984@163.com" <r7...@163.com> on 2015/05/06 02:50:23 UTC
hive sql on tez run forever
I change the sql where condition to (where t.update_time >= '2015-05-04') , the sql can return result for a while. Because t.update_time >= '2015-05-04' can filter many row when table scan. But why change where condition to (where t.update_time >= '2015-05-04' or length(t8.end_user_id)>0) ,the sql run forever as follows:
Status: Running (Executing on YARN cluster with App id application_1419300485749_1419769)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Map 10 ......... SUCCEEDED 3 3 0 0 0 0
Map 11 ......... SUCCEEDED 151 151 0 0 0 0
Map 12 ......... SUCCEEDED 1 1 0 0 0 0
Map 13 ......... SUCCEEDED 76 76 0 0 0 0
Map 5 .......... SUCCEEDED 11 11 0 0 0 0
Map 7 .......... SUCCEEDED 156 156 0 0 0 0
Map 9 .......... SUCCEEDED 10 10 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 3 ..... RUNNING 642 641 1 0 0 0
Reducer 4 RUNNING 1009 0 89 920 0 0
Reducer 6 ...... SUCCEEDED 3 3 0 0 0 0
Reducer 8 ...... SUCCEEDED 203 203 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 11/13 [==============>>------------] 55% ELAPSED TIME: 307.54 s
What is the root cause ?
r7raul1984@163.com
RE: hive sql on tez run forever
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
The other option is to try UNION ALL or UNION depending on the nature of
the result set
SELECT rs.col1, rs,col2 ,
FROM
(
SELECT t.col1, t.col2, ..
FROM t WHERE t.update_time > '2015-05-04'
UNION ALL
SELECT t8.col1, t8.col2,..
FROM t8 WHERE length(t8.end_user_id) > 0
) rs
This may work.
HTH
Mich Talebzadeh
http://talebzadehmich.wordpress.com
Author of the books "A Practitioners Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.
-----Original Message-----
From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of Gopal
Vijayaraghavan
Sent: 11 May 2015 18:14
To: user
Cc: r7raul1984@163.com
Subject: Re: hive sql on tez run forever
Hi,
> I change the sql where condition to (where t.update_time >=
>'2015-05-04') , the sql can return result for a while. Because
>t.update_time
> >= '2015-05-04' can filter many row when table scan. But why change
>where condition to
> (where t.update_time >= '2015-05-04' or length(t8.end_user_id)>0) ,the
>sql run forever as follows:
The OR clause is probably causing the problems.
We¹re probably not pushing down the OR clauses down to the original table
scans.
This is most likely a hive PPD miss where you do something like
select a.*,b.* from a,b where a.x = b.x and (a.y = 1 or b.z = 1);
where it doesn¹t get planned as
select a1.*, b1.* from (select a.* from a where a.y=1) a1, (select b.* from
b where b.z = 1) b1 where a1.x = b1.x;
instead gets planned as a full-scan JOIN, then a filter.
Can you spend some time and try to rewrite down your case to something like
the above queries?
If that works, then file a JIRA.
Cheers,
Gopal
Re: hive sql on tez run forever
Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi,
You’re correct - that is not a valid rewrite.
Both tables have to be shuffled across due to the OR clause with no
reductions.
Cheers,
Gopal
On 5/11/15, 10:43 AM, "Eugene Koifman" <ek...@hortonworks.com> wrote:
>This isn’t a valid rewrite.
>if a(x,y) has 1 row (1,2) and b(x,z) has 1 row (1,1) then the 1st query
>will produce 1 row
>but the 2nd query with subselects will not.
>
>On 5/11/15, 10:13 AM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:
>
>>Hi,
>>
>>> I change the sql where condition to (where t.update_time >=
>>>'2015-05-04') , the sql can return result for a while. Because
>>>t.update_time
>>> >= '2015-05-04' can filter many row when table scan. But why change
>>>where condition to
>>> (where t.update_time >= '2015-05-04' or length(t8.end_user_id)>0) ,the
>>>sql run forever as follows:
>>
>>
>>The OR clause is probably causing the problems.
>>
>>We¹re probably not pushing down the OR clauses down to the original table
>>scans.
>>
>>This is most likely a hive PPD miss where you do something like
>>
>>select a.*,b.* from a,b where a.x = b.x and (a.y = 1 or b.z = 1);
>>
>>where it doesn¹t get planned as
>>
>>select a1.*, b1.* from (select a.* from a where a.y=1) a1, (select b.*
>>from b where b.z = 1) b1 where a1.x = b1.x;
>>
>>instead gets planned as a full-scan JOIN, then a filter.
>>
>>Can you spend some time and try to rewrite down your case to something
>>like the above queries?
>>
>>If that works, then file a JIRA.
>>
>>Cheers,
>>Gopal
>>
>>
>
Re: Re: hive sql on tez run forever
Posted by "r7raul1984@163.com" <r7...@163.com>.
I see only 1 reduce run forerver. Skew join?
r7raul1984@163.com
From: Eugene Koifman
Date: 2015-05-12 01:43
To: user
CC: r7raul1984@163.com
Subject: Re: hive sql on tez run forever
This isn’t a valid rewrite.
if a(x,y) has 1 row (1,2) and b(x,z) has 1 row (1,1) then the 1st query
will produce 1 row
but the 2nd query with subselects will not.
On 5/11/15, 10:13 AM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:
>Hi,
>
>> I change the sql where condition to (where t.update_time >=
>>'2015-05-04') , the sql can return result for a while. Because
>>t.update_time
>> >= '2015-05-04' can filter many row when table scan. But why change
>>where condition to
>> (where t.update_time >= '2015-05-04' or length(t8.end_user_id)>0) ,the
>>sql run forever as follows:
>
>
>The OR clause is probably causing the problems.
>
>We¹re probably not pushing down the OR clauses down to the original table
>scans.
>
>This is most likely a hive PPD miss where you do something like
>
>select a.*,b.* from a,b where a.x = b.x and (a.y = 1 or b.z = 1);
>
>where it doesn¹t get planned as
>
>select a1.*, b1.* from (select a.* from a where a.y=1) a1, (select b.*
>from b where b.z = 1) b1 where a1.x = b1.x;
>
>instead gets planned as a full-scan JOIN, then a filter.
>
>Can you spend some time and try to rewrite down your case to something
>like the above queries?
>
>If that works, then file a JIRA.
>
>Cheers,
>Gopal
>
>
Re: hive sql on tez run forever
Posted by Eugene Koifman <ek...@hortonworks.com>.
This isn’t a valid rewrite.
if a(x,y) has 1 row (1,2) and b(x,z) has 1 row (1,1) then the 1st query
will produce 1 row
but the 2nd query with subselects will not.
On 5/11/15, 10:13 AM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:
>Hi,
>
>> I change the sql where condition to (where t.update_time >=
>>'2015-05-04') , the sql can return result for a while. Because
>>t.update_time
>> >= '2015-05-04' can filter many row when table scan. But why change
>>where condition to
>> (where t.update_time >= '2015-05-04' or length(t8.end_user_id)>0) ,the
>>sql run forever as follows:
>
>
>The OR clause is probably causing the problems.
>
>We¹re probably not pushing down the OR clauses down to the original table
>scans.
>
>This is most likely a hive PPD miss where you do something like
>
>select a.*,b.* from a,b where a.x = b.x and (a.y = 1 or b.z = 1);
>
>where it doesn¹t get planned as
>
>select a1.*, b1.* from (select a.* from a where a.y=1) a1, (select b.*
>from b where b.z = 1) b1 where a1.x = b1.x;
>
>instead gets planned as a full-scan JOIN, then a filter.
>
>Can you spend some time and try to rewrite down your case to something
>like the above queries?
>
>If that works, then file a JIRA.
>
>Cheers,
>Gopal
>
>
Re: hive sql on tez run forever
Posted by Gopal Vijayaraghavan <go...@apache.org>.
Hi,
> I change the sql where condition to (where t.update_time >=
>'2015-05-04') , the sql can return result for a while. Because
>t.update_time
> >= '2015-05-04' can filter many row when table scan. But why change
>where condition to
> (where t.update_time >= '2015-05-04' or length(t8.end_user_id)>0) ,the
>sql run forever as follows:
The OR clause is probably causing the problems.
We¹re probably not pushing down the OR clauses down to the original table
scans.
This is most likely a hive PPD miss where you do something like
select a.*,b.* from a,b where a.x = b.x and (a.y = 1 or b.z = 1);
where it doesn¹t get planned as
select a1.*, b1.* from (select a.* from a where a.y=1) a1, (select b.*
from b where b.z = 1) b1 where a1.x = b1.x;
instead gets planned as a full-scan JOIN, then a filter.
Can you spend some time and try to rewrite down your case to something
like the above queries?
If that works, then file a JIRA.
Cheers,
Gopal