You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Himanish Kushary <hi...@gmail.com> on 2012/08/24 16:36:09 UTC

Long running Join Query - Reduce task fails due to failing to report status

Hi,

We have a complex query that involves several left outer joins resulting in
8 M/R jobs in Hive.During execution of one of the stages ( after three M/R
has run) the M/R job fails due to few Reduce tasks failing due to
inactivity.

Most of the reduce tasks go through fine ( within 3 mins) but the last one
gets stuck for a long time (> 1 hour) and finally after several attempts
gets killed due to "failed to report status for 600 seconds. Killing!"

What may be causing this issue ? Would hive.script.auto.progress help in
this case ? As we are not able to get much information from the log files
how may we approach resolving this ? Will tweaking of any specific M/R
parameters help ?

The task attempt log shows several lines like this before exiting :

2012-08-23 19:17:23,848 INFO ExecReducer: ExecReducer: processing
219000000 rows: used memory = 408582240
2012-08-23 19:17:30,189 INFO ExecReducer: ExecReducer: processing
220000000 rows: used memory = 346110400
2012-08-23 19:17:37,510 INFO ExecReducer: ExecReducer: processing
221000000 rows: used memory = 583913576
2012-08-23 19:17:44,829 INFO ExecReducer: ExecReducer: processing
222000000 rows: used memory = 513071504
2012-08-23 19:17:47,923 INFO org.apache.hadoop.mapred.FileInputFormat:
Total input paths to process : 1

Here are the reduce task counters:

*Map-Reduce Framework*Combine input records0Combine output records0Reduce
input groups222,480,335Reduce shuffle bytes7,726,141,897Reduce input records
222,480,335Reduce output records0Spilled Records355,827,191CPU time spent
(ms)2,152,160Physical memory (bytes) snapshot1,182,490,624Virtual memory
(bytes) snapshot1,694,531,584Total committed heap usage (bytes)990,052,352

The tasktracker log gives a thread dump at that time but no exception.

*2012-08-23 20:05:49,319 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task*
*69 active threads*

---------------------------
Thanks & Regards
Himanish

Re: Long running Join Query - Reduce task fails due to failing to report status

Posted by bharath vissapragada <bh...@students.iiit.ac.in>.

My two cents,

Try checking if there is a skew in the input to that reducer compared to
other reducers. This happens sometimes in joins where some reducers have
large amount of input data and keep running forever.


On Fri, Aug 24, 2012 at 11:41 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> It is not clear from your post but your job is always failing during the
> same step? Or only sometimes? Or only once?
> Since it's a hive query I would modify it to find the root cause.
>
> First create temporary "files" which are the results from the three first
> M/R.
> Then run the fourth M/R on it and try to filter the data in order to see
> if it is related to the volume or the format.
>
> Regards
>
> Bertrand
>
>
> On Fri, Aug 24, 2012 at 7:44 PM, Igor Tatarinov <ig...@decide.com> wrote:
>
>> Why don't you try splitting the big query into smaller ones?
>>
>>
>> On Fri, Aug 24, 2012 at 10:20 AM, Tim Havens <ti...@gmail.com> wrote:
>>
>>>
>>> Just curious if you've tried using Hive's explain method to see what IT
>>> thinks of your query.
>>>
>>>
>>> On Fri, Aug 24, 2012 at 9:36 AM, Himanish Kushary <hi...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> We have a complex query that involves several left outer joins
>>>> resulting in 8 M/R jobs in Hive.During execution of one of the stages (
>>>> after three M/R has run) the M/R job fails due to few Reduce tasks failing
>>>> due to inactivity.
>>>>
>>>> Most of the reduce tasks go through fine ( within 3 mins) but the last
>>>> one gets stuck for a long time (> 1 hour) and finally after several
>>>> attempts gets killed due to "failed to report status for 600 seconds.
>>>> Killing!"
>>>>
>>>> What may be causing this issue ? Would hive.script.auto.progress help
>>>> in this case ? As we are not able to get much information from the log
>>>> files how may we approach resolving this ? Will tweaking of any specific
>>>> M/R parameters help ?
>>>>
>>>> The task attempt log shows several lines like this before exiting :
>>>>
>>>> 2012-08-23 19:17:23,848 INFO ExecReducer: ExecReducer: processing 219000000 rows: used memory = 408582240
>>>> 2012-08-23 19:17:30,189 INFO ExecReducer: ExecReducer: processing 220000000 rows: used memory = 346110400
>>>> 2012-08-23 19:17:37,510 INFO ExecReducer: ExecReducer: processing 221000000 rows: used memory = 583913576
>>>> 2012-08-23 19:17:44,829 INFO ExecReducer: ExecReducer: processing 222000000 rows: used memory = 513071504
>>>> 2012-08-23 19:17:47,923 INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
>>>>
>>>> Here are the reduce task counters:
>>>>
>>>> *Map-Reduce Framework* Combine input records0 Combine output records0Reduce input groups
>>>> 222,480,335 Reduce shuffle bytes7,726,141,897 Reduce input records
>>>> 222,480,335 Reduce output records0 Spilled Records355,827,191 CPU time
>>>> spent (ms)2,152,160 Physical memory (bytes) snapshot1,182,490,624Virtual memory (bytes) snapshot
>>>> 1,694,531,584 Total committed heap usage (bytes)990,052,352
>>>>
>>>> The tasktracker log gives a thread dump at that time but no exception.
>>>>
>>>> *2012-08-23 20:05:49,319 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Process Thread Dump: lost task*
>>>> *69 active threads*
>>>>
>>>> ---------------------------
>>>> Thanks & Regards
>>>> Himanish
>>>>
>>>
>>>
>>>
>>> --
>>> "The whole world is you. Yet you keep thinking there is something else."
>>> - Xuefeng Yicun 822-902 A.D.
>>>
>>> Tim R. Havens
>>> Google Phone: 573.454.1232
>>> ICQ: 495992798
>>> ICBM:  37°51'34.79"N   90°35'24.35"W
>>> ham radio callsign: NW0W
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>



-- 
Bharath Vissapragada,
4th Year undergraduate,
IIIT Hyderabad.
w: http://researchweb.iiit.ac.in/~bharath.v<http://research.iiit.ac.in/%7Ebharath.v>

Re: Long running Join Query - Reduce task fails due to failing to report status

Posted by Bertrand Dechoux <de...@gmail.com>.

It is not clear from your post but your job is always failing during the
same step? Or only sometimes? Or only once?
Since it's a hive query I would modify it to find the root cause.

First create temporary "files" which are the results from the three first
M/R.
Then run the fourth M/R on it and try to filter the data in order to see if
it is related to the volume or the format.

Regards

Bertrand

On Fri, Aug 24, 2012 at 7:44 PM, Igor Tatarinov <ig...@decide.com> wrote:

> Why don't you try splitting the big query into smaller ones?
>
>
> On Fri, Aug 24, 2012 at 10:20 AM, Tim Havens <ti...@gmail.com> wrote:
>
>>
>> Just curious if you've tried using Hive's explain method to see what IT
>> thinks of your query.
>>
>>
>> On Fri, Aug 24, 2012 at 9:36 AM, Himanish Kushary <hi...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> We have a complex query that involves several left outer joins resulting
>>> in 8 M/R jobs in Hive.During execution of one of the stages ( after three
>>> M/R has run) the M/R job fails due to few Reduce tasks failing due to
>>> inactivity.
>>>
>>> Most of the reduce tasks go through fine ( within 3 mins) but the last
>>> one gets stuck for a long time (> 1 hour) and finally after several
>>> attempts gets killed due to "failed to report status for 600 seconds.
>>> Killing!"
>>>
>>> What may be causing this issue ? Would hive.script.auto.progress help in
>>> this case ? As we are not able to get much information from the log files
>>> how may we approach resolving this ? Will tweaking of any specific M/R
>>> parameters help ?
>>>
>>> The task attempt log shows several lines like this before exiting :
>>>
>>> 2012-08-23 19:17:23,848 INFO ExecReducer: ExecReducer: processing 219000000 rows: used memory = 408582240
>>> 2012-08-23 19:17:30,189 INFO ExecReducer: ExecReducer: processing 220000000 rows: used memory = 346110400
>>> 2012-08-23 19:17:37,510 INFO ExecReducer: ExecReducer: processing 221000000 rows: used memory = 583913576
>>> 2012-08-23 19:17:44,829 INFO ExecReducer: ExecReducer: processing 222000000 rows: used memory = 513071504
>>> 2012-08-23 19:17:47,923 INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
>>>
>>> Here are the reduce task counters:
>>>
>>> *Map-Reduce Framework* Combine input records0 Combine output records0Reduce input groups
>>> 222,480,335 Reduce shuffle bytes7,726,141,897 Reduce input records
>>> 222,480,335 Reduce output records0 Spilled Records355,827,191 CPU time
>>> spent (ms)2,152,160 Physical memory (bytes) snapshot1,182,490,624Virtual memory (bytes) snapshot
>>> 1,694,531,584 Total committed heap usage (bytes)990,052,352
>>>
>>> The tasktracker log gives a thread dump at that time but no exception.
>>>
>>> *2012-08-23 20:05:49,319 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Process Thread Dump: lost task*
>>> *69 active threads*
>>>
>>> ---------------------------
>>> Thanks & Regards
>>> Himanish
>>>
>>
>>
>>
>> --
>> "The whole world is you. Yet you keep thinking there is something else."
>> - Xuefeng Yicun 822-902 A.D.
>>
>> Tim R. Havens
>> Google Phone: 573.454.1232
>> ICQ: 495992798
>> ICBM:  37°51'34.79"N   90°35'24.35"W
>> ham radio callsign: NW0W
>>
>
>


-- 
Bertrand Dechoux

Re: Long running Join Query - Reduce task fails due to failing to report status

Posted by Igor Tatarinov <ig...@decide.com>.

Why don't you try splitting the big query into smaller ones?


On Fri, Aug 24, 2012 at 10:20 AM, Tim Havens <ti...@gmail.com> wrote:

>
> Just curious if you've tried using Hive's explain method to see what IT
> thinks of your query.
>
>
> On Fri, Aug 24, 2012 at 9:36 AM, Himanish Kushary <hi...@gmail.com>wrote:
>
>> Hi,
>>
>> We have a complex query that involves several left outer joins resulting
>> in 8 M/R jobs in Hive.During execution of one of the stages ( after three
>> M/R has run) the M/R job fails due to few Reduce tasks failing due to
>> inactivity.
>>
>> Most of the reduce tasks go through fine ( within 3 mins) but the last
>> one gets stuck for a long time (> 1 hour) and finally after several
>> attempts gets killed due to "failed to report status for 600 seconds.
>> Killing!"
>>
>> What may be causing this issue ? Would hive.script.auto.progress help in
>> this case ? As we are not able to get much information from the log files
>> how may we approach resolving this ? Will tweaking of any specific M/R
>> parameters help ?
>>
>> The task attempt log shows several lines like this before exiting :
>>
>> 2012-08-23 19:17:23,848 INFO ExecReducer: ExecReducer: processing 219000000 rows: used memory = 408582240
>> 2012-08-23 19:17:30,189 INFO ExecReducer: ExecReducer: processing 220000000 rows: used memory = 346110400
>> 2012-08-23 19:17:37,510 INFO ExecReducer: ExecReducer: processing 221000000 rows: used memory = 583913576
>> 2012-08-23 19:17:44,829 INFO ExecReducer: ExecReducer: processing 222000000 rows: used memory = 513071504
>> 2012-08-23 19:17:47,923 INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
>>
>> Here are the reduce task counters:
>>
>> *Map-Reduce Framework* Combine input records0 Combine output records0Reduce input groups
>> 222,480,335 Reduce shuffle bytes7,726,141,897 Reduce input records
>> 222,480,335 Reduce output records0 Spilled Records355,827,191 CPU time
>> spent (ms)2,152,160 Physical memory (bytes) snapshot1,182,490,624Virtual memory (bytes) snapshot
>> 1,694,531,584 Total committed heap usage (bytes)990,052,352
>>
>> The tasktracker log gives a thread dump at that time but no exception.
>>
>> *2012-08-23 20:05:49,319 INFO org.apache.hadoop.mapred.TaskTracker:
>> Process Thread Dump: lost task*
>> *69 active threads*
>>
>> ---------------------------
>> Thanks & Regards
>> Himanish
>>
>
>
>
> --
> "The whole world is you. Yet you keep thinking there is something else." -
> Xuefeng Yicun 822-902 A.D.
>
> Tim R. Havens
> Google Phone: 573.454.1232
> ICQ: 495992798
> ICBM:  37°51'34.79"N   90°35'24.35"W
> ham radio callsign: NW0W
>

Re: Long running Join Query - Reduce task fails due to failing to report status

Posted by Tim Havens <ti...@gmail.com>.

Just curious if you've tried using Hive's explain method to see what IT
thinks of your query.


On Fri, Aug 24, 2012 at 9:36 AM, Himanish Kushary <hi...@gmail.com>wrote:

> Hi,
>
> We have a complex query that involves several left outer joins resulting
> in 8 M/R jobs in Hive.During execution of one of the stages ( after three
> M/R has run) the M/R job fails due to few Reduce tasks failing due to
> inactivity.
>
> Most of the reduce tasks go through fine ( within 3 mins) but the last one
> gets stuck for a long time (> 1 hour) and finally after several attempts
> gets killed due to "failed to report status for 600 seconds. Killing!"
>
> What may be causing this issue ? Would hive.script.auto.progress help in
> this case ? As we are not able to get much information from the log files
> how may we approach resolving this ? Will tweaking of any specific M/R
> parameters help ?
>
> The task attempt log shows several lines like this before exiting :
>
> 2012-08-23 19:17:23,848 INFO ExecReducer: ExecReducer: processing 219000000 rows: used memory = 408582240
> 2012-08-23 19:17:30,189 INFO ExecReducer: ExecReducer: processing 220000000 rows: used memory = 346110400
> 2012-08-23 19:17:37,510 INFO ExecReducer: ExecReducer: processing 221000000 rows: used memory = 583913576
> 2012-08-23 19:17:44,829 INFO ExecReducer: ExecReducer: processing 222000000 rows: used memory = 513071504
> 2012-08-23 19:17:47,923 INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
>
> Here are the reduce task counters:
>
> *Map-Reduce Framework* Combine input records0 Combine output records0Reduce input groups
> 222,480,335 Reduce shuffle bytes7,726,141,897 Reduce input records
> 222,480,335 Reduce output records0 Spilled Records355,827,191 CPU time
> spent (ms)2,152,160 Physical memory (bytes) snapshot1,182,490,624 Virtual
> memory (bytes) snapshot1,694,531,584 Total committed heap usage (bytes)
> 990,052,352
>
> The tasktracker log gives a thread dump at that time but no exception.
>
> *2012-08-23 20:05:49,319 INFO org.apache.hadoop.mapred.TaskTracker:
> Process Thread Dump: lost task*
> *69 active threads*
>
> ---------------------------
> Thanks & Regards
> Himanish
>



-- 
"The whole world is you. Yet you keep thinking there is something else." -
Xuefeng Yicun 822-902 A.D.

Tim R. Havens
Google Phone: 573.454.1232
ICQ: 495992798
ICBM:  37°51'34.79"N   90°35'24.35"W
ham radio callsign: NW0W