You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Daning Wang <da...@netseer.com> on 2013/03/06 21:08:21 UTC

Hadoop cluster hangs on big hive job

We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while
running big hive jobs(hive-0.8.1). Basically all the nodes are dead, from
that trasktracker's log looks it went into some kinds of loop forever.

All the log entries like this when problem happened.

Any idea how to debug the issue?

Thanks in advance.


2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:32,893 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >

RE: Hadoop cluster hangs on big hive job

Posted by Chalcy Raja <Ch...@careerbuilder.com>.
In my case, it was not a bug.  The temp data was filling up the data space and it appeared like hanging, but the last reducer job was still running trying to move data.  Once there is absolutely no space for data then, cluster goes into safemode and it hangs. In my case it did not get to the absolute hanging part.  I terminated the query and broken down the query so the final table is partitioned and that worked fine.

If you provide you hive query and also give more information about your cluster size and the data size you are trying to run the query, I can analyze your issue and may be provide a solution.

Thanks,
Chalcy
________________________________
From: Daning Wang [daning@netseer.com]
Sent: Wednesday, March 06, 2013 4:17 PM
To: user@hive.apache.org
Subject: Re: Hadoop cluster hangs on big hive job

Thanks Chalcy! But the hadoop cluster should not hang in any way, is that a bug?

On Wed, Mar 6, 2013 at 12:33 PM, Chalcy Raja <Ch...@careerbuilder.com>> wrote:
You could try breaking up the hive query to return smaller datasets.  I have noticed this behavior when the hive query has ‘in’ in where clause.

Thanks,
Chalcy
From: Daning Wang [mailto:daning@netseer.com<ma...@netseer.com>]
Sent: Wednesday, March 06, 2013 3:08 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Hadoop cluster hangs on big hive job

We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while running big hive jobs(hive-0.8.1). Basically all the nodes are dead, from that trasktracker's log looks it went into some kinds of loop forever.

All the log entries like this when problem happened.

Any idea how to debug the issue?

Thanks in advance.


2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:32,893 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >




Re: Hadoop cluster hangs on big hive job

Posted by Daning Wang <da...@netseer.com>.
Thanks Chalcy! But the hadoop cluster should not hang in any way, is that a
bug?

On Wed, Mar 6, 2013 at 12:33 PM, Chalcy Raja
<Ch...@careerbuilder.com>wrote:

>  You could try breaking up the hive query to return smaller datasets.  I
> have noticed this behavior when the hive query has ‘in’ in where clause.**
> **
>
> ** **
>
> Thanks,****
>
> Chalcy****
>
> *From:* Daning Wang [mailto:daning@netseer.com]
> *Sent:* Wednesday, March 06, 2013 3:08 PM
> *To:* user@hive.apache.org
> *Subject:* Hadoop cluster hangs on big hive job****
>
> ** **
>
> We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while
> running big hive jobs(hive-0.8.1). Basically all the nodes are dead, from
> that trasktracker's log looks it went into some kinds of loop forever.****
>
> ** **
>
> All the log entries like this when problem happened.****
>
> ** **
>
> Any idea how to debug the issue?****
>
> ** **
>
> Thanks in advance.****
>
> ** **
>
> ** **
>
> 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> 2013-03-05 15:13:32,893 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > ****
>
> ** **
>
> ** **
>

RE: Hadoop cluster hangs on big hive job

Posted by Chalcy Raja <Ch...@careerbuilder.com>.
You could try breaking up the hive query to return smaller datasets.  I have noticed this behavior when the hive query has 'in' in where clause.

Thanks,
Chalcy
From: Daning Wang [mailto:daning@netseer.com]
Sent: Wednesday, March 06, 2013 3:08 PM
To: user@hive.apache.org
Subject: Hadoop cluster hangs on big hive job

We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while running big hive jobs(hive-0.8.1). Basically all the nodes are dead, from that trasktracker's log looks it went into some kinds of loop forever.

All the log entries like this when problem happened.

Any idea how to debug the issue?

Thanks in advance.


2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >
2013-03-05 15:13:32,893 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706 of 49964 at 0.00 MB/s) >