You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Mayuresh <ma...@gmail.com> on 2011/06/02 11:36:58 UTC

Debugging killed task attempts

Hi,

I am trying to scan around 4,600,000 rows of hbase data. I am using hive to
query them back. I start the job with around 25 maps and there are 11 nodes
in my cluster each running 2 maps at a time.

I saw that it took around 7 minutes to scan all this data with 7 nodes,
However I added 4 more nodes, and it is taking even more time. In the map
task which is taking the longest, I see the following:

attempt_201106011013_0010_m_000009_0    Task attempt:
/default-rack/domU-12-31-39-0F-75-13.compute-1.internal
Cleanup Attempt: /default-rack/domU-12-31-39-0F-75-13.compute-1.internal
KILLED    100.00%
    2-Jun-2011 08:53:16    2-Jun-2011 09:01:48 (8mins, 32sec)

and

attempt_201106011013_0010_m_000009_1
/default-rack/ip-10-196-198-48.ec2.internal    SUCCEEDED    100.00%
    2-Jun-2011 08:57:28    2-Jun-2011 09:01:44 (4mins, 15sec)

The first attempt waited for 8mins 32secs before getting killed. I checked
datanode logs and all I see over there is some data coming in and some going
out. Can someout point me out to exactly how can I debug what exactly was
going on, and how can I avoid such long non-useful task attempts from being
run?

Thanks,
-Mayuresh

Re: Debugging killed task attempts

Posted by Mayuresh <ma...@gmail.com>.
Ok. Some more progress. Once thing i noticed is that this machine may not be
holding any data yet:

Name: 10.206.73.198:50010
Decommission Status : Normal
Configured Capacity: 359313149952 (334.64 GB)
DFS Used: 2478612 (2.36 MB)
Non DFS Used: 18468232684 (17.2 GB)
DFS Remaining: 340842438656(317.43 GB)
DFS Used%: 0%
DFS Remaining%: 94.86%
Last contact: Thu Jun 02 10:55:22 UTC 2011


is there a way to force hadoop/hbase to balance some data on to this node? I
tried running hadoop rebalancer, but no effect.

On Thu, Jun 2, 2011 at 3:10 PM, Mayuresh <ma...@gmail.com>wrote:

> Additional information:
>
> I see the following logs in the syslog of the task attempt:
>
> 2011-06-02 09:09:41,145 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 10000 rows
> 2011-06-02 09:09:41,145 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 10000 rows
> 2011-06-02 09:09:41,145 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 10000 rows
> 2011-06-02 09:09:41,145 INFO ExecMapper: ExecMapper: processing 10000 rows: used memory = 110421792
> 2011-06-02 09:12:51,340 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 100000 rows
> 2011-06-02 09:12:51,340 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 100000 rows
> 2011-06-02 09:12:51,340 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 100000 rows
> 2011-06-02 09:12:51,341 INFO ExecMapper: ExecMapper: processing 100000 rows: used memory = 115097528
> 2011-06-02 09:18:15,591 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
> 2011-06-02 09:18:15,592 WARN org.mortbay.log: Ignore, probably already closed
> java.io.IOException: The client is stopped
> 	at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:822)
> 	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:716)
> 	at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
> 	at $Proxy2.close(Unknown Source)
> 	at org.apache.hadoop.hbase.client.ScannerCallable.close(ScannerCallable.java:105)
> 	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:72)
> 	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:38)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1040)
> 	at org.apache.hadoop.hbase.client.HTable$ClientScanner.nextScanner(HTable.java:1967)
> 	at org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:2092)
> 	at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.next(TableInputFormatBase.java:222)
> 	at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.next(TableInputFormatBase.java:91)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.next(HiveRecordReader.java:67)
> 	at org.apache.hadoop.hive.ql.io.HiveRecordReader.next(HiveRecordReader.java:33)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
>
>
> On Thu, Jun 2, 2011 at 3:06 PM, Mayuresh <ma...@gmail.com>wrote:
>
>> Hi,
>>
>> I am trying to scan around 4,600,000 rows of hbase data. I am using hive
>> to query them back. I start the job with around 25 maps and there are 11
>> nodes in my cluster each running 2 maps at a time.
>>
>> I saw that it took around 7 minutes to scan all this data with 7 nodes,
>> However I added 4 more nodes, and it is taking even more time. In the map
>> task which is taking the longest, I see the following:
>>
>> attempt_201106011013_0010_m_000009_0    Task attempt:
>> /default-rack/domU-12-31-39-0F-75-13.compute-1.internal
>> Cleanup Attempt:
>> /default-rack/domU-12-31-39-0F-75-13.compute-1.internal    KILLED    100.00%
>>     2-Jun-2011 08:53:16    2-Jun-2011 09:01:48 (8mins, 32sec)
>>
>> and
>>
>> attempt_201106011013_0010_m_000009_1
>> /default-rack/ip-10-196-198-48.ec2.internal    SUCCEEDED    100.00%
>>     2-Jun-2011 08:57:28    2-Jun-2011 09:01:44 (4mins, 15sec)
>>
>> The first attempt waited for 8mins 32secs before getting killed. I checked
>> datanode logs and all I see over there is some data coming in and some going
>> out. Can someout point me out to exactly how can I debug what exactly was
>> going on, and how can I avoid such long non-useful task attempts from being
>> run?
>>
>> Thanks,
>> -Mayuresh
>>
>
>
>
> --
> -Mayuresh
>



-- 
-Mayuresh

Re: Debugging killed task attempts

Posted by Mayuresh <ma...@gmail.com>.
Additional information:

I see the following logs in the syslog of the task attempt:

2011-06-02 09:09:41,145 INFO
org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 10000 rows
2011-06-02 09:09:41,145 INFO
org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 10000
rows
2011-06-02 09:09:41,145 INFO
org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 10000 rows
2011-06-02 09:09:41,145 INFO ExecMapper: ExecMapper: processing 10000
rows: used memory = 110421792
2011-06-02 09:12:51,340 INFO
org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 100000 rows
2011-06-02 09:12:51,340 INFO
org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 100000
rows
2011-06-02 09:12:51,340 INFO
org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 100000
rows
2011-06-02 09:12:51,341 INFO ExecMapper: ExecMapper: processing 100000
rows: used memory = 115097528
2011-06-02 09:18:15,591 INFO org.mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
2011-06-02 09:18:15,592 WARN org.mortbay.log: Ignore, probably already closed
java.io.IOException: The client is stopped
	at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:822)
	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:716)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
	at $Proxy2.close(Unknown Source)
	at org.apache.hadoop.hbase.client.ScannerCallable.close(ScannerCallable.java:105)
	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:72)
	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:38)
	at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1040)
	at org.apache.hadoop.hbase.client.HTable$ClientScanner.nextScanner(HTable.java:1967)
	at org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:2092)
	at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.next(TableInputFormatBase.java:222)
	at org.apache.hadoop.hbase.mapred.TableInputFormatBase$TableRecordReader.next(TableInputFormatBase.java:91)
	at org.apache.hadoop.hive.ql.io.HiveRecordReader.next(HiveRecordReader.java:67)
	at org.apache.hadoop.hive.ql.io.HiveRecordReader.next(HiveRecordReader.java:33)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)



On Thu, Jun 2, 2011 at 3:06 PM, Mayuresh <ma...@gmail.com>wrote:

> Hi,
>
> I am trying to scan around 4,600,000 rows of hbase data. I am using hive to
> query them back. I start the job with around 25 maps and there are 11 nodes
> in my cluster each running 2 maps at a time.
>
> I saw that it took around 7 minutes to scan all this data with 7 nodes,
> However I added 4 more nodes, and it is taking even more time. In the map
> task which is taking the longest, I see the following:
>
> attempt_201106011013_0010_m_000009_0    Task attempt:
> /default-rack/domU-12-31-39-0F-75-13.compute-1.internal
> Cleanup Attempt: /default-rack/domU-12-31-39-0F-75-13.compute-1.internal
> KILLED    100.00%
>     2-Jun-2011 08:53:16    2-Jun-2011 09:01:48 (8mins, 32sec)
>
> and
>
> attempt_201106011013_0010_m_000009_1
> /default-rack/ip-10-196-198-48.ec2.internal    SUCCEEDED    100.00%
>     2-Jun-2011 08:57:28    2-Jun-2011 09:01:44 (4mins, 15sec)
>
> The first attempt waited for 8mins 32secs before getting killed. I checked
> datanode logs and all I see over there is some data coming in and some going
> out. Can someout point me out to exactly how can I debug what exactly was
> going on, and how can I avoid such long non-useful task attempts from being
> run?
>
> Thanks,
> -Mayuresh
>



-- 
-Mayuresh