You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by yaotian <ya...@gmail.com> on 2013/01/11 04:23:38 UTC

I am running MapReduce on a 30G data on 1master/2 slave, but failed.

I have 1 hadoop master which name node locates and 2 slave which datanode
locate.

If i choose a small data like 200M, it can be done.

But if i run 30G data, Map is done. But the reduce report error. Any
sugggestion?


This is the information.

*Black-listed TaskTrackers:*
1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
------------------------------
Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
100.00%45000450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
100.00%1500002<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
 / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>


TaskCompleteStatusStart TimeFinish TimeErrorsCounters
task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
0.00%
10-Jan-2013 04:18:54
10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)

Task attempt_201301090834_0041_r_000001_0 failed to report status for
600 seconds. Killing!
Task attempt_201301090834_0041_r_000001_1 failed to report status for
602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_2 failed to report status for
602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_3 failed to report status for
602 seconds. Killing!


0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
0.00%
10-Jan-2013 04:18:54
10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)

Task attempt_201301090834_0041_r_000002_0 failed to report status for
601 seconds. Killing!
Task attempt_201301090834_0041_r_000002_1 failed to report status for
600 seconds. Killing!


0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
0.00%
10-Jan-2013 04:18:57
10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)

Task attempt_201301090834_0041_r_000003_0 failed to report status for
602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_1 failed to report status for
602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_2 failed to report status for
602 seconds. Killing!


0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
0.00%
10-Jan-2013 06:11:07
10-Jan-2013 06:46:38 (35mins, 31sec)

Task attempt_201301090834_0041_r_000005_0 failed to report status for
600 seconds. Killing!


0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Serge Blazhiyevskyy <Se...@nice.com>.

Are you running this on the VM by any chance?






On Jan 10, 2013, at 9:11 PM, Mahesh Balija <ba...@gmail.com>> wrote:

Hi,

          2 reducers are successfully completed and 1498 have been killed. I assume that you have the data issues. (Either the data is huge or some issues with the data you are trying to process)
          One possibility could be you have many values associated to a single key, which can cause these kind of issues based on the operation you do in your reducer.
          Can you put some logs in your reducer and try to trace out what is happening.

Best,
Mahesh Balija,
Calsoft Labs.

On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com>> wrote:
I have 1 hadoop master which name node locates and 2 slave which datanode locate.

If i choose a small data like 200M, it can be done.

But if i run 30G data, Map is done. But the reduce report error. Any sugggestion?


This is the information.

Black-listed TaskTrackers: 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
________________________________
Kind    % Complete      Num Tasks       Pending Running Complete        Killed  Failed/Killed
Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>       100.00%

        450     0       0       450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>       0       0 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1> 100.00%

        1500    0       0       2<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>      1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>      12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed> / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>



Task    Complete        Status  Start Time      Finish Time     Errors  Counters
task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001> 0.00%


        10-Jan-2013 04:18:54
        10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)


Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002> 0.00%


        10-Jan-2013 04:18:54
        10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)


Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003> 0.00%


        10-Jan-2013 04:18:57
        10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)


Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005> 0.00%


        10-Jan-2013 06:11:07
        10-Jan-2013 06:46:38 (35mins, 31sec)


Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Serge Blazhiyevskyy <Se...@nice.com>.

Are you running this on the VM by any chance?






On Jan 10, 2013, at 9:11 PM, Mahesh Balija <ba...@gmail.com>> wrote:

Hi,

          2 reducers are successfully completed and 1498 have been killed. I assume that you have the data issues. (Either the data is huge or some issues with the data you are trying to process)
          One possibility could be you have many values associated to a single key, which can cause these kind of issues based on the operation you do in your reducer.
          Can you put some logs in your reducer and try to trace out what is happening.

Best,
Mahesh Balija,
Calsoft Labs.

On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com>> wrote:
I have 1 hadoop master which name node locates and 2 slave which datanode locate.

If i choose a small data like 200M, it can be done.

But if i run 30G data, Map is done. But the reduce report error. Any sugggestion?


This is the information.

Black-listed TaskTrackers: 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
________________________________
Kind    % Complete      Num Tasks       Pending Running Complete        Killed  Failed/Killed
Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>       100.00%

        450     0       0       450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>       0       0 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1> 100.00%

        1500    0       0       2<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>      1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>      12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed> / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>



Task    Complete        Status  Start Time      Finish Time     Errors  Counters
task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001> 0.00%


        10-Jan-2013 04:18:54
        10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)


Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002> 0.00%


        10-Jan-2013 04:18:54
        10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)


Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003> 0.00%


        10-Jan-2013 04:18:57
        10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)


Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005> 0.00%


        10-Jan-2013 06:11:07
        10-Jan-2013 06:46:38 (35mins, 31sec)


Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>

Re: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

Thanks. I will read that document. I am new on this. So i attach the log
here for help. I can't understand this information.

These log from tasktrack log file:

2013-01-15 08:00:44,731 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.66737837% reduce > reduce
2013-01-15 08:00:47,757 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.66737837% reduce > reduce
2013-01-15 08:10:49,083 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0: Task
attempt_201301150318_0001_r_000000_0 failed to report status for 601
seconds. Killing!
2013-01-15 08:10:49,092 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task
24 active threads
Thread 219 (process reaper):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.lang.UNIXProcess.waitForProcessExit(Native Method)
    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 218 (JVM Runner jvm_201301150318_0001_r_-641939786 spawned.):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.UNIXProcess@1195929
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    org.apache.hadoop.util.Shell.runCommand(Shell.java:244)
    org.apache.hadoop.util.Shell.run(Shell.java:182)

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)

org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:131)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:497)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:471)
Thread 214 (Thread-135):
  State: WAITING
  Blocked count: 2
  Waited count: 3
  Waiting on java.lang.Object@1f35647
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:296)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:251)
Thread 38 (IPC Client (47) connection to master/10.120.253.32:9001 from
hadoop):
  State: TIMED_WAITING
  Blocked count: 11210
  Waited count: 11199
  Stack:
    java.lang.Object.wait(Native Method)
    org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:702)
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:744)
Thread 10 (taskCleanup):
  State: WAITING
  Blocked count: 4
  Waited count: 7
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@104109e
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:422)
    java.lang.Thread.run(Thread.java:662)
Thread 13 (Thread-5):
  State: WAITING
  Blocked count: 475
  Waited count: 646
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@61bb9b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.monitor(UserLogManager.java:131)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager$1.run(UserLogManager.java:66)
Thread 14 (Thread-6):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 5
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.mapred.UserLogCleaner.run(UserLogCleaner.java:93)
Thread 37 (Timer-0):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 618
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 36 (23978087@qtp-7433399-0 - Acceptor0
SelectChannelConnector@0.0.0.0:50060):
  State: RUNNABLE
  Blocked count: 29509
  Waited count: 1
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)

org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:498)
    org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:192)

org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)

org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)

org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Thread 35 (TaskLauncher for REDUCE tasks):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.util.LinkedList@ca1038
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2265)
Thread 34 (TaskLauncher for MAP tasks):
  State: WAITING
  Blocked count: 170
  Waited count: 171
  Waiting on java.util.LinkedList@1ddec9e
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2265)
Thread 27 (Map-events fetcher for all reduce tasks on
tracker_ip-10-87-145-204.ec2.internal:localhost/127.0.0.1:60497):
  State: TIMED_WAITING
  Blocked count: 16184
  Waited count: 17896
  Stack:
    java.lang.Object.wait(Native Method)

org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:975)
Thread 25 (Thread-14):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 293
  Stack:
    java.lang.Thread.sleep(Native Method)

org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926)
Thread 24 (IPC Server handler 1 on 60497):
  State: WAITING
  Blocked count: 1
  Waited count: 10501
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fe99b6
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364)
Thread 23 (IPC Server handler 0 on 60497):
  State: WAITING
  Blocked count: 1
  Waited count: 10502
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fe99b6
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364)
Thread 20 (IPC Server listener on 60497):
  State: RUNNABLE
  Blocked count: 171
  Waited count: 0
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    org.apache.hadoop.ipc.Server$Listener.run(Server.java:439)
Thread 22 (IPC Server Responder):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    org.apache.hadoop.ipc.Server$Responder.run(Server.java:605)
Thread 21 (pool-1-thread-1):
  State: RUNNABLE
  Blocked count: 171
  Waited count: 171
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:333)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    java.lang.Thread.run(Thread.java:662)
Thread 15 (Directory/File cleanup thread):
  State: WAITING
  Blocked count: 0
  Waited count: 343
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@94af2f
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:130)
Thread 9 (Timer for 'TaskTracker' metrics system):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 1796
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 4 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 186
  Waited count: 187
  Waiting on java.lang.ref.ReferenceQueue$Lock@19a0203
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 186
  Waited count: 187
  Waiting on java.lang.ref.Reference$Lock@1fa39bb
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
  State: RUNNABLE
  Blocked count: 5850
  Waited count: 11663
  Stack:
    sun.management.ThreadImpl.getThreadInfo1(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:156)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:121)

org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)

org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)

org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1970)
    org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1652)
    org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2434)
    org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3675)

2013-01-15 08:10:49,093 INFO org.apache.hadoop.mapred.TaskTracker: About to
purge task: attempt_201301150318_0001_r_000000_0
2013-01-15 08:10:49,114 INFO org.apache.hadoop.util.ProcessTree: Killing
process group21719 with signal TERM. Exit code 0
2013-01-15 08:10:49,115 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 1
2013-01-15 08:10:52,095 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_201301150318_0001_r_000000_0
task's state:FAILED_UNCLEAN
2013-01-15 08:10:52,096 INFO org.apache.hadoop.mapred.TaskTracker: Trying
to launch : attempt_201301150318_0001_r_000000_0 which needs 1 slots
2013-01-15 08:10:52,096 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_201301150318_0001_r_000000_0 which needs 1 slots
2013-01-15 08:10:52,112 INFO org.apache.hadoop.mapred.JvmManager: In
JvmRunner constructed JVM ID: jvm_201301150318_0001_r_-56086075
2013-01-15 08:10:52,113 INFO org.apache.hadoop.mapred.JvmManager: JVM
Runner jvm_201301150318_0001_r_-56086075 spawned.
2013-01-15 08:10:52,122 INFO org.apache.hadoop.mapred.TaskController:
Writing commands to
/data/hadoop/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201301150318_0001/attempt_201301150318_0001_r_000000_0.cleanup/taskjvm.sh
2013-01-15 08:10:52,780 WARN
org.apache.hadoop.mapred.DefaultTaskController: Exit code from task is : 143
2013-01-15 08:10:52,780 INFO
org.apache.hadoop.mapred.DefaultTaskController: Output from
DefaultTaskController's launchTask follows:
2013-01-15 08:10:52,780 INFO org.apache.hadoop.mapred.TaskController:
2013-01-15 08:10:52,780 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201301150318_0001_r_-641939786 exited with exit code 143. Number of
tasks it ran: 0
2013-01-15 08:10:52,794 INFO org.apache.hadoop.io.nativeio.NativeIO: Got
UserName hadoop for UID 1006 from the native implementation
2013-01-15 08:10:53,177 INFO org.apache.hadoop.mapred.TaskTracker: JVM with
ID: jvm_201301150318_0001_r_-56086075 given task:
attempt_201301150318_0001_r_000000_0
2013-01-15 08:10:53,652 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.0%
2013-01-15 08:10:56,620 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.0% cleanup
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker: Task
attempt_201301150318_0001_r_000000_0 is done.
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker: reported
output size for attempt_201301150318_0001_r_000000_0  was -1
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 1
2013-01-15 08:10:56,679 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201301150318_0001_r_-56086075 exited with exit code 0. Number of tasks
it ran: 1
2013-01-15 08:11:07,483 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.87.145.204:50060,
dest: 10.123.73.45:42479, bytes: 6394091, op: MAPRED_SHUFFLE, cliID:
attempt_201301150318_0001_m_000000_0, duration: 111147314
2013-01-15 08:11:07,831 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.87.145.204:50060,
dest: 10.123.73.45:42479, bytes: 6163890, op: MAPRED_SHUFFLE, cliID:
attempt_201301150318_0001_m_000001_0, duration: 126898055


2013/1/15 Charlie A. <ha...@163.com>

> Hi, yaotian
> I think you should check logs on the very tasktracker, it'll tell you why.
> And here's some tips on deploying a MR job.
> http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
>
> Charlie
>
>
>
> At 2013-01-15 16:34:27,yaotian <ya...@gmail.com> wrote:
>
> I set mapred.reduce.tasks from -1 to "AutoReduce"
> And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems
> that this reduce only run on 1 slave (I have two slaves).
>
> But when it was running on 66%, the error report again "Task
> attempt_201301150318_0001_r_000000_0 failed to report status for 601
> seconds. Killing!"
>
>
>
> 2013/1/14 yaotian <ya...@gmail.com>
>
>> How to judge which counter would work?
>>
>>
>> 2013/1/11 <be...@gmail.com>
>>
>> **
>>> Hi
>>>
>>> To add on to Harsh's comments.
>>>
>>> You need not have to change the task time out.
>>>
>>> In your map/reduce code, you can increment a counter or report status
>>> intermediate on intervals so that there is communication from the task and
>>> hence won't have a task time out.
>>>
>>> Every map and reduce task run on its own jvm limited by a jvm size. If
>>> you try to holds too much data in memory then it can go beyond the jvm size
>>> and cause OOM errors.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * yaotian <ya...@gmail.com>
>>> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
>>> but failed.
>>>
>>> See inline.
>>>
>>>
>>> 2013/1/11 Harsh J <ha...@cloudera.com>
>>>
>>>> If the per-record processing time is very high, you will need to
>>>> periodically report a status. Without a status change report from the task
>>>> to the tracker, it will be killed away as a dead task after a default
>>>> timeout of 10 minutes (600s).
>>>>
>>> =====================> Do you mean to increase the report time: "*
>>> mapred.task.timeout"*?
>>>
>>>
>>>> Also, beware of holding too much memory in a reduce JVM - you're still
>>>> limited there. Best to let the framework do the sort or secondary sort.
>>>>
>>> =======================>  You mean use the default value ? This is my
>>> value.
>>> *mapred.job.reduce.memory.mb*-1
>>>
>>>>
>>>>
>>>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>>>
>>>>> Yes, you are right. The data is GPS trace related to corresponding
>>>>> uid. The reduce is doing this: Sort user to get this kind of result: uid,
>>>>> gps1, gps2, gps3........
>>>>> Yes, the gps data is big because this is 30G data.
>>>>>
>>>>> How to solve this?
>>>>>
>>>>>
>>>>>
>>>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>           2 reducers are successfully completed and 1498 have been
>>>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>>>> some issues with the data you are trying to process)
>>>>>>           One possibility could be you have many values associated to
>>>>>> a single key, which can cause these kind of issues based on the operation
>>>>>> you do in your reducer.
>>>>>>           Can you put some logs in your reducer and try to trace out
>>>>>> what is happening.
>>>>>>
>>>>>> Best,
>>>>>> Mahesh Balija,
>>>>>> Calsoft Labs.
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>>>
>>>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>>>> datanode locate.
>>>>>>>
>>>>>>> If i choose a small data like 200M, it can be done.
>>>>>>>
>>>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>>>> sugggestion?
>>>>>>>
>>>>>>>
>>>>>>> This is the information.
>>>>>>>
>>>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>>>> ------------------------------
>>>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>>>
>>>>>>>
>>>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:54
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:54
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:57
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 06:11:07
>>>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>
>
>

Re: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

Thanks. I will read that document. I am new on this. So i attach the log
here for help. I can't understand this information.

These log from tasktrack log file:

2013-01-15 08:00:44,731 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.66737837% reduce > reduce
2013-01-15 08:00:47,757 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.66737837% reduce > reduce
2013-01-15 08:10:49,083 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0: Task
attempt_201301150318_0001_r_000000_0 failed to report status for 601
seconds. Killing!
2013-01-15 08:10:49,092 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task
24 active threads
Thread 219 (process reaper):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.lang.UNIXProcess.waitForProcessExit(Native Method)
    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 218 (JVM Runner jvm_201301150318_0001_r_-641939786 spawned.):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.UNIXProcess@1195929
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    org.apache.hadoop.util.Shell.runCommand(Shell.java:244)
    org.apache.hadoop.util.Shell.run(Shell.java:182)

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)

org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:131)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:497)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:471)
Thread 214 (Thread-135):
  State: WAITING
  Blocked count: 2
  Waited count: 3
  Waiting on java.lang.Object@1f35647
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:296)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:251)
Thread 38 (IPC Client (47) connection to master/10.120.253.32:9001 from
hadoop):
  State: TIMED_WAITING
  Blocked count: 11210
  Waited count: 11199
  Stack:
    java.lang.Object.wait(Native Method)
    org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:702)
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:744)
Thread 10 (taskCleanup):
  State: WAITING
  Blocked count: 4
  Waited count: 7
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@104109e
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:422)
    java.lang.Thread.run(Thread.java:662)
Thread 13 (Thread-5):
  State: WAITING
  Blocked count: 475
  Waited count: 646
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@61bb9b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.monitor(UserLogManager.java:131)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager$1.run(UserLogManager.java:66)
Thread 14 (Thread-6):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 5
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.mapred.UserLogCleaner.run(UserLogCleaner.java:93)
Thread 37 (Timer-0):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 618
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 36 (23978087@qtp-7433399-0 - Acceptor0
SelectChannelConnector@0.0.0.0:50060):
  State: RUNNABLE
  Blocked count: 29509
  Waited count: 1
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)

org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:498)
    org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:192)

org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)

org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)

org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Thread 35 (TaskLauncher for REDUCE tasks):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.util.LinkedList@ca1038
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2265)
Thread 34 (TaskLauncher for MAP tasks):
  State: WAITING
  Blocked count: 170
  Waited count: 171
  Waiting on java.util.LinkedList@1ddec9e
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2265)
Thread 27 (Map-events fetcher for all reduce tasks on
tracker_ip-10-87-145-204.ec2.internal:localhost/127.0.0.1:60497):
  State: TIMED_WAITING
  Blocked count: 16184
  Waited count: 17896
  Stack:
    java.lang.Object.wait(Native Method)

org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:975)
Thread 25 (Thread-14):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 293
  Stack:
    java.lang.Thread.sleep(Native Method)

org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926)
Thread 24 (IPC Server handler 1 on 60497):
  State: WAITING
  Blocked count: 1
  Waited count: 10501
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fe99b6
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364)
Thread 23 (IPC Server handler 0 on 60497):
  State: WAITING
  Blocked count: 1
  Waited count: 10502
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fe99b6
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364)
Thread 20 (IPC Server listener on 60497):
  State: RUNNABLE
  Blocked count: 171
  Waited count: 0
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    org.apache.hadoop.ipc.Server$Listener.run(Server.java:439)
Thread 22 (IPC Server Responder):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    org.apache.hadoop.ipc.Server$Responder.run(Server.java:605)
Thread 21 (pool-1-thread-1):
  State: RUNNABLE
  Blocked count: 171
  Waited count: 171
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:333)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    java.lang.Thread.run(Thread.java:662)
Thread 15 (Directory/File cleanup thread):
  State: WAITING
  Blocked count: 0
  Waited count: 343
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@94af2f
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:130)
Thread 9 (Timer for 'TaskTracker' metrics system):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 1796
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 4 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 186
  Waited count: 187
  Waiting on java.lang.ref.ReferenceQueue$Lock@19a0203
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 186
  Waited count: 187
  Waiting on java.lang.ref.Reference$Lock@1fa39bb
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
  State: RUNNABLE
  Blocked count: 5850
  Waited count: 11663
  Stack:
    sun.management.ThreadImpl.getThreadInfo1(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:156)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:121)

org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)

org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)

org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1970)
    org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1652)
    org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2434)
    org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3675)

2013-01-15 08:10:49,093 INFO org.apache.hadoop.mapred.TaskTracker: About to
purge task: attempt_201301150318_0001_r_000000_0
2013-01-15 08:10:49,114 INFO org.apache.hadoop.util.ProcessTree: Killing
process group21719 with signal TERM. Exit code 0
2013-01-15 08:10:49,115 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 1
2013-01-15 08:10:52,095 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_201301150318_0001_r_000000_0
task's state:FAILED_UNCLEAN
2013-01-15 08:10:52,096 INFO org.apache.hadoop.mapred.TaskTracker: Trying
to launch : attempt_201301150318_0001_r_000000_0 which needs 1 slots
2013-01-15 08:10:52,096 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_201301150318_0001_r_000000_0 which needs 1 slots
2013-01-15 08:10:52,112 INFO org.apache.hadoop.mapred.JvmManager: In
JvmRunner constructed JVM ID: jvm_201301150318_0001_r_-56086075
2013-01-15 08:10:52,113 INFO org.apache.hadoop.mapred.JvmManager: JVM
Runner jvm_201301150318_0001_r_-56086075 spawned.
2013-01-15 08:10:52,122 INFO org.apache.hadoop.mapred.TaskController:
Writing commands to
/data/hadoop/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201301150318_0001/attempt_201301150318_0001_r_000000_0.cleanup/taskjvm.sh
2013-01-15 08:10:52,780 WARN
org.apache.hadoop.mapred.DefaultTaskController: Exit code from task is : 143
2013-01-15 08:10:52,780 INFO
org.apache.hadoop.mapred.DefaultTaskController: Output from
DefaultTaskController's launchTask follows:
2013-01-15 08:10:52,780 INFO org.apache.hadoop.mapred.TaskController:
2013-01-15 08:10:52,780 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201301150318_0001_r_-641939786 exited with exit code 143. Number of
tasks it ran: 0
2013-01-15 08:10:52,794 INFO org.apache.hadoop.io.nativeio.NativeIO: Got
UserName hadoop for UID 1006 from the native implementation
2013-01-15 08:10:53,177 INFO org.apache.hadoop.mapred.TaskTracker: JVM with
ID: jvm_201301150318_0001_r_-56086075 given task:
attempt_201301150318_0001_r_000000_0
2013-01-15 08:10:53,652 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.0%
2013-01-15 08:10:56,620 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.0% cleanup
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker: Task
attempt_201301150318_0001_r_000000_0 is done.
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker: reported
output size for attempt_201301150318_0001_r_000000_0  was -1
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 1
2013-01-15 08:10:56,679 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201301150318_0001_r_-56086075 exited with exit code 0. Number of tasks
it ran: 1
2013-01-15 08:11:07,483 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.87.145.204:50060,
dest: 10.123.73.45:42479, bytes: 6394091, op: MAPRED_SHUFFLE, cliID:
attempt_201301150318_0001_m_000000_0, duration: 111147314
2013-01-15 08:11:07,831 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.87.145.204:50060,
dest: 10.123.73.45:42479, bytes: 6163890, op: MAPRED_SHUFFLE, cliID:
attempt_201301150318_0001_m_000001_0, duration: 126898055


2013/1/15 Charlie A. <ha...@163.com>

> Hi, yaotian
> I think you should check logs on the very tasktracker, it'll tell you why.
> And here's some tips on deploying a MR job.
> http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
>
> Charlie
>
>
>
> At 2013-01-15 16:34:27,yaotian <ya...@gmail.com> wrote:
>
> I set mapred.reduce.tasks from -1 to "AutoReduce"
> And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems
> that this reduce only run on 1 slave (I have two slaves).
>
> But when it was running on 66%, the error report again "Task
> attempt_201301150318_0001_r_000000_0 failed to report status for 601
> seconds. Killing!"
>
>
>
> 2013/1/14 yaotian <ya...@gmail.com>
>
>> How to judge which counter would work?
>>
>>
>> 2013/1/11 <be...@gmail.com>
>>
>> **
>>> Hi
>>>
>>> To add on to Harsh's comments.
>>>
>>> You need not have to change the task time out.
>>>
>>> In your map/reduce code, you can increment a counter or report status
>>> intermediate on intervals so that there is communication from the task and
>>> hence won't have a task time out.
>>>
>>> Every map and reduce task run on its own jvm limited by a jvm size. If
>>> you try to holds too much data in memory then it can go beyond the jvm size
>>> and cause OOM errors.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * yaotian <ya...@gmail.com>
>>> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
>>> but failed.
>>>
>>> See inline.
>>>
>>>
>>> 2013/1/11 Harsh J <ha...@cloudera.com>
>>>
>>>> If the per-record processing time is very high, you will need to
>>>> periodically report a status. Without a status change report from the task
>>>> to the tracker, it will be killed away as a dead task after a default
>>>> timeout of 10 minutes (600s).
>>>>
>>> =====================> Do you mean to increase the report time: "*
>>> mapred.task.timeout"*?
>>>
>>>
>>>> Also, beware of holding too much memory in a reduce JVM - you're still
>>>> limited there. Best to let the framework do the sort or secondary sort.
>>>>
>>> =======================>  You mean use the default value ? This is my
>>> value.
>>> *mapred.job.reduce.memory.mb*-1
>>>
>>>>
>>>>
>>>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>>>
>>>>> Yes, you are right. The data is GPS trace related to corresponding
>>>>> uid. The reduce is doing this: Sort user to get this kind of result: uid,
>>>>> gps1, gps2, gps3........
>>>>> Yes, the gps data is big because this is 30G data.
>>>>>
>>>>> How to solve this?
>>>>>
>>>>>
>>>>>
>>>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>           2 reducers are successfully completed and 1498 have been
>>>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>>>> some issues with the data you are trying to process)
>>>>>>           One possibility could be you have many values associated to
>>>>>> a single key, which can cause these kind of issues based on the operation
>>>>>> you do in your reducer.
>>>>>>           Can you put some logs in your reducer and try to trace out
>>>>>> what is happening.
>>>>>>
>>>>>> Best,
>>>>>> Mahesh Balija,
>>>>>> Calsoft Labs.
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>>>
>>>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>>>> datanode locate.
>>>>>>>
>>>>>>> If i choose a small data like 200M, it can be done.
>>>>>>>
>>>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>>>> sugggestion?
>>>>>>>
>>>>>>>
>>>>>>> This is the information.
>>>>>>>
>>>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>>>> ------------------------------
>>>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>>>
>>>>>>>
>>>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:54
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:54
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:57
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 06:11:07
>>>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>
>
>

Re: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

Thanks. I will read that document. I am new on this. So i attach the log
here for help. I can't understand this information.

These log from tasktrack log file:

2013-01-15 08:00:44,731 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.66737837% reduce > reduce
2013-01-15 08:00:47,757 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.66737837% reduce > reduce
2013-01-15 08:10:49,083 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0: Task
attempt_201301150318_0001_r_000000_0 failed to report status for 601
seconds. Killing!
2013-01-15 08:10:49,092 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task
24 active threads
Thread 219 (process reaper):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.lang.UNIXProcess.waitForProcessExit(Native Method)
    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 218 (JVM Runner jvm_201301150318_0001_r_-641939786 spawned.):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.UNIXProcess@1195929
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    org.apache.hadoop.util.Shell.runCommand(Shell.java:244)
    org.apache.hadoop.util.Shell.run(Shell.java:182)

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)

org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:131)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:497)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:471)
Thread 214 (Thread-135):
  State: WAITING
  Blocked count: 2
  Waited count: 3
  Waiting on java.lang.Object@1f35647
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:296)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:251)
Thread 38 (IPC Client (47) connection to master/10.120.253.32:9001 from
hadoop):
  State: TIMED_WAITING
  Blocked count: 11210
  Waited count: 11199
  Stack:
    java.lang.Object.wait(Native Method)
    org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:702)
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:744)
Thread 10 (taskCleanup):
  State: WAITING
  Blocked count: 4
  Waited count: 7
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@104109e
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:422)
    java.lang.Thread.run(Thread.java:662)
Thread 13 (Thread-5):
  State: WAITING
  Blocked count: 475
  Waited count: 646
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@61bb9b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.monitor(UserLogManager.java:131)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager$1.run(UserLogManager.java:66)
Thread 14 (Thread-6):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 5
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.mapred.UserLogCleaner.run(UserLogCleaner.java:93)
Thread 37 (Timer-0):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 618
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 36 (23978087@qtp-7433399-0 - Acceptor0
SelectChannelConnector@0.0.0.0:50060):
  State: RUNNABLE
  Blocked count: 29509
  Waited count: 1
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)

org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:498)
    org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:192)

org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)

org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)

org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Thread 35 (TaskLauncher for REDUCE tasks):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.util.LinkedList@ca1038
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2265)
Thread 34 (TaskLauncher for MAP tasks):
  State: WAITING
  Blocked count: 170
  Waited count: 171
  Waiting on java.util.LinkedList@1ddec9e
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2265)
Thread 27 (Map-events fetcher for all reduce tasks on
tracker_ip-10-87-145-204.ec2.internal:localhost/127.0.0.1:60497):
  State: TIMED_WAITING
  Blocked count: 16184
  Waited count: 17896
  Stack:
    java.lang.Object.wait(Native Method)

org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:975)
Thread 25 (Thread-14):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 293
  Stack:
    java.lang.Thread.sleep(Native Method)

org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926)
Thread 24 (IPC Server handler 1 on 60497):
  State: WAITING
  Blocked count: 1
  Waited count: 10501
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fe99b6
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364)
Thread 23 (IPC Server handler 0 on 60497):
  State: WAITING
  Blocked count: 1
  Waited count: 10502
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fe99b6
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364)
Thread 20 (IPC Server listener on 60497):
  State: RUNNABLE
  Blocked count: 171
  Waited count: 0
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    org.apache.hadoop.ipc.Server$Listener.run(Server.java:439)
Thread 22 (IPC Server Responder):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    org.apache.hadoop.ipc.Server$Responder.run(Server.java:605)
Thread 21 (pool-1-thread-1):
  State: RUNNABLE
  Blocked count: 171
  Waited count: 171
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:333)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    java.lang.Thread.run(Thread.java:662)
Thread 15 (Directory/File cleanup thread):
  State: WAITING
  Blocked count: 0
  Waited count: 343
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@94af2f
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:130)
Thread 9 (Timer for 'TaskTracker' metrics system):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 1796
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 4 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 186
  Waited count: 187
  Waiting on java.lang.ref.ReferenceQueue$Lock@19a0203
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 186
  Waited count: 187
  Waiting on java.lang.ref.Reference$Lock@1fa39bb
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
  State: RUNNABLE
  Blocked count: 5850
  Waited count: 11663
  Stack:
    sun.management.ThreadImpl.getThreadInfo1(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:156)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:121)

org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)

org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)

org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1970)
    org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1652)
    org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2434)
    org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3675)

2013-01-15 08:10:49,093 INFO org.apache.hadoop.mapred.TaskTracker: About to
purge task: attempt_201301150318_0001_r_000000_0
2013-01-15 08:10:49,114 INFO org.apache.hadoop.util.ProcessTree: Killing
process group21719 with signal TERM. Exit code 0
2013-01-15 08:10:49,115 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 1
2013-01-15 08:10:52,095 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_201301150318_0001_r_000000_0
task's state:FAILED_UNCLEAN
2013-01-15 08:10:52,096 INFO org.apache.hadoop.mapred.TaskTracker: Trying
to launch : attempt_201301150318_0001_r_000000_0 which needs 1 slots
2013-01-15 08:10:52,096 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_201301150318_0001_r_000000_0 which needs 1 slots
2013-01-15 08:10:52,112 INFO org.apache.hadoop.mapred.JvmManager: In
JvmRunner constructed JVM ID: jvm_201301150318_0001_r_-56086075
2013-01-15 08:10:52,113 INFO org.apache.hadoop.mapred.JvmManager: JVM
Runner jvm_201301150318_0001_r_-56086075 spawned.
2013-01-15 08:10:52,122 INFO org.apache.hadoop.mapred.TaskController:
Writing commands to
/data/hadoop/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201301150318_0001/attempt_201301150318_0001_r_000000_0.cleanup/taskjvm.sh
2013-01-15 08:10:52,780 WARN
org.apache.hadoop.mapred.DefaultTaskController: Exit code from task is : 143
2013-01-15 08:10:52,780 INFO
org.apache.hadoop.mapred.DefaultTaskController: Output from
DefaultTaskController's launchTask follows:
2013-01-15 08:10:52,780 INFO org.apache.hadoop.mapred.TaskController:
2013-01-15 08:10:52,780 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201301150318_0001_r_-641939786 exited with exit code 143. Number of
tasks it ran: 0
2013-01-15 08:10:52,794 INFO org.apache.hadoop.io.nativeio.NativeIO: Got
UserName hadoop for UID 1006 from the native implementation
2013-01-15 08:10:53,177 INFO org.apache.hadoop.mapred.TaskTracker: JVM with
ID: jvm_201301150318_0001_r_-56086075 given task:
attempt_201301150318_0001_r_000000_0
2013-01-15 08:10:53,652 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.0%
2013-01-15 08:10:56,620 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.0% cleanup
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker: Task
attempt_201301150318_0001_r_000000_0 is done.
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker: reported
output size for attempt_201301150318_0001_r_000000_0  was -1
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 1
2013-01-15 08:10:56,679 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201301150318_0001_r_-56086075 exited with exit code 0. Number of tasks
it ran: 1
2013-01-15 08:11:07,483 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.87.145.204:50060,
dest: 10.123.73.45:42479, bytes: 6394091, op: MAPRED_SHUFFLE, cliID:
attempt_201301150318_0001_m_000000_0, duration: 111147314
2013-01-15 08:11:07,831 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.87.145.204:50060,
dest: 10.123.73.45:42479, bytes: 6163890, op: MAPRED_SHUFFLE, cliID:
attempt_201301150318_0001_m_000001_0, duration: 126898055


2013/1/15 Charlie A. <ha...@163.com>

> Hi, yaotian
> I think you should check logs on the very tasktracker, it'll tell you why.
> And here's some tips on deploying a MR job.
> http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
>
> Charlie
>
>
>
> At 2013-01-15 16:34:27,yaotian <ya...@gmail.com> wrote:
>
> I set mapred.reduce.tasks from -1 to "AutoReduce"
> And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems
> that this reduce only run on 1 slave (I have two slaves).
>
> But when it was running on 66%, the error report again "Task
> attempt_201301150318_0001_r_000000_0 failed to report status for 601
> seconds. Killing!"
>
>
>
> 2013/1/14 yaotian <ya...@gmail.com>
>
>> How to judge which counter would work?
>>
>>
>> 2013/1/11 <be...@gmail.com>
>>
>> **
>>> Hi
>>>
>>> To add on to Harsh's comments.
>>>
>>> You need not have to change the task time out.
>>>
>>> In your map/reduce code, you can increment a counter or report status
>>> intermediate on intervals so that there is communication from the task and
>>> hence won't have a task time out.
>>>
>>> Every map and reduce task run on its own jvm limited by a jvm size. If
>>> you try to holds too much data in memory then it can go beyond the jvm size
>>> and cause OOM errors.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * yaotian <ya...@gmail.com>
>>> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
>>> but failed.
>>>
>>> See inline.
>>>
>>>
>>> 2013/1/11 Harsh J <ha...@cloudera.com>
>>>
>>>> If the per-record processing time is very high, you will need to
>>>> periodically report a status. Without a status change report from the task
>>>> to the tracker, it will be killed away as a dead task after a default
>>>> timeout of 10 minutes (600s).
>>>>
>>> =====================> Do you mean to increase the report time: "*
>>> mapred.task.timeout"*?
>>>
>>>
>>>> Also, beware of holding too much memory in a reduce JVM - you're still
>>>> limited there. Best to let the framework do the sort or secondary sort.
>>>>
>>> =======================>  You mean use the default value ? This is my
>>> value.
>>> *mapred.job.reduce.memory.mb*-1
>>>
>>>>
>>>>
>>>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>>>
>>>>> Yes, you are right. The data is GPS trace related to corresponding
>>>>> uid. The reduce is doing this: Sort user to get this kind of result: uid,
>>>>> gps1, gps2, gps3........
>>>>> Yes, the gps data is big because this is 30G data.
>>>>>
>>>>> How to solve this?
>>>>>
>>>>>
>>>>>
>>>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>           2 reducers are successfully completed and 1498 have been
>>>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>>>> some issues with the data you are trying to process)
>>>>>>           One possibility could be you have many values associated to
>>>>>> a single key, which can cause these kind of issues based on the operation
>>>>>> you do in your reducer.
>>>>>>           Can you put some logs in your reducer and try to trace out
>>>>>> what is happening.
>>>>>>
>>>>>> Best,
>>>>>> Mahesh Balija,
>>>>>> Calsoft Labs.
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>>>
>>>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>>>> datanode locate.
>>>>>>>
>>>>>>> If i choose a small data like 200M, it can be done.
>>>>>>>
>>>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>>>> sugggestion?
>>>>>>>
>>>>>>>
>>>>>>> This is the information.
>>>>>>>
>>>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>>>> ------------------------------
>>>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>>>
>>>>>>>
>>>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:54
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:54
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:57
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 06:11:07
>>>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>
>
>

Re: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

Thanks. I will read that document. I am new on this. So i attach the log
here for help. I can't understand this information.

These log from tasktrack log file:

2013-01-15 08:00:44,731 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.66737837% reduce > reduce
2013-01-15 08:00:47,757 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.66737837% reduce > reduce
2013-01-15 08:10:49,083 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0: Task
attempt_201301150318_0001_r_000000_0 failed to report status for 601
seconds. Killing!
2013-01-15 08:10:49,092 INFO org.apache.hadoop.mapred.TaskTracker: Process
Thread Dump: lost task
24 active threads
Thread 219 (process reaper):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.lang.UNIXProcess.waitForProcessExit(Native Method)
    java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
    java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
Thread 218 (JVM Runner jvm_201301150318_0001_r_-641939786 spawned.):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.lang.UNIXProcess@1195929
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
    org.apache.hadoop.util.Shell.runCommand(Shell.java:244)
    org.apache.hadoop.util.Shell.run(Shell.java:182)

org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)

org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:131)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:497)

org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:471)
Thread 214 (Thread-135):
  State: WAITING
  Blocked count: 2
  Waited count: 3
  Waiting on java.lang.Object@1f35647
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskRunner.launchJvmAndWait(TaskRunner.java:296)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:251)
Thread 38 (IPC Client (47) connection to master/10.120.253.32:9001 from
hadoop):
  State: TIMED_WAITING
  Blocked count: 11210
  Waited count: 11199
  Stack:
    java.lang.Object.wait(Native Method)
    org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:702)
    org.apache.hadoop.ipc.Client$Connection.run(Client.java:744)
Thread 10 (taskCleanup):
  State: WAITING
  Blocked count: 4
  Waited count: 7
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@104109e
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:422)
    java.lang.Thread.run(Thread.java:662)
Thread 13 (Thread-5):
  State: WAITING
  Blocked count: 475
  Waited count: 646
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@61bb9b
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.monitor(UserLogManager.java:131)

org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager$1.run(UserLogManager.java:66)
Thread 14 (Thread-6):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 5
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.mapred.UserLogCleaner.run(UserLogCleaner.java:93)
Thread 37 (Timer-0):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 618
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 36 (23978087@qtp-7433399-0 - Acceptor0
SelectChannelConnector@0.0.0.0:50060):
  State: RUNNABLE
  Blocked count: 29509
  Waited count: 1
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)

org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:498)
    org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:192)

org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)

org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)

org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Thread 35 (TaskLauncher for REDUCE tasks):
  State: WAITING
  Blocked count: 1
  Waited count: 2
  Waiting on java.util.LinkedList@ca1038
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2265)
Thread 34 (TaskLauncher for MAP tasks):
  State: WAITING
  Blocked count: 170
  Waited count: 171
  Waiting on java.util.LinkedList@1ddec9e
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)

org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2265)
Thread 27 (Map-events fetcher for all reduce tasks on
tracker_ip-10-87-145-204.ec2.internal:localhost/127.0.0.1:60497):
  State: TIMED_WAITING
  Blocked count: 16184
  Waited count: 17896
  Stack:
    java.lang.Object.wait(Native Method)

org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:975)
Thread 25 (Thread-14):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 293
  Stack:
    java.lang.Thread.sleep(Native Method)

org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926)
Thread 24 (IPC Server handler 1 on 60497):
  State: WAITING
  Blocked count: 1
  Waited count: 10501
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fe99b6
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364)
Thread 23 (IPC Server handler 0 on 60497):
  State: WAITING
  Blocked count: 1
  Waited count: 10502
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@fe99b6
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
    org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364)
Thread 20 (IPC Server listener on 60497):
  State: RUNNABLE
  Blocked count: 171
  Waited count: 0
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    org.apache.hadoop.ipc.Server$Listener.run(Server.java:439)
Thread 22 (IPC Server Responder):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    org.apache.hadoop.ipc.Server$Responder.run(Server.java:605)
Thread 21 (pool-1-thread-1):
  State: RUNNABLE
  Blocked count: 171
  Waited count: 171
  Stack:
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
    sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
    org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:333)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    java.lang.Thread.run(Thread.java:662)
Thread 15 (Directory/File cleanup thread):
  State: WAITING
  Blocked count: 0
  Waited count: 343
  Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@94af2f
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)

java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)

org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run(CleanupQueue.java:130)
Thread 9 (Timer for 'TaskTracker' metrics system):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 1796
  Stack:
    java.lang.Object.wait(Native Method)
    java.util.TimerThread.mainLoop(Timer.java:509)
    java.util.TimerThread.run(Timer.java:462)
Thread 4 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 186
  Waited count: 187
  Waiting on java.lang.ref.ReferenceQueue$Lock@19a0203
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 186
  Waited count: 187
  Waiting on java.lang.ref.Reference$Lock@1fa39bb
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:485)
    java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
Thread 1 (main):
  State: RUNNABLE
  Blocked count: 5850
  Waited count: 11663
  Stack:
    sun.management.ThreadImpl.getThreadInfo1(Native Method)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:156)
    sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:121)

org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)

org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:203)

org.apache.hadoop.mapred.TaskTracker.markUnresponsiveTasks(TaskTracker.java:1970)
    org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1652)
    org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2434)
    org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3675)

2013-01-15 08:10:49,093 INFO org.apache.hadoop.mapred.TaskTracker: About to
purge task: attempt_201301150318_0001_r_000000_0
2013-01-15 08:10:49,114 INFO org.apache.hadoop.util.ProcessTree: Killing
process group21719 with signal TERM. Exit code 0
2013-01-15 08:10:49,115 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 1
2013-01-15 08:10:52,095 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_201301150318_0001_r_000000_0
task's state:FAILED_UNCLEAN
2013-01-15 08:10:52,096 INFO org.apache.hadoop.mapred.TaskTracker: Trying
to launch : attempt_201301150318_0001_r_000000_0 which needs 1 slots
2013-01-15 08:10:52,096 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_201301150318_0001_r_000000_0 which needs 1 slots
2013-01-15 08:10:52,112 INFO org.apache.hadoop.mapred.JvmManager: In
JvmRunner constructed JVM ID: jvm_201301150318_0001_r_-56086075
2013-01-15 08:10:52,113 INFO org.apache.hadoop.mapred.JvmManager: JVM
Runner jvm_201301150318_0001_r_-56086075 spawned.
2013-01-15 08:10:52,122 INFO org.apache.hadoop.mapred.TaskController:
Writing commands to
/data/hadoop/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201301150318_0001/attempt_201301150318_0001_r_000000_0.cleanup/taskjvm.sh
2013-01-15 08:10:52,780 WARN
org.apache.hadoop.mapred.DefaultTaskController: Exit code from task is : 143
2013-01-15 08:10:52,780 INFO
org.apache.hadoop.mapred.DefaultTaskController: Output from
DefaultTaskController's launchTask follows:
2013-01-15 08:10:52,780 INFO org.apache.hadoop.mapred.TaskController:
2013-01-15 08:10:52,780 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201301150318_0001_r_-641939786 exited with exit code 143. Number of
tasks it ran: 0
2013-01-15 08:10:52,794 INFO org.apache.hadoop.io.nativeio.NativeIO: Got
UserName hadoop for UID 1006 from the native implementation
2013-01-15 08:10:53,177 INFO org.apache.hadoop.mapred.TaskTracker: JVM with
ID: jvm_201301150318_0001_r_-56086075 given task:
attempt_201301150318_0001_r_000000_0
2013-01-15 08:10:53,652 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.0%
2013-01-15 08:10:56,620 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201301150318_0001_r_000000_0 0.0% cleanup
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker: Task
attempt_201301150318_0001_r_000000_0 is done.
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker: reported
output size for attempt_201301150318_0001_r_000000_0  was -1
2013-01-15 08:10:56,621 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 1
2013-01-15 08:10:56,679 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201301150318_0001_r_-56086075 exited with exit code 0. Number of tasks
it ran: 1
2013-01-15 08:11:07,483 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.87.145.204:50060,
dest: 10.123.73.45:42479, bytes: 6394091, op: MAPRED_SHUFFLE, cliID:
attempt_201301150318_0001_m_000000_0, duration: 111147314
2013-01-15 08:11:07,831 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.87.145.204:50060,
dest: 10.123.73.45:42479, bytes: 6163890, op: MAPRED_SHUFFLE, cliID:
attempt_201301150318_0001_m_000001_0, duration: 126898055


2013/1/15 Charlie A. <ha...@163.com>

> Hi, yaotian
> I think you should check logs on the very tasktracker, it'll tell you why.
> And here's some tips on deploying a MR job.
> http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
>
> Charlie
>
>
>
> At 2013-01-15 16:34:27,yaotian <ya...@gmail.com> wrote:
>
> I set mapred.reduce.tasks from -1 to "AutoReduce"
> And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems
> that this reduce only run on 1 slave (I have two slaves).
>
> But when it was running on 66%, the error report again "Task
> attempt_201301150318_0001_r_000000_0 failed to report status for 601
> seconds. Killing!"
>
>
>
> 2013/1/14 yaotian <ya...@gmail.com>
>
>> How to judge which counter would work?
>>
>>
>> 2013/1/11 <be...@gmail.com>
>>
>> **
>>> Hi
>>>
>>> To add on to Harsh's comments.
>>>
>>> You need not have to change the task time out.
>>>
>>> In your map/reduce code, you can increment a counter or report status
>>> intermediate on intervals so that there is communication from the task and
>>> hence won't have a task time out.
>>>
>>> Every map and reduce task run on its own jvm limited by a jvm size. If
>>> you try to holds too much data in memory then it can go beyond the jvm size
>>> and cause OOM errors.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * yaotian <ya...@gmail.com>
>>> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
>>> *To: *<us...@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
>>> but failed.
>>>
>>> See inline.
>>>
>>>
>>> 2013/1/11 Harsh J <ha...@cloudera.com>
>>>
>>>> If the per-record processing time is very high, you will need to
>>>> periodically report a status. Without a status change report from the task
>>>> to the tracker, it will be killed away as a dead task after a default
>>>> timeout of 10 minutes (600s).
>>>>
>>> =====================> Do you mean to increase the report time: "*
>>> mapred.task.timeout"*?
>>>
>>>
>>>> Also, beware of holding too much memory in a reduce JVM - you're still
>>>> limited there. Best to let the framework do the sort or secondary sort.
>>>>
>>> =======================>  You mean use the default value ? This is my
>>> value.
>>> *mapred.job.reduce.memory.mb*-1
>>>
>>>>
>>>>
>>>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>>>
>>>>> Yes, you are right. The data is GPS trace related to corresponding
>>>>> uid. The reduce is doing this: Sort user to get this kind of result: uid,
>>>>> gps1, gps2, gps3........
>>>>> Yes, the gps data is big because this is 30G data.
>>>>>
>>>>> How to solve this?
>>>>>
>>>>>
>>>>>
>>>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>           2 reducers are successfully completed and 1498 have been
>>>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>>>> some issues with the data you are trying to process)
>>>>>>           One possibility could be you have many values associated to
>>>>>> a single key, which can cause these kind of issues based on the operation
>>>>>> you do in your reducer.
>>>>>>           Can you put some logs in your reducer and try to trace out
>>>>>> what is happening.
>>>>>>
>>>>>> Best,
>>>>>> Mahesh Balija,
>>>>>> Calsoft Labs.
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>>>
>>>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>>>> datanode locate.
>>>>>>>
>>>>>>> If i choose a small data like 200M, it can be done.
>>>>>>>
>>>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>>>> sugggestion?
>>>>>>>
>>>>>>>
>>>>>>> This is the information.
>>>>>>>
>>>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>>>> ------------------------------
>>>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>>>
>>>>>>>
>>>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:54
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:54
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 04:18:57
>>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>> 0.00%
>>>>>>> 10-Jan-2013 06:11:07
>>>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>>>
>>>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>>>
>>>>>>>
>>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>
>
>

Re:Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by "Charlie A." <ha...@163.com>.

Hi, yaotian
I think you should check logs on the very tasktracker, it'll tell you why.
And here's some tips on deploying a MR job. http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/


Charlie



At 2013-01-15 16:34:27,yaotian <ya...@gmail.com> wrote:

I set mapred.reduce.tasks from -1 to "AutoReduce"

And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems that this reduce only run on 1 slave (I have two slaves).


But when it was running on 66%, the error report again "Task attempt_201301150318_0001_r_000000_0 failed to report status for 601 seconds. Killing!"





2013/1/14 yaotian <ya...@gmail.com>

How to judge which counter would work? 



2013/1/11 <be...@gmail.com>


Hi

To add on to Harsh's comments.

You need not have to change the task time out.

In your map/reduce code, you can increment a counter or report status intermediate on intervals so that there is communication from the task and hence won't have a task time out.

Every map and reduce task run on its own jvm limited by a jvm size. If you try to holds too much data in memory then it can go beyond the jvm size and cause OOM errors.


Regards
Bejoy KS

Sent from remote device, Please excuse typos
From: yaotian <ya...@gmail.com>
Date: Fri, 11 Jan 2013 14:35:07 +0800
To: <us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.


See inline.




2013/1/11 Harsh J <ha...@cloudera.com>

If the per-record processing time is very high, you will need to periodically report a status. Without a status change report from the task to the tracker, it will be killed away as a dead task after a default timeout of 10 minutes (600s).
=====================> Do you mean to increase the report time: "mapred.task.timeout"? 




Also, beware of holding too much memory in a reduce JVM - you're still limited there. Best to let the framework do the sort or secondary sort.
=======================>  You mean use the default value ? This is my value.
| mapred.job.reduce.memory.mb | -1 |



On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:

Yes, you are right. The data is GPS trace related to corresponding uid. The reduce is doing this: Sort user to get this kind of result: uid, gps1, gps2, gps3........
Yes, the gps data is big because this is 30G data.


How to solve this?





2013/1/11 Mahesh Balija <ba...@gmail.com>
Hi,

          2 reducers are successfully completed and 1498 have been killed. I assume that you have the data issues. (Either the data is huge or some issues with the data you are trying to process)
          One possibility could be you have many values associated to a single key, which can cause these kind of issues based on the operation you do in your reducer.
          Can you put some logs in your reducer and try to trace out what is happening.

Best,
Mahesh Balija,
Calsoft Labs.



On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:

I have 1 hadoop master which name node locates and 2 slave which datanode locate.


If i choose a small data like 200M, it can be done.


But if i run 30G data, Map is done. But the reduce report error. Any sugggestion?




This is the information.


Black-listed TaskTrackers: 1

| Kind | % Complete | Num Tasks | Pending | Running | Complete | Killed | Failed/Killed
Task Attempts |
| map | 100.00%
| |
| 450 | 0 | 0 | 450 | 0 | 0 / 1 |
| reduce | 100.00%
| |
| 1500 | 0 | 0 | 2 | 1498 | 12 / 3

|


| Task | Complete | Status | Start Time | Finish Time | Errors | Counters |
| task_201301090834_0041_r_000001 | 0.00%
| |
|
| 10-Jan-2013 04:18:54
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
|
Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000002 | 0.00%
| |
|
| 10-Jan-2013 04:18:54
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
|
Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000003 | 0.00%
| |
|
| 10-Jan-2013 04:18:57
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
|
Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000005 | 0.00%
| |
|
| 10-Jan-2013 06:11:07
| 10-Jan-2013 06:46:38 (35mins, 31sec)
|
Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!


| 0 |









--
Harsh J

Re:Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by "Charlie A." <ha...@163.com>.

Hi, yaotian
I think you should check logs on the very tasktracker, it'll tell you why.
And here's some tips on deploying a MR job. http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/


Charlie



At 2013-01-15 16:34:27,yaotian <ya...@gmail.com> wrote:

I set mapred.reduce.tasks from -1 to "AutoReduce"

And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems that this reduce only run on 1 slave (I have two slaves).


But when it was running on 66%, the error report again "Task attempt_201301150318_0001_r_000000_0 failed to report status for 601 seconds. Killing!"





2013/1/14 yaotian <ya...@gmail.com>

How to judge which counter would work? 



2013/1/11 <be...@gmail.com>


Hi

To add on to Harsh's comments.

You need not have to change the task time out.

In your map/reduce code, you can increment a counter or report status intermediate on intervals so that there is communication from the task and hence won't have a task time out.

Every map and reduce task run on its own jvm limited by a jvm size. If you try to holds too much data in memory then it can go beyond the jvm size and cause OOM errors.


Regards
Bejoy KS

Sent from remote device, Please excuse typos
From: yaotian <ya...@gmail.com>
Date: Fri, 11 Jan 2013 14:35:07 +0800
To: <us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.


See inline.




2013/1/11 Harsh J <ha...@cloudera.com>

If the per-record processing time is very high, you will need to periodically report a status. Without a status change report from the task to the tracker, it will be killed away as a dead task after a default timeout of 10 minutes (600s).
=====================> Do you mean to increase the report time: "mapred.task.timeout"? 




Also, beware of holding too much memory in a reduce JVM - you're still limited there. Best to let the framework do the sort or secondary sort.
=======================>  You mean use the default value ? This is my value.
| mapred.job.reduce.memory.mb | -1 |



On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:

Yes, you are right. The data is GPS trace related to corresponding uid. The reduce is doing this: Sort user to get this kind of result: uid, gps1, gps2, gps3........
Yes, the gps data is big because this is 30G data.


How to solve this?





2013/1/11 Mahesh Balija <ba...@gmail.com>
Hi,

          2 reducers are successfully completed and 1498 have been killed. I assume that you have the data issues. (Either the data is huge or some issues with the data you are trying to process)
          One possibility could be you have many values associated to a single key, which can cause these kind of issues based on the operation you do in your reducer.
          Can you put some logs in your reducer and try to trace out what is happening.

Best,
Mahesh Balija,
Calsoft Labs.



On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:

I have 1 hadoop master which name node locates and 2 slave which datanode locate.


If i choose a small data like 200M, it can be done.


But if i run 30G data, Map is done. But the reduce report error. Any sugggestion?




This is the information.


Black-listed TaskTrackers: 1

| Kind | % Complete | Num Tasks | Pending | Running | Complete | Killed | Failed/Killed
Task Attempts |
| map | 100.00%
| |
| 450 | 0 | 0 | 450 | 0 | 0 / 1 |
| reduce | 100.00%
| |
| 1500 | 0 | 0 | 2 | 1498 | 12 / 3

|


| Task | Complete | Status | Start Time | Finish Time | Errors | Counters |
| task_201301090834_0041_r_000001 | 0.00%
| |
|
| 10-Jan-2013 04:18:54
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
|
Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000002 | 0.00%
| |
|
| 10-Jan-2013 04:18:54
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
|
Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000003 | 0.00%
| |
|
| 10-Jan-2013 04:18:57
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
|
Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000005 | 0.00%
| |
|
| 10-Jan-2013 06:11:07
| 10-Jan-2013 06:46:38 (35mins, 31sec)
|
Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!


| 0 |









--
Harsh J

Re:Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by "Charlie A." <ha...@163.com>.

Hi, yaotian
I think you should check logs on the very tasktracker, it'll tell you why.
And here's some tips on deploying a MR job. http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/


Charlie



At 2013-01-15 16:34:27,yaotian <ya...@gmail.com> wrote:

I set mapred.reduce.tasks from -1 to "AutoReduce"

And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems that this reduce only run on 1 slave (I have two slaves).


But when it was running on 66%, the error report again "Task attempt_201301150318_0001_r_000000_0 failed to report status for 601 seconds. Killing!"





2013/1/14 yaotian <ya...@gmail.com>

How to judge which counter would work? 



2013/1/11 <be...@gmail.com>


Hi

To add on to Harsh's comments.

You need not have to change the task time out.

In your map/reduce code, you can increment a counter or report status intermediate on intervals so that there is communication from the task and hence won't have a task time out.

Every map and reduce task run on its own jvm limited by a jvm size. If you try to holds too much data in memory then it can go beyond the jvm size and cause OOM errors.


Regards
Bejoy KS

Sent from remote device, Please excuse typos
From: yaotian <ya...@gmail.com>
Date: Fri, 11 Jan 2013 14:35:07 +0800
To: <us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.


See inline.




2013/1/11 Harsh J <ha...@cloudera.com>

If the per-record processing time is very high, you will need to periodically report a status. Without a status change report from the task to the tracker, it will be killed away as a dead task after a default timeout of 10 minutes (600s).
=====================> Do you mean to increase the report time: "mapred.task.timeout"? 




Also, beware of holding too much memory in a reduce JVM - you're still limited there. Best to let the framework do the sort or secondary sort.
=======================>  You mean use the default value ? This is my value.
| mapred.job.reduce.memory.mb | -1 |



On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:

Yes, you are right. The data is GPS trace related to corresponding uid. The reduce is doing this: Sort user to get this kind of result: uid, gps1, gps2, gps3........
Yes, the gps data is big because this is 30G data.


How to solve this?





2013/1/11 Mahesh Balija <ba...@gmail.com>
Hi,

          2 reducers are successfully completed and 1498 have been killed. I assume that you have the data issues. (Either the data is huge or some issues with the data you are trying to process)
          One possibility could be you have many values associated to a single key, which can cause these kind of issues based on the operation you do in your reducer.
          Can you put some logs in your reducer and try to trace out what is happening.

Best,
Mahesh Balija,
Calsoft Labs.



On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:

I have 1 hadoop master which name node locates and 2 slave which datanode locate.


If i choose a small data like 200M, it can be done.


But if i run 30G data, Map is done. But the reduce report error. Any sugggestion?




This is the information.


Black-listed TaskTrackers: 1

| Kind | % Complete | Num Tasks | Pending | Running | Complete | Killed | Failed/Killed
Task Attempts |
| map | 100.00%
| |
| 450 | 0 | 0 | 450 | 0 | 0 / 1 |
| reduce | 100.00%
| |
| 1500 | 0 | 0 | 2 | 1498 | 12 / 3

|


| Task | Complete | Status | Start Time | Finish Time | Errors | Counters |
| task_201301090834_0041_r_000001 | 0.00%
| |
|
| 10-Jan-2013 04:18:54
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
|
Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000002 | 0.00%
| |
|
| 10-Jan-2013 04:18:54
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
|
Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000003 | 0.00%
| |
|
| 10-Jan-2013 04:18:57
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
|
Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000005 | 0.00%
| |
|
| 10-Jan-2013 06:11:07
| 10-Jan-2013 06:46:38 (35mins, 31sec)
|
Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!


| 0 |









--
Harsh J

Re:Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by "Charlie A." <ha...@163.com>.

Hi, yaotian
I think you should check logs on the very tasktracker, it'll tell you why.
And here's some tips on deploying a MR job. http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/


Charlie



At 2013-01-15 16:34:27,yaotian <ya...@gmail.com> wrote:

I set mapred.reduce.tasks from -1 to "AutoReduce"

And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems that this reduce only run on 1 slave (I have two slaves).


But when it was running on 66%, the error report again "Task attempt_201301150318_0001_r_000000_0 failed to report status for 601 seconds. Killing!"





2013/1/14 yaotian <ya...@gmail.com>

How to judge which counter would work? 



2013/1/11 <be...@gmail.com>


Hi

To add on to Harsh's comments.

You need not have to change the task time out.

In your map/reduce code, you can increment a counter or report status intermediate on intervals so that there is communication from the task and hence won't have a task time out.

Every map and reduce task run on its own jvm limited by a jvm size. If you try to holds too much data in memory then it can go beyond the jvm size and cause OOM errors.


Regards
Bejoy KS

Sent from remote device, Please excuse typos
From: yaotian <ya...@gmail.com>
Date: Fri, 11 Jan 2013 14:35:07 +0800
To: <us...@hadoop.apache.org>
ReplyTo: user@hadoop.apache.org
Subject: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.


See inline.




2013/1/11 Harsh J <ha...@cloudera.com>

If the per-record processing time is very high, you will need to periodically report a status. Without a status change report from the task to the tracker, it will be killed away as a dead task after a default timeout of 10 minutes (600s).
=====================> Do you mean to increase the report time: "mapred.task.timeout"? 




Also, beware of holding too much memory in a reduce JVM - you're still limited there. Best to let the framework do the sort or secondary sort.
=======================>  You mean use the default value ? This is my value.
| mapred.job.reduce.memory.mb | -1 |



On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:

Yes, you are right. The data is GPS trace related to corresponding uid. The reduce is doing this: Sort user to get this kind of result: uid, gps1, gps2, gps3........
Yes, the gps data is big because this is 30G data.


How to solve this?





2013/1/11 Mahesh Balija <ba...@gmail.com>
Hi,

          2 reducers are successfully completed and 1498 have been killed. I assume that you have the data issues. (Either the data is huge or some issues with the data you are trying to process)
          One possibility could be you have many values associated to a single key, which can cause these kind of issues based on the operation you do in your reducer.
          Can you put some logs in your reducer and try to trace out what is happening.

Best,
Mahesh Balija,
Calsoft Labs.



On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:

I have 1 hadoop master which name node locates and 2 slave which datanode locate.


If i choose a small data like 200M, it can be done.


But if i run 30G data, Map is done. But the reduce report error. Any sugggestion?




This is the information.


Black-listed TaskTrackers: 1

| Kind | % Complete | Num Tasks | Pending | Running | Complete | Killed | Failed/Killed
Task Attempts |
| map | 100.00%
| |
| 450 | 0 | 0 | 450 | 0 | 0 / 1 |
| reduce | 100.00%
| |
| 1500 | 0 | 0 | 2 | 1498 | 12 / 3

|


| Task | Complete | Status | Start Time | Finish Time | Errors | Counters |
| task_201301090834_0041_r_000001 | 0.00%
| |
|
| 10-Jan-2013 04:18:54
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
|
Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000002 | 0.00%
| |
|
| 10-Jan-2013 04:18:54
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
|
Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000003 | 0.00%
| |
|
| 10-Jan-2013 04:18:57
| 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
|
Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!


| 0 |
| task_201301090834_0041_r_000005 | 0.00%
| |
|
| 10-Jan-2013 06:11:07
| 10-Jan-2013 06:46:38 (35mins, 31sec)
|
Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!


| 0 |









--
Harsh J

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

I set mapred.reduce.tasks from -1 to "AutoReduce"
And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems
that this reduce only run on 1 slave (I have two slaves).

But when it was running on 66%, the error report again "Task
attempt_201301150318_0001_r_000000_0 failed to report status for 601
seconds. Killing!"



2013/1/14 yaotian <ya...@gmail.com>

> How to judge which counter would work?
>
>
> 2013/1/11 <be...@gmail.com>
>
> **
>> Hi
>>
>> To add on to Harsh's comments.
>>
>> You need not have to change the task time out.
>>
>> In your map/reduce code, you can increment a counter or report status
>> intermediate on intervals so that there is communication from the task and
>> hence won't have a task time out.
>>
>> Every map and reduce task run on its own jvm limited by a jvm size. If
>> you try to holds too much data in memory then it can go beyond the jvm size
>> and cause OOM errors.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * yaotian <ya...@gmail.com>
>> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
>> but failed.
>>
>> See inline.
>>
>>
>> 2013/1/11 Harsh J <ha...@cloudera.com>
>>
>>> If the per-record processing time is very high, you will need to
>>> periodically report a status. Without a status change report from the task
>>> to the tracker, it will be killed away as a dead task after a default
>>> timeout of 10 minutes (600s).
>>>
>> =====================> Do you mean to increase the report time: "*
>> mapred.task.timeout"*?
>>
>>
>>> Also, beware of holding too much memory in a reduce JVM - you're still
>>> limited there. Best to let the framework do the sort or secondary sort.
>>>
>> =======================>  You mean use the default value ? This is my
>> value.
>> *mapred.job.reduce.memory.mb*-1
>>
>>>
>>>
>>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> Yes, you are right. The data is GPS trace related to corresponding uid.
>>>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>>>> gps2, gps3........
>>>> Yes, the gps data is big because this is 30G data.
>>>>
>>>> How to solve this?
>>>>
>>>>
>>>>
>>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>>           2 reducers are successfully completed and 1498 have been
>>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>>> some issues with the data you are trying to process)
>>>>>           One possibility could be you have many values associated to
>>>>> a single key, which can cause these kind of issues based on the operation
>>>>> you do in your reducer.
>>>>>           Can you put some logs in your reducer and try to trace out
>>>>> what is happening.
>>>>>
>>>>> Best,
>>>>> Mahesh Balija,
>>>>> Calsoft Labs.
>>>>>
>>>>>
>>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>>
>>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>>> datanode locate.
>>>>>>
>>>>>> If i choose a small data like 200M, it can be done.
>>>>>>
>>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>>> sugggestion?
>>>>>>
>>>>>>
>>>>>> This is the information.
>>>>>>
>>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>>> ------------------------------
>>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>>
>>>>>>
>>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:54
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:54
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:57
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 06:11:07
>>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

I set mapred.reduce.tasks from -1 to "AutoReduce"
And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems
that this reduce only run on 1 slave (I have two slaves).

But when it was running on 66%, the error report again "Task
attempt_201301150318_0001_r_000000_0 failed to report status for 601
seconds. Killing!"



2013/1/14 yaotian <ya...@gmail.com>

> How to judge which counter would work?
>
>
> 2013/1/11 <be...@gmail.com>
>
> **
>> Hi
>>
>> To add on to Harsh's comments.
>>
>> You need not have to change the task time out.
>>
>> In your map/reduce code, you can increment a counter or report status
>> intermediate on intervals so that there is communication from the task and
>> hence won't have a task time out.
>>
>> Every map and reduce task run on its own jvm limited by a jvm size. If
>> you try to holds too much data in memory then it can go beyond the jvm size
>> and cause OOM errors.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * yaotian <ya...@gmail.com>
>> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
>> but failed.
>>
>> See inline.
>>
>>
>> 2013/1/11 Harsh J <ha...@cloudera.com>
>>
>>> If the per-record processing time is very high, you will need to
>>> periodically report a status. Without a status change report from the task
>>> to the tracker, it will be killed away as a dead task after a default
>>> timeout of 10 minutes (600s).
>>>
>> =====================> Do you mean to increase the report time: "*
>> mapred.task.timeout"*?
>>
>>
>>> Also, beware of holding too much memory in a reduce JVM - you're still
>>> limited there. Best to let the framework do the sort or secondary sort.
>>>
>> =======================>  You mean use the default value ? This is my
>> value.
>> *mapred.job.reduce.memory.mb*-1
>>
>>>
>>>
>>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> Yes, you are right. The data is GPS trace related to corresponding uid.
>>>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>>>> gps2, gps3........
>>>> Yes, the gps data is big because this is 30G data.
>>>>
>>>> How to solve this?
>>>>
>>>>
>>>>
>>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>>           2 reducers are successfully completed and 1498 have been
>>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>>> some issues with the data you are trying to process)
>>>>>           One possibility could be you have many values associated to
>>>>> a single key, which can cause these kind of issues based on the operation
>>>>> you do in your reducer.
>>>>>           Can you put some logs in your reducer and try to trace out
>>>>> what is happening.
>>>>>
>>>>> Best,
>>>>> Mahesh Balija,
>>>>> Calsoft Labs.
>>>>>
>>>>>
>>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>>
>>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>>> datanode locate.
>>>>>>
>>>>>> If i choose a small data like 200M, it can be done.
>>>>>>
>>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>>> sugggestion?
>>>>>>
>>>>>>
>>>>>> This is the information.
>>>>>>
>>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>>> ------------------------------
>>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>>
>>>>>>
>>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:54
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:54
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:57
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 06:11:07
>>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

I set mapred.reduce.tasks from -1 to "AutoReduce"
And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems
that this reduce only run on 1 slave (I have two slaves).

But when it was running on 66%, the error report again "Task
attempt_201301150318_0001_r_000000_0 failed to report status for 601
seconds. Killing!"



2013/1/14 yaotian <ya...@gmail.com>

> How to judge which counter would work?
>
>
> 2013/1/11 <be...@gmail.com>
>
> **
>> Hi
>>
>> To add on to Harsh's comments.
>>
>> You need not have to change the task time out.
>>
>> In your map/reduce code, you can increment a counter or report status
>> intermediate on intervals so that there is communication from the task and
>> hence won't have a task time out.
>>
>> Every map and reduce task run on its own jvm limited by a jvm size. If
>> you try to holds too much data in memory then it can go beyond the jvm size
>> and cause OOM errors.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * yaotian <ya...@gmail.com>
>> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
>> but failed.
>>
>> See inline.
>>
>>
>> 2013/1/11 Harsh J <ha...@cloudera.com>
>>
>>> If the per-record processing time is very high, you will need to
>>> periodically report a status. Without a status change report from the task
>>> to the tracker, it will be killed away as a dead task after a default
>>> timeout of 10 minutes (600s).
>>>
>> =====================> Do you mean to increase the report time: "*
>> mapred.task.timeout"*?
>>
>>
>>> Also, beware of holding too much memory in a reduce JVM - you're still
>>> limited there. Best to let the framework do the sort or secondary sort.
>>>
>> =======================>  You mean use the default value ? This is my
>> value.
>> *mapred.job.reduce.memory.mb*-1
>>
>>>
>>>
>>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> Yes, you are right. The data is GPS trace related to corresponding uid.
>>>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>>>> gps2, gps3........
>>>> Yes, the gps data is big because this is 30G data.
>>>>
>>>> How to solve this?
>>>>
>>>>
>>>>
>>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>>           2 reducers are successfully completed and 1498 have been
>>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>>> some issues with the data you are trying to process)
>>>>>           One possibility could be you have many values associated to
>>>>> a single key, which can cause these kind of issues based on the operation
>>>>> you do in your reducer.
>>>>>           Can you put some logs in your reducer and try to trace out
>>>>> what is happening.
>>>>>
>>>>> Best,
>>>>> Mahesh Balija,
>>>>> Calsoft Labs.
>>>>>
>>>>>
>>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>>
>>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>>> datanode locate.
>>>>>>
>>>>>> If i choose a small data like 200M, it can be done.
>>>>>>
>>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>>> sugggestion?
>>>>>>
>>>>>>
>>>>>> This is the information.
>>>>>>
>>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>>> ------------------------------
>>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>>
>>>>>>
>>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:54
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:54
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:57
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 06:11:07
>>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

I set mapred.reduce.tasks from -1 to "AutoReduce"
And the hadoop created 450 tasks for Map. But 1 task for Reduce. It seems
that this reduce only run on 1 slave (I have two slaves).

But when it was running on 66%, the error report again "Task
attempt_201301150318_0001_r_000000_0 failed to report status for 601
seconds. Killing!"



2013/1/14 yaotian <ya...@gmail.com>

> How to judge which counter would work?
>
>
> 2013/1/11 <be...@gmail.com>
>
> **
>> Hi
>>
>> To add on to Harsh's comments.
>>
>> You need not have to change the task time out.
>>
>> In your map/reduce code, you can increment a counter or report status
>> intermediate on intervals so that there is communication from the task and
>> hence won't have a task time out.
>>
>> Every map and reduce task run on its own jvm limited by a jvm size. If
>> you try to holds too much data in memory then it can go beyond the jvm size
>> and cause OOM errors.
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * yaotian <ya...@gmail.com>
>> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
>> *To: *<us...@hadoop.apache.org>
>> *ReplyTo: * user@hadoop.apache.org
>> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
>> but failed.
>>
>> See inline.
>>
>>
>> 2013/1/11 Harsh J <ha...@cloudera.com>
>>
>>> If the per-record processing time is very high, you will need to
>>> periodically report a status. Without a status change report from the task
>>> to the tracker, it will be killed away as a dead task after a default
>>> timeout of 10 minutes (600s).
>>>
>> =====================> Do you mean to increase the report time: "*
>> mapred.task.timeout"*?
>>
>>
>>> Also, beware of holding too much memory in a reduce JVM - you're still
>>> limited there. Best to let the framework do the sort or secondary sort.
>>>
>> =======================>  You mean use the default value ? This is my
>> value.
>> *mapred.job.reduce.memory.mb*-1
>>
>>>
>>>
>>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> Yes, you are right. The data is GPS trace related to corresponding uid.
>>>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>>>> gps2, gps3........
>>>> Yes, the gps data is big because this is 30G data.
>>>>
>>>> How to solve this?
>>>>
>>>>
>>>>
>>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>>           2 reducers are successfully completed and 1498 have been
>>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>>> some issues with the data you are trying to process)
>>>>>           One possibility could be you have many values associated to
>>>>> a single key, which can cause these kind of issues based on the operation
>>>>> you do in your reducer.
>>>>>           Can you put some logs in your reducer and try to trace out
>>>>> what is happening.
>>>>>
>>>>> Best,
>>>>> Mahesh Balija,
>>>>> Calsoft Labs.
>>>>>
>>>>>
>>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>>
>>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>>> datanode locate.
>>>>>>
>>>>>> If i choose a small data like 200M, it can be done.
>>>>>>
>>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>>> sugggestion?
>>>>>>
>>>>>>
>>>>>> This is the information.
>>>>>>
>>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>>> ------------------------------
>>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>>
>>>>>>
>>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:54
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:54
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 04:18:57
>>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>> 0.00%
>>>>>> 10-Jan-2013 06:11:07
>>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>>
>>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>>
>>>>>>
>>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

How to judge which counter would work?


2013/1/11 <be...@gmail.com>

> **
> Hi
>
> To add on to Harsh's comments.
>
> You need not have to change the task time out.
>
> In your map/reduce code, you can increment a counter or report status
> intermediate on intervals so that there is communication from the task and
> hence won't have a task time out.
>
> Every map and reduce task run on its own jvm limited by a jvm size. If you
> try to holds too much data in memory then it can go beyond the jvm size and
> cause OOM errors.
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * yaotian <ya...@gmail.com>
> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
> but failed.
>
> See inline.
>
>
> 2013/1/11 Harsh J <ha...@cloudera.com>
>
>> If the per-record processing time is very high, you will need to
>> periodically report a status. Without a status change report from the task
>> to the tracker, it will be killed away as a dead task after a default
>> timeout of 10 minutes (600s).
>>
> =====================> Do you mean to increase the report time: "*
> mapred.task.timeout"*?
>
>
>> Also, beware of holding too much memory in a reduce JVM - you're still
>> limited there. Best to let the framework do the sort or secondary sort.
>>
> =======================>  You mean use the default value ? This is my
> value.
> *mapred.job.reduce.memory.mb*-1
>
>>
>>
>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>
>>> Yes, you are right. The data is GPS trace related to corresponding uid.
>>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>>> gps2, gps3........
>>> Yes, the gps data is big because this is 30G data.
>>>
>>> How to solve this?
>>>
>>>
>>>
>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>>           2 reducers are successfully completed and 1498 have been
>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>> some issues with the data you are trying to process)
>>>>           One possibility could be you have many values associated to a
>>>> single key, which can cause these kind of issues based on the operation you
>>>> do in your reducer.
>>>>           Can you put some logs in your reducer and try to trace out
>>>> what is happening.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>
>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>> datanode locate.
>>>>>
>>>>> If i choose a small data like 200M, it can be done.
>>>>>
>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>> sugggestion?
>>>>>
>>>>>
>>>>> This is the information.
>>>>>
>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>> ------------------------------
>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>
>>>>>
>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:54
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:54
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:57
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>> 0.00%
>>>>> 10-Jan-2013 06:11:07
>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

How to judge which counter would work?


2013/1/11 <be...@gmail.com>

> **
> Hi
>
> To add on to Harsh's comments.
>
> You need not have to change the task time out.
>
> In your map/reduce code, you can increment a counter or report status
> intermediate on intervals so that there is communication from the task and
> hence won't have a task time out.
>
> Every map and reduce task run on its own jvm limited by a jvm size. If you
> try to holds too much data in memory then it can go beyond the jvm size and
> cause OOM errors.
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * yaotian <ya...@gmail.com>
> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
> but failed.
>
> See inline.
>
>
> 2013/1/11 Harsh J <ha...@cloudera.com>
>
>> If the per-record processing time is very high, you will need to
>> periodically report a status. Without a status change report from the task
>> to the tracker, it will be killed away as a dead task after a default
>> timeout of 10 minutes (600s).
>>
> =====================> Do you mean to increase the report time: "*
> mapred.task.timeout"*?
>
>
>> Also, beware of holding too much memory in a reduce JVM - you're still
>> limited there. Best to let the framework do the sort or secondary sort.
>>
> =======================>  You mean use the default value ? This is my
> value.
> *mapred.job.reduce.memory.mb*-1
>
>>
>>
>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>
>>> Yes, you are right. The data is GPS trace related to corresponding uid.
>>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>>> gps2, gps3........
>>> Yes, the gps data is big because this is 30G data.
>>>
>>> How to solve this?
>>>
>>>
>>>
>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>>           2 reducers are successfully completed and 1498 have been
>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>> some issues with the data you are trying to process)
>>>>           One possibility could be you have many values associated to a
>>>> single key, which can cause these kind of issues based on the operation you
>>>> do in your reducer.
>>>>           Can you put some logs in your reducer and try to trace out
>>>> what is happening.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>
>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>> datanode locate.
>>>>>
>>>>> If i choose a small data like 200M, it can be done.
>>>>>
>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>> sugggestion?
>>>>>
>>>>>
>>>>> This is the information.
>>>>>
>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>> ------------------------------
>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>
>>>>>
>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:54
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:54
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:57
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>> 0.00%
>>>>> 10-Jan-2013 06:11:07
>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

How to judge which counter would work?


2013/1/11 <be...@gmail.com>

> **
> Hi
>
> To add on to Harsh's comments.
>
> You need not have to change the task time out.
>
> In your map/reduce code, you can increment a counter or report status
> intermediate on intervals so that there is communication from the task and
> hence won't have a task time out.
>
> Every map and reduce task run on its own jvm limited by a jvm size. If you
> try to holds too much data in memory then it can go beyond the jvm size and
> cause OOM errors.
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * yaotian <ya...@gmail.com>
> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
> but failed.
>
> See inline.
>
>
> 2013/1/11 Harsh J <ha...@cloudera.com>
>
>> If the per-record processing time is very high, you will need to
>> periodically report a status. Without a status change report from the task
>> to the tracker, it will be killed away as a dead task after a default
>> timeout of 10 minutes (600s).
>>
> =====================> Do you mean to increase the report time: "*
> mapred.task.timeout"*?
>
>
>> Also, beware of holding too much memory in a reduce JVM - you're still
>> limited there. Best to let the framework do the sort or secondary sort.
>>
> =======================>  You mean use the default value ? This is my
> value.
> *mapred.job.reduce.memory.mb*-1
>
>>
>>
>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>
>>> Yes, you are right. The data is GPS trace related to corresponding uid.
>>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>>> gps2, gps3........
>>> Yes, the gps data is big because this is 30G data.
>>>
>>> How to solve this?
>>>
>>>
>>>
>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>>           2 reducers are successfully completed and 1498 have been
>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>> some issues with the data you are trying to process)
>>>>           One possibility could be you have many values associated to a
>>>> single key, which can cause these kind of issues based on the operation you
>>>> do in your reducer.
>>>>           Can you put some logs in your reducer and try to trace out
>>>> what is happening.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>
>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>> datanode locate.
>>>>>
>>>>> If i choose a small data like 200M, it can be done.
>>>>>
>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>> sugggestion?
>>>>>
>>>>>
>>>>> This is the information.
>>>>>
>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>> ------------------------------
>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>
>>>>>
>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:54
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:54
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:57
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>> 0.00%
>>>>> 10-Jan-2013 06:11:07
>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

How to judge which counter would work?


2013/1/11 <be...@gmail.com>

> **
> Hi
>
> To add on to Harsh's comments.
>
> You need not have to change the task time out.
>
> In your map/reduce code, you can increment a counter or report status
> intermediate on intervals so that there is communication from the task and
> hence won't have a task time out.
>
> Every map and reduce task run on its own jvm limited by a jvm size. If you
> try to holds too much data in memory then it can go beyond the jvm size and
> cause OOM errors.
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * yaotian <ya...@gmail.com>
> *Date: *Fri, 11 Jan 2013 14:35:07 +0800
> *To: *<us...@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: I am running MapReduce on a 30G data on 1master/2 slave,
> but failed.
>
> See inline.
>
>
> 2013/1/11 Harsh J <ha...@cloudera.com>
>
>> If the per-record processing time is very high, you will need to
>> periodically report a status. Without a status change report from the task
>> to the tracker, it will be killed away as a dead task after a default
>> timeout of 10 minutes (600s).
>>
> =====================> Do you mean to increase the report time: "*
> mapred.task.timeout"*?
>
>
>> Also, beware of holding too much memory in a reduce JVM - you're still
>> limited there. Best to let the framework do the sort or secondary sort.
>>
> =======================>  You mean use the default value ? This is my
> value.
> *mapred.job.reduce.memory.mb*-1
>
>>
>>
>> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>>
>>> Yes, you are right. The data is GPS trace related to corresponding uid.
>>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>>> gps2, gps3........
>>> Yes, the gps data is big because this is 30G data.
>>>
>>> How to solve this?
>>>
>>>
>>>
>>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>>           2 reducers are successfully completed and 1498 have been
>>>> killed. I assume that you have the data issues. (Either the data is huge or
>>>> some issues with the data you are trying to process)
>>>>           One possibility could be you have many values associated to a
>>>> single key, which can cause these kind of issues based on the operation you
>>>> do in your reducer.
>>>>           Can you put some logs in your reducer and try to trace out
>>>> what is happening.
>>>>
>>>> Best,
>>>> Mahesh Balija,
>>>> Calsoft Labs.
>>>>
>>>>
>>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>>
>>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>>> datanode locate.
>>>>>
>>>>> If i choose a small data like 200M, it can be done.
>>>>>
>>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>>> sugggestion?
>>>>>
>>>>>
>>>>> This is the information.
>>>>>
>>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>>> ------------------------------
>>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>>
>>>>>
>>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:54
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:54
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>>> 0.00%
>>>>> 10-Jan-2013 04:18:57
>>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>>> 0.00%
>>>>> 10-Jan-2013 06:11:07
>>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>>
>>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>>
>>>>>
>>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by be...@gmail.com.

Hi

To add on to Harsh's comments.

You need not have to change the task time out.

In your map/reduce code, you can increment a counter or report status intermediate  on intervals so that there is communication from the task and hence won't have a task time out.

Every map and reduce task run on its own jvm limited by a jvm size. If you try to holds too much data in memory then it can go beyond the jvm size and cause OOM errors. 


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: yaotian <ya...@gmail.com>
Date: Fri, 11 Jan 2013 14:35:07 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

See inline.


2013/1/11 Harsh J <ha...@cloudera.com>

> If the per-record processing time is very high, you will need to
> periodically report a status. Without a status change report from the task
> to the tracker, it will be killed away as a dead task after a default
> timeout of 10 minutes (600s).
>
=====================> Do you mean to increase the report time: "*
mapred.task.timeout"*?


> Also, beware of holding too much memory in a reduce JVM - you're still
> limited there. Best to let the framework do the sort or secondary sort.
>
=======================>  You mean use the default value ? This is my value.
*mapred.job.reduce.memory.mb*-1

>
>
> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>
>> Yes, you are right. The data is GPS trace related to corresponding uid.
>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>> gps2, gps3........
>> Yes, the gps data is big because this is 30G data.
>>
>> How to solve this?
>>
>>
>>
>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>
>>> Hi,
>>>
>>>           2 reducers are successfully completed and 1498 have been
>>> killed. I assume that you have the data issues. (Either the data is huge or
>>> some issues with the data you are trying to process)
>>>           One possibility could be you have many values associated to a
>>> single key, which can cause these kind of issues based on the operation you
>>> do in your reducer.
>>>           Can you put some logs in your reducer and try to trace out
>>> what is happening.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>> datanode locate.
>>>>
>>>> If i choose a small data like 200M, it can be done.
>>>>
>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>> sugggestion?
>>>>
>>>>
>>>> This is the information.
>>>>
>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>> ------------------------------
>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>
>>>>
>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:57
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>> 0.00%
>>>> 10-Jan-2013 06:11:07
>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>
>>>
>>>
>>
>
>
> --
> Harsh J
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by be...@gmail.com.

Hi

To add on to Harsh's comments.

You need not have to change the task time out.

In your map/reduce code, you can increment a counter or report status intermediate  on intervals so that there is communication from the task and hence won't have a task time out.

Every map and reduce task run on its own jvm limited by a jvm size. If you try to holds too much data in memory then it can go beyond the jvm size and cause OOM errors. 


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: yaotian <ya...@gmail.com>
Date: Fri, 11 Jan 2013 14:35:07 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

See inline.


2013/1/11 Harsh J <ha...@cloudera.com>

> If the per-record processing time is very high, you will need to
> periodically report a status. Without a status change report from the task
> to the tracker, it will be killed away as a dead task after a default
> timeout of 10 minutes (600s).
>
=====================> Do you mean to increase the report time: "*
mapred.task.timeout"*?


> Also, beware of holding too much memory in a reduce JVM - you're still
> limited there. Best to let the framework do the sort or secondary sort.
>
=======================>  You mean use the default value ? This is my value.
*mapred.job.reduce.memory.mb*-1

>
>
> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>
>> Yes, you are right. The data is GPS trace related to corresponding uid.
>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>> gps2, gps3........
>> Yes, the gps data is big because this is 30G data.
>>
>> How to solve this?
>>
>>
>>
>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>
>>> Hi,
>>>
>>>           2 reducers are successfully completed and 1498 have been
>>> killed. I assume that you have the data issues. (Either the data is huge or
>>> some issues with the data you are trying to process)
>>>           One possibility could be you have many values associated to a
>>> single key, which can cause these kind of issues based on the operation you
>>> do in your reducer.
>>>           Can you put some logs in your reducer and try to trace out
>>> what is happening.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>> datanode locate.
>>>>
>>>> If i choose a small data like 200M, it can be done.
>>>>
>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>> sugggestion?
>>>>
>>>>
>>>> This is the information.
>>>>
>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>> ------------------------------
>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>
>>>>
>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:57
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>> 0.00%
>>>> 10-Jan-2013 06:11:07
>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>
>>>
>>>
>>
>
>
> --
> Harsh J
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by be...@gmail.com.

Hi

To add on to Harsh's comments.

You need not have to change the task time out.

In your map/reduce code, you can increment a counter or report status intermediate  on intervals so that there is communication from the task and hence won't have a task time out.

Every map and reduce task run on its own jvm limited by a jvm size. If you try to holds too much data in memory then it can go beyond the jvm size and cause OOM errors. 


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: yaotian <ya...@gmail.com>
Date: Fri, 11 Jan 2013 14:35:07 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

See inline.


2013/1/11 Harsh J <ha...@cloudera.com>

> If the per-record processing time is very high, you will need to
> periodically report a status. Without a status change report from the task
> to the tracker, it will be killed away as a dead task after a default
> timeout of 10 minutes (600s).
>
=====================> Do you mean to increase the report time: "*
mapred.task.timeout"*?


> Also, beware of holding too much memory in a reduce JVM - you're still
> limited there. Best to let the framework do the sort or secondary sort.
>
=======================>  You mean use the default value ? This is my value.
*mapred.job.reduce.memory.mb*-1

>
>
> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>
>> Yes, you are right. The data is GPS trace related to corresponding uid.
>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>> gps2, gps3........
>> Yes, the gps data is big because this is 30G data.
>>
>> How to solve this?
>>
>>
>>
>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>
>>> Hi,
>>>
>>>           2 reducers are successfully completed and 1498 have been
>>> killed. I assume that you have the data issues. (Either the data is huge or
>>> some issues with the data you are trying to process)
>>>           One possibility could be you have many values associated to a
>>> single key, which can cause these kind of issues based on the operation you
>>> do in your reducer.
>>>           Can you put some logs in your reducer and try to trace out
>>> what is happening.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>> datanode locate.
>>>>
>>>> If i choose a small data like 200M, it can be done.
>>>>
>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>> sugggestion?
>>>>
>>>>
>>>> This is the information.
>>>>
>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>> ------------------------------
>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>
>>>>
>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:57
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>> 0.00%
>>>> 10-Jan-2013 06:11:07
>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>
>>>
>>>
>>
>
>
> --
> Harsh J
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by be...@gmail.com.

Hi

To add on to Harsh's comments.

You need not have to change the task time out.

In your map/reduce code, you can increment a counter or report status intermediate  on intervals so that there is communication from the task and hence won't have a task time out.

Every map and reduce task run on its own jvm limited by a jvm size. If you try to holds too much data in memory then it can go beyond the jvm size and cause OOM errors. 


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: yaotian <ya...@gmail.com>
Date: Fri, 11 Jan 2013 14:35:07 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

See inline.


2013/1/11 Harsh J <ha...@cloudera.com>

> If the per-record processing time is very high, you will need to
> periodically report a status. Without a status change report from the task
> to the tracker, it will be killed away as a dead task after a default
> timeout of 10 minutes (600s).
>
=====================> Do you mean to increase the report time: "*
mapred.task.timeout"*?


> Also, beware of holding too much memory in a reduce JVM - you're still
> limited there. Best to let the framework do the sort or secondary sort.
>
=======================>  You mean use the default value ? This is my value.
*mapred.job.reduce.memory.mb*-1

>
>
> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>
>> Yes, you are right. The data is GPS trace related to corresponding uid.
>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>> gps2, gps3........
>> Yes, the gps data is big because this is 30G data.
>>
>> How to solve this?
>>
>>
>>
>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>
>>> Hi,
>>>
>>>           2 reducers are successfully completed and 1498 have been
>>> killed. I assume that you have the data issues. (Either the data is huge or
>>> some issues with the data you are trying to process)
>>>           One possibility could be you have many values associated to a
>>> single key, which can cause these kind of issues based on the operation you
>>> do in your reducer.
>>>           Can you put some logs in your reducer and try to trace out
>>> what is happening.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>> datanode locate.
>>>>
>>>> If i choose a small data like 200M, it can be done.
>>>>
>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>> sugggestion?
>>>>
>>>>
>>>> This is the information.
>>>>
>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>> ------------------------------
>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>
>>>>
>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:57
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>> 0.00%
>>>> 10-Jan-2013 06:11:07
>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>
>>>
>>>
>>
>
>
> --
> Harsh J
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

See inline.


2013/1/11 Harsh J <ha...@cloudera.com>

> If the per-record processing time is very high, you will need to
> periodically report a status. Without a status change report from the task
> to the tracker, it will be killed away as a dead task after a default
> timeout of 10 minutes (600s).
>
=====================> Do you mean to increase the report time: "*
mapred.task.timeout"*?


> Also, beware of holding too much memory in a reduce JVM - you're still
> limited there. Best to let the framework do the sort or secondary sort.
>
=======================>  You mean use the default value ? This is my value.
*mapred.job.reduce.memory.mb*-1

>
>
> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>
>> Yes, you are right. The data is GPS trace related to corresponding uid.
>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>> gps2, gps3........
>> Yes, the gps data is big because this is 30G data.
>>
>> How to solve this?
>>
>>
>>
>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>
>>> Hi,
>>>
>>>           2 reducers are successfully completed and 1498 have been
>>> killed. I assume that you have the data issues. (Either the data is huge or
>>> some issues with the data you are trying to process)
>>>           One possibility could be you have many values associated to a
>>> single key, which can cause these kind of issues based on the operation you
>>> do in your reducer.
>>>           Can you put some logs in your reducer and try to trace out
>>> what is happening.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>> datanode locate.
>>>>
>>>> If i choose a small data like 200M, it can be done.
>>>>
>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>> sugggestion?
>>>>
>>>>
>>>> This is the information.
>>>>
>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>> ------------------------------
>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>
>>>>
>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:57
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>> 0.00%
>>>> 10-Jan-2013 06:11:07
>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>
>>>
>>>
>>
>
>
> --
> Harsh J
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

See inline.


2013/1/11 Harsh J <ha...@cloudera.com>

> If the per-record processing time is very high, you will need to
> periodically report a status. Without a status change report from the task
> to the tracker, it will be killed away as a dead task after a default
> timeout of 10 minutes (600s).
>
=====================> Do you mean to increase the report time: "*
mapred.task.timeout"*?


> Also, beware of holding too much memory in a reduce JVM - you're still
> limited there. Best to let the framework do the sort or secondary sort.
>
=======================>  You mean use the default value ? This is my value.
*mapred.job.reduce.memory.mb*-1

>
>
> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>
>> Yes, you are right. The data is GPS trace related to corresponding uid.
>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>> gps2, gps3........
>> Yes, the gps data is big because this is 30G data.
>>
>> How to solve this?
>>
>>
>>
>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>
>>> Hi,
>>>
>>>           2 reducers are successfully completed and 1498 have been
>>> killed. I assume that you have the data issues. (Either the data is huge or
>>> some issues with the data you are trying to process)
>>>           One possibility could be you have many values associated to a
>>> single key, which can cause these kind of issues based on the operation you
>>> do in your reducer.
>>>           Can you put some logs in your reducer and try to trace out
>>> what is happening.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>> datanode locate.
>>>>
>>>> If i choose a small data like 200M, it can be done.
>>>>
>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>> sugggestion?
>>>>
>>>>
>>>> This is the information.
>>>>
>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>> ------------------------------
>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>
>>>>
>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:57
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>> 0.00%
>>>> 10-Jan-2013 06:11:07
>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>
>>>
>>>
>>
>
>
> --
> Harsh J
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

See inline.


2013/1/11 Harsh J <ha...@cloudera.com>

> If the per-record processing time is very high, you will need to
> periodically report a status. Without a status change report from the task
> to the tracker, it will be killed away as a dead task after a default
> timeout of 10 minutes (600s).
>
=====================> Do you mean to increase the report time: "*
mapred.task.timeout"*?


> Also, beware of holding too much memory in a reduce JVM - you're still
> limited there. Best to let the framework do the sort or secondary sort.
>
=======================>  You mean use the default value ? This is my value.
*mapred.job.reduce.memory.mb*-1

>
>
> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>
>> Yes, you are right. The data is GPS trace related to corresponding uid.
>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>> gps2, gps3........
>> Yes, the gps data is big because this is 30G data.
>>
>> How to solve this?
>>
>>
>>
>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>
>>> Hi,
>>>
>>>           2 reducers are successfully completed and 1498 have been
>>> killed. I assume that you have the data issues. (Either the data is huge or
>>> some issues with the data you are trying to process)
>>>           One possibility could be you have many values associated to a
>>> single key, which can cause these kind of issues based on the operation you
>>> do in your reducer.
>>>           Can you put some logs in your reducer and try to trace out
>>> what is happening.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>> datanode locate.
>>>>
>>>> If i choose a small data like 200M, it can be done.
>>>>
>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>> sugggestion?
>>>>
>>>>
>>>> This is the information.
>>>>
>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>> ------------------------------
>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>
>>>>
>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:57
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>> 0.00%
>>>> 10-Jan-2013 06:11:07
>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>
>>>
>>>
>>
>
>
> --
> Harsh J
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

See inline.


2013/1/11 Harsh J <ha...@cloudera.com>

> If the per-record processing time is very high, you will need to
> periodically report a status. Without a status change report from the task
> to the tracker, it will be killed away as a dead task after a default
> timeout of 10 minutes (600s).
>
=====================> Do you mean to increase the report time: "*
mapred.task.timeout"*?


> Also, beware of holding too much memory in a reduce JVM - you're still
> limited there. Best to let the framework do the sort or secondary sort.
>
=======================>  You mean use the default value ? This is my value.
*mapred.job.reduce.memory.mb*-1

>
>
> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:
>
>> Yes, you are right. The data is GPS trace related to corresponding uid.
>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>> gps2, gps3........
>> Yes, the gps data is big because this is 30G data.
>>
>> How to solve this?
>>
>>
>>
>> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>>
>>> Hi,
>>>
>>>           2 reducers are successfully completed and 1498 have been
>>> killed. I assume that you have the data issues. (Either the data is huge or
>>> some issues with the data you are trying to process)
>>>           One possibility could be you have many values associated to a
>>> single key, which can cause these kind of issues based on the operation you
>>> do in your reducer.
>>>           Can you put some logs in your reducer and try to trace out
>>> what is happening.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>>
>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>> datanode locate.
>>>>
>>>> If i choose a small data like 200M, it can be done.
>>>>
>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>> sugggestion?
>>>>
>>>>
>>>> This is the information.
>>>>
>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>> ------------------------------
>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>
>>>>
>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:57
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>> 0.00%
>>>> 10-Jan-2013 06:11:07
>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>
>>>
>>>
>>
>
>
> --
> Harsh J
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Harsh J <ha...@cloudera.com>.

If the per-record processing time is very high, you will need to
periodically report a status. Without a status change report from the task
to the tracker, it will be killed away as a dead task after a default
timeout of 10 minutes (600s).

Also, beware of holding too much memory in a reduce JVM - you're still
limited there. Best to let the framework do the sort or secondary sort.


On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:

> Yes, you are right. The data is GPS trace related to corresponding uid.
> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
> gps2, gps3........
> Yes, the gps data is big because this is 30G data.
>
> How to solve this?
>
>
>
> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>
>> Hi,
>>
>>           2 reducers are successfully completed and 1498 have been
>> killed. I assume that you have the data issues. (Either the data is huge or
>> some issues with the data you are trying to process)
>>           One possibility could be you have many values associated to a
>> single key, which can cause these kind of issues based on the operation you
>> do in your reducer.
>>           Can you put some logs in your reducer and try to trace out what
>> is happening.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>
>>> I have 1 hadoop master which name node locates and 2 slave which
>>> datanode locate.
>>>
>>> If i choose a small data like 200M, it can be done.
>>>
>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>> sugggestion?
>>>
>>>
>>> This is the information.
>>>
>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>> ------------------------------
>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>
>>>
>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>> 0.00%
>>> 10-Jan-2013 04:18:54
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>
>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>> 0.00%
>>> 10-Jan-2013 04:18:54
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>
>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>> 0.00%
>>> 10-Jan-2013 04:18:57
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>
>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>> 0.00%
>>> 10-Jan-2013 06:11:07
>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>
>>>
>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>
>>
>>
>


-- 
Harsh J

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Harsh J <ha...@cloudera.com>.

If the per-record processing time is very high, you will need to
periodically report a status. Without a status change report from the task
to the tracker, it will be killed away as a dead task after a default
timeout of 10 minutes (600s).

Also, beware of holding too much memory in a reduce JVM - you're still
limited there. Best to let the framework do the sort or secondary sort.


On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:

> Yes, you are right. The data is GPS trace related to corresponding uid.
> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
> gps2, gps3........
> Yes, the gps data is big because this is 30G data.
>
> How to solve this?
>
>
>
> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>
>> Hi,
>>
>>           2 reducers are successfully completed and 1498 have been
>> killed. I assume that you have the data issues. (Either the data is huge or
>> some issues with the data you are trying to process)
>>           One possibility could be you have many values associated to a
>> single key, which can cause these kind of issues based on the operation you
>> do in your reducer.
>>           Can you put some logs in your reducer and try to trace out what
>> is happening.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>
>>> I have 1 hadoop master which name node locates and 2 slave which
>>> datanode locate.
>>>
>>> If i choose a small data like 200M, it can be done.
>>>
>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>> sugggestion?
>>>
>>>
>>> This is the information.
>>>
>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>> ------------------------------
>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>
>>>
>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>> 0.00%
>>> 10-Jan-2013 04:18:54
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>
>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>> 0.00%
>>> 10-Jan-2013 04:18:54
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>
>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>> 0.00%
>>> 10-Jan-2013 04:18:57
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>
>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>> 0.00%
>>> 10-Jan-2013 06:11:07
>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>
>>>
>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>
>>
>>
>


-- 
Harsh J

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Harsh J <ha...@cloudera.com>.

If the per-record processing time is very high, you will need to
periodically report a status. Without a status change report from the task
to the tracker, it will be killed away as a dead task after a default
timeout of 10 minutes (600s).

Also, beware of holding too much memory in a reduce JVM - you're still
limited there. Best to let the framework do the sort or secondary sort.


On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:

> Yes, you are right. The data is GPS trace related to corresponding uid.
> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
> gps2, gps3........
> Yes, the gps data is big because this is 30G data.
>
> How to solve this?
>
>
>
> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>
>> Hi,
>>
>>           2 reducers are successfully completed and 1498 have been
>> killed. I assume that you have the data issues. (Either the data is huge or
>> some issues with the data you are trying to process)
>>           One possibility could be you have many values associated to a
>> single key, which can cause these kind of issues based on the operation you
>> do in your reducer.
>>           Can you put some logs in your reducer and try to trace out what
>> is happening.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>
>>> I have 1 hadoop master which name node locates and 2 slave which
>>> datanode locate.
>>>
>>> If i choose a small data like 200M, it can be done.
>>>
>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>> sugggestion?
>>>
>>>
>>> This is the information.
>>>
>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>> ------------------------------
>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>
>>>
>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>> 0.00%
>>> 10-Jan-2013 04:18:54
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>
>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>> 0.00%
>>> 10-Jan-2013 04:18:54
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>
>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>> 0.00%
>>> 10-Jan-2013 04:18:57
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>
>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>> 0.00%
>>> 10-Jan-2013 06:11:07
>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>
>>>
>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>
>>
>>
>


-- 
Harsh J

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Harsh J <ha...@cloudera.com>.

If the per-record processing time is very high, you will need to
periodically report a status. Without a status change report from the task
to the tracker, it will be killed away as a dead task after a default
timeout of 10 minutes (600s).

Also, beware of holding too much memory in a reduce JVM - you're still
limited there. Best to let the framework do the sort or secondary sort.


On Fri, Jan 11, 2013 at 10:58 AM, yaotian <ya...@gmail.com> wrote:

> Yes, you are right. The data is GPS trace related to corresponding uid.
> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
> gps2, gps3........
> Yes, the gps data is big because this is 30G data.
>
> How to solve this?
>
>
>
> 2013/1/11 Mahesh Balija <ba...@gmail.com>
>
>> Hi,
>>
>>           2 reducers are successfully completed and 1498 have been
>> killed. I assume that you have the data issues. (Either the data is huge or
>> some issues with the data you are trying to process)
>>           One possibility could be you have many values associated to a
>> single key, which can cause these kind of issues based on the operation you
>> do in your reducer.
>>           Can you put some logs in your reducer and try to trace out what
>> is happening.
>>
>> Best,
>> Mahesh Balija,
>> Calsoft Labs.
>>
>>
>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>>
>>> I have 1 hadoop master which name node locates and 2 slave which
>>> datanode locate.
>>>
>>> If i choose a small data like 200M, it can be done.
>>>
>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>> sugggestion?
>>>
>>>
>>> This is the information.
>>>
>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>> ------------------------------
>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>
>>>
>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>> 0.00%
>>> 10-Jan-2013 04:18:54
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>
>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>> 0.00%
>>> 10-Jan-2013 04:18:54
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>
>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>> 0.00%
>>> 10-Jan-2013 04:18:57
>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>
>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>> 0.00%
>>> 10-Jan-2013 06:11:07
>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>
>>>
>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>>
>>>
>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>
>>
>>
>


-- 
Harsh J

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

Yes, you are right. The data is GPS trace related to corresponding uid. The
reduce is doing this: Sort user to get this kind of result: uid, gps1,
gps2, gps3........
Yes, the gps data is big because this is 30G data.

How to solve this?



2013/1/11 Mahesh Balija <ba...@gmail.com>

> Hi,
>
>           2 reducers are successfully completed and 1498 have been killed.
> I assume that you have the data issues. (Either the data is huge or some
> issues with the data you are trying to process)
>           One possibility could be you have many values associated to a
> single key, which can cause these kind of issues based on the operation you
> do in your reducer.
>           Can you put some logs in your reducer and try to trace out what
> is happening.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
>
> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>
>> I have 1 hadoop master which name node locates and 2 slave which datanode
>> locate.
>>
>> If i choose a small data like 200M, it can be done.
>>
>> But if i run 30G data, Map is done. But the reduce report error. Any
>> sugggestion?
>>
>>
>> This is the information.
>>
>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>> ------------------------------
>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>
>>
>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>> 0.00%
>> 10-Jan-2013 04:18:54
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>
>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>> 0.00%
>> 10-Jan-2013 04:18:54
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>
>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>> 0.00%
>> 10-Jan-2013 04:18:57
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>
>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>> 0.00%
>> 10-Jan-2013 06:11:07
>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>
>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>
>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Serge Blazhiyevskyy <Se...@nice.com>.

Are you running this on the VM by any chance?






On Jan 10, 2013, at 9:11 PM, Mahesh Balija <ba...@gmail.com>> wrote:

Hi,

          2 reducers are successfully completed and 1498 have been killed. I assume that you have the data issues. (Either the data is huge or some issues with the data you are trying to process)
          One possibility could be you have many values associated to a single key, which can cause these kind of issues based on the operation you do in your reducer.
          Can you put some logs in your reducer and try to trace out what is happening.

Best,
Mahesh Balija,
Calsoft Labs.

On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com>> wrote:
I have 1 hadoop master which name node locates and 2 slave which datanode locate.

If i choose a small data like 200M, it can be done.

But if i run 30G data, Map is done. But the reduce report error. Any sugggestion?


This is the information.

Black-listed TaskTrackers: 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
________________________________
Kind    % Complete      Num Tasks       Pending Running Complete        Killed  Failed/Killed
Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>       100.00%

        450     0       0       450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>       0       0 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1> 100.00%

        1500    0       0       2<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>      1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>      12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed> / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>



Task    Complete        Status  Start Time      Finish Time     Errors  Counters
task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001> 0.00%


        10-Jan-2013 04:18:54
        10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)


Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002> 0.00%


        10-Jan-2013 04:18:54
        10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)


Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003> 0.00%


        10-Jan-2013 04:18:57
        10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)


Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005> 0.00%


        10-Jan-2013 06:11:07
        10-Jan-2013 06:46:38 (35mins, 31sec)


Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

Yes, you are right. The data is GPS trace related to corresponding uid. The
reduce is doing this: Sort user to get this kind of result: uid, gps1,
gps2, gps3........
Yes, the gps data is big because this is 30G data.

How to solve this?



2013/1/11 Mahesh Balija <ba...@gmail.com>

> Hi,
>
>           2 reducers are successfully completed and 1498 have been killed.
> I assume that you have the data issues. (Either the data is huge or some
> issues with the data you are trying to process)
>           One possibility could be you have many values associated to a
> single key, which can cause these kind of issues based on the operation you
> do in your reducer.
>           Can you put some logs in your reducer and try to trace out what
> is happening.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
>
> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>
>> I have 1 hadoop master which name node locates and 2 slave which datanode
>> locate.
>>
>> If i choose a small data like 200M, it can be done.
>>
>> But if i run 30G data, Map is done. But the reduce report error. Any
>> sugggestion?
>>
>>
>> This is the information.
>>
>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>> ------------------------------
>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>
>>
>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>> 0.00%
>> 10-Jan-2013 04:18:54
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>
>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>> 0.00%
>> 10-Jan-2013 04:18:54
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>
>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>> 0.00%
>> 10-Jan-2013 04:18:57
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>
>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>> 0.00%
>> 10-Jan-2013 06:11:07
>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>
>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>
>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Serge Blazhiyevskyy <Se...@nice.com>.

Are you running this on the VM by any chance?






On Jan 10, 2013, at 9:11 PM, Mahesh Balija <ba...@gmail.com>> wrote:

Hi,

          2 reducers are successfully completed and 1498 have been killed. I assume that you have the data issues. (Either the data is huge or some issues with the data you are trying to process)
          One possibility could be you have many values associated to a single key, which can cause these kind of issues based on the operation you do in your reducer.
          Can you put some logs in your reducer and try to trace out what is happening.

Best,
Mahesh Balija,
Calsoft Labs.

On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com>> wrote:
I have 1 hadoop master which name node locates and 2 slave which datanode locate.

If i choose a small data like 200M, it can be done.

But if i run 30G data, Map is done. But the reduce report error. Any sugggestion?


This is the information.

Black-listed TaskTrackers: 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
________________________________
Kind    % Complete      Num Tasks       Pending Running Complete        Killed  Failed/Killed
Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>       100.00%

        450     0       0       450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>       0       0 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1> 100.00%

        1500    0       0       2<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>      1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>      12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed> / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>



Task    Complete        Status  Start Time      Finish Time     Errors  Counters
task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001> 0.00%


        10-Jan-2013 04:18:54
        10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)


Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002> 0.00%


        10-Jan-2013 04:18:54
        10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)


Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003> 0.00%


        10-Jan-2013 04:18:57
        10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)


Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005> 0.00%


        10-Jan-2013 06:11:07
        10-Jan-2013 06:46:38 (35mins, 31sec)


Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!


        0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

Yes, you are right. The data is GPS trace related to corresponding uid. The
reduce is doing this: Sort user to get this kind of result: uid, gps1,
gps2, gps3........
Yes, the gps data is big because this is 30G data.

How to solve this?



2013/1/11 Mahesh Balija <ba...@gmail.com>

> Hi,
>
>           2 reducers are successfully completed and 1498 have been killed.
> I assume that you have the data issues. (Either the data is huge or some
> issues with the data you are trying to process)
>           One possibility could be you have many values associated to a
> single key, which can cause these kind of issues based on the operation you
> do in your reducer.
>           Can you put some logs in your reducer and try to trace out what
> is happening.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
>
> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>
>> I have 1 hadoop master which name node locates and 2 slave which datanode
>> locate.
>>
>> If i choose a small data like 200M, it can be done.
>>
>> But if i run 30G data, Map is done. But the reduce report error. Any
>> sugggestion?
>>
>>
>> This is the information.
>>
>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>> ------------------------------
>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>
>>
>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>> 0.00%
>> 10-Jan-2013 04:18:54
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>
>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>> 0.00%
>> 10-Jan-2013 04:18:54
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>
>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>> 0.00%
>> 10-Jan-2013 04:18:57
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>
>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>> 0.00%
>> 10-Jan-2013 06:11:07
>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>
>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>
>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by yaotian <ya...@gmail.com>.

Yes, you are right. The data is GPS trace related to corresponding uid. The
reduce is doing this: Sort user to get this kind of result: uid, gps1,
gps2, gps3........
Yes, the gps data is big because this is 30G data.

How to solve this?



2013/1/11 Mahesh Balija <ba...@gmail.com>

> Hi,
>
>           2 reducers are successfully completed and 1498 have been killed.
> I assume that you have the data issues. (Either the data is huge or some
> issues with the data you are trying to process)
>           One possibility could be you have many values associated to a
> single key, which can cause these kind of issues based on the operation you
> do in your reducer.
>           Can you put some logs in your reducer and try to trace out what
> is happening.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
>
> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:
>
>> I have 1 hadoop master which name node locates and 2 slave which datanode
>> locate.
>>
>> If i choose a small data like 200M, it can be done.
>>
>> But if i run 30G data, Map is done. But the reduce report error. Any
>> sugggestion?
>>
>>
>> This is the information.
>>
>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>> ------------------------------
>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>
>>
>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>> 0.00%
>> 10-Jan-2013 04:18:54
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>
>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>> 0.00%
>> 10-Jan-2013 04:18:54
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>
>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>> 0.00%
>> 10-Jan-2013 04:18:57
>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>
>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>> 0.00%
>> 10-Jan-2013 06:11:07
>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>
>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>>
>>
>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>
>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Mahesh Balija <ba...@gmail.com>.

Hi,

          2 reducers are successfully completed and 1498 have been killed.
I assume that you have the data issues. (Either the data is huge or some
issues with the data you are trying to process)
          One possibility could be you have many values associated to a
single key, which can cause these kind of issues based on the operation you
do in your reducer.
          Can you put some logs in your reducer and try to trace out what
is happening.

Best,
Mahesh Balija,
Calsoft Labs.

On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:

> I have 1 hadoop master which name node locates and 2 slave which datanode
> locate.
>
> If i choose a small data like 200M, it can be done.
>
> But if i run 30G data, Map is done. But the reduce report error. Any
> sugggestion?
>
>
> This is the information.
>
> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
> ------------------------------
> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
> 100.00%15000 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>
>
> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
> 0.00%
> 10-Jan-2013 04:18:54
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>
> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
> 0.00%
> 10-Jan-2013 04:18:54
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>
> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
> 0.00%
> 10-Jan-2013 04:18:57
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>
> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
> 0.00%
> 10-Jan-2013 06:11:07
> 10-Jan-2013 06:46:38 (35mins, 31sec)
>
> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Mahesh Balija <ba...@gmail.com>.

Hi,

          2 reducers are successfully completed and 1498 have been killed.
I assume that you have the data issues. (Either the data is huge or some
issues with the data you are trying to process)
          One possibility could be you have many values associated to a
single key, which can cause these kind of issues based on the operation you
do in your reducer.
          Can you put some logs in your reducer and try to trace out what
is happening.

Best,
Mahesh Balija,
Calsoft Labs.

On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:

> I have 1 hadoop master which name node locates and 2 slave which datanode
> locate.
>
> If i choose a small data like 200M, it can be done.
>
> But if i run 30G data, Map is done. But the reduce report error. Any
> sugggestion?
>
>
> This is the information.
>
> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
> ------------------------------
> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
> 100.00%15000 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>
>
> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
> 0.00%
> 10-Jan-2013 04:18:54
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>
> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
> 0.00%
> 10-Jan-2013 04:18:54
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>
> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
> 0.00%
> 10-Jan-2013 04:18:57
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>
> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
> 0.00%
> 10-Jan-2013 06:11:07
> 10-Jan-2013 06:46:38 (35mins, 31sec)
>
> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Mahesh Balija <ba...@gmail.com>.

Hi,

          2 reducers are successfully completed and 1498 have been killed.
I assume that you have the data issues. (Either the data is huge or some
issues with the data you are trying to process)
          One possibility could be you have many values associated to a
single key, which can cause these kind of issues based on the operation you
do in your reducer.
          Can you put some logs in your reducer and try to trace out what
is happening.

Best,
Mahesh Balija,
Calsoft Labs.

On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:

> I have 1 hadoop master which name node locates and 2 slave which datanode
> locate.
>
> If i choose a small data like 200M, it can be done.
>
> But if i run 30G data, Map is done. But the reduce report error. Any
> sugggestion?
>
>
> This is the information.
>
> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
> ------------------------------
> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
> 100.00%15000 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>
>
> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
> 0.00%
> 10-Jan-2013 04:18:54
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>
> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
> 0.00%
> 10-Jan-2013 04:18:54
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>
> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
> 0.00%
> 10-Jan-2013 04:18:57
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>
> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
> 0.00%
> 10-Jan-2013 06:11:07
> 10-Jan-2013 06:46:38 (35mins, 31sec)
>
> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

Posted by Mahesh Balija <ba...@gmail.com>.

Hi,

          2 reducers are successfully completed and 1498 have been killed.
I assume that you have the data issues. (Either the data is huge or some
issues with the data you are trying to process)
          One possibility could be you have many values associated to a
single key, which can cause these kind of issues based on the operation you
do in your reducer.
          Can you put some logs in your reducer and try to trace out what
is happening.

Best,
Mahesh Balija,
Calsoft Labs.

On Fri, Jan 11, 2013 at 8:53 AM, yaotian <ya...@gmail.com> wrote:

> I have 1 hadoop master which name node locates and 2 slave which datanode
> locate.
>
> If i choose a small data like 200M, it can be done.
>
> But if i run 30G data, Map is done. But the reduce report error. Any
> sugggestion?
>
>
> This is the information.
>
> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
> ------------------------------
> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
> 100.00%15000 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>
>
> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
> 0.00%
> 10-Jan-2013 04:18:54
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>
> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
> 0.00%
> 10-Jan-2013 04:18:54
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>
> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601 seconds. Killing!
> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
> 0.00%
> 10-Jan-2013 04:18:57
> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>
> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602 seconds. Killing!
> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
> 0.00%
> 10-Jan-2013 06:11:07
> 10-Jan-2013 06:46:38 (35mins, 31sec)
>
> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600 seconds. Killing!
>
>
> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>