You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Bryan Pendleton (JIRA)" <ji...@apache.org> on 2006/03/08 01:51:39 UTC

[jira] Commented: (HADOOP-16) RPC call times out while indexing map task is computing splits

    [ http://issues.apache.org/jira/browse/HADOOP-16?page=comments#action_12369339 ] 

Bryan Pendleton commented on HADOOP-16:
---------------------------------------

This is great, and finally fixes my issues with a large job that would never start.

However, the way things are in this patch (and the current code), the job doesn't get started until the background thread finishes computing all of the cache hints. This takes far too long - it took 15 minutes on a recent run. During that time, of course, no other work was getting done. How about moving the cachedHints-filling-loop to the end of initTasks(), and go ahead and set the job to RUNNING and "tasksInited=true" in the meantime?

Doing this locally lets work commence immediately, while the cache hints continue to get filled in for future task allocations.

> RPC call times out while indexing map task is computing splits
> --------------------------------------------------------------
>
>          Key: HADOOP-16
>          URL: http://issues.apache.org/jira/browse/HADOOP-16
>      Project: Hadoop
>         Type: Bug
>   Components: mapred
>     Versions: 0.1
>  Environment: MapReduce multi-computer crawl environment: 11 machines (1 master with JobTracker/NameNode, 10 slaves with TaskTrackers/DataNodes)
>     Reporter: Chris Schneider
>     Assignee: Mike Cafarella
>      Fix For: 0.1
>  Attachments: patch.16, patch_h16.v0
>
> We've been using Nutch 0.8 (MapReduce) to perform some internet crawling. Things seemed to be going well until...
> 060129 222409 Lost tracker 'tracker_56288'
> 060129 222409 Task 'task_m_10gs5f' has been lost.
> 060129 222409 Task 'task_m_10qhzr' has been lost.
>    ........
>    ........
> 060129 222409 Task 'task_r_zggbwu' has been lost.
> 060129 222409 Task 'task_r_zh8dao' has been lost.
> 060129 222455 Server handler 8 on 8010 caught: java.net.SocketException: Socket closed
> java.net.SocketException: Socket closed
>         at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>         at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>         at org.apache.nutch.ipc.Server$Handler.run(Server.java:216)
> 060129 222455 Adding task 'task_m_cia5po' to set for tracker 'tracker_56288'
> 060129 223711 Adding task 'task_m_ffv59i' to set for tracker 'tracker_25647'
> I'm hoping that someone could explain why task_m_cia5po got added to tracker_56288 after this tracker was lost.
> The Crawl .main process died with the following output:
> 060129 221129 Indexer: adding segment: /user/crawler/crawl-20060129091444/segments/20060129200246
> Exception in thread "main" java.io.IOException: timed out waiting for response
>     at org.apache.nutch.ipc.Client.call(Client.java:296)
>     at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>     at $Proxy1.submitJob(Unknown Source)
>     at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>     at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
> However, it definitely seems as if the JobTracker is still waiting for the job to finish (no failed jobs).
> Doug Cutting's response:
> The bug here is that the RPC call times out while the map task is computing splits.  The fix is that the job tracker should not compute splits until after it has returned from the submitJob RPC.  Please submit a bug in Jira to help remind us to fix this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira