You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by charlie w <sp...@gmail.com> on 2007/07/17 01:52:00 UTC
can't crawl with hadoop under cygwin

I've been using the Nutch and Hadoop tutorials on the respective wikis to
try to get Nutch to use Hadoop for crawling, and have worked through many
problems, but now have run up against something I can't work out.

Nutch version is 0.9, and Hadoop is 0.12.2.

To try to keep things simple, I have Hadoop up and running with a single
node.  I can put files onto the DFS, and can view them in the status
interface, so everything looks OK there.

My urls file is on the DFS, and contains a single URL.

But when I try to do a simple crawl of a single site, the crawler runs for a
while, but fails to inject the URL.  The task tracker log shows this
multiple times (I guess there is retry logic for the map/reduce tasks):
--------------------------------------------------------------------------------
2007-07-16 17:24:17,591 INFO  mapred.TaskRunner - task_0001_r_000000_0 Need
2 map output(s)
2007-07-16 17:24:17,591 INFO  mapred.TaskRunner - task_0001_r_000000_0 Need
2 map output location(s)
2007-07-16 17:24:17,591 INFO  mapred.TaskRunner - task_0001_r_000000_0 Got 0
new map outputs from jobtracker and 0 map o
utputs from previous failures
2007-07-16 17:24:17,591 INFO  mapred.TaskRunner - task_0001_r_000000_0 Got 2
known map output location(s); scheduling...
2007-07-16 17:24:17,591 INFO  mapred.TaskRunner - task_0001_r_000000_0
Scheduled 1 of 2 known outputs (0 slow hosts and
1 dup hosts)
2007-07-16 17:24:17,591 INFO  mapred.TaskRunner - task_0001_r_000000_0
Copying task_0001_m_000001_3 output from ssv.
2007-07-16 17:24:17,591 WARN  mapred.TaskTracker -
getMapOutput(task_0001_m_000001_3,0) failed :
java.io.FileNotFoundException:
../../hadoop/mapreduce/local/task_0001_m_000001_3/file.out.index
        at org.apache.hadoop.fs.ChecksumFileSystem.open(
ChecksumFileSystem.java:328)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
        at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(
TaskTracker.java:1637)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java
:427)
        at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(
WebApplicationHandler.java:475)
        at org.mortbay.jetty.servlet.ServletHandler.handle(
ServletHandler.java:567)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
        at org.mortbay.jetty.servlet.WebApplicationContext.handle(
WebApplicationContext.java:635)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
        at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java
:981)
        at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
        at org.mortbay.http.SocketListener.handleConnection(
SocketListener.java:244)
        at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
        at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

2007-07-16 17:24:17,591 INFO  mapred.TaskTracker - Reporting output
lost:task_0001_m_000001_3

-----------------------------------------------------------------------------------------------

(I hope there's enough there to be of help).
I have mapred.map.tasks and mapred.reduce.tasks both set to 1, but have had
the same results with them set to 2.

Can anybody give me some insight into what might be going wrong?  Any
suggestions for further debugging?

Thanks