You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Matt Zytaruk <ma...@wavefire.com> on 2006/01/09 19:00:46 UTC

Crawl and parse exceptions

I've been having a lot of trouble lately with the newest nutch src. Both 
my crawls and parses are failing (for our fetches we crawl and parse at 
the same time with just the default nutch config, just to get the 
outlinks and update the crawldb, but then later on, after the fetch we 
do another parse with custom parse filters). Here are the exceptions below.

This exception happens sometimes when crawling (on the linkdb part of 
the crawl):

Exception in thread "main" java.io.IOException: Not a file: 
/user/nutch/segments/20060107130328/parse_data/part-00000/data
        at org.apache.nutch.ipc.Client.call(Client.java:294)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy1.submitJob(Unknown Source)
        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)

We also got this for awhile (seems like the mapred/system dir is never 
being created for some reason):
java.io.IOException: Cannot open filename 
/nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
       at org.apache.nutch.ipc.Client.call(Client.java:294)
       at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
       at $Proxy1.open(Unknown Source)
       at 
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256) 

       at 
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.<init>(NDFSClient.java:242) 

       at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
       at 
org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66)
       at 
org.apache.nutch.fs.NFSDataInputStream$Checker.<init>(NFSDataInputStream.java:45) 

       at 
org.apache.nutch.fs.NFSDataInputStream.<init>(NFSDataInputStream.java:221)
       at 
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
       at 
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149)
       at 
org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221)
       at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346) 

       at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:332) 

       at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)
       at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
       at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)

Then, on parsing, we got this, within 10 second of the parse starting:

060109 093759 task_m_ltgpnj  Error running child
060109 093759 task_m_ltgpnj java.lang.RuntimeException: java.io.EOFException
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.protocol.Content.getContent(Content.java:124)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
060109 093759 task_m_ltgpnj     at 
java.io.DataInputStream.readFully(DataInputStream.java:268)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.io.UTF8.readChars(UTF8.java:212)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.io.UTF8.readString(UTF8.java:204)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81)
060109 093759 task_m_ltgpnj     at 
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54)
060109 093759 task_m_ltgpnj     ... 6 more
060109 093802 task_m_txrnu3 done; removing files.
060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
060109 093805 task_m_ltgpnj done; removing files.
060109 093805 Lost connection to JobTracker 
[crawler-d-03.internal.wavefire.ca/127.0.0.2:8050]. 
ex=java.lang.NullPointerException  Retrying...

On a different segment we got this instead:
Exception in thread "main" java.io.IOException: No input directories 
specified in: NutchConf: nutch-default.xml , mapred-default.xml , 
/nutch-data/nutch/tmp/nutch/mapred/local/jobTracker/job_tn7u97.xml , 
nutch-site.xml
        at org.apache.nutch.ipc.Client.call(Client.java:294)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy0.submitJob(Unknown Source)
        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:95)
        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:113)

(I think you usually get this error when you dont put the right 
filenames in arguments, but that is definately not the case here)


These are all tasks on segments which worked fine before we changed src 
code (we had been working with the src from about the beginning of 
december previously). It's also not a permissions issue as it all worked 
fine previously. The only things that have changed are the updated code 
and the number of map/reduce tasks in the config (side note: what is the 
best number of tasks for each to use? we have a set of 2 machines that 
works together to crawl, and a set of 3 machines that work together to 
parse/index).

Any help would be muchly appreciated as otherwise I am doomed. Thanks, 
ahead of time.

-Matt Zytaruk



Re: Crawl and parse exceptions

Posted by Matt Zytaruk <ma...@wavefire.com>.
Unfortunately, the logs have since been overwritten by nutch so I can't 
check them, but I am pretty sure those are actually the messages from 
the task tracker log on the remote machine. If I am remembering 
correctly, all that was shown on the master was a short exception saying 
the child failed or something like that. I wish I could be more help but 
as I said, when the jobtracker/tasktrackers were stopped and started, 
they overwrote the log.

-Matt Zytaruk

Doug Cutting wrote:

> Matt Zytaruk wrote:
>
>> Exception in thread "main" java.io.IOException: Not a file: 
>> /user/nutch/segments/20060107130328/parse_data/part-00000/data
>>        at org.apache.nutch.ipc.Client.call(Client.java:294)
>
>
> This is an error returned from an RPC call.  There should be more 
> details about this in a slave log, e.g., a better stack trace, some 
> context, etc.  What do you see there?
>
>> We also got this for awhile (seems like the mapred/system dir is 
>> never being created for some reason):
>> java.io.IOException: Cannot open filename 
>> /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
>>       at org.apache.nutch.ipc.Client.call(Client.java:294)
>
>
> Again, it would be interesting to see what happened on the other end 
> of this RPC call.  Please look in the remote log.
>
> Doug
>
>


Re: Crawl and parse exceptions

Posted by Doug Cutting <cu...@nutch.org>.
Matt Zytaruk wrote:
> Exception in thread "main" java.io.IOException: Not a file: 
> /user/nutch/segments/20060107130328/parse_data/part-00000/data
>        at org.apache.nutch.ipc.Client.call(Client.java:294)

This is an error returned from an RPC call.  There should be more 
details about this in a slave log, e.g., a better stack trace, some 
context, etc.  What do you see there?

> We also got this for awhile (seems like the mapred/system dir is never 
> being created for some reason):
> java.io.IOException: Cannot open filename 
> /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
>       at org.apache.nutch.ipc.Client.call(Client.java:294)

Again, it would be interesting to see what happened on the other end of 
this RPC call.  Please look in the remote log.

Doug

Re: Crawl and parse exceptions

Posted by Matt Zytaruk <ma...@wavefire.com>.
Just a followup, i figured out the 3rd exception below ( Exception in 
thread "main" java.io.IOException: No input directories specified in: 
NutchConf..) so no worries there. but the others are still issues.

Matt Zytaruk wrote:

> I've been having a lot of trouble lately with the newest nutch src. 
> Both my crawls and parses are failing (for our fetches we crawl and 
> parse at the same time with just the default nutch config, just to get 
> the outlinks and update the crawldb, but then later on, after the 
> fetch we do another parse with custom parse filters). Here are the 
> exceptions below.
>
> This exception happens sometimes when crawling (on the linkdb part of 
> the crawl):
>
> Exception in thread "main" java.io.IOException: Not a file: 
> /user/nutch/segments/20060107130328/parse_data/part-00000/data
>        at org.apache.nutch.ipc.Client.call(Client.java:294)
>        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>        at $Proxy1.submitJob(Unknown Source)
>        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)
>
> We also got this for awhile (seems like the mapred/system dir is never 
> being created for some reason):
> java.io.IOException: Cannot open filename 
> /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
>       at org.apache.nutch.ipc.Client.call(Client.java:294)
>       at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>       at $Proxy1.open(Unknown Source)
>       at 
> org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256) 
>
>       at 
> org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.<init>(NDFSClient.java:242) 
>
>       at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
>       at 
> org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66)
>       at 
> org.apache.nutch.fs.NFSDataInputStream$Checker.<init>(NFSDataInputStream.java:45) 
>
>       at 
> org.apache.nutch.fs.NFSDataInputStream.<init>(NFSDataInputStream.java:221) 
>
>       at 
> org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
>       at 
> org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149)
>       at 
> org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221) 
>
>       at 
> org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346) 
>
>       at 
> org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:332) 
>
>       at 
> org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)
>       at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
>       at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)
>
> Then, on parsing, we got this, within 10 second of the parse starting:
>
> 060109 093759 task_m_ltgpnj  Error running child
> 060109 093759 task_m_ltgpnj java.lang.RuntimeException: 
> java.io.EOFException
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57) 
>
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.protocol.Content.getContent(Content.java:124)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
> 060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
> 060109 093759 task_m_ltgpnj     at 
> java.io.DataInputStream.readFully(DataInputStream.java:268)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) 
>
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.io.UTF8.readChars(UTF8.java:212)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.io.UTF8.readString(UTF8.java:204)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169) 
>
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81)
> 060109 093759 task_m_ltgpnj     at 
> org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54) 
>
> 060109 093759 task_m_ltgpnj     ... 6 more
> 060109 093802 task_m_txrnu3 done; removing files.
> 060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
> 060109 093805 task_m_ltgpnj done; removing files.
> 060109 093805 Lost connection to JobTracker 
> [crawler-d-03.internal.wavefire.ca/127.0.0.2:8050]. 
> ex=java.lang.NullPointerException  Retrying...
>
> On a different segment we got this instead:
> Exception in thread "main" java.io.IOException: No input directories 
> specified in: NutchConf: nutch-default.xml , mapred-default.xml , 
> /nutch-data/nutch/tmp/nutch/mapred/local/jobTracker/job_tn7u97.xml , 
> nutch-site.xml
>        at org.apache.nutch.ipc.Client.call(Client.java:294)
>        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>        at $Proxy0.submitJob(Unknown Source)
>        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:95)
>        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:113)
>
> (I think you usually get this error when you dont put the right 
> filenames in arguments, but that is definately not the case here)
>
>
> These are all tasks on segments which worked fine before we changed 
> src code (we had been working with the src from about the beginning of 
> december previously). It's also not a permissions issue as it all 
> worked fine previously. The only things that have changed are the 
> updated code and the number of map/reduce tasks in the config (side 
> note: what is the best number of tasks for each to use? we have a set 
> of 2 machines that works together to crawl, and a set of 3 machines 
> that work together to parse/index).
>
> Any help would be muchly appreciated as otherwise I am doomed. Thanks, 
> ahead of time.
>
> -Matt Zytaruk
>
>
>
>