You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Edward Quick <ed...@hotmail.com> on 2008/09/05 10:46:07 UTC

Job failed!

Hi,

I ran a crawl last night 

bin/nutch crawl urls -dir crawl -depth 10

which collected 10612 pages, and then bailed out with the following error:

Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)

I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.

Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?

Thanks for any help.

Ed.



_________________________________________________________________
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/

FW: Job failed!

Posted by Edward Quick <ed...@hotmail.com>.
For info only. I fixed this problem by removing the mapreduce directory in tmp before running another fetch.

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Subject: FW: Job failed!
Date: Sat, 6 Sep 2008 07:10:11 +0000








Hi,

I reran the fetch and got this error again after 5 hours. Any ideas what causes this?


2008-09-06 04:10:23,062 WARN  mapred.LocalJobRunner - job_local_1
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_1/job_local_1_map_0000/output/spill0.out in an
y of the configured local directories
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
        at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:972)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-06 04:10:23,860 FATAL fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:587)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:559)





From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:58:13 +0000








Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000








Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:

$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141605
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141522
SegmentMerger:   adding file:/tmp/crawl/segments/20080905142231
SegmentMerger:   adding file:/tmp/crawl/segments/20080905153116
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate

$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000

But when I run invertlinks, I get an error about a missing path:

$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)



From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000








Sort of figured out how to kickstart the crawl again.

Basically did:

$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2

But unfortunately this is fetching the same urls as the previous fetch. :(

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000








Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.

By the way, is there a way to kickstart this crawl off again without crawling from the start again?


tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN  mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO  searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO  searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO  searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)



> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
> 
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> > 
> > I ran a crawl last night 
> > 
> > bin/nutch crawl urls -dir crawl -depth 10
> > 
> > which collected 10612 pages, and then bailed out with the following error:
> > 
> > Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> > 
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> > 
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> > 
> > Thanks for any help.
> > 
> > Ed.
> > 
> > 
> > 
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
> 
> 

Get Hotmail on your mobile from Vodafone  Try it Now

Get Hotmail on your mobile from Vodafone  Try it Now!

Win £3000 to spend on whatever you want at Uni! Click here to WIN!

Get Hotmail on your mobile from Vodafone  Try it Now!

Try Facebook in Windows Live Messenger! Try it Now!

_________________________________________________________________
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/

FW: Job failed!

Posted by Edward Quick <ed...@hotmail.com>.
Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000








Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:

$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141605
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141522
SegmentMerger:   adding file:/tmp/crawl/segments/20080905142231
SegmentMerger:   adding file:/tmp/crawl/segments/20080905153116
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate

$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000

But when I run invertlinks, I get an error about a missing path:

$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)



From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000








Sort of figured out how to kickstart the crawl again.

Basically did:

$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2

But unfortunately this is fetching the same urls as the previous fetch. :(

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000








Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.

By the way, is there a way to kickstart this crawl off again without crawling from the start again?


tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN  mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO  searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO  searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO  searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)



> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
> 
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> > 
> > I ran a crawl last night 
> > 
> > bin/nutch crawl urls -dir crawl -depth 10
> > 
> > which collected 10612 pages, and then bailed out with the following error:
> > 
> > Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> > 
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> > 
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> > 
> > Thanks for any help.
> > 
> > Ed.
> > 
> > 
> > 
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
> 
> 

Get Hotmail on your mobile from Vodafone  Try it Now

Get Hotmail on your mobile from Vodafone  Try it Now!

Win £3000 to spend on whatever you want at Uni! Click here to WIN!

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/

RE: Job failed!

Posted by Edward Quick <ed...@hotmail.com>.
Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.

By the way, is there a way to kickstart this crawl off again without crawling from the start again?


tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN  mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO  searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO  searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO  searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)



> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
> 
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> > 
> > I ran a crawl last night 
> > 
> > bin/nutch crawl urls -dir crawl -depth 10
> > 
> > which collected 10612 pages, and then bailed out with the following error:
> > 
> > Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> > 
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> > 
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> > 
> > Thanks for any help.
> > 
> > Ed.
> > 
> > 
> > 
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
> 
> 

_________________________________________________________________
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/

FW: Job failed!

Posted by Edward Quick <ed...@hotmail.com>.
Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:

$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141605
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141522
SegmentMerger:   adding file:/tmp/crawl/segments/20080905142231
SegmentMerger:   adding file:/tmp/crawl/segments/20080905153116
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate

$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000

But when I run invertlinks, I get an error about a missing path:

$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)



From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000








Sort of figured out how to kickstart the crawl again.

Basically did:

$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2

But unfortunately this is fetching the same urls as the previous fetch. :(

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000








Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.

By the way, is there a way to kickstart this crawl off again without crawling from the start again?


tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN  mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO  searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO  searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO  searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)



> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
> 
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> > 
> > I ran a crawl last night 
> > 
> > bin/nutch crawl urls -dir crawl -depth 10
> > 
> > which collected 10612 pages, and then bailed out with the following error:
> > 
> > Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> > 
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> > 
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> > 
> > Thanks for any help.
> > 
> > Ed.
> > 
> > 
> > 
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
> 
> 

Get Hotmail on your mobile from Vodafone  Try it Now

Get Hotmail on your mobile from Vodafone  Try it Now!

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/

FW: Job failed!

Posted by Edward Quick <ed...@hotmail.com>.
Sort of figured out how to kickstart the crawl again.

Basically did:

$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2

But unfortunately this is fetching the same urls as the previous fetch. :(

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000








Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.

By the way, is there a way to kickstart this crawl off again without crawling from the start again?


tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN  mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO  searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO  searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO  searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)



> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
> 
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> > 
> > I ran a crawl last night 
> > 
> > bin/nutch crawl urls -dir crawl -depth 10
> > 
> > which collected 10612 pages, and then bailed out with the following error:
> > 
> > Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> > 
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> > 
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> > 
> > Thanks for any help.
> > 
> > Ed.
> > 
> > 
> > 
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
> 
> 

Get Hotmail on your mobile from Vodafone  Try it Now

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/

FW: Job failed!

Posted by Edward Quick <ed...@hotmail.com>.
Hi,

I reran the fetch and got this error again after 5 hours. Any ideas what causes this?


2008-09-06 04:10:23,062 WARN  mapred.LocalJobRunner - job_local_1
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_1/job_local_1_map_0000/output/spill0.out in an
y of the configured local directories
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
        at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:972)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-06 04:10:23,860 FATAL fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:587)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:559)





From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:58:13 +0000








Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000








Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:

$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141605
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141522
SegmentMerger:   adding file:/tmp/crawl/segments/20080905142231
SegmentMerger:   adding file:/tmp/crawl/segments/20080905153116
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate

$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000

But when I run invertlinks, I get an error about a missing path:

$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)



From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000








Sort of figured out how to kickstart the crawl again.

Basically did:

$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2

But unfortunately this is fetching the same urls as the previous fetch. :(

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000








Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.

By the way, is there a way to kickstart this crawl off again without crawling from the start again?


tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN  mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO  searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO  searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO  searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)



> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
> 
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> > 
> > I ran a crawl last night 
> > 
> > bin/nutch crawl urls -dir crawl -depth 10
> > 
> > which collected 10612 pages, and then bailed out with the following error:
> > 
> > Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> > 
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> > 
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> > 
> > Thanks for any help.
> > 
> > Ed.
> > 
> > 
> > 
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
> 
> 

Get Hotmail on your mobile from Vodafone  Try it Now

Get Hotmail on your mobile from Vodafone  Try it Now!

Win £3000 to spend on whatever you want at Uni! Click here to WIN!

Get Hotmail on your mobile from Vodafone  Try it Now!

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/

FW: Job failed!

Posted by Edward Quick <ed...@hotmail.com>.
Hi,

I reran the fetch and got this error again after 5 hours. Any ideas what causes this?


2008-09-06 04:10:23,062 WARN  mapred.LocalJobRunner - job_local_1
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_1/job_local_1_map_0000/output/spill0.out in an
y of the configured local directories
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
        at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:972)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-06 04:10:23,860 FATAL fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:587)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:559)





From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:58:13 +0000








Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000








Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:

$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141605
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141522
SegmentMerger:   adding file:/tmp/crawl/segments/20080905142231
SegmentMerger:   adding file:/tmp/crawl/segments/20080905153116
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate

$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000

But when I run invertlinks, I get an error about a missing path:

$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)



From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000








Sort of figured out how to kickstart the crawl again.

Basically did:

$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2

But unfortunately this is fetching the same urls as the previous fetch. :(

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000








Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.

By the way, is there a way to kickstart this crawl off again without crawling from the start again?


tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN  mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO  searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO  searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO  searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)



> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
> 
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> > 
> > I ran a crawl last night 
> > 
> > bin/nutch crawl urls -dir crawl -depth 10
> > 
> > which collected 10612 pages, and then bailed out with the following error:
> > 
> > Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> > 
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> > 
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> > 
> > Thanks for any help.
> > 
> > Ed.
> > 
> > 
> > 
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
> 
> 

Get Hotmail on your mobile from Vodafone  Try it Now

Get Hotmail on your mobile from Vodafone  Try it Now!

Win £3000 to spend on whatever you want at Uni! Click here to WIN!

Get Hotmail on your mobile from Vodafone  Try it Now!

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/

FW: Job failed!

Posted by Edward Quick <ed...@hotmail.com>.
For info only. I fixed this problem by removing the mapreduce directory in tmp before running another fetch.

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Subject: FW: Job failed!
Date: Sat, 6 Sep 2008 07:10:11 +0000








Hi,

I reran the fetch and got this error again after 5 hours. Any ideas what causes this?


2008-09-06 04:10:23,062 WARN  mapred.LocalJobRunner - job_local_1
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_1/job_local_1_map_0000/output/spill0.out in an
y of the configured local directories
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
        at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:972)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-06 04:10:23,860 FATAL fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:587)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:559)





From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:58:13 +0000








Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000








Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:

$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141605
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141522
SegmentMerger:   adding file:/tmp/crawl/segments/20080905142231
SegmentMerger:   adding file:/tmp/crawl/segments/20080905153116
SegmentMerger:   adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate

$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000

But when I run invertlinks, I get an error about a missing path:

$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)



From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000








Sort of figured out how to kickstart the crawl again.

Basically did:

$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2

But unfortunately this is fetching the same urls as the previous fetch. :(

From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000








Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.

By the way, is there a way to kickstart this crawl off again without crawling from the start again?


tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO  fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN  parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN  parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN  mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO  searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO  searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO  searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO  searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO  plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)



> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
> 
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> > 
> > I ran a crawl last night 
> > 
> > bin/nutch crawl urls -dir crawl -depth 10
> > 
> > which collected 10612 pages, and then bailed out with the following error:
> > 
> > Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> > 
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> > 
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> > 
> > Thanks for any help.
> > 
> > Ed.
> > 
> > 
> > 
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
> 
> 

Get Hotmail on your mobile from Vodafone  Try it Now

Get Hotmail on your mobile from Vodafone  Try it Now!

Win £3000 to spend on whatever you want at Uni! Click here to WIN!

Get Hotmail on your mobile from Vodafone  Try it Now!

Try Facebook in Windows Live Messenger! Try it Now!

_________________________________________________________________
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/

Re: Job failed!

Posted by zhengsj03 <zh...@163.com>.
Could you show the whole hdaoop.log?
在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> Hi,
> 
> I ran a crawl last night 
> 
> bin/nutch crawl urls -dir crawl -depth 10
> 
> which collected 10612 pages, and then bailed out with the following error:
> 
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> 
> I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> 
> Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> 
> Thanks for any help.
> 
> Ed.
> 
> 
> 
> _________________________________________________________________
> Discover Bird's Eye View now with Multimap from Live Search
> http://clk.atdmt.com/UKM/go/111354026/direct/01/