You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Edward Quick <ed...@hotmail.com> on 2008/09/05 10:46:07 UTC
Job failed!
Hi,
I ran a crawl last night
bin/nutch crawl urls -dir crawl -depth 10
which collected 10612 pages, and then bailed out with the following error:
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
Thanks for any help.
Ed.
_________________________________________________________________
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/
FW: Job failed!
Posted by Edward Quick <ed...@hotmail.com>.
For info only. I fixed this problem by removing the mapreduce directory in tmp before running another fetch.
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Subject: FW: Job failed!
Date: Sat, 6 Sep 2008 07:10:11 +0000
Hi,
I reran the fetch and got this error again after 5 hours. Any ideas what causes this?
2008-09-06 04:10:23,062 WARN mapred.LocalJobRunner - job_local_1
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_1/job_local_1_map_0000/output/spill0.out in an
y of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:972)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-06 04:10:23,860 FATAL fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:587)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:559)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:58:13 +0000
Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000
Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:
$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger: adding file:/tmp/crawl/segments/20080905141605
SegmentMerger: adding file:/tmp/crawl/segments/20080905141522
SegmentMerger: adding file:/tmp/crawl/segments/20080905142231
SegmentMerger: adding file:/tmp/crawl/segments/20080905153116
SegmentMerger: adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate
$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000
But when I run invertlinks, I get an error about a missing path:
$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000
Sort of figured out how to kickstart the crawl again.
Basically did:
$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2
But unfortunately this is fetching the same urls as the previous fetch. :(
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000
Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.
By the way, is there a way to kickstart this crawl off again without crawling from the start again?
tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
>
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> >
> > I ran a crawl last night
> >
> > bin/nutch crawl urls -dir crawl -depth 10
> >
> > which collected 10612 pages, and then bailed out with the following error:
> >
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> >
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> >
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> >
> > Thanks for any help.
> >
> > Ed.
> >
> >
> >
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>
>
Get Hotmail on your mobile from Vodafone Try it Now
Get Hotmail on your mobile from Vodafone Try it Now!
Win £3000 to spend on whatever you want at Uni! Click here to WIN!
Get Hotmail on your mobile from Vodafone Try it Now!
Try Facebook in Windows Live Messenger! Try it Now!
_________________________________________________________________
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/
FW: Job failed!
Posted by Edward Quick <ed...@hotmail.com>.
Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000
Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:
$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger: adding file:/tmp/crawl/segments/20080905141605
SegmentMerger: adding file:/tmp/crawl/segments/20080905141522
SegmentMerger: adding file:/tmp/crawl/segments/20080905142231
SegmentMerger: adding file:/tmp/crawl/segments/20080905153116
SegmentMerger: adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate
$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000
But when I run invertlinks, I get an error about a missing path:
$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000
Sort of figured out how to kickstart the crawl again.
Basically did:
$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2
But unfortunately this is fetching the same urls as the previous fetch. :(
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000
Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.
By the way, is there a way to kickstart this crawl off again without crawling from the start again?
tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
>
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> >
> > I ran a crawl last night
> >
> > bin/nutch crawl urls -dir crawl -depth 10
> >
> > which collected 10612 pages, and then bailed out with the following error:
> >
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> >
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> >
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> >
> > Thanks for any help.
> >
> > Ed.
> >
> >
> >
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>
>
Get Hotmail on your mobile from Vodafone Try it Now
Get Hotmail on your mobile from Vodafone Try it Now!
Win £3000 to spend on whatever you want at Uni! Click here to WIN!
_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
RE: Job failed!
Posted by Edward Quick <ed...@hotmail.com>.
Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.
By the way, is there a way to kickstart this crawl off again without crawling from the start again?
tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
>
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> >
> > I ran a crawl last night
> >
> > bin/nutch crawl urls -dir crawl -depth 10
> >
> > which collected 10612 pages, and then bailed out with the following error:
> >
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> >
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> >
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> >
> > Thanks for any help.
> >
> > Ed.
> >
> >
> >
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>
>
_________________________________________________________________
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/
FW: Job failed!
Posted by Edward Quick <ed...@hotmail.com>.
Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:
$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger: adding file:/tmp/crawl/segments/20080905141605
SegmentMerger: adding file:/tmp/crawl/segments/20080905141522
SegmentMerger: adding file:/tmp/crawl/segments/20080905142231
SegmentMerger: adding file:/tmp/crawl/segments/20080905153116
SegmentMerger: adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate
$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000
But when I run invertlinks, I get an error about a missing path:
$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000
Sort of figured out how to kickstart the crawl again.
Basically did:
$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2
But unfortunately this is fetching the same urls as the previous fetch. :(
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000
Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.
By the way, is there a way to kickstart this crawl off again without crawling from the start again?
tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
>
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> >
> > I ran a crawl last night
> >
> > bin/nutch crawl urls -dir crawl -depth 10
> >
> > which collected 10612 pages, and then bailed out with the following error:
> >
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> >
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> >
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> >
> > Thanks for any help.
> >
> > Ed.
> >
> >
> >
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>
>
Get Hotmail on your mobile from Vodafone Try it Now
Get Hotmail on your mobile from Vodafone Try it Now!
_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
FW: Job failed!
Posted by Edward Quick <ed...@hotmail.com>.
Sort of figured out how to kickstart the crawl again.
Basically did:
$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2
But unfortunately this is fetching the same urls as the previous fetch. :(
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000
Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.
By the way, is there a way to kickstart this crawl off again without crawling from the start again?
tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
>
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> >
> > I ran a crawl last night
> >
> > bin/nutch crawl urls -dir crawl -depth 10
> >
> > which collected 10612 pages, and then bailed out with the following error:
> >
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> >
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> >
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> >
> > Thanks for any help.
> >
> > Ed.
> >
> >
> >
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>
>
Get Hotmail on your mobile from Vodafone Try it Now
_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
FW: Job failed!
Posted by Edward Quick <ed...@hotmail.com>.
Hi,
I reran the fetch and got this error again after 5 hours. Any ideas what causes this?
2008-09-06 04:10:23,062 WARN mapred.LocalJobRunner - job_local_1
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_1/job_local_1_map_0000/output/spill0.out in an
y of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:972)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-06 04:10:23,860 FATAL fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:587)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:559)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:58:13 +0000
Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000
Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:
$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger: adding file:/tmp/crawl/segments/20080905141605
SegmentMerger: adding file:/tmp/crawl/segments/20080905141522
SegmentMerger: adding file:/tmp/crawl/segments/20080905142231
SegmentMerger: adding file:/tmp/crawl/segments/20080905153116
SegmentMerger: adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate
$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000
But when I run invertlinks, I get an error about a missing path:
$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000
Sort of figured out how to kickstart the crawl again.
Basically did:
$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2
But unfortunately this is fetching the same urls as the previous fetch. :(
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000
Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.
By the way, is there a way to kickstart this crawl off again without crawling from the start again?
tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
>
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> >
> > I ran a crawl last night
> >
> > bin/nutch crawl urls -dir crawl -depth 10
> >
> > which collected 10612 pages, and then bailed out with the following error:
> >
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> >
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> >
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> >
> > Thanks for any help.
> >
> > Ed.
> >
> >
> >
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>
>
Get Hotmail on your mobile from Vodafone Try it Now
Get Hotmail on your mobile from Vodafone Try it Now!
Win £3000 to spend on whatever you want at Uni! Click here to WIN!
Get Hotmail on your mobile from Vodafone Try it Now!
_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
FW: Job failed!
Posted by Edward Quick <ed...@hotmail.com>.
Hi,
I reran the fetch and got this error again after 5 hours. Any ideas what causes this?
2008-09-06 04:10:23,062 WARN mapred.LocalJobRunner - job_local_1
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_1/job_local_1_map_0000/output/spill0.out in an
y of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:972)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-06 04:10:23,860 FATAL fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:587)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:559)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:58:13 +0000
Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000
Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:
$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger: adding file:/tmp/crawl/segments/20080905141605
SegmentMerger: adding file:/tmp/crawl/segments/20080905141522
SegmentMerger: adding file:/tmp/crawl/segments/20080905142231
SegmentMerger: adding file:/tmp/crawl/segments/20080905153116
SegmentMerger: adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate
$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000
But when I run invertlinks, I get an error about a missing path:
$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000
Sort of figured out how to kickstart the crawl again.
Basically did:
$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2
But unfortunately this is fetching the same urls as the previous fetch. :(
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000
Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.
By the way, is there a way to kickstart this crawl off again without crawling from the start again?
tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
>
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> >
> > I ran a crawl last night
> >
> > bin/nutch crawl urls -dir crawl -depth 10
> >
> > which collected 10612 pages, and then bailed out with the following error:
> >
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> >
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> >
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> >
> > Thanks for any help.
> >
> > Ed.
> >
> >
> >
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>
>
Get Hotmail on your mobile from Vodafone Try it Now
Get Hotmail on your mobile from Vodafone Try it Now!
Win £3000 to spend on whatever you want at Uni! Click here to WIN!
Get Hotmail on your mobile from Vodafone Try it Now!
_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
FW: Job failed!
Posted by Edward Quick <ed...@hotmail.com>.
For info only. I fixed this problem by removing the mapreduce directory in tmp before running another fetch.
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Subject: FW: Job failed!
Date: Sat, 6 Sep 2008 07:10:11 +0000
Hi,
I reran the fetch and got this error again after 5 hours. Any ideas what causes this?
2008-09-06 04:10:23,062 WARN mapred.LocalJobRunner - job_local_1
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_1/job_local_1_map_0000/output/spill0.out in an
y of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:972)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-06 04:10:23,860 FATAL fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:587)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:559)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:58:13 +0000
Sorry for all these posts. I found the problem. Had a dodgy segment, probably the one which was left after the last fetch bombed out.
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:49:38 +0000
Struth! Here's another problem as well. I'm trying to merge the segments I've created so far:
$ nutch mergesegs crawl/mergesegs_dir -dir crawl/segments
Merging 5 segments to crawl/mergesegs_dir/20080905223155
SegmentMerger: adding file:/tmp/crawl/segments/20080905141605
SegmentMerger: adding file:/tmp/crawl/segments/20080905141522
SegmentMerger: adding file:/tmp/crawl/segments/20080905142231
SegmentMerger: adding file:/tmp/crawl/segments/20080905153116
SegmentMerger: adding file:/tmp/crawl/segments/20080905141348
SegmentMerger: using segment data from: crawl_generate
$ find crawl/mergesegs_dir
crawl/mergesegs_dir
crawl/mergesegs_dir/20080905223155
crawl/mergesegs_dir/20080905223155/crawl_generate
crawl/mergesegs_dir/20080905223155/crawl_generate/.part-00000.crc
crawl/mergesegs_dir/20080905223155/crawl_generate/part-00000
But when I run invertlinks, I get an error about a missing path:
$ mv crawl/segments crawl/BACKUPsegments
$ mv crawl/mergesegs_dir crawl/segments
$ nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/tmp/crawl/segments/20080905223155
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/tmp/crawl/segments/20080905223155/parse_data
at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:215)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: FW: Job failed!
Date: Fri, 5 Sep 2008 21:09:13 +0000
Sort of figured out how to kickstart the crawl again.
Basically did:
$s1=ls -d crawl/segments/* | tail -1
bin/nutch updatedb crawl/crawldb $1
bin/nutch generate crawl/crawldb crawl/segments
$2=ls -d crawl/segments/* | tail -1
bin/nutch fetch $2
But unfortunately this is fetching the same urls as the previous fetch. :(
From: edwardquick@hotmail.com
To: nutch-user@lucene.apache.org
Subject: RE: Job failed!
Date: Fri, 5 Sep 2008 09:45:00 +0000
Initially I just did a tail -10 so thought there were no errors, but there are a few actually. The pdf errors are my fault because I updated the pdf plugin with the latest PDFBox and FontBox jars from cvs on sf.net and missed out parse-pdf.jar on the rebuild. I'm not sure that's the reason why the job failed though. The log is 5MB so I can't really attach it all here but hopefully the last 200 lines gives an indication.
By the way, is there a way to kickstart this crawl off again without crawling from the start again?
tail -200 hadoop.log.2008-09-05
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:22,360 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:22,360 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Premium+Service+Training+insert/$FILE/Premium+training.pdf of type application/pdf
2008-09-05 03:41:22,362 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/CTP+-+Travel+Plan+Objectives?OpenDocument
2008-09-05 03:41:23,616 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CC%5Ccomp+tckts%5Ccr+comp+tickets?OpenDocument
2008-09-05 03:41:24,745 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Notes+7+-+93+Rooms?OpenDocument
2008-09-05 03:41:26,033 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf
2008-09-05 03:41:27,215 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:27,216 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:27,216 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/SCCM+January+/$FILE/SCCMonthly+-+NovDec%2C+07+%2808+Jan%2C+08%29.pdf of type application/pdf
2008-09-05 03:41:27,216 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/bani.nsf/Content/XXXXLS%5FQ1Results%5F030807%5CXXXXLS%5FQ1Resultsvideo%5F030807?opendocument
2008-09-05 03:41:28,451 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf
2008-09-05 03:41:29,760 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Virus+2+questions?OpenDocument
2008-09-05 03:41:30,789 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Gender+Reass+the+process?OpenDocument
2008-09-05 03:41:32,066 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/LGW+Crew+Responsibilities/$FILE/Crew+Responsibilities.doc
2008-09-05 03:41:33,390 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptflt.nsf/Content/Flight+Ops+Home%5CBusiness+Tools%5CFlight+Technical+Services%5CAircraft+Weights+%26+Evaluation%5CFleet+Weights+-+Aircraft+Weighing+Schedules?OpenDocument
2008-09-05 03:41:34,562 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:34,562 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:34,563 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:34,563 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/T5+Retail+-+T5+Ground+Level/$FILE/T5_Ground_Level.pdf of type application/pdf
2008-09-05 03:41:34,564 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/travel/stpg2.nsf/072561aa006322660725618c006b09a0/fc11f85e25deb736802574a30033c99e?OpenDocument
2008-09-05 03:41:35,926 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:35,926 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:35,926 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Diversity+dignity+at+work+booklet/$FILE/Dignity+at+work+booklet.pdf of type application/pdf
2008-09-05 03:41:35,928 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/communications/wtps1.nsf/$lookup/1D94AD9A45B463638025730100263FDF
2008-09-05 03:41:36,988 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf
2008-09-05 03:41:38,217 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CDepartment+Information%5CEngineering+IT+Support+%26+Delivery+Homepage%5CEngineering+Solution+Group+%28ESG%29+Homepage%5CKey+user+Guides?OpenDocument
2008-09-05 03:41:41,143 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/Cultural+Awareness+Photo+Prize+Draw?OpenDocument
2008-09-05 03:41:42,278 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:42,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:42,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptflt.nsf/AttachmentsByTitle/Flight+Ops+News+Aug+2008/$FILE/FLIGHT+OPS_AUGUST_08+intranet+live.pdf of type application/pdf
2008-09-05 03:41:42,313 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptsal.nsf/Content/ctcBA+Home%5CBusTools%5CB%5Cbah%5CPromos+pckge%5CFlrda+08+EBO+WTP+upgde?OpenDocument
2008-09-05 03:41:42,342 INFO fetcher.Fetcher - fetching http://planetba.baplc.com/general/aptrix/aptrix.nsf/Content/PMA+EG904+timescales?OpenDocument
2008-09-05 03:41:52,279 WARN parse.ParserFactory - org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:133)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:67)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:355)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - Caused by: java.lang.ClassNotFoundException: org.apache.nutch.parse.pdf.PdfParser
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.security.AccessController.doPrivileged(Native Method)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ... 4 more
2008-09-05 03:41:52,279 WARN parse.ParserFactory - ParserFactory:PluginRuntimeException when initializing parser plugin parse-pdf instance in getParsers function: attempting to continue instantiating parsers
2008-09-05 03:41:52,279 WARN parse.ParseUtil - Unable to successfully parse content http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/Barplus+Hints+and+Tips/$FILE/Barplus+Hints+and+Tips.pdf of type application/pdf
2008-09-05 03:41:55,927 WARN mapred.LocalJobRunner - job_local_21
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_21/job_local_21_map_0000/output/file.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:982)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
2008-09-05 09:32:46,906 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:47,002 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSPowerPoint Parse Plug-in (parse-mspowerpoint)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Site Query Filter (query-site)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - MSWord Parse Plug-in (parse-msword)
2008-09-05 09:32:47,305 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - MSExcel Parse Plug-in (parse-msexcel)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Jakarta POI - Java API To Access Microsoft Format Files (lib-jakarta-poi)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - URL Query Filter (query-url)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Parse MS Documents Framework (lib-parsems)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Zip Parse Plug-in (parse-zip)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Registered Extension-Points:
2008-09-05 09:32:47,306 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-09-05 09:32:47,307 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-09-05 09:32:47,342 INFO searcher.NutchBean - opening segments in crawl/segments
2008-09-05 09:32:47,368 INFO searcher.SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2008-09-05 09:32:47,371 INFO searcher.NutchBean - opening linkdb in crawl/linkdb
2008-09-05 09:32:52,746 INFO searcher.NutchBean - opening indexes in crawl/indexes
2008-09-05 09:32:52,791 INFO plugin.PluginRepository - Plugins: looking in: /ok/appl/nutch-2008-09-04_04-01-27/plugins
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - Registered Plugins:
2008-09-05 09:32:52,999 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
> Subject: Re: Job failed!
> From: zhengsj03@163.com
> To: nutch-user@lucene.apache.org
> Date: Fri, 5 Sep 2008 17:28:47 +0800
>
> Could you show the whole hdaoop.log?
> 在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> > Hi,
> >
> > I ran a crawl last night
> >
> > bin/nutch crawl urls -dir crawl -depth 10
> >
> > which collected 10612 pages, and then bailed out with the following error:
> >
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
> >
> > I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
> >
> > Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
> >
> > Thanks for any help.
> >
> > Ed.
> >
> >
> >
> > _________________________________________________________________
> > Discover Bird's Eye View now with Multimap from Live Search
> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>
>
Get Hotmail on your mobile from Vodafone Try it Now
Get Hotmail on your mobile from Vodafone Try it Now!
Win £3000 to spend on whatever you want at Uni! Click here to WIN!
Get Hotmail on your mobile from Vodafone Try it Now!
Try Facebook in Windows Live Messenger! Try it Now!
_________________________________________________________________
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/
Re: Job failed!
Posted by zhengsj03 <zh...@163.com>.
Could you show the whole hdaoop.log?
在 2008-09-05五的 08:46 +0000,Edward Quick写道:
> Hi,
>
> I ran a crawl last night
>
> bin/nutch crawl urls -dir crawl -depth 10
>
> which collected 10612 pages, and then bailed out with the following error:
>
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:552)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)
>
> I checked there was enough space on the box, and there don't appear to be any errors in hadoop.log or the crawl output, so I'm stuck on what caused this.
>
> Also, is there a way to pick up the crawl from where it stopped rather than having to rerun it all over again?
>
> Thanks for any help.
>
> Ed.
>
>
>
> _________________________________________________________________
> Discover Bird's Eye View now with Multimap from Live Search
> http://clk.atdmt.com/UKM/go/111354026/direct/01/