You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by BlackIce <bl...@gmail.com> on 2014/05/04 13:46:41 UTC
Nutch 1.8 CrawlDb update error
I get this error now whendoing crawls at 120k each run:
2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: starting at
2014-05-04 11:56:44
2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: db:
TestCrawl/crawldb
2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: segments:
[TestCrawl/segments/20140504110143]
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: additions
allowed: true
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: URL
normalizing: false
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: URL
filtering: false
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: 404 purging:
false
2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2014-05-04 11:57:49,615 ERROR mapred.MapTask - IO error in map input file
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
2014-05-04 11:58:36,732 WARN mapred.LocalJobRunner -
job_local385844795_0001
java.lang.Exception: java.io.IOException: IO error in map input file
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.io.IOException: IO error in map input file
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
at 55756800
at
org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
... 10 more
2014-05-04 11:58:36,797 ERROR crawl.CrawlDb - CrawlDb update:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:207)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:166)
Re: Nutch 1.8 CrawlDb update error
Posted by Bayu Widyasanyata <bw...@gmail.com>.
I also experienced the same thing [checksum error] :(
I couldn't avoid to delete segment and do refetch again...
Deleting .crc files, or other files inside segments didn't help much.
Thanks.-
On Tue, May 6, 2014 at 2:55 AM, Sebastian Nagel
<wa...@googlemail.com>wrote:
> > Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
>
> It may be caused by a broken disk or memory.
>
> Sebastian
>
> On 05/04/2014 01:46 PM, BlackIce wrote:
> > I get this error now whendoing crawls at 120k each run:
> >
> >
> > 2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: starting at
> > 2014-05-04 11:56:44
> > 2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: db:
> > TestCrawl/crawldb
> > 2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: segments:
> > [TestCrawl/segments/20140504110143]
> > 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: additions
> > allowed: true
> > 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: URL
> > normalizing: false
> > 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: URL
> > filtering: false
> > 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: 404
> purging:
> > false
> > 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: Merging
> > segment data into db.
> > 2014-05-04 11:57:49,615 ERROR mapred.MapTask - IO error in map input file
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> > 2014-05-04 11:58:36,732 WARN mapred.LocalJobRunner -
> > job_local385844795_0001
> > java.lang.Exception: java.io.IOException: IO error in map input file
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> > at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> > Caused by: java.io.IOException: IO error in map input file
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> > at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
> > at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> > at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> > at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> > at 55756800
> > at
> > org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> > at
> >
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> > at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> > at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> > at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> > at java.io.DataInputStream.readFully(DataInputStream.java:195)
> > at
> >
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> > at
> > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> > at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
> > at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
> > at
> >
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
> > at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
> > ... 10 more
> > 2014-05-04 11:58:36,797 ERROR crawl.CrawlDb - CrawlDb update:
> > java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> > at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
> > at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:207)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:166)
> >
>
>
--
wassalam,
[bayu]
Re: Nutch 1.8 CrawlDb update error
Posted by Sebastian Nagel <wa...@googlemail.com>.
> Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
It may be caused by a broken disk or memory.
Sebastian
On 05/04/2014 01:46 PM, BlackIce wrote:
> I get this error now whendoing crawls at 120k each run:
>
>
> 2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: starting at
> 2014-05-04 11:56:44
> 2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: db:
> TestCrawl/crawldb
> 2014-05-04 11:56:44,549 INFO crawl.CrawlDb - CrawlDb update: segments:
> [TestCrawl/segments/20140504110143]
> 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: URL
> normalizing: false
> 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: URL
> filtering: false
> 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: 404 purging:
> false
> 2014-05-04 11:56:44,550 INFO crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2014-05-04 11:57:49,615 ERROR mapred.MapTask - IO error in map input file
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> 2014-05-04 11:58:36,732 WARN mapred.LocalJobRunner -
> job_local385844795_0001
> java.lang.Exception: java.io.IOException: IO error in map input file
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.io.IOException: IO error in map input file
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> at 55756800
> at
> org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> at
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
> at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
> ... 10 more
> 2014-05-04 11:58:36,797 ERROR crawl.CrawlDb - CrawlDb update:
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
> at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:207)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:166)
>