You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by BlackIce <bl...@gmail.com> on 2014/05/04 13:46:41 UTC

Nutch 1.8 CrawlDb update error

I get this error now whendoing crawls at 120k each run:


2014-05-04 11:56:44,549 INFO  crawl.CrawlDb - CrawlDb update: starting at
2014-05-04 11:56:44
2014-05-04 11:56:44,549 INFO  crawl.CrawlDb - CrawlDb update: db:
TestCrawl/crawldb
2014-05-04 11:56:44,549 INFO  crawl.CrawlDb - CrawlDb update: segments:
[TestCrawl/segments/20140504110143]
2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: additions
allowed: true
2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: URL
normalizing: false
2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: URL
filtering: false
2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
false
2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2014-05-04 11:57:49,615 ERROR mapred.MapTask - IO error in map input file
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
2014-05-04 11:58:36,732 WARN  mapred.LocalJobRunner -
job_local385844795_0001
java.lang.Exception: java.io.IOException: IO error in map input file
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.io.IOException: IO error in map input file
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
    at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
    at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
    at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
at 55756800
    at
org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
    at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
    at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
    at java.io.DataInputStream.readFully(DataInputStream.java:195)
    at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
    at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
    at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
    at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
    ... 10 more
2014-05-04 11:58:36,797 ERROR crawl.CrawlDb - CrawlDb update:
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
    at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:207)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:166)

Re: Nutch 1.8 CrawlDb update error

Posted by Bayu Widyasanyata <bw...@gmail.com>.
I also experienced the same thing [checksum error] :(
I couldn't avoid to delete segment and do refetch again...

Deleting .crc files, or other files inside segments didn't help much.

Thanks.-


On Tue, May 6, 2014 at 2:55 AM, Sebastian Nagel
<wa...@googlemail.com>wrote:

> > Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
>
> It may be caused by a broken disk or memory.
>
> Sebastian
>
> On 05/04/2014 01:46 PM, BlackIce wrote:
> > I get this error now whendoing crawls at 120k each run:
> >
> >
> > 2014-05-04 11:56:44,549 INFO  crawl.CrawlDb - CrawlDb update: starting at
> > 2014-05-04 11:56:44
> > 2014-05-04 11:56:44,549 INFO  crawl.CrawlDb - CrawlDb update: db:
> > TestCrawl/crawldb
> > 2014-05-04 11:56:44,549 INFO  crawl.CrawlDb - CrawlDb update: segments:
> > [TestCrawl/segments/20140504110143]
> > 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: additions
> > allowed: true
> > 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: URL
> > normalizing: false
> > 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: URL
> > filtering: false
> > 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: 404
> purging:
> > false
> > 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: Merging
> > segment data into db.
> > 2014-05-04 11:57:49,615 ERROR mapred.MapTask - IO error in map input file
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> > 2014-05-04 11:58:36,732 WARN  mapred.LocalJobRunner -
> > job_local385844795_0001
> > java.lang.Exception: java.io.IOException: IO error in map input file
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> >     at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> > Caused by: java.io.IOException: IO error in map input file
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> >     at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
> >     at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
> >     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> >     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> >     at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> >     at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >     at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >     at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >     at java.lang.Thread.run(Thread.java:745)
> > Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
> >
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> > at 55756800
> >     at
> > org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> >     at
> >
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> >     at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> >     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> >     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> >     at java.io.DataInputStream.readFully(DataInputStream.java:195)
> >     at
> >
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> >     at
> > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> >     at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
> >     at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
> >     at
> >
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
> >     at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
> >     ... 10 more
> > 2014-05-04 11:58:36,797 ERROR crawl.CrawlDb - CrawlDb update:
> > java.io.IOException: Job failed!
> >     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> >     at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
> >     at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:207)
> >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >     at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:166)
> >
>
>


-- 
wassalam,
[bayu]

Re: Nutch 1.8 CrawlDb update error

Posted by Sebastian Nagel <wa...@googlemail.com>.
> Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000

It may be caused by a broken disk or memory.

Sebastian

On 05/04/2014 01:46 PM, BlackIce wrote:
> I get this error now whendoing crawls at 120k each run:
> 
> 
> 2014-05-04 11:56:44,549 INFO  crawl.CrawlDb - CrawlDb update: starting at
> 2014-05-04 11:56:44
> 2014-05-04 11:56:44,549 INFO  crawl.CrawlDb - CrawlDb update: db:
> TestCrawl/crawldb
> 2014-05-04 11:56:44,549 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [TestCrawl/segments/20140504110143]
> 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: false
> 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: false
> 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
> false
> 2014-05-04 11:56:44,550 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2014-05-04 11:57:49,615 ERROR mapred.MapTask - IO error in map input file
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> 2014-05-04 11:58:36,732 WARN  mapred.LocalJobRunner -
> job_local385844795_0001
> java.lang.Exception: java.io.IOException: IO error in map input file
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.io.IOException: IO error in map input file
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
>     at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
>     at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
> file:/home/hduser/nutch-1.8/runtime/local/TestCrawl/segments/20140504110143/crawl_parse/part-00000
> at 55756800
>     at
> org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
>     at
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
>     at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
>     at java.io.DataInputStream.readFully(DataInputStream.java:195)
>     at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>     at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>     at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
>     at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
>     at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
>     at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
>     ... 10 more
> 2014-05-04 11:58:36,797 ERROR crawl.CrawlDb - CrawlDb update:
> java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>     at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
>     at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:207)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:166)
>