You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Daqing Zhao <de...@gmail.com> on 2005/12/06 14:54:08 UTC

try to restart aborted crawl

Hi All,

Happy Holidays!!!

I am a new user of nutch and I need some help on how to restart a failed
crawl, without losing what had been done. I have attached the error
messages.

Thanks,

Daqing

---------------------------------------------------------------------

The initial crawl stopped with this error, without hanging:

$ tail crawl_dir.log
051204 170050 Processing pagesByURL: Sorted 59853.59626727705instructions/secon
d
Exception in thread "main" java.io.IOException: key out of order:
http://web.mit
.edu/is/about/index.html after http://web.mit.edu/is/?ut/index.html
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.db.WebDBWriter$PagesByURLProcessor.mergeEdits
(WebDBW
riter.java:736)
        at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown
(WebDBWriter.
java:557)
        at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
        at org.apache.nutch.tools.UpdateDatabaseTool.close(
UpdateDatabaseTool.ja
va:321)
        at org.apache.nutch.tools.UpdateDatabaseTool.main(
UpdateDatabaseTool.jav
a:371)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)


I try restarting the crawl, following the FAQ on
http://wiki.apache.org/nutch/FAQ:

% touch /index/segments/2005somesegment/fetcher.done

% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/

% bin/nutch generate /index/db/ /index/segments/2005somesegment/

% bin/nutch fetch /index/segments/2005somesegment


At the fetch command, I got the following error:

$ ./bin/nutch fetch ./crawl.dir/segments/20051204030125/
051206 052124 parsing file:/F:/nutch/nutch-0.7.1/conf/nutch-default.xml
051206 052125 parsing file:/F:/nutch/nutch-0.7.1/conf/nutch-site.xml
051206 052125 No FS indicated, using default:local
Exception in thread "main" java.io.IOException: already exists:
.\crawl.dir\segm
ents\20051204030125\fetcher
        at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
        at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:75)
        at org.apache.nutch.io.ArrayFile$Writer.<init>(ArrayFile.java:34)
        at org.apache.nutch.fetcher.Fetcher.<init>(Fetcher.java:301)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:475)

Re: try to restart aborted crawl

Posted by Daqing Zhao <de...@gmail.com>.

Thanks, Stefan. It works now.

I assume this will fetch the latest segment.  When I used ./bin/nutch crawl
command with a depth of 10, it also generate all the segments to fetch from
the urls list. I am guessing now I need to bootstrap myself using generate
or something after this segment is done. Is that right?

Thanks,

Daqing



On 12/6/05, Stefan Groschupf <sg...@media-style.com> wrote:
>
> You can not continue a failed fetch just restart it.
> Just delete everything in failed  segment "2005XXXXXXXX" folder
> except of the fetchlist folder.
> Than just start your fetch again and use the step by step commands.
>
>
> Am 06.12.2005 um 14:54 schrieb Daqing Zhao:
>
> > Hi All,
> >
> > Happy Holidays!!!
> >
> > I am a new user of nutch and I need some help on how to restart a
> > failed
> > crawl, without losing what had been done. I have attached the error
> > messages.
> >
> > Thanks,
> >
> > Daqing
> >
> > ---------------------------------------------------------------------
> >
> > The initial crawl stopped with this error, without hanging:
> >
> > $ tail crawl_dir.log
> > 051204 170050 Processing pagesByURL: Sorted
> > 59853.59626727705instructions/secon
> > d
> > Exception in thread "main" java.io.IOException: key out of order:
> > http://web.mit
> > .edu/is/about/index.html after http://web.mit.edu/is/?ut/index.html
> >         at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:
> > 134)
> >         at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
> >         at org.apache.nutch.db.WebDBWriter
> > $PagesByURLProcessor.mergeEdits
> > (WebDBW
> > riter.java:736)
> >         at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown
> > (WebDBWriter.
> > java:557)
> >         at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:
> > 1544)
> >         at org.apache.nutch.tools.UpdateDatabaseTool.close(
> > UpdateDatabaseTool.ja
> > va:321)
> >         at org.apache.nutch.tools.UpdateDatabaseTool.main(
> > UpdateDatabaseTool.jav
> > a:371)
> >         at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> >
> >
> > I try restarting the crawl, following the FAQ on
> > http://wiki.apache.org/nutch/FAQ:
> >
> > % touch /index/segments/2005somesegment/fetcher.done
> >
> > % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
> >
> > % bin/nutch generate /index/db/ /index/segments/2005somesegment/
> >
> > % bin/nutch fetch /index/segments/2005somesegment
> >
> >
> > At the fetch command, I got the following error:
> >
> > $ ./bin/nutch fetch ./crawl.dir/segments/20051204030125/
> > 051206 052124 parsing file:/F:/nutch/nutch-0.7.1/conf/nutch-
> > default.xml
> > 051206 052125 parsing file:/F:/nutch/nutch-0.7.1/conf/nutch-site.xml
> > 051206 052125 No FS indicated, using default:local
> > Exception in thread "main" java.io.IOException: already exists:
> > .\crawl.dir\segm
> > ents\20051204030125\fetcher
> >         at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
> >         at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:75)
> >         at org.apache.nutch.io.ArrayFile$Writer.<init>
> > (ArrayFile.java:34)
> >         at org.apache.nutch.fetcher.Fetcher.<init>(Fetcher.java:301)
> >         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:475)
>
>

Re: try to restart aborted crawl

Posted by Stefan Groschupf <sg...@media-style.com>.

You can not continue a failed fetch just restart it.
Just delete everything in failed  segment "2005XXXXXXXX" folder  
except of the fetchlist folder.
Than just start your fetch again and use the step by step commands.


Am 06.12.2005 um 14:54 schrieb Daqing Zhao:

> Hi All,
>
> Happy Holidays!!!
>
> I am a new user of nutch and I need some help on how to restart a  
> failed
> crawl, without losing what had been done. I have attached the error
> messages.
>
> Thanks,
>
> Daqing
>
> ---------------------------------------------------------------------
>
> The initial crawl stopped with this error, without hanging:
>
> $ tail crawl_dir.log
> 051204 170050 Processing pagesByURL: Sorted  
> 59853.59626727705instructions/secon
> d
> Exception in thread "main" java.io.IOException: key out of order:
> http://web.mit
> .edu/is/about/index.html after http://web.mit.edu/is/?ut/index.html
>         at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java: 
> 134)
>         at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>         at org.apache.nutch.db.WebDBWriter 
> $PagesByURLProcessor.mergeEdits
> (WebDBW
> riter.java:736)
>         at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown
> (WebDBWriter.
> java:557)
>         at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java: 
> 1544)
>         at org.apache.nutch.tools.UpdateDatabaseTool.close(
> UpdateDatabaseTool.ja
> va:321)
>         at org.apache.nutch.tools.UpdateDatabaseTool.main(
> UpdateDatabaseTool.jav
> a:371)
>         at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
>
>
> I try restarting the crawl, following the FAQ on
> http://wiki.apache.org/nutch/FAQ:
>
> % touch /index/segments/2005somesegment/fetcher.done
>
> % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
>
> % bin/nutch generate /index/db/ /index/segments/2005somesegment/
>
> % bin/nutch fetch /index/segments/2005somesegment
>
>
> At the fetch command, I got the following error:
>
> $ ./bin/nutch fetch ./crawl.dir/segments/20051204030125/
> 051206 052124 parsing file:/F:/nutch/nutch-0.7.1/conf/nutch- 
> default.xml
> 051206 052125 parsing file:/F:/nutch/nutch-0.7.1/conf/nutch-site.xml
> 051206 052125 No FS indicated, using default:local
> Exception in thread "main" java.io.IOException: already exists:
> .\crawl.dir\segm
> ents\20051204030125\fetcher
>         at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:86)
>         at org.apache.nutch.io.MapFile$Writer.<init>(MapFile.java:75)
>         at org.apache.nutch.io.ArrayFile$Writer.<init> 
> (ArrayFile.java:34)
>         at org.apache.nutch.fetcher.Fetcher.<init>(Fetcher.java:301)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:475)