You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Fred Zimmerman <wf...@nimblebooks.com> on 2011/09/22 20:00:02 UTC

not writing anything to crawldb

I had to delete the contents of the  crawldb folder to recover from a failed
fetch (was this the best response? i doubt it).  now I have a fetch running,
successfully, but i don't see any evidence that is writing anything to
crawldb.  Is it going to write all the crawldb stuff at the end, or should I
go ahead and kill the crawl now?

Re: not writing anything to crawldb

Posted by Fred Zimmerman <wf...@nimblebooks.com>.

ok. i found that the crawl is writing crawldb to my home directory instead
of crawldb, presumably because I ran from the wrong place, and presumably I
will be able to index this in solr from the current location. so, good news!
thx




On Thu, Sep 22, 2011 at 14:03, Markus Jelsma <ma...@openindex.io>wrote:

> That is not neccessary. At most you would delete the failed segment or
> delete
> all segment dirs except crawl_generate (or was it fetch_generate) so you
> can
> restart the fetch from the beginning.
>
>
> What do you use? The crawl command? I don't see any evidence of you
> updating
> the DB ;). Anyway, never kill a running job unless you really have to. It
> cannot be resumed.
>
> > I had to delete the contents of the  crawldb folder to recover from a
> > failed fetch (was this the best response? i doubt it).  now I have a
> fetch
> > running, successfully, but i don't see any evidence that is writing
> > anything to crawldb.  Is it going to write all the crawldb stuff at the
> > end, or should I go ahead and kill the crawl now?
>

Re: not writing anything to crawldb

Posted by Markus Jelsma <ma...@openindex.io>.

Read about OPIC scoring. It can be confusing indeed. I would not recommend 
using OPIC for incremental crawls where you refetch pages over time.

> Ha! but out of curiosity, why is the average score so low out of 1.0? that
> seems pretty darned weak, whatever it is.
> 
> 
> TOTAL urls:     1241
> retry 0:        1241
> min score:      0.0
> avg score:      0.0049016923
> max score:      1.0
> status 1 (db_unfetched):        1001
> status 2 (db_fetched):  224
> status 3 (db_gone):     15
> status 5 (db_redir_perm):
> 
> On Thu, Sep 22, 2011 at 14:03, Markus Jelsma 
<ma...@openindex.io>wrote:
> > That is not neccessary. At most you would delete the failed segment or
> > delete
> > all segment dirs except crawl_generate (or was it fetch_generate) so you
> > can
> > restart the fetch from the beginning.
> > 
> > 
> > What do you use? The crawl command? I don't see any evidence of you
> > updating
> > the DB ;). Anyway, never kill a running job unless you really have to. It
> > cannot be resumed.
> > 
> > > I had to delete the contents of the  crawldb folder to recover from a
> > > failed fetch (was this the best response? i doubt it).  now I have a
> > 
> > fetch
> > 
> > > running, successfully, but i don't see any evidence that is writing
> > > anything to crawldb.  Is it going to write all the crawldb stuff at the
> > > end, or should I go ahead and kill the crawl now?

Re: not writing anything to crawldb

Posted by Fred Zimmerman <wf...@nimblebooks.com>.

Ha! but out of curiosity, why is the average score so low out of 1.0? that
seems pretty darned weak, whatever it is.


TOTAL urls:     1241
retry 0:        1241
min score:      0.0
avg score:      0.0049016923
max score:      1.0
status 1 (db_unfetched):        1001
status 2 (db_fetched):  224
status 3 (db_gone):     15
status 5 (db_redir_perm):


On Thu, Sep 22, 2011 at 14:03, Markus Jelsma <ma...@openindex.io>wrote:

> That is not neccessary. At most you would delete the failed segment or
> delete
> all segment dirs except crawl_generate (or was it fetch_generate) so you
> can
> restart the fetch from the beginning.
>
>
> What do you use? The crawl command? I don't see any evidence of you
> updating
> the DB ;). Anyway, never kill a running job unless you really have to. It
> cannot be resumed.
>
> > I had to delete the contents of the  crawldb folder to recover from a
> > failed fetch (was this the best response? i doubt it).  now I have a
> fetch
> > running, successfully, but i don't see any evidence that is writing
> > anything to crawldb.  Is it going to write all the crawldb stuff at the
> > end, or should I go ahead and kill the crawl now?
>

Re: not writing anything to crawldb

Posted by Markus Jelsma <ma...@openindex.io>.

That is not neccessary. At most you would delete the failed segment or delete 
all segment dirs except crawl_generate (or was it fetch_generate) so you can 
restart the fetch from the beginning.


What do you use? The crawl command? I don't see any evidence of you updating 
the DB ;). Anyway, never kill a running job unless you really have to. It 
cannot be resumed.

> I had to delete the contents of the  crawldb folder to recover from a
> failed fetch (was this the best response? i doubt it).  now I have a fetch
> running, successfully, but i don't see any evidence that is writing
> anything to crawldb.  Is it going to write all the crawldb stuff at the
> end, or should I go ahead and kill the crawl now?