You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fred Zimmerman <wf...@nimblebooks.com> on 2011/09/22 20:00:02 UTC
not writing anything to crawldb
I had to delete the contents of the crawldb folder to recover from a failed
fetch (was this the best response? i doubt it). now I have a fetch running,
successfully, but i don't see any evidence that is writing anything to
crawldb. Is it going to write all the crawldb stuff at the end, or should I
go ahead and kill the crawl now?
Re: not writing anything to crawldb
Posted by Fred Zimmerman <wf...@nimblebooks.com>.
ok. i found that the crawl is writing crawldb to my home directory instead
of crawldb, presumably because I ran from the wrong place, and presumably I
will be able to index this in solr from the current location. so, good news!
thx
On Thu, Sep 22, 2011 at 14:03, Markus Jelsma <ma...@openindex.io>wrote:
> That is not neccessary. At most you would delete the failed segment or
> delete
> all segment dirs except crawl_generate (or was it fetch_generate) so you
> can
> restart the fetch from the beginning.
>
>
> What do you use? The crawl command? I don't see any evidence of you
> updating
> the DB ;). Anyway, never kill a running job unless you really have to. It
> cannot be resumed.
>
> > I had to delete the contents of the crawldb folder to recover from a
> > failed fetch (was this the best response? i doubt it). now I have a
> fetch
> > running, successfully, but i don't see any evidence that is writing
> > anything to crawldb. Is it going to write all the crawldb stuff at the
> > end, or should I go ahead and kill the crawl now?
>
Re: not writing anything to crawldb
Posted by Markus Jelsma <ma...@openindex.io>.
Read about OPIC scoring. It can be confusing indeed. I would not recommend
using OPIC for incremental crawls where you refetch pages over time.
> Ha! but out of curiosity, why is the average score so low out of 1.0? that
> seems pretty darned weak, whatever it is.
>
>
> TOTAL urls: 1241
> retry 0: 1241
> min score: 0.0
> avg score: 0.0049016923
> max score: 1.0
> status 1 (db_unfetched): 1001
> status 2 (db_fetched): 224
> status 3 (db_gone): 15
> status 5 (db_redir_perm):
>
> On Thu, Sep 22, 2011 at 14:03, Markus Jelsma
<ma...@openindex.io>wrote:
> > That is not neccessary. At most you would delete the failed segment or
> > delete
> > all segment dirs except crawl_generate (or was it fetch_generate) so you
> > can
> > restart the fetch from the beginning.
> >
> >
> > What do you use? The crawl command? I don't see any evidence of you
> > updating
> > the DB ;). Anyway, never kill a running job unless you really have to. It
> > cannot be resumed.
> >
> > > I had to delete the contents of the crawldb folder to recover from a
> > > failed fetch (was this the best response? i doubt it). now I have a
> >
> > fetch
> >
> > > running, successfully, but i don't see any evidence that is writing
> > > anything to crawldb. Is it going to write all the crawldb stuff at the
> > > end, or should I go ahead and kill the crawl now?
Re: not writing anything to crawldb
Posted by Fred Zimmerman <wf...@nimblebooks.com>.
Ha! but out of curiosity, why is the average score so low out of 1.0? that
seems pretty darned weak, whatever it is.
TOTAL urls: 1241
retry 0: 1241
min score: 0.0
avg score: 0.0049016923
max score: 1.0
status 1 (db_unfetched): 1001
status 2 (db_fetched): 224
status 3 (db_gone): 15
status 5 (db_redir_perm):
On Thu, Sep 22, 2011 at 14:03, Markus Jelsma <ma...@openindex.io>wrote:
> That is not neccessary. At most you would delete the failed segment or
> delete
> all segment dirs except crawl_generate (or was it fetch_generate) so you
> can
> restart the fetch from the beginning.
>
>
> What do you use? The crawl command? I don't see any evidence of you
> updating
> the DB ;). Anyway, never kill a running job unless you really have to. It
> cannot be resumed.
>
> > I had to delete the contents of the crawldb folder to recover from a
> > failed fetch (was this the best response? i doubt it). now I have a
> fetch
> > running, successfully, but i don't see any evidence that is writing
> > anything to crawldb. Is it going to write all the crawldb stuff at the
> > end, or should I go ahead and kill the crawl now?
>
Re: not writing anything to crawldb
Posted by Markus Jelsma <ma...@openindex.io>.
That is not neccessary. At most you would delete the failed segment or delete
all segment dirs except crawl_generate (or was it fetch_generate) so you can
restart the fetch from the beginning.
What do you use? The crawl command? I don't see any evidence of you updating
the DB ;). Anyway, never kill a running job unless you really have to. It
cannot be resumed.
> I had to delete the contents of the crawldb folder to recover from a
> failed fetch (was this the best response? i doubt it). now I have a fetch
> running, successfully, but i don't see any evidence that is writing
> anything to crawldb. Is it going to write all the crawldb stuff at the
> end, or should I go ahead and kill the crawl now?