You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fred Zimmerman <wf...@nimblebooks.com> on 2011/09/30 15:23:34 UTC

Interpreting Nutch results

What does this mean? Why is db_unfetched so high?

I want to know how I can be confident that the crawler has fetched all the
pages in the target site.

CrawlDb statistics start: crawl-20110930124111/crawldb
Statistics for CrawlDb: crawl-20110930124111/crawldb
TOTAL urls:     1237
retry 0:        1236
retry 1:        1
min score:      0.0
avg score:      0.005751819
max score:      1.0
status 1 (db_unfetched):        1040
status 2 (db_fetched):  179
status 3 (db_gone):     15
status 5 (db_redir_perm):       3
CrawlDb statistics: done

Re: Interpreting Nutch results

Posted by Fred Zimmerman <wf...@nimblebooks.com>.
thanks for the tip about filtering

-----------------------------------------------------
Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
monthly updates



On Fri, Sep 30, 2011 at 11:00, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> What is type of filtering is going on in your configuration?
>
> It might be best to readdb incrementally on smaller test fetches to make
> sure your fetching everything you want to.
>
> On Fri, Sep 30, 2011 at 2:23 PM, Fred Zimmerman <wf...@nimblebooks.com>
> wrote:
>
> > What does this mean? Why is db_unfetched so high?
> >
> > I want to know how I can be confident that the crawler has fetched all
> the
> > pages in the target site.
> >
> > CrawlDb statistics start: crawl-20110930124111/crawldb
> > Statistics for CrawlDb: crawl-20110930124111/crawldb
> > TOTAL urls:     1237
> > retry 0:        1236
> > retry 1:        1
> > min score:      0.0
> > avg score:      0.005751819
> > max score:      1.0
> > status 1 (db_unfetched):        1040
> > status 2 (db_fetched):  179
> > status 3 (db_gone):     15
> > status 5 (db_redir_perm):       3
> > CrawlDb statistics: done
> >
>
>
>
> --
> *Lewis*
>

Re: Interpreting Nutch results

Posted by lewis john mcgibbney <le...@gmail.com>.
What is type of filtering is going on in your configuration?

It might be best to readdb incrementally on smaller test fetches to make
sure your fetching everything you want to.

On Fri, Sep 30, 2011 at 2:23 PM, Fred Zimmerman <wf...@nimblebooks.com> wrote:

> What does this mean? Why is db_unfetched so high?
>
> I want to know how I can be confident that the crawler has fetched all the
> pages in the target site.
>
> CrawlDb statistics start: crawl-20110930124111/crawldb
> Statistics for CrawlDb: crawl-20110930124111/crawldb
> TOTAL urls:     1237
> retry 0:        1236
> retry 1:        1
> min score:      0.0
> avg score:      0.005751819
> max score:      1.0
> status 1 (db_unfetched):        1040
> status 2 (db_fetched):  179
> status 3 (db_gone):     15
> status 5 (db_redir_perm):       3
> CrawlDb statistics: done
>



-- 
*Lewis*