You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2012/03/23 14:46:20 UTC

db_unfetched large number, but crawling not fetching any longer

I was under the impression that setting topN for crawl cycles would limit the
number of items each iteration of the crawl would fetch/parse.  However,
eventually after continuously running crawl cycles it would get ALL the
urls.  My continuous crawl has stopped fetching/parsing and the stats from
crawldb indicate that db_unfetched is 133,359. 

Why is it no longer fetching urls if there are so many unfetched?

--
View this message in context: http://lucene.472066.n3.nabble.com/db-unfetched-large-number-but-crawling-not-fetching-any-longer-tp3851587p3851587.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: db_unfetched large number, but crawling not fetching any longer

Posted by remi tassing <ta...@gmail.com>.

I'm not sure to totally understand what you meant.

1. In case you know exactly how the relative urls are translated into, you
can use urlnormalizefilter to change them in what would make more 'sense'.
2. The 2nd option, if you don't want those relative links to be included,
you can use the urlregexfilter to block Nutch from crawling them.

Would that help?

Remi

On Tue, Mar 27, 2012 at 2:12 AM, webdev1977 <we...@gmail.com> wrote:

> I think I may have figured it out.. but I don't know how to fix it :-(
>
> I have many pdfs and html files that have relative links in them.  They are
> not from the originally hosted site, but are re-hosted.  Nutch/Tika is
> trying to prepend the relative urls in incounters with the url that
> contained the link to it.
>
> So if the first page was: http://mysite.com/web/myapp?id=12345
> and that is an html file with this:
>
> link_to_new_place.htm mylink
>
> It is doing this:
>
> http://mysite.com/web/myapp?id=12345link_to_new_place.htm.
>
> It is getting into the crawldb this way, but obviously is not a valid url.
> So my crawldb looks like it has 1,000,000 records, even though there should
> only be about 300,000
>
> Is there anyway to stop this behavior?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/db-unfetched-large-number-but-crawling-not-fetching-any-longer-tp3851587p3858935.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: db_unfetched large number, but crawling not fetching any longer

Posted by webdev1977 <we...@gmail.com>.

I think I may have figured it out.. but I don't know how to fix it :-(

I have many pdfs and html files that have relative links in them.  They are
not from the originally hosted site, but are re-hosted.  Nutch/Tika is
trying to prepend the relative urls in incounters with the url that
contained the link to it. 

So if the first page was: http://mysite.com/web/myapp?id=12345
and that is an html file with this:

link_to_new_place.htm mylink 

It is doing this:

http://mysite.com/web/myapp?id=12345link_to_new_place.htm. 

It is getting into the crawldb this way, but obviously is not a valid url. 
So my crawldb looks like it has 1,000,000 records, even though there should
only be about 300,000

Is there anyway to stop this behavior?

--
View this message in context: http://lucene.472066.n3.nabble.com/db-unfetched-large-number-but-crawling-not-fetching-any-longer-tp3851587p3858935.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: db_unfetched large number, but crawling not fetching any longer

Posted by webdev1977 <we...@gmail.com>.

I guess I STILL don't understand the topN setting.  Here is what I thought it
would do:

Seed: file:////myfileserver.com/share1

share1 Dir listing: 
file1.pdf ... file300.pdf, dir1 ... dir20

running the following in a never ending shell script:

{generate crawl/crawldb crawl/segments -topN 1000
fetch
parse
updatedb
invertlinks
solrindex
solrdedup}

The first iteration it would get the top 1000 scoring urls.  After this
first iteration it would have 1000 urls in the crawldb and the next
iteration it would choose the next 1000 top scoring urls.. and so on and so
forth.

Which means that eventually it would crawl ALL of the urls.  I am running
this script and I see that my db_fetched, db_unfetched, and total urls are
growing in number, but I am not seeing any new content being sent to solr. 
Not sure what is going on here?







--
View this message in context: http://lucene.472066.n3.nabble.com/db-unfetched-large-number-but-crawling-not-fetching-any-longer-tp3851587p3858258.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: db_unfetched large number, but crawling not fetching any longer

Posted by Sebastian Nagel <wa...@googlemail.com>.

Could you explain what is meant by "continuously running crawl cycles"?

Usually, you run a crawl with a certain "depth", a max. number of cycles.
If the depth is reached the crawler stops even if there are still unfetched
URLs. If generator generates an empty fetch list in one cycle the crawler stops
before depth is reached. The reason for an empty fetch list may be:
  - no more unfetched URLs (trivial, but not in your case)
  - recent temporary failures: after a temporary failure (network timeout, etc.)
    a URL is blocked for one day.

Does one of these suggestions answer your question?

Sebastian

On 03/23/2012 02:46 PM, webdev1977 wrote:
> I was under the impression that setting topN for crawl cycles would limit the
> number of items each iteration of the crawl would fetch/parse.  However,
> eventually after continuously running crawl cycles it would get ALL the
> urls.  My continuous crawl has stopped fetching/parsing and the stats from
> crawldb indicate that db_unfetched is 133,359.
>
> Why is it no longer fetching urls if there are so many unfetched?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/db-unfetched-large-number-but-crawling-not-fetching-any-longer-tp3851587p3851587.html
> Sent from the Nutch - User mailing list archive at Nabble.com.