You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by h b <hb...@gmail.com> on 2013/07/12 20:52:24 UTC

URL count in queue

Hi
I am crawling a url. I downloaded the page as well. I counted the urls in
the page by simply doing...

grep -c href page.html

I got 724 links

So I run inject/generate/fetch/parse/updatedb once. I believe this first
run will collect all the links on this page to be crawled on next run.

So I run the next generate/fetch

This is what I see in the fetch reducer on jobtracker

20/20 spinwaiting/active, 61 pages, 0 errors, 0.1 0 pages/s, 414 459 kb/s,
1000 URLs in 1 queues > reduce


So why are there 1000 urls in the queue, when the page only has 724 links.
This page does not have any ajax stuff.

Re: URL count in queue

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

1) link attributes are also found in a, area, form, frame, iframe, script, link, img
elements. The attribute is not always named "href" but also "src" and "action".
Cf. property parser.html.outlinks.ignore_tags:
to exclude img,script,link is a good choice (but not the default).

2) grep is case-sensitive if not told otherwise (option -i). HTML may specify <A HREF="..."

Cheers,
Sebastian

On 07/12/2013 08:52 PM, h b wrote:
> Hi
> I am crawling a url. I downloaded the page as well. I counted the urls in
> the page by simply doing...
> 
> grep -c href page.html
> 
> I got 724 links
> 
> So I run inject/generate/fetch/parse/updatedb once. I believe this first
> run will collect all the links on this page to be crawled on next run.
> 
> So I run the next generate/fetch
> 
> This is what I see in the fetch reducer on jobtracker
> 
> 20/20 spinwaiting/active, 61 pages, 0 errors, 0.1 0 pages/s, 414 459 kb/s,
> 1000 URLs in 1 queues > reduce
> 
> 
> So why are there 1000 urls in the queue, when the page only has 724 links.
> This page does not have any ajax stuff.
>