You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by h b <hb...@gmail.com> on 2013/07/12 20:52:24 UTC
URL count in queue
Hi
I am crawling a url. I downloaded the page as well. I counted the urls in
the page by simply doing...
grep -c href page.html
I got 724 links
So I run inject/generate/fetch/parse/updatedb once. I believe this first
run will collect all the links on this page to be crawled on next run.
So I run the next generate/fetch
This is what I see in the fetch reducer on jobtracker
20/20 spinwaiting/active, 61 pages, 0 errors, 0.1 0 pages/s, 414 459 kb/s,
1000 URLs in 1 queues > reduce
So why are there 1000 urls in the queue, when the page only has 724 links.
This page does not have any ajax stuff.
Re: URL count in queue
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
1) link attributes are also found in a, area, form, frame, iframe, script, link, img
elements. The attribute is not always named "href" but also "src" and "action".
Cf. property parser.html.outlinks.ignore_tags:
to exclude img,script,link is a good choice (but not the default).
2) grep is case-sensitive if not told otherwise (option -i). HTML may specify <A HREF="..."
Cheers,
Sebastian
On 07/12/2013 08:52 PM, h b wrote:
> Hi
> I am crawling a url. I downloaded the page as well. I counted the urls in
> the page by simply doing...
>
> grep -c href page.html
>
> I got 724 links
>
> So I run inject/generate/fetch/parse/updatedb once. I believe this first
> run will collect all the links on this page to be crawled on next run.
>
> So I run the next generate/fetch
>
> This is what I see in the fetch reducer on jobtracker
>
> 20/20 spinwaiting/active, 61 pages, 0 errors, 0.1 0 pages/s, 414 459 kb/s,
> 1000 URLs in 1 queues > reduce
>
>
> So why are there 1000 urls in the queue, when the page only has 724 links.
> This page does not have any ajax stuff.
>