You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ankur Garg <ga...@gmail.com> on 2009/04/06 08:12:04 UTC

Problem crawling BBC Hindi Site

Hi All,

I am trying to crawl BBC Hindi site "http://www.bbc.co.uk/hindi/ "
but after depth 1 it shows, stopping at depth-1, no more urls to fetch.

Looking at the dump for depth-1, I realised there is no content fetched from
the page, could any one help me to figure out the root cause of the problem,
why it's not fetching any content from the page?

Had any one tried to crawl the site http://www.bbc.co.uk/hindi/   ??


thanks in advance

-- 
Ankur Garg
अँकुर गर्ग

Re: Problem crawling BBC Hindi Site

Posted by yanky young <ya...@gmail.com>.
Hi:

if you just use nutch crawl command, you should put your domain names
in crawl-urlfilter.txt

like this:

+^http://([a-z0-9]*\.)bbc.co.uk/hindi

or

+^http://www.bbc.co.uk/hindi

good luck



2009/4/6, Ankur Garg <ga...@gmail.com>:
> Hi All,
>
> I am trying to crawl BBC Hindi site "http://www.bbc.co.uk/hindi/ "
> but after depth 1 it shows, stopping at depth-1, no more urls to fetch.
>
> Looking at the dump for depth-1, I realised there is no content fetched from
> the page, could any one help me to figure out the root cause of the problem,
> why it's not fetching any content from the page?
>
> Had any one tried to crawl the site http://www.bbc.co.uk/hindi/   ??
>
>
> thanks in advance
>
> --
> Ankur Garg
> अँकुर गर्ग
>