You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Liddelow <pa...@gmail.com> on 2007/03/19 10:14:37 UTC

Problems crawling a URL

Hi
I have set Nutch up and the crawler (following the intranet tutorial) and
can fetch results OK for the few URL's I have tested, but for some reason I
cannot get any results returned when I try to crawl this URL:
http://www.comlaw.gov.au/ComLaw/legislation/actcompilation1.nsf/sh/browse&VIEW=current&ORDER=bytitle&CATEGORY=actcompilation


I think it might have something to do with the file extension ".nsf" which
is midway in the URL. I think the crawler cannot deal with it. Has anybody
else had this problem or can help?

Much obliged if anybody knows the answer.

Cheers
Paul

Re: Problems crawling a URL

Posted by Jeroen Verhagen <je...@gmail.com>.

Hi Paul,

Someone had to point this out to me too: in conf/crawl-urlfilter.txt
there is a line: -[?*!@=]
which tells which characters are not allowed in urls.

Try to remove this line or only remove '=' from it

regards,

Jeroen

On 3/19/07, Paul Liddelow <pa...@gmail.com> wrote:
> Hi
> I have set Nutch up and the crawler (following the intranet tutorial) and
> can fetch results OK for the few URL's I have tested, but for some reason I
> cannot get any results returned when I try to crawl this URL:
> http://www.comlaw.gov.au/ComLaw/legislation/actcompilation1.nsf/sh/browse&VIEW=current&ORDER=bytitle&CATEGORY=actcompilation
>
>
> I think it might have something to do with the file extension ".nsf" which
> is midway in the URL. I think the crawler cannot deal with it. Has anybody
> else had this problem or can help?
>
> Much obliged if anybody knows the answer.
>
> Cheers
> Paul
>


-- 

regards,

Jeroen