You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Susam Pal <su...@gmail.com> on 2007/10/11 19:49:11 UTC

Re: nutch won't index urls to servlets

Check the URL filter (conf/crawl-urlfilter.txt if you are running
bin/nutch crawl; conf/regex-urlfilter.txt if you are running the crawl
script).

By default, all queries are blocked with the following regex.

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

You need to comment this line.

Regards,
Susam Pal
http://susam.in/

On 10/11/07, Rohit Trivedi <ro...@db.com> wrote:
> Hi,
>
> I have an archive page with a bunch of links in it like so:
>
> <a
> href="/servlet/ShowContent?ResourceType=S&ServerLocation=1&ResourceId=1163280">qcs
> Monthly</a>
>
> but nutch doesn't index them - it doesn't even try..no traces in the logs
> of it even trying to fetch this url..is it because it's relative? is it
> because it's a query??
>
> help much appreciated,
> Rohit