You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kevin Porter <ke...@tinternet.mobi> on 2015/01/02 12:11:13 UTC

regex-urlfilter problem

Hi,


I added a regex to conf/regex-urlfilter.txt because I want to stop it
crawling all pages with "highlight=" in the query part of the URL. This is
the regex:

-^http://9ballpool.co.uk/forums/.*\?.*highlight=

Now nutch crawls nothing, it's like every URL is matching that regex and so
being excluded. Why?


-- 
http://themapps.com

Re: regex-urlfilter problem

Posted by Kevin Porter <ke...@tinternet.mobi>.
Hey thanks for trying to help, but I've realised something else went wrong
and it was just coincidence I had added that regex at that time. Had me
scratching my head for a while, but the regex works fine :)

On 2 January 2015 at 12:07, Talat Uyarer <ta...@uyarer.com> wrote:

> Can you share your regex conf file. You should add accept all rule end of
> file.
>
> 2015-01-02 13:11 GMT+02:00 Kevin Porter <ke...@tinternet.mobi>:
> > Hi,
> >
> >
> > I added a regex to conf/regex-urlfilter.txt because I want to stop it
> > crawling all pages with "highlight=" in the query part of the URL. This
> is
> > the regex:
> >
> > -^http://9ballpool.co.uk/forums/.*\?.*highlight=
> >
> > Now nutch crawls nothing, it's like every URL is matching that regex and
> so
> > being excluded. Why?
> >
> >
> > --
> > http://themapps.com
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>



-- 
http://themapps.com

Re: regex-urlfilter problem

Posted by Talat Uyarer <ta...@uyarer.com>.
Can you share your regex conf file. You should add accept all rule end of file.

2015-01-02 13:11 GMT+02:00 Kevin Porter <ke...@tinternet.mobi>:
> Hi,
>
>
> I added a regex to conf/regex-urlfilter.txt because I want to stop it
> crawling all pages with "highlight=" in the query part of the URL. This is
> the regex:
>
> -^http://9ballpool.co.uk/forums/.*\?.*highlight=
>
> Now nutch crawls nothing, it's like every URL is matching that regex and so
> being excluded. Why?
>
>
> --
> http://themapps.com



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304