You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dean Del Ponte <de...@gmail.com> on 2012/01/19 18:08:46 UTC

Regex help - exclude a url

I have a website and the home page's URL is:  http://www.homepage.com

I would like to crawl all pages, EXCEPT the home page.

What regex expression may I use that would exclude
http://www.homepage.combut include other pages like
http://www.homepage.com/stuff

Thanks!

Dean

Re: Regex help - exclude a url

Posted by Dean Del Ponte <de...@gmail.com>.
Thanks everyone for your help!

On Thu, Jan 19, 2012 at 11:17 AM, Eddie Drapkin <ed...@wolfram.com> wrote:

> On 1/19/2012 11:08 AM, Dean Del Ponte wrote:
>
>> I have a website and the home page's URL is:  http://www.homepage.com
>>
>> I would like to crawl all pages, EXCEPT the home page.
>>
>> What regex expression may I use that would exclude
>> http://www.homepage.combut include other pages like
>> http://www.homepage.com/stuff
>>
>> Thanks!
>>
>> Dean
>>
>>
> An expression like:
>
> +^http://www.homepage.com/.+$
>
> will force there to be something after the trailing /
>
> A more general approach (assuming index pages) might be:
>
> -^http://www.homepage.com/(**index.(php[3-6]?|html|htm|py|**rb|cgi))?$<http://www.homepage.com/(index.(php%5B3-6%5D?%7Chtml%7Chtm%7Cpy%7Crb%7Ccgi))?$>
> +^http://www.homepage.com/
>
> The first expression here will block anything at http://www.homepage.com/and
> http://www.homepage.com/index.**whatever<http://www.homepage.com/index.whatever>(be sure to add more extensions there if you need them, I added as many as
> I could think of).  The second expression will allow anything at
> http://www.homepage.com/ which is fine at this point because we've
> already blocked the pages we don't want (url filters are executed top
> down).  The one problem I can foresee from this approach is that you may
> need to crawl that page that you're excluding to get links to other pages
> (but maybe not).
>
>
> (Be sure to note the leading + and -, I assume you're using one of the
> urlfilter plugins).
>
> Thanks,
> Eddie
>

Re: Regex help - exclude a url

Posted by Eddie Drapkin <ed...@wolfram.com>.
On 1/19/2012 11:08 AM, Dean Del Ponte wrote:
> I have a website and the home page's URL is:  http://www.homepage.com
>
> I would like to crawl all pages, EXCEPT the home page.
>
> What regex expression may I use that would exclude
> http://www.homepage.combut include other pages like
> http://www.homepage.com/stuff
>
> Thanks!
>
> Dean
>

An expression like:

+^http://www.homepage.com/.+$

will force there to be something after the trailing /

A more general approach (assuming index pages) might be:

-^http://www.homepage.com/(index.(php[3-6]?|html|htm|py|rb|cgi))?$
+^http://www.homepage.com/

The first expression here will block anything at 
http://www.homepage.com/ and http://www.homepage.com/index.whatever (be 
sure to add more extensions there if you need them, I added as many as I 
could think of).  The second expression will allow anything at 
http://www.homepage.com/ which is fine at this point because we've 
already blocked the pages we don't want (url filters are executed top 
down).  The one problem I can foresee from this approach is that you may 
need to crawl that page that you're excluding to get links to other 
pages (but maybe not).


(Be sure to note the leading + and -, I assume you're using one of the 
urlfilter plugins).

Thanks,
Eddie

Re: Regex help - exclude a url

Posted by remi tassing <ta...@gmail.com>.
Your homepage is probably http://www.homepage.com/index.html, so try
-^http://www.homepage.com/index.html
+^http://www.homepage.com

On Thursday, January 19, 2012, Dean Del Ponte <de...@gmail.com>
wrote:
> I have a website and the home page's URL is:  http://www.homepage.com
>
> I would like to crawl all pages, EXCEPT the home page.
>
> What regex expression may I use that would exclude
> http://www.homepage.combut include other pages like
> http://www.homepage.com/stuff
>
> Thanks!
>
> Dean
>