You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by dhoulker <da...@gmail.com> on 2011/02/04 16:02:03 UTC

Skipping certain URLs

Hi,

I'm trying to skip certain urls in an intranet site. 

I'd like to skip: (this is actually default.aspx we have the default
document set up)

http://10.47.23.110:85/firm-info/bios/

However when i try and block that page it also blocks the entire section of
the site.

So URLs like also get blocked:

http://10.47.23.110:85/firm-info/bios/2904/some-page.aspx

My regex skills aren't great so i suspect its just that.

I've tried the below, but to no avail

-http://10.47.23.110:85/firm-info/bios/
-http://10.47.23.110:85/firm-info/bios/[^0-9]

Can anyone help please!

Thanks

Dave
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Skipping-certain-URLs-tp2424735p2424735.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Skipping certain URLs

Posted by mmartinek <mi...@gmail.com>.
Try throwing a dollar sign ($) at the end of your expression to indicate that
it's the end of the string.

For example:

http://10.47.23.110:85/firm-info/bios/$

Would block just that URL but allow
http://10.47.23.110:85/firm-info/bios/2904/somethingelse.aspx

You could also play around with a regex like this to block anything not
ending in ASPX:

http://10\.47\.23\.110:85/firm-info/(.*?)\.(ASPX|aspx)$

The (.*?) will match all characters in a non-greedy fashion but only match
up to a .ASPX or .aspx, at which point the .ASPX must be the end of the URL.
This means that things like default.aspx?name=value would fail the match.

--
View this message in context: http://lucene.472066.n3.nabble.com/Skipping-certain-URLs-tp2424735p2997664.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.