You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by dhoulker <da...@gmail.com> on 2011/02/04 16:02:03 UTC
Skipping certain URLs
Hi,
I'm trying to skip certain urls in an intranet site.
I'd like to skip: (this is actually default.aspx we have the default
document set up)
http://10.47.23.110:85/firm-info/bios/
However when i try and block that page it also blocks the entire section of
the site.
So URLs like also get blocked:
http://10.47.23.110:85/firm-info/bios/2904/some-page.aspx
My regex skills aren't great so i suspect its just that.
I've tried the below, but to no avail
-http://10.47.23.110:85/firm-info/bios/
-http://10.47.23.110:85/firm-info/bios/[^0-9]
Can anyone help please!
Thanks
Dave
--
View this message in context: http://lucene.472066.n3.nabble.com/Skipping-certain-URLs-tp2424735p2424735.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Skipping certain URLs
Posted by mmartinek <mi...@gmail.com>.
Try throwing a dollar sign ($) at the end of your expression to indicate that
it's the end of the string.
For example:
http://10.47.23.110:85/firm-info/bios/$
Would block just that URL but allow
http://10.47.23.110:85/firm-info/bios/2904/somethingelse.aspx
You could also play around with a regex like this to block anything not
ending in ASPX:
http://10\.47\.23\.110:85/firm-info/(.*?)\.(ASPX|aspx)$
The (.*?) will match all characters in a non-greedy fashion but only match
up to a .ASPX or .aspx, at which point the .ASPX must be the end of the URL.
This means that things like default.aspx?name=value would fail the match.
--
View this message in context: http://lucene.472066.n3.nabble.com/Skipping-certain-URLs-tp2424735p2997664.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.