You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Elwin <ma...@gmail.com> on 2006/02/23 10:49:46 UTC
About regex in the crawl-urlfilter.txt config file
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
I think it's not, but in fact nutch can crawl and get urls like that in
intranet crawl. Why?
Re: About regex in the crawl-urlfilter.txt config file
Posted by Elwin <ma...@gmail.com>.
Oh I have asked a silly question about regex, hehe.
2006/2/23, Jack Tang <hi...@gmail.com>:
>
> Hi
>
> I think in the url-filter it uses "contain" rather than "match".
>
> /Jack
>
> On 2/23/06, Elwin <ma...@gmail.com> wrote:
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >
> > Will this pattern accept url like this
> http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
> > I think it's not, but in fact nutch can crawl and get urls like that in
> > intranet crawl. Why?
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。
Re: About regex in the crawl-urlfilter.txt config file
Posted by Gal Nitzan <gn...@usa.net>.
if (matcher.find()) ....
On Thu, 2006-02-23 at 18:10 +0800, Jack Tang wrote:
> Hi
>
> I think in the url-filter it uses "contain" rather than "match".
>
> /Jack
>
> On 2/23/06, Elwin <ma...@gmail.com> wrote:
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >
> > Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
> > I think it's not, but in fact nutch can crawl and get urls like that in
> > intranet crawl. Why?
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
Re: About regex in the crawl-urlfilter.txt config file
Posted by Jack Tang <hi...@gmail.com>.
Hi
I think in the url-filter it uses "contain" rather than "match".
/Jack
On 2/23/06, Elwin <ma...@gmail.com> wrote:
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
> I think it's not, but in fact nutch can crawl and get urls like that in
> intranet crawl. Why?
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
(AW) About regex in the crawl-urlfilter.txt config file
Posted by Martin Gutbrod <gu...@ifalt.de>.
nutch-user@lucene.apache.org schrieb:
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> Will this pattern accept url like this
http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
Yes.
The regex in crawl-urlfilter.txt has only a start delimiter (^) but no
end delimtiter ($). So only the start part (left part) of the url
is compared.