You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Elwin <ma...@gmail.com> on 2006/02/23 10:49:46 UTC

About regex in the crawl-urlfilter.txt config file

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
I think it's not, but in fact nutch can crawl and get urls like that in
intranet crawl. Why?

Re: About regex in the crawl-urlfilter.txt config file

Posted by Elwin <ma...@gmail.com>.

Oh I have asked a silly question about regex, hehe.

2006/2/23, Jack Tang <hi...@gmail.com>:
>
> Hi
>
> I think in the url-filter it uses "contain" rather than "match".
>
> /Jack
>
> On 2/23/06, Elwin <ma...@gmail.com> wrote:
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >
> > Will this pattern accept url like this
> http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
> > I think it's not, but in fact nutch can crawl and get urls like that in
> > intranet crawl. Why?
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>



--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: About regex in the crawl-urlfilter.txt config file

Posted by Gal Nitzan <gn...@usa.net>.

if (matcher.find()) ....


On Thu, 2006-02-23 at 18:10 +0800, Jack Tang wrote:
> Hi
> 
> I think in the url-filter it uses "contain" rather than "match".
> 
> /Jack
> 
> On 2/23/06, Elwin <ma...@gmail.com> wrote:
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >
> > Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
> > I think it's not, but in fact nutch can crawl and get urls like that in
> > intranet crawl. Why?
> >
> >
> 
> 
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

Re: About regex in the crawl-urlfilter.txt config file

Posted by Jack Tang <hi...@gmail.com>.

Hi

I think in the url-filter it uses "contain" rather than "match".

/Jack

On 2/23/06, Elwin <ma...@gmail.com> wrote:
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
> I think it's not, but in fact nutch can crawl and get urls like that in
> intranet crawl. Why?
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

(AW) About regex in the crawl-urlfilter.txt config file

Posted by Martin Gutbrod <gu...@ifalt.de>.

nutch-user@lucene.apache.org schrieb:
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> 
> Will this pattern accept url like this
http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?

Yes. 
The regex in crawl-urlfilter.txt has only a start delimiter (^) but no
end delimtiter ($). So only the start part (left part) of the url 
is compared.