You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2006/03/03 14:50:49 UTC

limit fetching by using crawl-urlfilter.txt

Hi,

I searched on the mail-post, but still have problem to
run my testing.

Actually, I want my crawling is limited to two site
solely.

such as, *.abc.com/*
and      *.def.com/*

so I put two line in crawl-urlfilter.txt as
+^http://([a-z0-9]*\.)*.abc.com/
+^http://([a-z0-9]*\.)*.def.com/

But after running testing, the crawling is not limited
to the above two sites. 

>From log, I found "not found ...urlfilter-prefix"

I wonder if the failure is due to not include
crawl-urlfilter.txt in my configure xml or there is
syntax error for my previous statement.

thanks,

Michael


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: limit fetching by using crawl-urlfilter.txt

Posted by Ravi Chintakunta <ra...@gmail.com>.
You can have the inclusion and exclusion urls regex specified in
different lines or combine them by ORing. That does not make much
difference. Make sure that you have this line at the end.

-.

This will make sure all other sites are not crawled.

- Ravi

On 3/3/06, Jack Tang <hi...@gmail.com> wrote:
> On 3/3/06, Michael Ji <fj...@yahoo.com> wrote:
> > hi,
> >
> > I tried this, actually in my case, one site ends with
> > .net and the other is .org
> >
> > so I modified it to
> >
> > +^http://([a-z0-9]*\.)*(abc.net|def.org)/
> I guess '.' is metadata in regexp, so pls try
> +^http://([a-z0-9]*\.)*(abc\.net|def\.org)/
>
> Good luck!
>
> > and I run another testing, seems doesn't work, coz I
> > saw a site other than abc and def is being fetched,
> >
> > any hints?
> >
> > thanks,
> >
> > Michael,
> >
> > --- sudhendra seshachala <su...@yahoo.com> wrote:
> >
> > >
> > > Hi,
> > >   Try the following pattern
> > >   +^http://([a-z0-9]*\.)*(abc|def).com/
> > >
> > >   I was able to search couple of sites using similar
> > > pattern.
> > >   If this is what you are asking ?
> > >
> > > Michael Ji <fj...@yahoo.com> wrote:
> > >   Hi,
> > >
> > > I searched on the mail-post, but still have problem
> > > to
> > > run my testing.
> > >
> > > Actually, I want my crawling is limited to two site
> > > solely.
> > >
> > > such as, *.abc.com/*
> > > and *.def.com/*
> > >
> > > so I put two line in crawl-urlfilter.txt as
> > > +^http://([a-z0-9]*\.)*.abc.com/
> > > +^http://([a-z0-9]*\.)*.def.com/
> > >
> > > But after running testing, the crawling is not
> > > limited
> > > to the above two sites.
> > >
> > > From log, I found "not found ...urlfilter-prefix"
> > >
> > > I wonder if the failure is due to not include
> > > crawl-urlfilter.txt in my configure xml or there is
> > > syntax error for my previous statement.
> > >
> > > thanks,
> > >
> > > Michael
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam? Yahoo! Mail has the best spam
> > > protection around
> > > http://mail.yahoo.com
> > >
> > >
> > >
> > >   Sudhi Seshachala
> > >   http://sudhilogs.blogspot.com/
> > >
> > >
> > >
> > >
> > > ---------------------------------
> > > Yahoo! Mail
> > > Bring photos to life! New PhotoMail  makes sharing a
> > > breeze.
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

Re: limit fetching by using crawl-urlfilter.txt

Posted by Jack Tang <hi...@gmail.com>.
On 3/3/06, Michael Ji <fj...@yahoo.com> wrote:
> hi,
>
> I tried this, actually in my case, one site ends with
> .net and the other is .org
>
> so I modified it to
>
> +^http://([a-z0-9]*\.)*(abc.net|def.org)/
I guess '.' is metadata in regexp, so pls try
+^http://([a-z0-9]*\.)*(abc\.net|def\.org)/

Good luck!

> and I run another testing, seems doesn't work, coz I
> saw a site other than abc and def is being fetched,
>
> any hints?
>
> thanks,
>
> Michael,
>
> --- sudhendra seshachala <su...@yahoo.com> wrote:
>
> >
> > Hi,
> >   Try the following pattern
> >   +^http://([a-z0-9]*\.)*(abc|def).com/
> >
> >   I was able to search couple of sites using similar
> > pattern.
> >   If this is what you are asking ?
> >
> > Michael Ji <fj...@yahoo.com> wrote:
> >   Hi,
> >
> > I searched on the mail-post, but still have problem
> > to
> > run my testing.
> >
> > Actually, I want my crawling is limited to two site
> > solely.
> >
> > such as, *.abc.com/*
> > and *.def.com/*
> >
> > so I put two line in crawl-urlfilter.txt as
> > +^http://([a-z0-9]*\.)*.abc.com/
> > +^http://([a-z0-9]*\.)*.def.com/
> >
> > But after running testing, the crawling is not
> > limited
> > to the above two sites.
> >
> > From log, I found "not found ...urlfilter-prefix"
> >
> > I wonder if the failure is due to not include
> > crawl-urlfilter.txt in my configure xml or there is
> > syntax error for my previous statement.
> >
> > thanks,
> >
> > Michael
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam
> > protection around
> > http://mail.yahoo.com
> >
> >
> >
> >   Sudhi Seshachala
> >   http://sudhilogs.blogspot.com/
> >
> >
> >
> >
> > ---------------------------------
> > Yahoo! Mail
> > Bring photos to life! New PhotoMail  makes sharing a
> > breeze.
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: limit fetching by using crawl-urlfilter.txt

Posted by Michael Ji <fj...@yahoo.com>.
hi,

I tried this, actually in my case, one site ends with
.net and the other is .org

so I modified it to 

+^http://([a-z0-9]*\.)*(abc.net|def.org)/

and I run another testing, seems doesn't work, coz I
saw a site other than abc and def is being fetched,

any hints?

thanks,

Michael,

--- sudhendra seshachala <su...@yahoo.com> wrote:

> 
> Hi,
>   Try the following pattern
>   +^http://([a-z0-9]*\.)*(abc|def).com/
>    
>   I was able to search couple of sites using similar
> pattern.
>   If this is what you are asking ?
>   
> Michael Ji <fj...@yahoo.com> wrote:
>   Hi,
> 
> I searched on the mail-post, but still have problem
> to
> run my testing.
> 
> Actually, I want my crawling is limited to two site
> solely.
> 
> such as, *.abc.com/*
> and *.def.com/*
> 
> so I put two line in crawl-urlfilter.txt as
> +^http://([a-z0-9]*\.)*.abc.com/
> +^http://([a-z0-9]*\.)*.def.com/
> 
> But after running testing, the crawling is not
> limited
> to the above two sites. 
> 
> From log, I found "not found ...urlfilter-prefix"
> 
> I wonder if the failure is due to not include
> crawl-urlfilter.txt in my configure xml or there is
> syntax error for my previous statement.
> 
> thanks,
> 
> Michael
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
> 		
> ---------------------------------
> Yahoo! Mail
> Bring photos to life! New PhotoMail  makes sharing a
> breeze. 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: limit fetching by using crawl-urlfilter.txt

Posted by sudhendra seshachala <su...@yahoo.com>.
Hi,
  Try the following pattern
  +^http://([a-z0-9]*\.)*(abc|def).com/
   
  I was able to search couple of sites using similar pattern.
  If this is what you are asking ?
  
Michael Ji <fj...@yahoo.com> wrote:
  Hi,

I searched on the mail-post, but still have problem to
run my testing.

Actually, I want my crawling is limited to two site
solely.

such as, *.abc.com/*
and *.def.com/*

so I put two line in crawl-urlfilter.txt as
+^http://([a-z0-9]*\.)*.abc.com/
+^http://([a-z0-9]*\.)*.def.com/

But after running testing, the crawling is not limited
to the above two sites. 

>From log, I found "not found ...urlfilter-prefix"

I wonder if the failure is due to not include
crawl-urlfilter.txt in my configure xml or there is
syntax error for my previous statement.

thanks,

Michael


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze.