You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ted Yu <yu...@gmail.com> on 2009/12/21 23:14:43 UTC

domain crawl using bin/nutch

Hi,
I found db.ignore.external.links property.
How do I limit the crawl by also excluding links within the same domain as
well ?

Thanks

RE: domain crawl using bin/nutch

Posted by Jun Mao <Ju...@symantec.com>.
But how could we tell Nutch every time to do crawling in this way?
I do not want to edit *-filter.txt every time.

Thanks,
 
Jun

-----Original Message-----
From: Jesse Hires [mailto:jhires@gmail.com] 
Sent: 2009年12月22日 9:23
To: nutch-user@lucene.apache.org
Subject: Re: domain crawl using bin/nutch

You should be able to do this using one of the variations of *-urlfilter.txt
files. Instead of using "+" in front of the regex, you can tell it to
exclude lines that match the regex with a "-".

Just a guess, I haven't actually tried it, but you could probably use
something like the following. (I'm sure you would have to fiddle with it to
get it to work correctly).

+^http://([a-z0-9]*\.)*mydomain.com/
-*/(pagename1.php|pagename2.php)



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Mon, Dec 21, 2009 at 2:14 PM, Ted Yu <yu...@gmail.com> wrote:

> Hi,
> I found db.ignore.external.links property.
> How do I limit the crawl by also excluding links within the same domain as
> well ?
>
> Thanks
>

Re: domain crawl using bin/nutch

Posted by Jesse Hires <jh...@gmail.com>.
You should be able to do this using one of the variations of *-urlfilter.txt
files. Instead of using "+" in front of the regex, you can tell it to
exclude lines that match the regex with a "-".

Just a guess, I haven't actually tried it, but you could probably use
something like the following. (I'm sure you would have to fiddle with it to
get it to work correctly).

+^http://([a-z0-9]*\.)*mydomain.com/
-*/(pagename1.php|pagename2.php)



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Mon, Dec 21, 2009 at 2:14 PM, Ted Yu <yu...@gmail.com> wrote:

> Hi,
> I found db.ignore.external.links property.
> How do I limit the crawl by also excluding links within the same domain as
> well ?
>
> Thanks
>