You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Vince Filby <vf...@gmail.com> on 2007/08/02 19:59:08 UTC
Domain Url Filtering
Hello,
Is there a way to tell Nutch to only follow links within the domain it is
currently crawling?
What I would like to do is pass a list of Url's and Nutch should ignore all
outbound links from any domain other than the domain that the link comes
from. Let's say that I am crawling www.test1.com, I should only follow
links to www.test1.com.
I realize that I can do this with regex filter *if* I add a regex rule for
*each* site that I want to crawl, this solution doesn't scale well for my
project. I have also read of a db based url filter that will maintain a
list of accepted url's in a database. This also doesn't fit well since I
don't want to maintain the crawl list and the accepted domain database. I
can but it is rather clunky.
I have poked around the source and it looks like the url filtering mechanism
only passes the link url and returns a url. So it appears that this is not
really possible at the code level without source modifications. I would
just like to confirm that I am not missing anything obvious before I start
reworking the code.
Cheers,
Vince
Re: Domain Url Filtering
Posted by Vince Filby <vf...@gmail.com>.
I will indeed! I hadn't thought to check through the properties but I am
familiarizing myself with them now. There is certainly a treasure trove of
goodness in there.
Thank you for your assistance.
Cheers,
Vince
On 8/2/07, Renaud Richardet <re...@apache.org> wrote:
>
> hi Vince,
>
> have you tried this property?
>
> <property>
> <name>db.ignore.external.links</name>
> <value>false</value>
> <description>If true, outlinks leading from a page to external hosts
> will be ignored. This is an effective way to limit the crawl to include
> only initially injected hosts, without creating complex URLFilters.
> </description>
> </property>
>
> HTH,
> Renaud
>
>
>
> Vince Filby wrote:
> > Hello,
> >
> > Is there a way to tell Nutch to only follow links within the domain it
> is
> > currently crawling?
> >
> > What I would like to do is pass a list of Url's and Nutch should ignore
> all
> > outbound links from any domain other than the domain that the link comes
> > from. Let's say that I am crawling www.test1.com, I should only follow
> > links to www.test1.com.
> >
> > I realize that I can do this with regex filter *if* I add a regex rule
> for
> > *each* site that I want to crawl, this solution doesn't scale well for
> my
> > project. I have also read of a db based url filter that will maintain a
> > list of accepted url's in a database. This also doesn't fit well since
> I
> > don't want to maintain the crawl list and the accepted domain
> database. I
> > can but it is rather clunky.
> >
> > I have poked around the source and it looks like the url filtering
> mechanism
> > only passes the link url and returns a url. So it appears that this is
> not
> > really possible at the code level without source modifications. I would
> > just like to confirm that I am not missing anything obvious before I
> start
> > reworking the code.
> >
> > Cheers,
> > Vince
> >
> >
>
>
Re: Domain Url Filtering
Posted by Renaud Richardet <re...@apache.org>.
hi Vince,
have you tried this property?
<property>
<name>db.ignore.external.links</name>
<value>false</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
HTH,
Renaud
Vince Filby wrote:
> Hello,
>
> Is there a way to tell Nutch to only follow links within the domain it is
> currently crawling?
>
> What I would like to do is pass a list of Url's and Nutch should ignore all
> outbound links from any domain other than the domain that the link comes
> from. Let's say that I am crawling www.test1.com, I should only follow
> links to www.test1.com.
>
> I realize that I can do this with regex filter *if* I add a regex rule for
> *each* site that I want to crawl, this solution doesn't scale well for my
> project. I have also read of a db based url filter that will maintain a
> list of accepted url's in a database. This also doesn't fit well since I
> don't want to maintain the crawl list and the accepted domain database. I
> can but it is rather clunky.
>
> I have poked around the source and it looks like the url filtering mechanism
> only passes the link url and returns a url. So it appears that this is not
> really possible at the code level without source modifications. I would
> just like to confirm that I am not missing anything obvious before I start
> reworking the code.
>
> Cheers,
> Vince
>
>