You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by rubenll <ru...@hotmail.com> on 2007/11/02 18:17:37 UTC

restrict indexing only to a domain list with no using crawl-urlfilter

Hello, crawling in a intranet style it is easy to restrict domains only to a
list. I meant, only searching N levels but only in the domains (not external
links).

Using whole-web crawling is there any way to restrict indexing external
links for domains list with no using crawl-urlfilter??  It has no sense for
me using this file (a hard work).

Regards
rub
-- 
View this message in context: http://www.nabble.com/restrict-indexing-only-to-a-domain-list-with-no-using-crawl-urlfilter-tf4738836.html#a13551940
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: restrict indexing only to a domain list with no using crawl-urlfilter

Posted by rubenll <ru...@hotmail.com>.

perfect, thank you very much

Rub


misc wrote:
> 
> 
> Hello-
> 
>     From the wiki faq.
> 
> Is it possible to fetch only pages from some specific domains?
> Please have a look on PrefixURLFilter. Adding some regular expressions to 
> the urlfilter.regex.file might work, but adding a list with thousands of 
> regular expressions would slow down your system excessively.
> 
> Alternatively, you can set db.ignore.external.links to "true", and inject 
> seeds from the domains you wish to crawl (these seeds must link to all
> pages 
> you wish to crawl, directly or indirectly). Doing this will let the crawl
> go 
> through only these domains without leaving to start crawling external
> links. 
> Unfortunately there is no way to record external links encountered for 
> future processing, although a very small patch to the generator code can 
> allow you to log these links to hadoop.log.
> 
> 
> 
>     I use the second method.
> 
>                         see you
> 
>                             -Jim
> 
> 
> 
> 
> ----- Original Message ----- 
> From: "rubenll" <ru...@hotmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Friday, November 02, 2007 10:17 AM
> Subject: restrict indexing only to a domain list with no using 
> crawl-urlfilter
> 
> 
>>
>> Hello, crawling in a intranet style it is easy to restrict domains only
>> to 
>> a
>> list. I meant, only searching N levels but only in the domains (not 
>> external
>> links).
>>
>> Using whole-web crawling is there any way to restrict indexing external
>> links for domains list with no using crawl-urlfilter??  It has no sense 
>> for
>> me using this file (a hard work).
>>
>> Regards
>> rub
>> -- 
>> View this message in context: 
>> http://www.nabble.com/restrict-indexing-only-to-a-domain-list-with-no-using-crawl-urlfilter-tf4738836.html#a13551940
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/restrict-indexing-only-to-a-domain-list-with-no-using-crawl-urlfilter-tf4738836.html#a13562194
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: restrict indexing only to a domain list with no using crawl-urlfilter

Posted by misc <mi...@robotgenius.net>.

Hello-

    From the wiki faq.

Is it possible to fetch only pages from some specific domains?
Please have a look on PrefixURLFilter. Adding some regular expressions to 
the urlfilter.regex.file might work, but adding a list with thousands of 
regular expressions would slow down your system excessively.

Alternatively, you can set db.ignore.external.links to "true", and inject 
seeds from the domains you wish to crawl (these seeds must link to all pages 
you wish to crawl, directly or indirectly). Doing this will let the crawl go 
through only these domains without leaving to start crawling external links. 
Unfortunately there is no way to record external links encountered for 
future processing, although a very small patch to the generator code can 
allow you to log these links to hadoop.log.



    I use the second method.

                        see you

                            -Jim




----- Original Message ----- 
From: "rubenll" <ru...@hotmail.com>
To: <nu...@lucene.apache.org>
Sent: Friday, November 02, 2007 10:17 AM
Subject: restrict indexing only to a domain list with no using 
crawl-urlfilter


>
> Hello, crawling in a intranet style it is easy to restrict domains only to 
> a
> list. I meant, only searching N levels but only in the domains (not 
> external
> links).
>
> Using whole-web crawling is there any way to restrict indexing external
> links for domains list with no using crawl-urlfilter??  It has no sense 
> for
> me using this file (a hard work).
>
> Regards
> rub
> -- 
> View this message in context: 
> http://www.nabble.com/restrict-indexing-only-to-a-domain-list-with-no-using-crawl-urlfilter-tf4738836.html#a13551940
> Sent from the Nutch - User mailing list archive at Nabble.com.
>