You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Alexandre <al...@gmail.com> on 2012/09/24 09:15:42 UTC

External domain redirection with db.ignore.external.links=true

Hi,

I've a question concerning redirection to external domain.
I crawl different websites, but I don't want to crawl external links. For
that I used the option 
db.ignore.external.links=true
It's working fine. But my problem is, that the websites using redirection to
an external domain are not crawled.
For exemple:
http://www.ikea.at  is redirected to http://www.ikea.com/at/de/ and my
crawler ignore this website because of the option
db.ignore.external.links=true.

A solution could be to use directly the url  http://www.ikea.com/at/de/ in
the seed list, but this is not an option for me, because I can not change
this list.

Is there any possibility in Nutch to authorize to crawl websites that are
redirected to external domains, and ignore external links?

Thank for your help,

Alex.



--
View this message in context: http://lucene.472066.n3.nabble.com/External-domain-redirection-with-db-ignore-external-links-true-tp4009783.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: External domain redirection with db.ignore.external.links=true

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.

Hey, this depends on the question:
     Do you know all the redirection points, when you start the crawl?
         In this case you just can edit the conf/regex-urlfilter.txt For 
example:
             +http://www.ikea.com/at/de/.* - means: all domains which 
match to this regex will be included
                                                                 a 
leading '-' means, the domains will be excluded
      If you don't know about all redirection points it is a little more 
complicated. I wrote an own url-filter-plugin to make nutch follow 
redirects. But this was slowing down the crawling a little.


Am 24.09.2012 10:45, schrieb Markus Jelsma:
> Hi - You can use the domain url filter to manually whitelist domains.
>   
> -----Original message-----
>> From:Alexandre <al...@gmail.com>
>> Sent: Mon 24-Sep-2012 09:19
>> To: user@nutch.apache.org
>> Subject: External domain redirection with db.ignore.external.links=true
>>
>> Hi,
>>
>> I've a question concerning redirection to external domain.
>> I crawl different websites, but I don't want to crawl external links. For
>> that I used the option
>> db.ignore.external.links=true
>> It's working fine. But my problem is, that the websites using redirection to
>> an external domain are not crawled.
>> For exemple:
>> http://www.ikea.at  is redirected to http://www.ikea.com/at/de/ and my
>> crawler ignore this website because of the option
>> db.ignore.external.links=true.
>>
>> A solution could be to use directly the url  http://www.ikea.com/at/de/ in
>> the seed list, but this is not an option for me, because I can not change
>> this list.
>>
>> Is there any possibility in Nutch to authorize to crawl websites that are
>> redirected to external domains, and ignore external links?
>>
>> Thank for your help,
>>
>> Alex.
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/External-domain-redirection-with-db-ignore-external-links-true-tp4009783.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>

RE: External domain redirection with db.ignore.external.links=true

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - You can use the domain url filter to manually whitelist domains.  
 
-----Original message-----
> From:Alexandre <al...@gmail.com>
> Sent: Mon 24-Sep-2012 09:19
> To: user@nutch.apache.org
> Subject: External domain redirection with db.ignore.external.links=true
> 
> Hi,
> 
> I've a question concerning redirection to external domain.
> I crawl different websites, but I don't want to crawl external links. For
> that I used the option 
> db.ignore.external.links=true
> It's working fine. But my problem is, that the websites using redirection to
> an external domain are not crawled.
> For exemple:
> http://www.ikea.at  is redirected to http://www.ikea.com/at/de/ and my
> crawler ignore this website because of the option
> db.ignore.external.links=true.
> 
> A solution could be to use directly the url  http://www.ikea.com/at/de/ in
> the seed list, but this is not an option for me, because I can not change
> this list.
> 
> Is there any possibility in Nutch to authorize to crawl websites that are
> redirected to external domains, and ignore external links?
> 
> Thank for your help,
> 
> Alex.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/External-domain-redirection-with-db-ignore-external-links-true-tp4009783.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>