You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Patricia Helmich <pa...@hotmail.com> on 2018/04/20 10:31:42 UTC

No internet connection in Nutch crawler: Proxy configuration -PAC file

Hi,

I am using Nutch and it used to work fine. Now, some internet configurations changed and I have to use a proxy. In my browser, I specify the proxy by providing a PAC file to the option "Automatic proxy configuration URL". I was searching for a similar option in Nutch in the conf/nutch-default.xml file. I do find some proxy options (http.proxy.host, http.proxy.port, http.proxy.username, http.proxy.password, http.proxy.realm) but none seems to be the one I am searching for.

So, my question is: where can I specify the PAC file in the Nutch configurations for the proxy?

Thanks for your help,

Patricia

Re: Ignore external links but allow redirections to external websites

Posted by Semyon Semyonov <se...@mail.com>.
There is one more thing.

You(we do it) can do it outside of Nutch. You can create a program that validates the seed list urls and save redirects as an input for Nutch.
 

Sent: Monday, November 26, 2018 at 2:43 PM
From: "Semyon Semyonov" <se...@mail.com>
To: user@nutch.apache.org
Subject: Re: Ignore external links but allow redirections to external websites
Hi Patricia,

I wish I had a generic solution for this problem, but I managed to fix http://www.abc.com -> http://abc.com[http://abc.com][http://abc.com[http://abc.com]][http://abc.com[http://abc.com][http://abc.com[http://abc.com]]] problem with an extension of url exemption filter for both ways (www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]] -> abc.com and abc.com -> www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]]). 
https://jira.apache.org/jira/browse/NUTCH-2522[https://jira.apache.org/jira/browse/NUTCH-2522][https://jira.apache.org/jira/browse/NUTCH-2522[https://jira.apache.org/jira/browse/NUTCH-2522]]

You need to replicate this logic in an indexer, if you want to have www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]], abc.com with under the same hostname.
 
Semyon

  
 

Sent: Monday, November 26, 2018 at 12:19 PM
From: "Patricia Helmich" <pa...@hotmail.com>
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Ignore external links but allow redirections to external websites
Hi,

I am using Nutch with a seed set of URLS and I want to crawl all internal links found on the crawled websites. The external links should be ignored in my crawler, so I set the "db.ignore.external.links" in nutch-site.xml to "true". This works perfectly in order to ignore the external links. However, when a a seed URL redirects to another URL, I want to crawl the redirected URL, even if it's external. For example, if I have a seed URL like http://www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]][http://www.abc.com[http://www.abc.com][http://www.abc.com[http://www.abc.com]]] and it redirects to http://abc.com[http://abc.com][http://abc.com[http://abc.com]][http://abc.com[http://abc.com][http://abc.com[http://abc.com]]], the crawl process stops because the domain without www is an external link. (If I set "db.ignore.external.links" in nutch-site.xml to "false", the crawl process does continue, but in that case, it also crawls all external links on the site which I don't want it to.)

So, my question is: Is there a possibility to ignore external links but allow redirections to external websites?

Thanks for your help,
Patricia
 

Re: Ignore external links but allow redirections to external websites

Posted by Semyon Semyonov <se...@mail.com>.
Hi Patricia,

I wish I had a generic solution for this problem, but I managed to fix http://www.abc.com -> http://abc.com[http://abc.com] problem with an extension of url exemption filter for both ways (www.abc.com -> abc.com and abc.com -> www.abc.com). 
https://jira.apache.org/jira/browse/NUTCH-2522

You need to replicate this logic in an indexer, if you want to have www.abc.com, abc.com with under the same hostname.
 
Semyon

  
 

Sent: Monday, November 26, 2018 at 12:19 PM
From: "Patricia Helmich" <pa...@hotmail.com>
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Ignore external links but allow redirections to external websites
Hi,

I am using Nutch with a seed set of URLS and I want to crawl all internal links found on the crawled websites. The external links should be ignored in my crawler, so I set the "db.ignore.external.links" in nutch-site.xml to "true". This works perfectly in order to ignore the external links. However, when a a seed URL redirects to another URL, I want to crawl the redirected URL, even if it's external. For example, if I have a seed URL like http://www.abc.com[http://www.abc.com] and it redirects to http://abc.com[http://abc.com], the crawl process stops because the domain without www is an external link. (If I set "db.ignore.external.links" in nutch-site.xml to "false", the crawl process does continue, but in that case, it also crawls all external links on the site which I don't want it to.)

So, my question is: Is there a possibility to ignore external links but allow redirections to external websites?

Thanks for your help,
Patricia
 

Ignore external links but allow redirections to external websites

Posted by Patricia Helmich <pa...@hotmail.com>.
Hi,

I am using Nutch with a seed set of URLS and I want to crawl all internal links found on the crawled websites. The external links should be ignored in my crawler, so I set the "db.ignore.external.links" in nutch-site.xml to "true". This works perfectly in order to ignore the external links. However, when a a seed URL redirects to another URL, I want to crawl the redirected URL, even if it's external. For example, if I have a seed URL like http://www.abc.com and it redirects to http://abc.com, the crawl process stops because the domain without www is an external link. (If I set "db.ignore.external.links" in nutch-site.xml to "false", the crawl process does continue, but in that case, it also crawls all external links on the site which I don't want it to.)

So, my question is: Is there a possibility to ignore external links but allow redirections to external websites?

Thanks for your help,
Patricia