You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/05/13 01:19:08 UTC
[jira] Created: (NUTCH-268) Generator and lib-http use different
definitions of "unique host"
Generator and lib-http use different definitions of "unique host"
-----------------------------------------------------------------
Key: NUTCH-268
URL: http://issues.apache.org/jira/browse/NUTCH-268
Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Andrzej Bialecki
Assigned to: Andrzej Bialecki
Fix For: 0.8-dev
Generator uses a host name, as extracted from URL, to determine the maximum number of URLs from a unique host (when generator.max.per.host is set > 0). This supposedly should prevent the situation where fetchlists become dominated by URLs coming from the same hosts, which in turn would clash with "politeness" rules.
However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and instead use it's IP address (explicitly doing a DNS lookup on the host name extracted from URL). This leads to the following undesirable behavior:
* if DNS name resolves to different IPs (round-robin balancing), then technically we are in violation of the "politeness" rules, because lib-http doesn't see this as a conflict and permits concurrent accesses to the same host name.
* if different DNS names resolve to the same IP address (very common: CNAME-s, subdomains, web hosting, etc) then the purpose of generate.max.per.host is defeated, because lib-http will block more frequently than intended, leading to excessive numbers of "Exceeded http.max.delays" exceptions.
Proposed solution: synchronize Generator and lib-http in their interpretation of "unique host". Introduce a boolean property which instructs both Generator and lib-http to use in both places either IP addresses or host names as "unique hosts".
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-268) Generator and lib-http use different
definitions of "unique host"
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-268?page=all ]
Andrzej Bialecki closed NUTCH-268:
-----------------------------------
Resolution: Fixed
Fixed in rev. 406757.
> Generator and lib-http use different definitions of "unique host"
> -----------------------------------------------------------------
>
> Key: NUTCH-268
> URL: http://issues.apache.org/jira/browse/NUTCH-268
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Fix For: 0.8-dev
>
> Generator uses a host name, as extracted from URL, to determine the maximum number of URLs from a unique host (when generator.max.per.host is set > 0). This supposedly should prevent the situation where fetchlists become dominated by URLs coming from the same hosts, which in turn would clash with "politeness" rules.
> However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and instead use it's IP address (explicitly doing a DNS lookup on the host name extracted from URL). This leads to the following undesirable behavior:
> * if DNS name resolves to different IPs (round-robin balancing), then technically we are in violation of the "politeness" rules, because lib-http doesn't see this as a conflict and permits concurrent accesses to the same host name.
> * if different DNS names resolve to the same IP address (very common: CNAME-s, subdomains, web hosting, etc) then the purpose of generate.max.per.host is defeated, because lib-http will block more frequently than intended, leading to excessive numbers of "Exceeded http.max.delays" exceptions.
> Proposed solution: synchronize Generator and lib-http in their interpretation of "unique host". Introduce a boolean property which instructs both Generator and lib-http to use in both places either IP addresses or host names as "unique hosts".
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-268) Generator and lib-http use different
definitions of "unique host"
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-268?page=comments#action_12383327 ]
Andrzej Bialecki commented on NUTCH-268:
-----------------------------------------
I forgot to add: if we change Generator to use IP addresses, then we should warn users that running a local caching DNS server becomes practically mandatory - otherwise Generator would be very slow, not to mention that it would generate a lot of DNS traffic to external servers.
> Generator and lib-http use different definitions of "unique host"
> -----------------------------------------------------------------
>
> Key: NUTCH-268
> URL: http://issues.apache.org/jira/browse/NUTCH-268
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Fix For: 0.8-dev
>
> Generator uses a host name, as extracted from URL, to determine the maximum number of URLs from a unique host (when generator.max.per.host is set > 0). This supposedly should prevent the situation where fetchlists become dominated by URLs coming from the same hosts, which in turn would clash with "politeness" rules.
> However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and instead use it's IP address (explicitly doing a DNS lookup on the host name extracted from URL). This leads to the following undesirable behavior:
> * if DNS name resolves to different IPs (round-robin balancing), then technically we are in violation of the "politeness" rules, because lib-http doesn't see this as a conflict and permits concurrent accesses to the same host name.
> * if different DNS names resolve to the same IP address (very common: CNAME-s, subdomains, web hosting, etc) then the purpose of generate.max.per.host is defeated, because lib-http will block more frequently than intended, leading to excessive numbers of "Exceeded http.max.delays" exceptions.
> Proposed solution: synchronize Generator and lib-http in their interpretation of "unique host". Introduce a boolean property which instructs both Generator and lib-http to use in both places either IP addresses or host names as "unique hosts".
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira