You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2008/04/10 06:12:04 UTC
[jira] Created: (NUTCH-627) Minimize host address lookup
Minimize host address lookup
----------------------------
Key: NUTCH-627
URL: https://issues.apache.org/jira/browse/NUTCH-627
Project: Nutch
Issue Type: Improvement
Components: generator
Reporter: Otis Gospodnetic
Attachments: NUTCH-627.patch
The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed. For such hosts, further DNS lookups are skipped:
- there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
- there is little point in attempting to look up a hostname yet again if the previous lookup already failed
In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
If nobody complains, I'll commit by the end of the week.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-627) Minimize host address lookup
Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Otis Gospodnetic resolved NUTCH-627.
------------------------------------
Resolution: Fixed
Thanks Otis.
Sending CHANGES.txt
Sending src/java/org/apache/nutch/crawl/Generator.java
Transmitting file data ..
Committed revision 734257.
> Minimize host address lookup
> ----------------------------
>
> Key: NUTCH-627
> URL: https://issues.apache.org/jira/browse/NUTCH-627
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Otis Gospodnetic
> Assignee: Otis Gospodnetic
> Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed. For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-627) Minimize host address lookup
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney closed NUTCH-627.
-------------------------------
> Minimize host address lookup
> ----------------------------
>
> Key: NUTCH-627
> URL: https://issues.apache.org/jira/browse/NUTCH-627
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Otis Gospodnetic
> Assignee: Otis Gospodnetic
> Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed. For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-627) Minimize host address lookup
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663619#action_12663619 ]
Hudson commented on NUTCH-627:
------------------------------
Integrated in Nutch-trunk #692 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/692/])
- Minimize host address lookup while running generate
> Minimize host address lookup
> ----------------------------
>
> Key: NUTCH-627
> URL: https://issues.apache.org/jira/browse/NUTCH-627
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Otis Gospodnetic
> Assignee: Otis Gospodnetic
> Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed. For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-627) Minimize host address lookup
Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Otis Gospodnetic reassigned NUTCH-627:
--------------------------------------
Assignee: Otis Gospodnetic
> Minimize host address lookup
> ----------------------------
>
> Key: NUTCH-627
> URL: https://issues.apache.org/jira/browse/NUTCH-627
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Otis Gospodnetic
> Assignee: Otis Gospodnetic
> Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed. For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-627) Minimize host address lookup
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662474#action_12662474 ]
Andrzej Bialecki commented on NUTCH-627:
-----------------------------------------
Otis, is the patch already applied? If not, +1 from me.
> Minimize host address lookup
> ----------------------------
>
> Key: NUTCH-627
> URL: https://issues.apache.org/jira/browse/NUTCH-627
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Otis Gospodnetic
> Assignee: Otis Gospodnetic
> Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed. For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Updated: (NUTCH-627) Minimize host address lookup
Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
On 4/10/08 8:25 AM, "Dennis Kubes" <ku...@apache.org> wrote:
>
>
> Andrzej Bialecki wrote:
>> Otis Gospodnetic (JIRA) wrote:
>>
>>>> If nobody complains, I'll commit by the end of the week.
>>
>> Hi Otis,
>>
>> Thanks for helping with Nutch - we are indeed very shorthanded at the
>> moment, and any help is appreciated, and doubly so that of a person who
>> can commit things ...
>>
>> However, on the formal side I think the Nutch team needs to vote you in
>> as a Nutch committer (even though svn allows you to commit directly) -
>> witness the recent situation with Grant. If you wish I can start a vote,
>> and I'm sure it will be positive, and we will have a clean situation
>> from the formal POV. Ok?
>>
> +1
>>
+1, as well.
Cheers,
Chris
______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory Pasadena, CA
Office: 171-266B Mailstop: 171-246
_______________________________________________________
Disclaimer: The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
Re: [jira] Updated: (NUTCH-627) Minimize host address lookup
Posted by Dennis Kubes <ku...@apache.org>.
Andrzej Bialecki wrote:
> Otis Gospodnetic (JIRA) wrote:
>
>>> If nobody complains, I'll commit by the end of the week.
>
> Hi Otis,
>
> Thanks for helping with Nutch - we are indeed very shorthanded at the
> moment, and any help is appreciated, and doubly so that of a person who
> can commit things ...
>
> However, on the formal side I think the Nutch team needs to vote you in
> as a Nutch committer (even though svn allows you to commit directly) -
> witness the recent situation with Grant. If you wish I can start a vote,
> and I'm sure it will be positive, and we will have a clean situation
> from the formal POV. Ok?
>
+1
>
Re: [jira] Updated: (NUTCH-627) Minimize host address lookup
Posted by Andrzej Bialecki <ab...@getopt.org>.
Otis Gospodnetic (JIRA) wrote:
>> If nobody complains, I'll commit by the end of the week.
Hi Otis,
Thanks for helping with Nutch - we are indeed very shorthanded at the
moment, and any help is appreciated, and doubly so that of a person who
can commit things ...
However, on the formal side I think the Nutch team needs to vote you in
as a Nutch committer (even though svn allows you to commit directly) -
witness the recent situation with Grant. If you wish I can start a vote,
and I'm sure it will be positive, and we will have a clean situation
from the formal POV. Ok?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-627) Minimize host address lookup
Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Otis Gospodnetic updated NUTCH-627:
-----------------------------------
Attachment: NUTCH-627.patch
> Minimize host address lookup
> ----------------------------
>
> Key: NUTCH-627
> URL: https://issues.apache.org/jira/browse/NUTCH-627
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Otis Gospodnetic
> Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed. For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.