You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2008/04/10 06:12:04 UTC

[jira] Created: (NUTCH-627) Minimize host address lookup

Minimize host address lookup
----------------------------

                 Key: NUTCH-627
                 URL: https://issues.apache.org/jira/browse/NUTCH-627
             Project: Nutch
          Issue Type: Improvement
          Components: generator
            Reporter: Otis Gospodnetic
         Attachments: NUTCH-627.patch

The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
- there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
- there is little point in attempting to look up a hostname yet again if the previous lookup already failed

In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.

If nobody complains, I'll commit by the end of the week.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-627) Minimize host address lookup

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic resolved NUTCH-627.
------------------------------------

    Resolution: Fixed

Thanks Otis.
Sending        CHANGES.txt
Sending        src/java/org/apache/nutch/crawl/Generator.java
Transmitting file data ..
Committed revision 734257.


> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-627) Minimize host address lookup

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-627.
-------------------------------


> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-627) Minimize host address lookup

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663619#action_12663619 ] 

Hudson commented on NUTCH-627:
------------------------------

Integrated in Nutch-trunk #692 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/692/])
     - Minimize host address lookup while running generate


> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-627) Minimize host address lookup

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic reassigned NUTCH-627:
--------------------------------------

    Assignee: Otis Gospodnetic

> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-627) Minimize host address lookup

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662474#action_12662474 ] 

Andrzej Bialecki  commented on NUTCH-627:
-----------------------------------------

Otis, is the patch already applied? If not, +1 from me.

> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
On 4/10/08 8:25 AM, "Dennis Kubes" <ku...@apache.org> wrote:

> 
> 
> Andrzej Bialecki wrote:
>> Otis Gospodnetic (JIRA) wrote:
>> 
>>>> If nobody complains, I'll commit by the end of the week.
>> 
>> Hi Otis,
>> 
>> Thanks for helping with Nutch - we are indeed very shorthanded at the
>> moment, and any help is appreciated, and doubly so that of a person who
>> can commit things ...
>> 
>> However, on the formal side I think the Nutch team needs to vote you in
>> as a Nutch committer (even though svn allows you to commit directly) -
>> witness the recent situation with Grant. If you wish I can start a vote,
>> and I'm sure it will be positive, and we will have a clean situation
>> from the formal POV. Ok?
>> 
> +1
>> 

+1, as well.

Cheers,
 Chris


______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

Posted by Dennis Kubes <ku...@apache.org>.

Andrzej Bialecki wrote:
> Otis Gospodnetic (JIRA) wrote:
> 
>>> If nobody complains, I'll commit by the end of the week.
> 
> Hi Otis,
> 
> Thanks for helping with Nutch - we are indeed very shorthanded at the 
> moment, and any help is appreciated, and doubly so that of a person who 
> can commit things ...
> 
> However, on the formal side I think the Nutch team needs to vote you in 
> as a Nutch committer (even though svn allows you to commit directly) - 
> witness the recent situation with Grant. If you wish I can start a vote, 
> and I'm sure it will be positive, and we will have a clean situation 
> from the formal POV. Ok?
> 
+1
> 

Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

Posted by Andrzej Bialecki <ab...@getopt.org>.
Otis Gospodnetic (JIRA) wrote:

>> If nobody complains, I'll commit by the end of the week.

Hi Otis,

Thanks for helping with Nutch - we are indeed very shorthanded at the 
moment, and any help is appreciated, and doubly so that of a person who 
can commit things ...

However, on the formal side I think the Nutch team needs to vote you in 
as a Nutch committer (even though svn allows you to commit directly) - 
witness the recent situation with Grant. If you wish I can start a vote, 
and I'm sure it will be positive, and we will have a clean situation 
from the formal POV. Ok?


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


[jira] Updated: (NUTCH-627) Minimize host address lookup

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated NUTCH-627:
-----------------------------------

    Attachment: NUTCH-627.patch

> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.