You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ned Rockson (JIRA)" <ji...@apache.org> on 2007/10/26 03:33:51 UTC

[jira] Updated: (NUTCH-570) Improvement of URL Ordering in Generator.java

     [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ned Rockson updated NUTCH-570:
------------------------------

    Attachment: GeneratorDiff.out

This is an improvement to order URLs such that two URLs from the same host are separated by every other URL (hashed to the same machine) that can be fetched in parallel.  It causes a major speedup over the former , especially if generate.max.per.host is set to a reasonable value.

This requires an addition to nutch-default.xml to get it to run using the optimal ordering:

<property>
  <name>generate.optimal.url.ordering</name>
  <value>true</value>
  <description>Generates URLs in an optimal ordering for whole web fetching
  by separating webpages from the same host by as far as possible in the
  generated output list.</description>
</property>

> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Priority: Minor
>         Attachments: GeneratorDiff.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.