You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Ned Rockson (JIRA)" <ji...@apache.org> on 2007/10/26 02:00:52 UTC

[jira] Created: (NUTCH-570) Improvement of URL Ordering in Generator.java

Improvement of URL Ordering in Generator.java
---------------------------------------------

                 Key: NUTCH-570
                 URL: https://issues.apache.org/jira/browse/NUTCH-570
             Project: Nutch
          Issue Type: Improvement
          Components: generator
            Reporter: Ned Rockson
            Priority: Minor


[Copied directly from my email to nutch-dev list]

Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.

Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851461#action_12851461 ] 

Otis Gospodnetic commented on NUTCH-570:
----------------------------------------

Serykh, what does your version of the patch do differently? (maybe it's just an update so it applies to trunk?)

Julien, want to take this?


> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854767#action_12854767 ] 

Chris A. Mattmann commented on NUTCH-570:
-----------------------------------------

Hi Otis:

I think your logic perfectly rational here. Maybe you could leave it open for another 48 hrs, and then close it out if you don't get any feedback from the original reporter, or those that were interested.

Cheers,
Chris


> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Ned Rockson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ned Rockson updated NUTCH-570:
------------------------------

    Attachment: GeneratorDiff.out

This is an improvement to order URLs such that two URLs from the same host are separated by every other URL (hashed to the same machine) that can be fetched in parallel.  It causes a major speedup over the former , especially if generate.max.per.host is set to a reasonable value.

This requires an addition to nutch-default.xml to get it to run using the optimal ordering:

<property>
  <name>generate.optimal.url.ordering</name>
  <value>true</value>
  <description>Generates URLs in an optimal ordering for whole web fetching
  by separating webpages from the same host by as far as possible in the
  generated output list.</description>
</property>

> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Priority: Minor
>         Attachments: GeneratorDiff.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic resolved NUTCH-570.
------------------------------------

    Resolution: Won't Fix

> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated NUTCH-570:
-----------------------------------

    Assignee: Otis Gospodnetic

Another nudge for feedback from Ned or anyone else who tried this.
I've been using this patch without any problems, though I have not verified that it works as advertised and that it really orders URLs in a more optimal way.

Anyone?


> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Serykh Evgeniy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Serykh Evgeniy updated NUTCH-570:
---------------------------------

    Attachment: GeneratorDiff_v1.out

> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854665#action_12854665 ] 

Otis Gospodnetic commented on NUTCH-570:
----------------------------------------

I'm tempted to close this issue as Won't Fix, because:
* I have no way to test and verify this
* nobody seems to be using this
* this issue has only 2 votes and only 3 watchers
* the original reporter mentioned he noticed only marginal speedups


> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587786#action_12587786 ] 

Otis Gospodnetic commented on NUTCH-570:
----------------------------------------

Ned - are you still using this?  Still happy with it?  Did you make any updates to this since October 2007?

The logical of spreading URLs from the same hosts as far apart as possible makes sense to me, and I think it really is correct.  I tried this and nothing broke, though I have no easy way to verify if the patch indeed does what you described.  I think my fetchlists of less than 10K URLs are too small for improvement of this approach to be visible.

Has anyone else tried this?

I think this is extra - not used anywhere and thus not needed:

+  public static final String UPDATE_WITHOUT_CHECKING_TIME = "crawl.check.time";
+  public static final String MAX_URLS_PER_HOST = "generate.max.per.host";

And in here we count on the host name in the URL already being normalized/lowercased elsewhere, so there is no need to lowercase it here, right?

+      try {
+        u = new URL(((Text)key).toString());
+        host = u.getHost();
+      } catch (MalformedURLException e) {
+        host = ((Text)key).toString();
+      }

And I see 2 LOG.warn calls that look like they might be left-overs from debugging-through-logging:

+      LOG.warn("Current key: " + ((IntWritable)key).toString());
+      LOG.warn("current output: " + currUrl.toString());

Those should be changed to LOG.debug or they should be removed.


> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Priority: Minor
>         Attachments: GeneratorDiff.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-570:
------------------------------------

    Patch Info: [Patch Available]

> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Priority: Minor
>         Attachments: GeneratorDiff.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Ned Rockson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598809#action_12598809 ] 

Ned Rockson commented on NUTCH-570:
-----------------------------------

Hi Otis,

I actually have moved away from using nutch a while ago.  I did use this 
patch for at least two months and saw speedups (marginal as they may be) 
over the pseudo-random ordering that was in place before.  Essentially, 
with random ordering you will most likely get a good ordering for the 
head, but with this ordering you are guaranteed to get a good ordering.





> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Dmitry Lihachev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851710#action_12851710 ] 

Dmitry Lihachev commented on NUTCH-570:
---------------------------------------

Yeah, Otis. It's just an update so it applies to trunk.

> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851545#action_12851545 ] 

Julien Nioche commented on NUTCH-570:
-------------------------------------

{quote}Julien, want to take this?{quote}

Not particularly. I am busy on short term issues for 1.1  so feel free to take it if you have a particular interest in this. 
I would be curious to see some figures on the improvements from this patch, my impression is that NUTCH-776 would be quicker to implement and maintain and might possibly give similar gains. 

> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
>                 Key: NUTCH-570
>                 URL: https://issues.apache.org/jira/browse/NUTCH-570
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Ned Rockson
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time).  I found that the URLs generated are not optimal because they are simply randomized by a hash comparator.  In one crawl on 24 machines it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible.  So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster.  Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.