You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2006/11/04 00:07:17 UTC

[jira] Created: (NUTCH-396) mergesegs sorts URLs, making segments useless for subsequent fetch

mergesegs sorts URLs, making segments useless for subsequent fetch
------------------------------------------------------------------

                 Key: NUTCH-396
                 URL: http://issues.apache.org/jira/browse/NUTCH-396
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 0.8
         Environment: Mac OS X 10.4.7
            Reporter: Doug Cook
            Priority: Minor


Mergesegs leaves the output segment in URL-sorted order.

This is a problem if the segment was just generated and not yet fetched - the fetcher likes the URLs to be in essentially random order (sort by URL hash or similar). If I fetch a segment created by mergesegs, my performance is extremely poor since all URLs from a given host will be grouped together and the per-host delays kill me.

I have a local fix which I am using: map using a key of MD5(URL) + URL, then, during the reduce phase, chop the MD5 off the front to get the original URL. This is simple, has essentially random order, no problems with collisions, and seems to work nicely.

The only thing I don't know is whether or not there is some other tool expecting the sorted order (I would expect not, since generate does not produce this). Right now I have my fix as an option (-randomize), but if there is no other tool requiring sorted order, it's probably cleaner to just make this non-optional.

Thoughts?

 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

deep limitation

Posted by an...@orbita1.ru.
Does Nutch 0.7.2 have any "deep limitation"?

I added a few pages. I need processing this pages and all pages which
located 3 (for example) clicks away from added pages. 

I think, I explain clearly ;-)