You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 16:41:06 UTC

[jira] [Closed] (NUTCH-396) mergesegs sorts URLs, making segments useless for subsequent fetch

     [ https://issues.apache.org/jira/browse/NUTCH-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-396.
-------------------------------

    Resolution: Won't Fix

> mergesegs sorts URLs, making segments useless for subsequent fetch
> ------------------------------------------------------------------
>
>                 Key: NUTCH-396
>                 URL: https://issues.apache.org/jira/browse/NUTCH-396
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8
>         Environment: Mac OS X 10.4.7
>            Reporter: Doug Cook
>            Priority: Minor
>
> Mergesegs leaves the output segment in URL-sorted order.
> This is a problem if the segment was just generated and not yet fetched - the fetcher likes the URLs to be in essentially random order (sort by URL hash or similar). If I fetch a segment created by mergesegs, my performance is extremely poor since all URLs from a given host will be grouped together and the per-host delays kill me.
> I have a local fix which I am using: map using a key of MD5(URL) + URL, then, during the reduce phase, chop the MD5 off the front to get the original URL. This is simple, has essentially random order, no problems with collisions, and seems to work nicely.
> The only thing I don't know is whether or not there is some other tool expecting the sorted order (I would expect not, since generate does not produce this). Right now I have my fix as an option (-randomize), but if there is no other tool requiring sorted order, it's probably cleaner to just make this non-optional.
> Thoughts?
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira