You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2019/08/16 10:43:00 UTC

[jira] [Created] (NUTCH-2730) SitemapProcessor to treat sitemap URLs as Set instead of List

Markus Jelsma created NUTCH-2730:
------------------------------------

             Summary: SitemapProcessor to treat sitemap URLs as Set instead of List
                 Key: NUTCH-2730
                 URL: https://issues.apache.org/jira/browse/NUTCH-2730
             Project: Nutch
          Issue Type: Improvement
          Components: sitemap
    Affects Versions: 1.15
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.16


https://archive.epa.gov/robots.txt lists 160k sitemap URLs, absurd! Almost 160k of them are duplicates, no friendly words to describe this astonishing fact.

And although our Nutch locally chews through this list in 22s, for some weird reason the big job on Hadoop fails, although it is also working on a lot more.

Maybe this is not a problem, maybe it is. Nevertheless, treating them as Set and not List makes sense.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)