You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by misc <mi...@robotgenius.net> on 2007/11/10 02:46:11 UTC

Generator speed

Hi all-

    The generate phase has always taken a lot of time for me, and I wanted to report on this here.  (note- this is not the really bad problem I mentioned earlier, where it was going even an order of magnitude slower, that problem went away and I can not reproduce it).

    I have a crawldb that is 40 million items large, so I expect everything to be slow, but generate is the slowest part now, taking up to 3 hours to complete.  I can do a linux "sort -n" on a file with 40million lines in about 20 minutes, and I believe that this is basically what generate is doing (selecting the top scoring urls), in fact I think we can do better than linux sort which should be n log n.  In fact I think we could go effectively almost "n" by going through the list of urls one by one and only storing in the topN list when it appears above the cutoff ranking (this would be near n log n when topN is near the database size, and near n when small compared to it).  Shouldn't generate be able to go faster than "sort -n"?

    Am I missing something?

                        see you
                            -Jim