You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Georg Ochsner <g....@revolistic.com> on 2007/10/12 09:35:57 UTC

fast crawler / 100 mio pages

Hello list members,

I am looking for a solution to crawl about 100 million internet pages with a
(focused) crawler. The crawler should be able to handle reg expressions
concerning the URL and to have a depth limit for each domain (no real need
for sophisticated "topic" intelligence). The goal is to build up an index in
a database (e.g. MySQL).

- Which crawler would be the fastest solution out there on a single debian
machine (AMD Opteron 1212 HE, Debian Etch, 2 GB RAM)? I read about the
following crawlers, which are the fastest ones for my purpose or are there
other better ones?

iVia Data Fountains
Nutch
Combine
DataparkSearch
Terrier
Sherlock Holmes


- Would only 1 GB RAM matter in speed? 


Thank you very much!
Georg