You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Webmaster <we...@axismedia.ca> on 2008/10/07 07:13:29 UTC

Extensive web crawl

Ok..

So I want to index the web..  All of it..

Any thoughts on how to automate this so I can just point the spider off on
it's merry way and have it return 20 billion pages?

So far I've been injecting random portions of the DMOZ mixed with other urls
like directory.yahoo.com and wiki.org.  I was hoping this would give me a
good retuen with an unrestricted URL-filter where MY.DOMAIN.COM was replaced
with *.* --  Perhaps this is my error and that should be left as is and the
last line should be +. instead of -. ?

Anyhow after injecting 2000 urls and a few of my own I still only get back
minimal results in the range of 500 to 600k urls.

Right now I have a new grawl going with 1 million injected urls from the
DMOZ, I'm thinking that this should return a 20 million page index at
least..  No?

Anyhow..  I have more HD space on the way and would like to get the indexing
up to 1 billion by the end of the week..

Any examples on how to set up the url-filter.txt and regex-filter.txt would
be helpful..

Thanks..

Axel..