You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Clemens Marschner <cm...@lanlab.de> on 2002/04/24 21:47:25 UTC

Web Crawler

Hi,

I have been writing a web crawler in Java for quite some time now. Since
Lucene doesn't contain one by itself, I wonder if you were interested in a
contribution within the Lucene project.

I would probably call it a 0.4. It has quite a modular design, it's
multithreaded and still pretty simple.

And it's optimized for speed. I spent some time with a profiler to get the
beast FAST and memory consumption low. It contains an optimized HTML parser
that just extracts the necessary information and doesn't waste time nor
objects.

I was able to get a maximum of 3.7 MB/sec on a 100MBit line and a MAN-style
network (a University campus with about 150 web servers).

Its only purpose is to crawl documents and links and store them somewhere.
Nothing is done with the documents (though it would be easy to incorporate
any computation steps, but this would probably shift the balance between IO
and CPU usage until one of them becomes a bottleneck). Any connection to the
Lucene engine has yet to be provided.

I have also made a lot of optimizations on RAM usage, but still some data
structures are kept in main memory (notably the hash of visited URLs),
limiting the number of files that can be crawled.

Since it's not a production release yet, it still has some limitations. Some
work still has to be done, I still have a lot of ideas, and pretty much of
the configuration is still made in the Java source code (well, at least,
most of it is concentrated in the main() method). Since I just used it for
myself, this was fine so far.

Cheers,

Clemens Marschner


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>