You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Clemens Marschner <cm...@lanlab.de> on 2002/09/20 15:57:38 UTC

Re: your crawler

Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F
ieldsReader.java>----- Original Message -----
>From: Halácsy Péter
>To: cmad@lanlab.de
>Sent: Friday, September 20, 2002 12:10 PM
>Subject: your crawler
>
>
>BTW what is the status of the LARM crawler. 2 months ago I promised I could
help from September because I would >be a PHD student of Budapest University
of Technology. Did you choose avalon as a component framework?


I'm in the last days of my master's thesis. I will get back to the crawler
after Oct. 2nd (and a week of vacation on Garda's beautiful lakeside).

Otis has played around with the crawler in the last two weeks, and we had
long email conversations. We have found some problems one has to cope with.
I.e. LARM has a relatively high memory overhead per server (I mentioned it
was made for large intranets). Otis's 100MB RAM overflew after crawling
about 40000 URLs in the .hr domain.
I for myself have crawled 500.000 files from 500 servers with about 400 mb
of main memory (by the way, that only takes about 2-3 hours [but imposes
some load on the servers...])

We have talked about how the more or less linear rising memory consumption
could be controlled. Two components use up memory: The URLVisitedFilter,
which at this time simply holds a HashMap of already visited URLs; and the
FetcherTaskQueue, which holds a CachingQueue with crawling tasks for each
server. The cachingQueue itself holds up to two blocks of the queue in RAM,
so this may rise fast if the number of servers rises (look at the Javadoc, I
recall it's well documented).

We though about controlling this by a) compressing the visitedFilter's
contents, b) taking advantage of some locality property of URL distributions
(making it possible to move some of the URLs to secondary storage) and c)
binding a server to only one thread, minimizing the need for synchronization
(and providing more possibilities to move the tasks out of the RAM). a) can
be accomplished by compressing the sorted list of URLs (there are papers
about that on Citeseer). Incoming URLs would have to be divided into blocks
(i.e. per server) and, when a time/space threshold is reached, the block is
compressed. I have done a little work on that already, although my
implementation only works in batch mode, not incrementally.

Finally, the LuceneStorage is far from being optimized, and is a major
bottleneck. We thought about dividing the crawling from the indexing
process.

btw: Has anybody used a profiler with the Lucene indexing part? I suppose
there is still a lot to optimize there.

Regarding Avalon: I haven't had the time to look at it thoroughly. Mehran
Mehr wanted to to that, but I haven't heard anything from him for weeks now.
Probably he wants to present us the perfect solution very soon...

What I have done is I tried to use the Jakarta BeanUtils for loading the
config files. Works pretty simple (just a few lines of code, vers
straightforward) but then the check for mandatory parameters etc. would have
to be done by hand afterwards, something I would expect an XML reader to get
from an xsd file or something, at least optionally.

Back to my 15 hour day... :-|

--Clemens




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>