You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Christian Schrader <sc...@evendi.de> on 2002/07/29 12:37:38 UTC
AW: LARM Web Crawler: LuceneStorage [experimental]

Has anybody been able to access the luceneStorage data?
To me it seems, that the lucene storage needs to be closed, before it can be
accessed from another program. The closing would have to happen, after the
last thread runs out. I am not an experienced Thread programmer, so maybe
someone has an idea.

Chris
> -----Ursprungliche Nachricht-----
> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Gesendet: 18 June 2002 23:55
> An: Lucene Developers List; Clemens Marschner
> Betreff: Re: LARM Web Crawler: LuceneStorage [experimental]
>
>
> I see nice progress here.
> I will try it in the near future (time!).
>
> > I have added an experimental version of a LuceneStorage to the LARM
> > crawler,
> > available from CVS in lucene-sandbox. That means crawled documents
> > can now directly be indexed into a lucene index.
> >
> >
> >
> > Sorry, no configuration files yet. Config is done in
> > ...larm/FetcherMain.java
> > The main class FetcherMain is now configured to store the contents in
> > a lucene index called "luceneIndex".
> >
> >
> > Lots of open questions:
> > - LARM doesn't have the notion of closing everything down. What
> > happens if IndexWriter is interrupted?
>
> As in what if it encounters an exception (e.g. somebody removes the
> index directory)?  I guess one of the items that should them maybe get
> added to the to-do list is checkpointing for starters.
>
> > - I haven't tried to read from the index yet...
>
> Heh, I'm familiar with that situation.
>
> > - How to configure the stuff from a config file
> > ... (it's late)
>
> Property file with name=value pairs and some init() method that is
> called at the beginning may be sufficient.
>
> > Please try it:
> >
> > To build and run it,
> > - put ANT in your path
> > - provide a build.properties with the location of the lucene Jar file
> > (lucene.jar=)
> >   (just like javacc in lucene/build.xml)
> > - put HTTPClient.jar from http://innovation.ch/java and jakarta-oro
> > library
> > into libs
> > - type:
> >
> > ant
> > run -Dstart=<starturl> -Drestrictto=<restricttourl>
> > -Dthreads=<numThreads>
> >
> > ex.:
> > ant
> > run -Dstart=http://localhost/ -Drestrictto=http://localhost.*
> > -Dthreads=5
> >
> > note: restrictto is a regular expression; the URLs tested against it
> > are
> > normalized beforehand, which means
> > they are made lower case, index.* are removed, and some other
> > corrections
> > (see URLNormalizer.java for details)
>
> Removing index.* may be too bold and incorrect in some situations.
>
> > note: LuceneStorage is dumb; it just takes the WebDocument and stores
> > it.
> > That means with the current config it also stores tags, and only one
> > "content" field that contains everything. I plan to write another
> > storage
> > that uses the HTMLDocument from the demo package to store HTML
> > documents.
>
> Nice.
> I found NekoHTML to do a nice job of 'dehtmlization'.
>
> > Please note that when adding this storage to the storage pipeline,
> > the whole
> > crawling process becomes
> > CPU- instead of I/O bound. We already have plans how to do the
> > distribution.
> >
> > Feel free to contact me if there are questions.
> > Still Looking For Contributors!
> >
> > Clemens
>
> Ausgezeichnet!
>
> Otis
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! - Official partner of 2002 FIFA World Cup
> http://fifaworldcup.yahoo.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>