You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Clemens Marschner <cm...@lanlab.de> on 2002/06/18 03:11:38 UTC

LARM Web Crawler: LuceneStorage [experimental]

Hi,

I have added an experimental version of a LuceneStorage to the LARM crawler,
available from CVS in lucene-sandbox. That means crawled documents can now
directly be indexed into a lucene index.



Sorry, no configuration files yet. Config is done in
...larm/FetcherMain.java
The main class FetcherMain is now configured to store the contents in a
lucene index called "luceneIndex".


Lots of open questions:
- LARM doesn't have the notion of closing everything down. What happens if
IndexWriter is interrupted?
- I haven't tried to read from the index yet...
- How to configure the stuff from a config file
... (it's late)

Please try it:

To build and run it,
- put ANT in your path
- provide a build.properties with the location of the lucene Jar file
(lucene.jar=)
  (just like javacc in lucene/build.xml)
- put HTTPClient.jar from http://innovation.ch/java and jakarta-oro library
into libs
- type:

ant
run -Dstart=<starturl> -Drestrictto=<restricttourl> -Dthreads=<numThreads>

ex.:
ant
run -Dstart=http://localhost/ -Drestrictto=http://localhost.* -Dthreads=5

note: restrictto is a regular expression; the URLs tested against it are
normalized beforehand, which means
they are made lower case, index.* are removed, and some other corrections
(see URLNormalizer.java for
details)

note: LuceneStorage is dumb; it just takes the WebDocument and stores it.
That means with the current config it also stores tags, and only one
"content" field that contains everything. I plan to write another storage
that uses the HTMLDocument from the demo package to store HTML documents.

Please note that when adding this storage to the storage pipeline, the whole
crawling process becomes
CPU- instead of I/O bound. We already have plans how to do the distribution.

Feel free to contact me if there are questions.
Still Looking For Contributors!

Clemens




--------------------------------------
http://www.cmarschner.net



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM Web Crawler: note on normalized URLs

Posted by Jack Park <ja...@thinkalong.com>.

At 02:14 PM 6/19/2002 -0700, you wrote:
>It may be even nicer to use some DB implemented in Java, such as
>HyperSQL (I think that's the name)

It used to be called HypersonicSQL, now it's just HSQLDB.

https://sourceforge.net/projects/hsqldb/


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM Web Crawler: note on normalized URLs

Posted by Otis Gospodnetic <ot...@yahoo.com>.

> > It may be even nicer to use some DB implemented in Java, such as
> > HyperSQL (I think that's the name) or Smyle
> > (https://sourceforge.net/projects/smyle/) or Berkeley DB
> > (http://www.sleepycat.com/), although MySQL may be simpler if you
> want
> > to create a crawler that can be run on a cluster of machines that
> share
> > a central link repository.
> 
> Hm, I'll think about it. But MySQL seems to be the KISS way...
> I don't think a central link repository makes sense. Looks like a
> bottleneck to me.

Well, yes, it could become a bottleneck.
However, your crawler is not distributed (yet?), so we don't have to
waste time talking about hypothetical situations.

Otis



__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM Web Crawler: note on normalized URLs

Posted by Clemens Marschner <cm...@lanlab.de>.

> It may be even nicer to use some DB implemented in Java, such as
> HyperSQL (I think that's the name) or Smyle
> (https://sourceforge.net/projects/smyle/) or Berkeley DB
> (http://www.sleepycat.com/), although MySQL may be simpler if you want
> to create a crawler that can be run on a cluster of machines that share
> a central link repository.

Hm, I'll think about it. But MySQL seems to be the KISS way...
I don't think a central link repository makes sense. Looks like a bottleneck
to me.

Clemens


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM Web Crawler: note on normalized URLs

Posted by Otis Gospodnetic <ot...@yahoo.com>.

--- Clemens Marschner <cm...@lanlab.de> wrote:
> > > note: restrictto is a regular expression; the URLs tested against
> it
> > > are
> > > normalized beforehand, which means
> > > they are made lower case, index.* are removed, and some other
> > > corrections
> > > (see URLNormalizer.java for details)
> >
> > Removing index.* may be too bold and incorrect in some situations.
> 
> Hm, but I think it's much more likely that http://host/ and
> http://host/index.* point to the same document as to different
> documents.
> It's also very unlikely that (UNIX) users have one "abc" and one
> "Abc" file
> in the same directory, although it's possible. That's why URLs are
> made
> lower case.
> Therefore, I think the cost of not crawling a document that falls out
> of
> this scheme is higher than crawling a document twice.
> Later on we could use i.e. use MD5 hashes to be sure.

I don't know, maybe.  I haven't done any tests nor read anything that
would confirm that this is correct (or wrong).

> I must point out that these normalized URLs are only used for
> comparing the
> already crawled URLs with new ones. The actual request sent to the
> server is
> the original URL. removing index.* before sending the request would
> indeed
> be pretty bold.

Aha!
I thought you use normalized URLs for requests, too.

> I have a more detailed description of the URLnormalizer, but still in
> German; might check it in after I have translated it; I need it for
> my
> master's thesis (see my homepage). Probably I'll write that in
> English
> anyway...
> 
> by the way I've made some very promising experiments with MySQL as
> URL
> repository. seems to be fast enough. When I did this with MS SQL
> Server in
> the first place, I was very disappointed. That's the basis for
> incremental crawling!

People at Senga.org developed something called Webbase (its CVS
repository is at sf.net) that used MySQL for this purpose as well.

It may be even nicer to use some DB implemented in Java, such as
HyperSQL (I think that's the name) or Smyle
(https://sourceforge.net/projects/smyle/) or Berkeley DB
(http://www.sleepycat.com/), although MySQL may be simpler if you want
to create a crawler that can be run on a cluster of machines that share
a central link repository.

Otis

__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM Web Crawler: note on normalized URLs

Posted by Clemens Marschner <cm...@lanlab.de>.

> > note: restrictto is a regular expression; the URLs tested against it
> > are
> > normalized beforehand, which means
> > they are made lower case, index.* are removed, and some other
> > corrections
> > (see URLNormalizer.java for details)
>
> Removing index.* may be too bold and incorrect in some situations.

Hm, but I think it's much more likely that http://host/ and
http://host/index.* point to the same document as to different documents.
It's also very unlikely that (UNIX) users have one "abc" and one "Abc" file
in the same directory, although it's possible. That's why URLs are made
lower case.
Therefore, I think the cost of not crawling a document that falls out of
this scheme is higher than crawling a document twice.
Later on we could use i.e. use MD5 hashes to be sure.

I must point out that these normalized URLs are only used for comparing the
already crawled URLs with new ones. The actual request sent to the server is
the original URL. removing index.* before sending the request would indeed
be pretty bold.

I have a more detailed description of the URLnormalizer, but still in
German; might check it in after I have translated it; I need it for my
master's thesis (see my homepage). Probably I'll write that in English
anyway...

by the way I've made some very promising experiments with MySQL as URL
repository. seems to be fast enough. When I did this with MS SQL Server in
the first place, I was very disappointed. That's the basis for incremental
crawling!

--Clemens


http://www.cmarschner.net



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

AW: LARM Web Crawler: LuceneStorage [experimental]

Posted by Christian Schrader <sc...@evendi.de>.

Has anybody been able to access the luceneStorage data?
To me it seems, that the lucene storage needs to be closed, before it can be
accessed from another program. The closing would have to happen, after the
last thread runs out. I am not an experienced Thread programmer, so maybe
someone has an idea.

Chris
> -----Ursprungliche Nachricht-----
> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Gesendet: 18 June 2002 23:55
> An: Lucene Developers List; Clemens Marschner
> Betreff: Re: LARM Web Crawler: LuceneStorage [experimental]
>
>
> I see nice progress here.
> I will try it in the near future (time!).
>
> > I have added an experimental version of a LuceneStorage to the LARM
> > crawler,
> > available from CVS in lucene-sandbox. That means crawled documents
> > can now directly be indexed into a lucene index.
> >
> >
> >
> > Sorry, no configuration files yet. Config is done in
> > ...larm/FetcherMain.java
> > The main class FetcherMain is now configured to store the contents in
> > a lucene index called "luceneIndex".
> >
> >
> > Lots of open questions:
> > - LARM doesn't have the notion of closing everything down. What
> > happens if IndexWriter is interrupted?
>
> As in what if it encounters an exception (e.g. somebody removes the
> index directory)?  I guess one of the items that should them maybe get
> added to the to-do list is checkpointing for starters.
>
> > - I haven't tried to read from the index yet...
>
> Heh, I'm familiar with that situation.
>
> > - How to configure the stuff from a config file
> > ... (it's late)
>
> Property file with name=value pairs and some init() method that is
> called at the beginning may be sufficient.
>
> > Please try it:
> >
> > To build and run it,
> > - put ANT in your path
> > - provide a build.properties with the location of the lucene Jar file
> > (lucene.jar=)
> >   (just like javacc in lucene/build.xml)
> > - put HTTPClient.jar from http://innovation.ch/java and jakarta-oro
> > library
> > into libs
> > - type:
> >
> > ant
> > run -Dstart=<starturl> -Drestrictto=<restricttourl>
> > -Dthreads=<numThreads>
> >
> > ex.:
> > ant
> > run -Dstart=http://localhost/ -Drestrictto=http://localhost.*
> > -Dthreads=5
> >
> > note: restrictto is a regular expression; the URLs tested against it
> > are
> > normalized beforehand, which means
> > they are made lower case, index.* are removed, and some other
> > corrections
> > (see URLNormalizer.java for details)
>
> Removing index.* may be too bold and incorrect in some situations.
>
> > note: LuceneStorage is dumb; it just takes the WebDocument and stores
> > it.
> > That means with the current config it also stores tags, and only one
> > "content" field that contains everything. I plan to write another
> > storage
> > that uses the HTMLDocument from the demo package to store HTML
> > documents.
>
> Nice.
> I found NekoHTML to do a nice job of 'dehtmlization'.
>
> > Please note that when adding this storage to the storage pipeline,
> > the whole
> > crawling process becomes
> > CPU- instead of I/O bound. We already have plans how to do the
> > distribution.
> >
> > Feel free to contact me if there are questions.
> > Still Looking For Contributors!
> >
> > Clemens
>
> Ausgezeichnet!
>
> Otis
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! - Official partner of 2002 FIFA World Cup
> http://fifaworldcup.yahoo.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM Web Crawler: LuceneStorage [experimental]

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I see nice progress here.
I will try it in the near future (time!).

> I have added an experimental version of a LuceneStorage to the LARM
> crawler,
> available from CVS in lucene-sandbox. That means crawled documents
> can now directly be indexed into a lucene index.
> 
> 
> 
> Sorry, no configuration files yet. Config is done in
> ...larm/FetcherMain.java
> The main class FetcherMain is now configured to store the contents in
> a lucene index called "luceneIndex".
> 
> 
> Lots of open questions:
> - LARM doesn't have the notion of closing everything down. What
> happens if IndexWriter is interrupted?

As in what if it encounters an exception (e.g. somebody removes the
index directory)?  I guess one of the items that should them maybe get
added to the to-do list is checkpointing for starters.

> - I haven't tried to read from the index yet...

Heh, I'm familiar with that situation.

> - How to configure the stuff from a config file
> ... (it's late)

Property file with name=value pairs and some init() method that is
called at the beginning may be sufficient.

> Please try it:
> 
> To build and run it,
> - put ANT in your path
> - provide a build.properties with the location of the lucene Jar file
> (lucene.jar=)
>   (just like javacc in lucene/build.xml)
> - put HTTPClient.jar from http://innovation.ch/java and jakarta-oro
> library
> into libs
> - type:
> 
> ant
> run -Dstart=<starturl> -Drestrictto=<restricttourl>
> -Dthreads=<numThreads>
> 
> ex.:
> ant
> run -Dstart=http://localhost/ -Drestrictto=http://localhost.*
> -Dthreads=5
> 
> note: restrictto is a regular expression; the URLs tested against it
> are
> normalized beforehand, which means
> they are made lower case, index.* are removed, and some other
> corrections
> (see URLNormalizer.java for details)

Removing index.* may be too bold and incorrect in some situations.

> note: LuceneStorage is dumb; it just takes the WebDocument and stores
> it.
> That means with the current config it also stores tags, and only one
> "content" field that contains everything. I plan to write another
> storage
> that uses the HTMLDocument from the demo package to store HTML
> documents.

Nice.
I found NekoHTML to do a nice job of 'dehtmlization'.

> Please note that when adding this storage to the storage pipeline,
> the whole
> crawling process becomes
> CPU- instead of I/O bound. We already have plans how to do the
> distribution.
> 
> Feel free to contact me if there are questions.
> Still Looking For Contributors!
> 
> Clemens

Ausgezeichnet!

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>