You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Clemens Marschner <cm...@lanlab.de> on 2002/06/19 22:55:05 UTC

Re: LARM Web Crawler: note on normalized URLs

> > note: restrictto is a regular expression; the URLs tested against it
> > are
> > normalized beforehand, which means
> > they are made lower case, index.* are removed, and some other
> > corrections
> > (see URLNormalizer.java for details)
>
> Removing index.* may be too bold and incorrect in some situations.

Hm, but I think it's much more likely that http://host/ and
http://host/index.* point to the same document as to different documents.
It's also very unlikely that (UNIX) users have one "abc" and one "Abc" file
in the same directory, although it's possible. That's why URLs are made
lower case.
Therefore, I think the cost of not crawling a document that falls out of
this scheme is higher than crawling a document twice.
Later on we could use i.e. use MD5 hashes to be sure.

I must point out that these normalized URLs are only used for comparing the
already crawled URLs with new ones. The actual request sent to the server is
the original URL. removing index.* before sending the request would indeed
be pretty bold.

I have a more detailed description of the URLnormalizer, but still in
German; might check it in after I have translated it; I need it for my
master's thesis (see my homepage). Probably I'll write that in English
anyway...

by the way I've made some very promising experiments with MySQL as URL
repository. seems to be fast enough. When I did this with MS SQL Server in
the first place, I was very disappointed. That's the basis for incremental
crawling!

--Clemens


http://www.cmarschner.net



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: LARM Web Crawler: note on normalized URLs

Posted by Jack Park <ja...@thinkalong.com>.
At 02:14 PM 6/19/2002 -0700, you wrote:
>It may be even nicer to use some DB implemented in Java, such as
>HyperSQL (I think that's the name)

It used to be called HypersonicSQL, now it's just HSQLDB.

https://sourceforge.net/projects/hsqldb/


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: LARM Web Crawler: note on normalized URLs

Posted by Otis Gospodnetic <ot...@yahoo.com>.
> > It may be even nicer to use some DB implemented in Java, such as
> > HyperSQL (I think that's the name) or Smyle
> > (https://sourceforge.net/projects/smyle/) or Berkeley DB
> > (http://www.sleepycat.com/), although MySQL may be simpler if you
> want
> > to create a crawler that can be run on a cluster of machines that
> share
> > a central link repository.
> 
> Hm, I'll think about it. But MySQL seems to be the KISS way...
> I don't think a central link repository makes sense. Looks like a
> bottleneck to me.

Well, yes, it could become a bottleneck.
However, your crawler is not distributed (yet?), so we don't have to
waste time talking about hypothetical situations.

Otis



__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: LARM Web Crawler: note on normalized URLs

Posted by Clemens Marschner <cm...@lanlab.de>.
> It may be even nicer to use some DB implemented in Java, such as
> HyperSQL (I think that's the name) or Smyle
> (https://sourceforge.net/projects/smyle/) or Berkeley DB
> (http://www.sleepycat.com/), although MySQL may be simpler if you want
> to create a crawler that can be run on a cluster of machines that share
> a central link repository.

Hm, I'll think about it. But MySQL seems to be the KISS way...
I don't think a central link repository makes sense. Looks like a bottleneck
to me.

Clemens


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: LARM Web Crawler: note on normalized URLs

Posted by Otis Gospodnetic <ot...@yahoo.com>.
--- Clemens Marschner <cm...@lanlab.de> wrote:
> > > note: restrictto is a regular expression; the URLs tested against
> it
> > > are
> > > normalized beforehand, which means
> > > they are made lower case, index.* are removed, and some other
> > > corrections
> > > (see URLNormalizer.java for details)
> >
> > Removing index.* may be too bold and incorrect in some situations.
> 
> Hm, but I think it's much more likely that http://host/ and
> http://host/index.* point to the same document as to different
> documents.
> It's also very unlikely that (UNIX) users have one "abc" and one
> "Abc" file
> in the same directory, although it's possible. That's why URLs are
> made
> lower case.
> Therefore, I think the cost of not crawling a document that falls out
> of
> this scheme is higher than crawling a document twice.
> Later on we could use i.e. use MD5 hashes to be sure.

I don't know, maybe.  I haven't done any tests nor read anything that
would confirm that this is correct (or wrong).

> I must point out that these normalized URLs are only used for
> comparing the
> already crawled URLs with new ones. The actual request sent to the
> server is
> the original URL. removing index.* before sending the request would
> indeed
> be pretty bold.

Aha!
I thought you use normalized URLs for requests, too.

> I have a more detailed description of the URLnormalizer, but still in
> German; might check it in after I have translated it; I need it for
> my
> master's thesis (see my homepage). Probably I'll write that in
> English
> anyway...
> 
> by the way I've made some very promising experiments with MySQL as
> URL
> repository. seems to be fast enough. When I did this with MS SQL
> Server in
> the first place, I was very disappointed. That's the basis for
> incremental crawling!

People at Senga.org developed something called Webbase (its CVS
repository is at sf.net) that used MySQL for this purpose as well.

It may be even nicer to use some DB implemented in Java, such as
HyperSQL (I think that's the name) or Smyle
(https://sourceforge.net/projects/smyle/) or Berkeley DB
(http://www.sleepycat.com/), although MySQL may be simpler if you want
to create a crawler that can be run on a cluster of machines that share
a central link repository.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>