You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Clemens Marschner <cm...@lanlab.de> on 2002/06/19 22:36:28 UTC

Re: LARM Crawler: Status // Avalon?

> If you are interested, I can send you a class that is written as a
> NekoHTML Filter, which I use for extracting title, body, meta keywords
> and description.

Sure, send it over. But isn't the example packaged with Lucene doing the
same?

> Have I mentioned k2d2.org framework here before?
> I read about it in JavaPro a few months ago and chose it for an
> application that I was/am writing.  It allows for a very elegant and
> simple (in terms of use) producer/consumer pipeline.
> I've actually added a bit of functionality to the version that's at
> k2d2.org and sent it to the author who will, I believe, include it in
> the new version.
> Also, the framework allows for distributed consumer pipeline with
> different communication protocols (JMS, RMI, BEEP...).  That is
> something that is not available yet, but the author told me about it
> over a month ago.

Hmm.. I'll have a look at it. But keep in mind that the current solution is
working already, and we probably only need one very simple way to transfer
the data.

> We want this whole pipeline to be
> > configurable
> > (remember, most of it is still done from within the source code).
>...
> k2d2.org stuff doesn't have anything that allows for dynamic
> configurations, but it may be good to use because then you don't have
> to worry about developing, maintaining, fixing yet another component,
> which should really be just another piece of your infrastructure on top
> of which you can construct your specific application logic.

yep, right. that's what i hate about c++ programs (also called
'yet-another-linked-list-implementation's :-)) i'll have a look at it; I
just think the patterns used in LARM are probably too simple to be worth the
exchange. But I'll see.

By the way, I thought about the "putting all together in config files"
thing: It's probably sufficient to have a couple of applications (main
classes) that put the basic stuff together, and whose parts are then
configurable through property files. At least now.
I just have this feeling, but I fear some things could become very nasty if
we have to invent a declarative configuration language that describes the
configuration of the pipelines, or at least whose components tell the
configuring class which other components they need to know of... (oh, that
looks like we need component based development...)...

>> Lots of open questions:
>> - LARM doesn't have the notion of closing everything down. What
>> happens if IndexWriter is interrupted?

I must add that in general I don't have experience with using Lucene
incrementally, that is, updating the index while others are using it. Is
that working smoothly?

>As in what if it encounters an exception (e.g. somebody removes the
>index directory)?  I guess one of the items that should them maybe get
>added to the to-do list is checkpointing for starters.

Hm... what do you mean...?
>From what I understand you mean that then the doc is stored in a repository
until the index is available again...? [confused]


One last thought:
- the crawler should be be started as a daemon process (at least optionally)
- it should wake up from time to time to crawl changed pages
- it should provide a management and status interface to the outside.
- it internally needs the ability to run service jobs while crawling
(keeping memory tidy, collecting stats, etc.)

from what I know, these matters could be addressed by the Apache
Avalon/Phoenix
project. Does anyone know anything about it?

Clemens





--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM Crawler and the case of the missing HTTPClient.zip

Posted by Clemens Marschner <cm...@lanlab.de>.

I don't get a 404 with this URL:
http://www.innovation.ch/java/HTTPClient/

The URL of the archives are
http://www.innovation.ch/java/HTTPClient/HTTPClient.tar.gz or
http://www.innovation.ch/java/HTTPClient/HTTPClient.zip

Regards,

Clemens

----- Original Message -----
From: "Matthew King" <ma...@gnik.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Thursday, June 20, 2002 10:21 PM
Subject: LARM Crawler and the case of the missing HTTPClient.zip


> Hi.  I was trying to get a hold of HTTPClient.zip to try out the LARM
> Crawler... however, the web page cited in the README.txt (and via Google
> search) - http://www.innovation.ch/java/HTTPClient - comes up 404.
>
> Anyone know of any copies lying around or where the project has moved to?
>
> thanks.
>
> - matt
>
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

LARM Crawler and the case of the missing HTTPClient.zip

Posted by Matthew King <ma...@gnik.com>.

Hi.  I was trying to get a hold of HTTPClient.zip to try out the LARM 
Crawler... however, the web page cited in the README.txt (and via Google 
search) - http://www.innovation.ch/java/HTTPClient - comes up 404.

Anyone know of any copies lying around or where the project has moved to?

thanks.

- matt



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Avalon anybody?

Posted by "Andrew C. Oliver" <ac...@apache.org>.

Otis Gospodnetic wrote:

>Jakarta's James and Cocoon projects are written in Phoenix part of
>Avalon.  I just read an article about that on Tuesday.  The article was
>from http://www.onjava.com/, and it was just a very high level overview
>of Avalon.
>
>Otis
>  
>
+1

>--- Clemens Marschner <cm...@lanlab.de> wrote:
>  
>
>> 
>>    
>>
>>>>One last thought:
>>>>- the crawler should be be started as a daemon process (at least
>>>>optionally)
>>>>- it should wake up from time to time to crawl changed pages
>>>>- it should provide a management and status interface to the
>>>>        
>>>>
>>outside.
>>    
>>
>>>>- it internally needs the ability to run service jobs while
>>>>        
>>>>
>>crawling
>>    
>>
>>>>(keeping memory tidy, collecting stats, etc.)
>>>>
>>>>from what I know, these matters could be addressed by the Apache
>>>>Avalon/Phoenix project. Does anyone know anything about it?
>>>>        
>>>>
>>>To me Avalon looks relatively complex, but from what I've read it
>>>      
>>>
>>is a
>>    
>>
>>>piece of software designed to allow applications like your crawler
>>>      
>>>
>>to
>>    
>>
>>>run on top of it.  I'm stating the obvious, for some.
>>>      
>>>
>>Does anybody have experience with Avalon Phoenix?
>>
>>Some time ago I stepped over an app that used it. Was it Slide?
>>Maybe.
>>
>>Regards,
>>
>>Clemens
>>
>>
>>--
>>To unsubscribe, e-mail:  
>><ma...@jakarta.apache.org>
>>For additional commands, e-mail:
>><ma...@jakarta.apache.org>
>>
>>    
>>
>
>
>__________________________________________________
>Do You Yahoo!?
>Yahoo! - Official partner of 2002 FIFA World Cup
>http://fifaworldcup.yahoo.com
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: (VERY COOL IDEA) Re: Interesting idea

Posted by Erik Hatcher <li...@ehatchersolutions.com>.

Yes, I have this same idea floating around for my "copious free time".

But its very similar to what Zoe (see previous posts on this) is, or at
least my ideas of integrating James and Lucene.

    Erik

p.s. I hope to get my Ant <index> task finally into the Sandbox later this
week - finally done with the book and life now needs a new purpose!  :)


----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Monday, July 08, 2002 8:06 PM
Subject: Re: (VERY COOL IDEA) Re: Interesting idea


>
> --- "Andrew C. Oliver" <ac...@apache.org> wrote:
> > Very cool Cool!  Might make Lucene into a useful plugin for James
> > too.
>
> _That_ (James plugin) is what I have been thinking about lately and was
> wondering why nobody wrote it already.
>
> Otis
>
>
> > -Andy
> >
> > Jon Scott Stevens wrote:
> >
> > >Adding support to Lucene for Nilsimsa seems like a cool idea...
> > >
> > >http://ixazon.dynip.com/~cmeclax/nilsimsa.html
> > >
> > >The index would be the hash and one could use Lucene to rank
> > searches based
> > >on the Nilsimsa rating of the results...
> > >
> > >-jon
> > >
> > >
> > >--
> > >To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> > >For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Sign up for SBC Yahoo! Dial - First Month Free
> http://sbc.yahoo.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: (VERY COOL IDEA) Re: Interesting idea

Posted by "Andrew C. Oliver" <ac...@apache.org>.

No, I get it.  Was just thinking.

>I think you guys are missing the point of the idea with integrating Nilsimsa
>and Lucene.
>
>Imagine that the index will be a constant size and much smaller (and faster
>to search) if you simply save the Nilsimsa hash and then get a nilsimsa
>result...
>
>-jon
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: (VERY COOL IDEA) Re: Interesting idea

Posted by "Andrew C. Oliver" <ac...@apache.org>.

ditto.

Otis Gospodnetic wrote:

>No, I think I know why you thionk it would be cool.
>I was just reacting to the word 'plugin' that, combined with the word
>'Lucene' triggered the James association in my mind.
>Anyhow, nice idea that Nilsimsa.
>
>Otis
>
>--- Jon Scott Stevens <jo...@latchkey.com> wrote:
>  
>
>>on 7/8/02 5:06 PM, "Otis Gospodnetic" <ot...@yahoo.com>
>>wrote:
>>
>>    
>>
>>>--- "Andrew C. Oliver" <ac...@apache.org> wrote:
>>>      
>>>
>>>>Very cool Cool!  Might make Lucene into a useful plugin for James
>>>>too.  
>>>>        
>>>>
>>>_That_ (James plugin) is what I have been thinking about lately and
>>>      
>>>
>>was
>>    
>>
>>>wondering why nobody wrote it already.
>>>
>>>Otis
>>>      
>>>
>>I think you guys are missing the point of the idea with integrating
>>Nilsimsa
>>and Lucene.
>>
>>Imagine that the index will be a constant size and much smaller (and
>>faster
>>to search) if you simply save the Nilsimsa hash and then get a
>>nilsimsa
>>result...
>>
>>-jon
>>    
>>
>
>
>__________________________________________________
>Do You Yahoo!?
>Sign up for SBC Yahoo! Dial - First Month Free
>http://sbc.yahoo.com
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: (VERY COOL IDEA) Re: Interesting idea

Posted by Otis Gospodnetic <ot...@yahoo.com>.

No, I think I know why you thionk it would be cool.
I was just reacting to the word 'plugin' that, combined with the word
'Lucene' triggered the James association in my mind.
Anyhow, nice idea that Nilsimsa.

Otis

--- Jon Scott Stevens <jo...@latchkey.com> wrote:
> on 7/8/02 5:06 PM, "Otis Gospodnetic" <ot...@yahoo.com>
> wrote:
> 
> > 
> > --- "Andrew C. Oliver" <ac...@apache.org> wrote:
> >> Very cool Cool!  Might make Lucene into a useful plugin for James
> >> too.  
> > 
> > _That_ (James plugin) is what I have been thinking about lately and
> was
> > wondering why nobody wrote it already.
> > 
> > Otis
> 
> I think you guys are missing the point of the idea with integrating
> Nilsimsa
> and Lucene.
> 
> Imagine that the index will be a constant size and much smaller (and
> faster
> to search) if you simply save the Nilsimsa hash and then get a
> nilsimsa
> result...
> 
> -jon


__________________________________________________
Do You Yahoo!?
Sign up for SBC Yahoo! Dial - First Month Free
http://sbc.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: (VERY COOL IDEA) Re: Interesting idea

Posted by Jon Scott Stevens <jo...@latchkey.com>.

on 7/8/02 5:06 PM, "Otis Gospodnetic" <ot...@yahoo.com> wrote:

> 
> --- "Andrew C. Oliver" <ac...@apache.org> wrote:
>> Very cool Cool!  Might make Lucene into a useful plugin for James
>> too.  
> 
> _That_ (James plugin) is what I have been thinking about lately and was
> wondering why nobody wrote it already.
> 
> Otis

I think you guys are missing the point of the idea with integrating Nilsimsa
and Lucene.

Imagine that the index will be a constant size and much smaller (and faster
to search) if you simply save the Nilsimsa hash and then get a nilsimsa
result...

-jon

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: (VERY COOL IDEA) Re: Interesting idea

Posted by Otis Gospodnetic <ot...@yahoo.com>.

--- "Andrew C. Oliver" <ac...@apache.org> wrote:
> Very cool Cool!  Might make Lucene into a useful plugin for James
> too.  

_That_ (James plugin) is what I have been thinking about lately and was
wondering why nobody wrote it already.

Otis


> -Andy
> 
> Jon Scott Stevens wrote:
> 
> >Adding support to Lucene for Nilsimsa seems like a cool idea...
> >
> >http://ixazon.dynip.com/~cmeclax/nilsimsa.html
> >
> >The index would be the hash and one could use Lucene to rank
> searches based
> >on the Nilsimsa rating of the results...
> >
> >-jon
> >
> >
> >--
> >To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> >For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
> >
> >  
> >
> 
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Sign up for SBC Yahoo! Dial - First Month Free
http://sbc.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

(VERY COOL IDEA) Re: Interesting idea

Posted by "Andrew C. Oliver" <ac...@apache.org>.

Very cool Cool!  Might make Lucene into a useful plugin for James too.  

-Andy

Jon Scott Stevens wrote:

>Adding support to Lucene for Nilsimsa seems like a cool idea...
>
>http://ixazon.dynip.com/~cmeclax/nilsimsa.html
>
>The index would be the hash and one could use Lucene to rank searches based
>on the Nilsimsa rating of the results...
>
>-jon
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Interesting idea

Posted by "Andrew C. Oliver" <ac...@apache.org>.

+1 -- Doug is a great source of information on all things indexing 
related.  Reading Doug's emails and articles is
very educational.

Jon Scott Stevens wrote:

>on 7/10/02 9:35 AM, "Doug Cutting" <cu...@lucene.com> wrote:
>
>  
>
>>Nilsimsa appears to use what is called a "signature file" approach in
>>the literature, while Lucene uses an "inverted file".  A search on
>>Google for "signature file versus inverted index" turns up a paper by
>>Zobel et. al. which concludes:
>>
>> Our conclusions are unequivocal. For typical document indexing
>> applications, current signature file techniques do not perform well
>> compared to current implementations of inverted file indexes.
>>
>>See: http://www.cs.columbia.edu/~pirot/cs6111/Readings/zobel98.pdf
>>
>>Doug
>>    
>>
>
>Wow! Great response Doug. =) Learn something new every day!
>
>-jon
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Interesting idea

Posted by Jon Scott Stevens <jo...@latchkey.com>.

on 7/10/02 9:35 AM, "Doug Cutting" <cu...@lucene.com> wrote:

> Nilsimsa appears to use what is called a "signature file" approach in
> the literature, while Lucene uses an "inverted file".  A search on
> Google for "signature file versus inverted index" turns up a paper by
> Zobel et. al. which concludes:
> 
>  Our conclusions are unequivocal. For typical document indexing
>  applications, current signature file techniques do not perform well
>  compared to current implementations of inverted file indexes.
> 
> See: http://www.cs.columbia.edu/~pirot/cs6111/Readings/zobel98.pdf
> 
> Doug

Wow! Great response Doug. =) Learn something new every day!

-jon


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

DateFieldYMD

Posted by Peter Carlson <ca...@bookandhammer.com>.

Hi,

Does anyone have an objection/better idea to adding a new Class called
DateFieldYMD. This would be very similar to DateField, but return a
different format.

This would also support the field dateToString(date) and convert it to the
format

YYYYMMDD
It would also add
DateTimeToString(date) with the format

YYYYMMDDTHHMMSS

Where T is the delimiter between the date and time. Just trying to follow a
pseudo convention and minimize the bits so no delimeter.

The reason why another class vs another method are the supporting methods
such as stringToDate() or stringToTime() which decodes the string to a Date
or a long would be confusing in one class.

I think this would meet the criteria of people who need support before 1970
and be more readable.

--Peter


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Interesting idea

Posted by Doug Cutting <cu...@lucene.com>.

Jon Scott Stevens wrote:
> Adding support to Lucene for Nilsimsa seems like a cool idea...
> 
> http://ixazon.dynip.com/~cmeclax/nilsimsa.html
> 
> The index would be the hash and one could use Lucene to rank searches based
> on the Nilsimsa rating of the results...

Nilsimsa employs a very different model than Lucene.  So this would 
require a re-write of the indexing and search portions of Lucene, which 
is most of the code.

Nilsimsa appears to use what is called a "signature file" approach in 
the literature, while Lucene uses an "inverted file".  A search on 
Google for "signature file versus inverted index" turns up a paper by 
Zobel et. al. which concludes:

   Our conclusions are unequivocal. For typical document indexing
   applications, current signature file techniques do not perform well
   compared to current implementations of inverted file indexes.

See: http://www.cs.columbia.edu/~pirot/cs6111/Readings/zobel98.pdf

Doug

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Interesting idea

Posted by Jon Scott Stevens <jo...@latchkey.com>.

Adding support to Lucene for Nilsimsa seems like a cool idea...

http://ixazon.dynip.com/~cmeclax/nilsimsa.html

The index would be the hash and one could use Lucene to rank searches based
on the Nilsimsa rating of the results...

-jon


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Avalon anybody?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Jakarta's James and Cocoon projects are written in Phoenix part of
Avalon.  I just read an article about that on Tuesday.  The article was
from http://www.onjava.com/, and it was just a very high level overview
of Avalon.

Otis

--- Clemens Marschner <cm...@lanlab.de> wrote:
>  
> > > One last thought:
> > > - the crawler should be be started as a daemon process (at least
> > > optionally)
> > > - it should wake up from time to time to crawl changed pages
> > > - it should provide a management and status interface to the
> outside.
> > > - it internally needs the ability to run service jobs while
> crawling
> > > (keeping memory tidy, collecting stats, etc.)
> > > 
> > > from what I know, these matters could be addressed by the Apache
> > > Avalon/Phoenix project. Does anyone know anything about it?
> > 
> > To me Avalon looks relatively complex, but from what I've read it
> is a
> > piece of software designed to allow applications like your crawler
> to
> > run on top of it.  I'm stating the obvious, for some.
> 
> Does anybody have experience with Avalon Phoenix?
> 
> Some time ago I stepped over an app that used it. Was it Slide?
> Maybe.
> 
> Regards,
> 
> Clemens
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Avalon anybody?

Posted by Clemens Marschner <cm...@lanlab.de>.

 
> > One last thought:
> > - the crawler should be be started as a daemon process (at least
> > optionally)
> > - it should wake up from time to time to crawl changed pages
> > - it should provide a management and status interface to the outside.
> > - it internally needs the ability to run service jobs while crawling
> > (keeping memory tidy, collecting stats, etc.)
> > 
> > from what I know, these matters could be addressed by the Apache
> > Avalon/Phoenix project. Does anyone know anything about it?
> 
> To me Avalon looks relatively complex, but from what I've read it is a
> piece of software designed to allow applications like your crawler to
> run on top of it.  I'm stating the obvious, for some.

Does anybody have experience with Avalon Phoenix?

Some time ago I stepped over an app that used it. Was it Slide? Maybe.

Regards,

Clemens


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM Crawler: Repository

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

--- Clemens Marschner <cm...@lanlab.de> wrote:
> Ok I think I got your point.
> 
> > You have MySQL to hold your links.
> > You have N crawler threads.
> > You don't want to hit MySQL a lot, so you get links to crawl in
> batches
> > (e.g. each crawler thread tells MySQL: give me 1000 links to
> crawl).
> 
> 
> [just to make it clear: this looks like the threads would be
> "pulling"
> tasks. In fact the MessageHandler pushes tasks that are then
> distributed to the thread pool.]

Something must be doing pulling of links from MySQL by doing SELECT...

> That's still one of the things I wanted to change in the next time:
> at the
> moment the message processing pipeline is not working in batch mode.
> Each
> URLMessage is transmitted on its own. I already wanted to change this
> in
> order to reduce the number of synchronization points. But it was not
> top
> priority because the overall process was still pretty much I/O bound.
> But from my early experiments with the Repository I see now that this
> becomes more and more important.

Uh, doing it one by one will make the DB become the bottleneck quickly.
MySQL extends the 'standard SQL' by allowing you to get N rows at the
time, as do MS-SQL Server and PostgreSQL (offset, limit).

> I read the paper about the WebRace crawler
> (http://citeseer.nj.nec.com/zeinalipour-yazti01highperformance.html,
> pp.
> 12-15) and thought about the following way the repository should
> work:
> - for each URL put into the pipeline, look if it has already been
> crawled,
> and if, save the lastModified date into the URLMessage
> - in the crawling task, look if the lastModified timestamp is set. if
> so,
> send an "If-Modified-Since" header along with the GET command
> -> if a 304 (not-modified-since) is the answer, load the outbound
> links out
> of the repository and put them back into the pipeline

Do you need to do that?  If you've seen this page before, and you have
already extracted and stored links found in it, then those links will
get their turn, they will be pulled out of the link repository when
their time comes.  In other words, you don't need to put them in the
pipeline at this point.

> -> if a 404 (or similar) statuscode is returned, delete the file and
> its links from the repository

Delete the file, sure, although you may want to consider giving it a
few strikes before throwing it out for good (e.g. 404 once - make a
note of it, 404 again - make a note of it, 404 for the third time -
you're out!), thus also giving it a chance to 'recover' (e.g. 404 once,
404 twice, but 200 next time - page stays in repository).
But links, why remove links found in a page that you are about to
delete?  They will likely still be valid, and what's more, perhaps this
now 404 page might have been the only path to them, so if you remove
them now, you may never find them again, if no other page points to
them.

> -> if it was modified, delete the old stored links, update the
> timestamp of
> the doc in the repository and continue as if the file was new

I wouldn't delete old links, I'd live them, and just try to add links
anyway, ensuring that you don't end up with duplicate links in the
repository.

> - if the file is new, load it, parse it, save its links, and put them
> back into the pipeline

Why not pipeline, why not repository?  (I may be missing something
about how you handle link storage).
When you say pipeline, is this something that is persistent
(CachingQueue?) or in memory?
When I say link repository I'm referring to your MySQL database.

> (Any better ideas?)
> 
> I have already implemented a rather naive approach of that today,
> which (by
> no surprise) turns out to be slower than crawling everything from the
> start...
> 
> What I've learned:
> - The repository must load the information about already crawled
> documents
> into main memory after the start (which means the main memory must be
> large
> enough to hold all these URLs + some extra info, which is already
> done in
> URLVisitedFilter at this time) and, more importantly...

It has to load all this information in order to be able to check if a
link extracted from a fetched page has already been visited?
If so, this approach will, obviously, not scale.
AltaVista folks use a smaller set of popular URLs and spatially(?)
close URLs in memory, and keep everything else on disk.  That way they
don't require a lot of RAM for storing that, and disk accesses for 'has
this link already been seen?' checks are infrequent.

> - have a more efficient means of accessing the links than in a
> regular SQL
> table with {referer, target} pairs. The Meta-Info store mentioned in
> the
> WebRace crawler may be a solution (it's a plain text file that
> contains all
> document meta-data and whose index is held in main memory), but it
> prevents
> the URLs from being sorted other ways (e.g. all INlinks to a
> document), which is what I need for my further studies.

I think pulling links out of the DB is not such a huge problem if you
do it in batches, but updating each individual link's row in the DB
will be.  I don't know a way around that.

> > The crawler fetches all pages, and they go through your component
> > pipeline and get processed.
> > What happens if after fetching 100 links from this batch of 1000
> the
> > crawler thread dies?  Do you keep track of which links in that
> batch
> > you've crawled, so that in case the thread dies you don't recrawl
> > those?
> > That's roughly what I meant.
> 
> First of all, I have invested a lot of time to prevent any threads
> from
> dying. That's a reason why I took HTTPClient, because it has never
> hung so
> far. A lot of exceptions are caught on the task level. I've had a lot
> of
> problems with hanging threads when I still used the
> java.net.URLConnection
> classes, but no more.
> I have also learned that "whatever can go wrong, will go wrong, very
> soon".
> That is why I patched the HTTPClient classes to introduce a maximum
> file
> size to be fetched.
> I can imagine some sort of crawler trap when a server process sends
> characters very slowly, as it is used in some spam filters. That's
> where the
> ThreadMonitor comes in. Each task publishes its state (i.e. "loading
> data"),
> and the ThreadMonitor restarts it when it remains in a state for too
> long.
> That's the place where the ThreadMonitor could save the rest of the
> batch.
> This way, the ThreadMonitor could become the single point of failure,
> but
> the risk that this thread is hanging is reduced by keeping it simple.
> Just
> like a watchdog hardware that looks that traffic lights work at a
> street crossing...
> 
> Regards,
> 
> Clemens

Otis

__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

LARM Crawler: Repository

Posted by Clemens Marschner <cm...@lanlab.de>.

Ok I think I got your point.

> You have MySQL to hold your links.
> You have N crawler threads.
> You don't want to hit MySQL a lot, so you get links to crawl in batches
> (e.g. each crawler thread tells MySQL: give me 1000 links to crawl).


[just to make it clear: this looks like the threads would be "pulling"
tasks. In fact the MessageHandler pushes tasks that are then distributed to
the thread pool.]

That's still one of the things I wanted to change in the next time: at the
moment the message processing pipeline is not working in batch mode. Each
URLMessage is transmitted on its own. I already wanted to change this in
order to reduce the number of synchronization points. But it was not top
priority because the overall process was still pretty much I/O bound.
But from my early experiments with the Repository I see now that this
becomes more and more important.

I read the paper about the WebRace crawler
(http://citeseer.nj.nec.com/zeinalipour-yazti01highperformance.html, pp.
12-15) and thought about the following way the repository should work:
- for each URL put into the pipeline, look if it has already been crawled,
and if, save the lastModified date into the URLMessage
- in the crawling task, look if the lastModified timestamp is set. if so,
send an "If-Modified-Since" header along with the GET command
-> if a 304 (not-modified-since) is the answer, load the outbound links out
of the repository and put them back into the pipeline
-> if a 404 (or similar) statuscode is returned, delete the file and its
links from the repository
-> if it was modified, delete the old stored links, update the timestamp of
the doc in the repository and continue as if the file was new
- if the file is new, load it, parse it, save its links, and put them back
into the pipeline

(Any better ideas?)

I have already implemented a rather naive approach of that today, which (by
no surprise) turns out to be slower than crawling everything from the
start...

What I've learned:
- The repository must load the information about already crawled documents
into main memory after the start (which means the main memory must be large
enough to hold all these URLs + some extra info, which is already done in
URLVisitedFilter at this time) and, more importantly...
- have a more efficient means of accessing the links than in a regular SQL
table with {referer, target} pairs. The Meta-Info store mentioned in the
WebRace crawler may be a solution (it's a plain text file that contains all
document meta-data and whose index is held in main memory), but it prevents
the URLs from being sorted other ways (e.g. all INlinks to a document),
which is what I need for my further studies.

> The crawler fetches all pages, and they go through your component
> pipeline and get processed.
> What happens if after fetching 100 links from this batch of 1000 the
> crawler thread dies?  Do you keep track of which links in that batch
> you've crawled, so that in case the thread dies you don't recrawl
> those?
> That's roughly what I meant.

First of all, I have invested a lot of time to prevent any threads from
dying. That's a reason why I took HTTPClient, because it has never hung so
far. A lot of exceptions are caught on the task level. I've had a lot of
problems with hanging threads when I still used the java.net.URLConnection
classes, but no more.
I have also learned that "whatever can go wrong, will go wrong, very soon".
That is why I patched the HTTPClient classes to introduce a maximum file
size to be fetched.
I can imagine some sort of crawler trap when a server process sends
characters very slowly, as it is used in some spam filters. That's where the
ThreadMonitor comes in. Each task publishes its state (i.e. "loading data"),
and the ThreadMonitor restarts it when it remains in a state for too long.
That's the place where the ThreadMonitor could save the rest of the batch.
This way, the ThreadMonitor could become the single point of failure, but
the risk that this thread is hanging is reduced by keeping it simple. Just
like a watchdog hardware that looks that traffic lights work at a street
crossing...

Regards,

Clemens





--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: LARM Crawler: Status // Avalon?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello,

--- Clemens Marschner <cm...@lanlab.de> wrote:
> > If you are interested, I can send you a class that is written as a
> > NekoHTML Filter, which I use for extracting title, body, meta
> keywords
> > and description.
> 
> Sure, send it over. But isn't the example packaged with Lucene doing
> the same?

It's attached.  I'm sending it to the list, in case anyone searches the
list archives and needs code like this.

> > Have I mentioned k2d2.org framework here before?
> > I read about it in JavaPro a few months ago and chose it for an
> > application that I was/am writing.  It allows for a very elegant
> and
> > simple (in terms of use) producer/consumer pipeline.
> > I've actually added a bit of functionality to the version that's at
> > k2d2.org and sent it to the author who will, I believe, include it
> in
> > the new version.
> > Also, the framework allows for distributed consumer pipeline with
> > different communication protocols (JMS, RMI, BEEP...).  That is
> > something that is not available yet, but the author told me about
> it
> > over a month ago.
> 
> Hmm.. I'll have a look at it. But keep in mind that the current
> solution is
> working already, and we probably only need one very simple way to
> transfer the data.

I know, if it works it may not need fixing, but I thought you may want
to get rid of the infrastructure part of your code if there is
something that does it nicely already.

> > We want this whole pipeline to be
> > > configurable
> > > (remember, most of it is still done from within the source code).
> >...
> > k2d2.org stuff doesn't have anything that allows for dynamic
> > configurations, but it may be good to use because then you don't
> have
> > to worry about developing, maintaining, fixing yet another
> component,
> > which should really be just another piece of your infrastructure on
> top
> > of which you can construct your specific application logic.
> 
> yep, right. that's what i hate about c++ programs (also called
> 'yet-another-linked-list-implementation's :-)) i'll have a look at
> it; I
> just think the patterns used in LARM are probably too simple to be
> worth the exchange. But I'll see.

This k2d2 framework is super simple to use.  Register consumers, put
something in the front queue, extend a base class and override a single
method that takes an object and returns an object (or null if it
consumes it).  Pipeline done.

> By the way, I thought about the "putting all together in config
> files"
> thing: It's probably sufficient to have a couple of applications
> (main
> classes) that put the basic stuff together, and whose parts are then
> configurable through property files. At least now.
> I just have this feeling, but I fear some things could become very
> nasty if
> we have to invent a declarative configuration language that describes
> the
> configuration of the pipelines, or at least whose components tell the
> configuring class which other components they need to know of... (oh,
> that looks like we need component based development...)...

I don't have a better suggestion right now.

> >> Lots of open questions:
> >> - LARM doesn't have the notion of closing everything down. What
> >> happens if IndexWriter is interrupted?
> 
> I must add that in general I don't have experience with using Lucene
> incrementally, that is, updating the index while others are using it.
> Is that working smoothly?

Yes, in my experience it works without problems.

> >As in what if it encounters an exception (e.g. somebody removes the
> >index directory)?  I guess one of the items that should them maybe
> get
> >added to the to-do list is checkpointing for starters.
> 
> Hm... what do you mean...?
> From what I understand you mean that then the doc is stored in a
> repository
> until the index is available again...? [confused]

What I meant was this.
You have MySQL to hold your links.
You have N crawler threads.
You don't want to hit MySQL a lot, so you get links to crawl in batches
(e.g. each crawler thread tells MySQL: give me 1000 links to crawl).
The crawler fetches all pages, and they go through your component
pipeline and get processed.
What happens if after fetching 100 links from this batch of 1000 the
crawler thread dies?  Do you keep track of which links in that batch
you've crawled, so that in case the thread dies you don't recrawl
those?
That's roughly what I meant.

> One last thought:
> - the crawler should be be started as a daemon process (at least
> optionally)
> - it should wake up from time to time to crawl changed pages
> - it should provide a management and status interface to the outside.
> - it internally needs the ability to run service jobs while crawling
> (keeping memory tidy, collecting stats, etc.)
> 
> from what I know, these matters could be addressed by the Apache
> Avalon/Phoenix project. Does anyone know anything about it?

To me Avalon looks relatively complex, but from what I've read it is a
piece of software designed to allow applications like your crawler to
run on top of it.  I'm stating the obvious, for some.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com