You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Halácsy Péter <ha...@axelero.com> on 2002/03/03 01:10:13 UTC

RE: Proposal for Lucene / new component

> -----Original Message-----
> From: Andrew C. Oliver [mailto:acoliver@apache.org]
> Sent: Tuesday, February 26, 2002 2:13 PM
> To: Lucene Developers List
> Subject: Re: Proposal for Lucene / new component
> 
> 
> Humm.  Well said.  I'm not against using Avalon.  My approach to
> software is this though:  Get a working draft.  Refactor it into that
> *stand the test of time* for your second or third release.  Things
> change...iterate.  Not against a super configurable masterpiece...but
> first I want to crawl and index web pages over httpd in various
> pluggable mime formats.. Once we get there...
> 

Hello,
I had been abroad last week and it took at least 30 min to read the discussion about avalon. It's great!

Someone mentioned that Avalon is only used by Cocoon. Well, we are using cocoon and I'm very happy that it is Avalon based. I think that is the main reason of flexibility. BTW Cocoon uses Lucene, pls refer to http://xml.apache.org/cocoon/userdocs/generators/search-generator.html

I think if you need logging, configuring, threading, pooling (for the crawler) and want to be component based you need a framework some thing like avalon. It took one day to understand Avalon and write the first Hello world application but you can save a lot of time while coding.

Iteration is very good practica in software development and can be applied to avalon based application as well. First you should only write interfaces. First time you can implement fake component that works like the a real one. After a while you can change the working component by rewriting the config file.

For example I think the http crawler is built from more than one component:
1. the fetcher that connects to the webserver, gets the page from the url
responsible for: downloading the page as is (handling network errors), handling HTTP status codes (for example redirects)

configurable by: proxy server, max open sockets

2. component that parses the fetched page and extract relevant metadata

3. a component that is an interface to the loader; it gets the fetched and parsed pages from the parser (or gets command from the fetcher to delete pages from the search database)

this interface can be implemented in several components:
one that puts the data in files (if the loader and the search db is on other box)
one that gives the data to the loader component (that is in the same JVM)
and so on
 
4. one that feeds urls to the crawler's database 
responsible for: 
extracting links from the dowloaded pages
handling manually submitted urls (submitted by users or sysadmins)
filtering out the exluded urls

configurable by: excluding rules

5. one that reads urls from the database and feed them to the fetcher
the most sophisticated component that responsible for: 
choosing the right url to crawl:
 -  it can use a priority list based on url patterns
 - do not fetch a lot of pages from the same server (max 1 request/min)
 - robots.txt file
configurable by: priority lists, max urls from a host

6. and the last component is the database itself; it can be a JDBC compliant database or something file system based
responsible for: adding/deleting url to/from the database (url: last fetched date, last HTTP status code, last action [add or delete])
aswering host related questions: how many urls were fetched from the host, what time was the last url fetched,  robots.txt of the host

I know it's not a modell of a working http crawler but please notice:
1. using avalon you can change the implementation of a component in 30 seconds (if someone implemented it ;)
2. you don't have to work on implementing logging, configuration system, database pooling for JDBC 
3. the crawler is a component that needs no information about the search database (and the loader/indexer dosn't know the crawler)
4. the parser and loader interface component can be used in file based HTML crawler (that reads static HTML pages from the directory of the webserver in [if the engine is used in intranet])
5. having different loader components you can built a search engine for simple JVM or for distributed system (and you do not need to implement in the first iteration cycle)

OK, this mail is already too long and I'm tired.

peter

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Proposal for Lucene / new component

Posted by "Andrew C. Oliver" <ac...@apache.org>.

On Sat, 2002-03-02 at 19:10, Halácsy Péter wrote:
> 
> > -----Original Message-----
> > From: Andrew C. Oliver [mailto:acoliver@apache.org]
> > Sent: Tuesday, February 26, 2002 2:13 PM
> > To: Lucene Developers List
> > Subject: Re: Proposal for Lucene / new component
> > 
> > 
> > Humm.  Well said.  I'm not against using Avalon.  My approach to
> > software is this though:  Get a working draft.  Refactor it into that
> > *stand the test of time* for your second or third release.  Things
> > change...iterate.  Not against a super configurable masterpiece...but
> > first I want to crawl and index web pages over httpd in various
> > pluggable mime formats.. Once we get there...
> > 
> 
> Hello,
> I had been abroad last week and it took at least 30 min to read the discussion about avalon. It's great!
> 
> Someone mentioned that Avalon is only used by Cocoon. Well, we are using cocoon and I'm very happy that it is Avalon based. I think that is the main reason of flexibility. BTW Cocoon uses Lucene, pls refer to http://xml.apache.org/cocoon/userdocs/generators/search-generator.html
> 
> I think if you need logging, configuring, threading, pooling (for the crawler) and want to be component based you need a framework some thing like avalon. It took one day to understand Avalon and write the first Hello world application but you can save a lot of time while coding.
> 

Great!  Can you post your work to get the Hello Avalon App somewhere? 
If you could document along those lines as well then I'll be happy to go
and write a "getting started" guide for Avalon.  

I'm not objecting to using Avalon provided I can actually understand
it.  I'm really close thanks to the fine work of Ken Barrozzi 
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/poi/cocoon-poi/), but
I'm one step away from actually being about to start using Avalon.  Its
not a "I won't" its an "I can't" issue.  


> Iteration is very good practica in software development and can be applied to avalon based application as well. First you should only write interfaces. First time you can implement fake component that works like the a real one. After a while you can change the working component by rewriting the config file.
> 

I kinda believe in  writing components that work or do something useful
early on.  

> For example I think the http crawler is built from more than one component:
> 1. the fetcher that connects to the webserver, gets the page from the url
> responsible for: downloading the page as is (handling network errors), handling HTTP status codes (for example redirects)
> 
> configurable by: proxy server, max open sockets
> 
> 2. component that parses the fetched page and extract relevant metadata
> 
> 3. a component that is an interface to the loader; it gets the fetched and parsed pages from the parser (or gets command from the fetcher to delete pages from the search database)
> 
> this interface can be implemented in several components:
> one that puts the data in files (if the loader and the search db is on other box)
> one that gives the data to the loader component (that is in the same JVM)
> and so on
>  
> 4. one that feeds urls to the crawler's database 
> responsible for: 
> extracting links from the dowloaded pages
> handling manually submitted urls (submitted by users or sysadmins)
> filtering out the exluded urls
> 
> configurable by: excluding rules
> 

awesome, can you patch the proposal with how you propose to do that?

> 5. one that reads urls from the database and feed them to the fetcher
> the most sophisticated component that responsible for: 
> choosing the right url to crawl:
>  -  it can use a priority list based on url patterns
>  - do not fetch a lot of pages from the same server (max 1 request/min)
>  - robots.txt file
> configurable by: priority lists, max urls from a host
> 
> 6. and the last component is the database itself; it can be a JDBC compliant database or something file system based
> responsible for: adding/deleting url to/from the database (url: last fetched date, last HTTP status code, last action [add or delete])
> aswering host related questions: how many urls were fetched from the host, what time was the last url fetched,  robots.txt of the host
> 
> I know it's not a modell of a working http crawler but please notice:
> 1. using avalon you can change the implementation of a component in 30 seconds (if someone implemented it ;)
> 2. you don't have to work on implementing logging, configuration system, database pooling for JDBC 
> 3. the crawler is a component that needs no information about the search database (and the loader/indexer dosn't know the crawler)
> 4. the parser and loader interface component can be used in file based HTML crawler (that reads static HTML pages from the directory of the webserver in [if the engine is used in intranet])
> 5. having different loader components you can built a search engine for simple JVM or for distributed system (and you do not need to implement in the first iteration cycle)
> 
> OK, this mail is already too long and I'm tired.
> 
> peter

Cool, my only problems are, if I'm to participate in development
involving using Avalon I must understand Avalon; some folks have already
written/donated some tremendous code that does some of these things. 
I'd like to reuse this code -- I'm happy to help refactor it to
Avalon...but it goes back to #1.

Anyhow, maybe I'm just not skilled enough to grasp Avalon (I've thought
it was just a poor-documentation issue).  If that prevents me from
contributing to this effort in a meaningful way then no big deal.  My
goal is to help facilitate the work in any way I can.  If that means
Avalon, fine, but up until now I've mostly failed to get it.  If you're
able then how about getting us started with some Avalon-esque
interfaces?

Thanks,

-Andy

> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
-- 
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document 
                            format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html 
			- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>