You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Clemens Marschner <cm...@lanlab.de> on 2003/01/31 13:46:33 UTC

Fw: LARM / Re: Avalonized WebCrawler

While sipping a grande latte, I came up with the following questions (more
tom come):

1. I wonder how ...crawl.fetcher is working, since there seem to be some
typos:
  - DefaultFetcherTaskFacotry.xinfo (o<->t)
    contains a reference to
      com.celavi.crawl.fetcher.FetcherTaskFacotry
    which doesn't exist

2. Why crawl.Main and crawl.CrawlMain?

3. Do you think dynamically configuring the whole pipeline from a config
file would be possible? The contents of com.celavi.crawl.Main.service()
should come from a config file, say pipeline_xy.xml (more than one pipeline
config should be possible, say crawler_pipeline and indexer_pipeline).
Depending on the contents of this file, another config file should contain
the config values for each component (say crawl_full, crawl_incrementally)

4. What is your rule of thumb what becomes a component and what stays a
class?

5. Why Fortress and not a different container (just curious, I don't have
any preference)?

6. It appears to me that Fortress is creating proxy components that act as
facades to the underlying component interfaces (am I right here?). This is
exactly what I wanted to avoid. It simply becomes too heavy weighted (unless
we use typical component patterns). Since we may well create 100,000
URLMessages per second, it would kill us to send every call to
urlMessageFactory.createURLMessage through a proxy. I wonder if the other
available containers work the same way? (I know Phoenix doesn't do this)

By the way, I compiled it with Eclipse (first time...) A walk in the park...







---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Fw: LARM

Posted by David Worms <da...@simpledesign.com>.

Otis,

yes, I'm here. I sent an email to Clemens earlier, telling him I'll 
look at the docs he provided an the wiki, and thanks him for the 
invitation. (my email is working now)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Fw: LARM

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Is David Worms reading this? (emails to the old address bounce)

Otis

--- Clemens Marschner <cm...@lanlab.de> wrote:
> 
> 
> After some sabbatical time, I created a project on Sourceforge to
> restart
> the LARM development. LARM will be a full-featured search engine
> based on
> Lucene. The scope is corporate intranets or portions of the web,
> databases,
> and file systems, for people that want a Java open source solution.
> The URL
> is http://www.sf.net/projects/larm (still pretty empty)
> 
> 
> If anybody is interested please contact me. We need somebody with
> practical
> experience in IR for the architecture, people that are familiar with
> papers
> like the ones at
>
http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages/PapersOnCrawler
> s.
> The current architecture ideas in
>
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/projects/larm/docs/
> are far from enough.
> 
> Regards,
> 
> Clemens
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Fw: LARM

Posted by Clemens Marschner <cm...@lanlab.de>.


After some sabbatical time, I created a project on Sourceforge to restart
the LARM development. LARM will be a full-featured search engine based on
Lucene. The scope is corporate intranets or portions of the web, databases,
and file systems, for people that want a Java open source solution. The URL
is http://www.sf.net/projects/larm (still pretty empty)


If anybody is interested please contact me. We need somebody with practical
experience in IR for the architecture, people that are familiar with papers
like the ones at
http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages/PapersOnCrawler
s.
The current architecture ideas in
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/projects/larm/docs/
are far from enough.

Regards,

Clemens


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: [LARM] next steps

Posted by David Worms <da...@simpledesign.com>.

On Friday, January 31, 2003, at 03:48  PM, Clemens Marschner wrote:
>
> Great, so how should we go on?
>
> I suggest we wait for you, David, so that you can make the code a 
> little
> more stable and change the things you mentioned. You said something 
> about
> two weeks (?)

Two weeks is the time I need to become more familiar with the crawler, 
setup some config, try Merlin, and get a deeper look at the excalibur 
event package. At this time, I could send a similar but cleaner code.

> I would say we should then be at a point where we could get rid of
> de.lanlab.* packages and move the rest to something like 
> org.apache.larm and
> then put it into the sandbox.

or maybe incubator.apache.org

> We should also check if performance is a problem, especially with those
> factory methods.

We could easily avoid the message factories and use regular constructor.

> Within this time we should also review the docs and adapt the 
> LARM-speak
> (MessageHandler? MessageListener? MessageProcessor? Stage? Storage?)
>
> After this time I would like to concentrate on two things:
>
> The next thing I would like to do is to break up FetcherTask into at 
> least
> two pieces (move parsing out) and change Messages such that they 
> contain
> lists of URLs. This means StoragePipeline becomes a ProcessingPipeline.
>
> The other big issue I would like to take care of is the
> URLVisitedFilter/URLVisitedManager/URLSeenFilter. Its RAM usage must be
> optimized. I already have some ideas for that.
>
> That's only my part; I know that Otis, Kelvin and Peter wanted to work 
> on
> other parts. May I suggest we all become familiar with David's work 
> and read
> our docs once again?
>
> Btw, I also learned a lot from your code, David.
>
> Cheers,
>
> Clemens
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

[LARM] next steps

Posted by Clemens Marschner <cm...@lanlab.de>.

Great, so how should we go on?

I suggest we wait for you, David, so that you can make the code a little
more stable and change the things you mentioned. You said something about
two weeks (?)
I would say we should then be at a point where we could get rid of
de.lanlab.* packages and move the rest to something like org.apache.larm and
then put it into the sandbox.
We should also check if performance is a problem, especially with those
factory methods.
Within this time we should also review the docs and adapt the LARM-speak
(MessageHandler? MessageListener? MessageProcessor? Stage? Storage?)

After this time I would like to concentrate on two things:

The next thing I would like to do is to break up FetcherTask into at least
two pieces (move parsing out) and change Messages such that they contain
lists of URLs. This means StoragePipeline becomes a ProcessingPipeline.

The other big issue I would like to take care of is the
URLVisitedFilter/URLVisitedManager/URLSeenFilter. Its RAM usage must be
optimized. I already have some ideas for that.

That's only my part; I know that Otis, Kelvin and Peter wanted to work on
other parts. May I suggest we all become familiar with David's work and read
our docs once again?

Btw, I also learned a lot from your code, David.

Cheers,

Clemens



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Fw: LARM / Re: Avalonized WebCrawler

Posted by David Worms <da...@simpledesign.com>.

>

I forwarded this email to Avalon-users list in the hope they will 
correct / leverage our discussion. Many times, I am speaking about 
Merlin without deep knowledge.

> 1. I wonder how ...crawl.fetcher is working, since there seem to be 
> some
> typos:
>   - DefaultFetcherTaskFacotry.xinfo (o<->t)
>     contains a reference to
>       com.celavi.crawl.fetcher.FetcherTaskFacotry
>     which doesn't exist

OK, you are right, it took me a while to understand what the .xinfo 
does before I finally reach the conclusion, nothing. I started to learn 
fortress with the crawler (used phoenix before) and used the examples 
present in the fortress CVS, which contains the .xinfo files. But those 
file are only used by phoenix (auto generated via xdocklet) and merlin.

> 2. Why crawl.Main and crawl.CrawlMain?

Here is the idea:
- crawl.Main is the entry point and a temporary hack. The "service" 
method in which I manually initialize one component after the other 
should not be there and will be removed at one point.
- crawl.CrawlMain has the ability to become a component of its own. Or 
maybe the term "block" is more appropriate. Both Phoenix and Merlin 
have this concept. A block can export different services. So our 
CrawlMain block initialize our (inner) crawl container (Merlin or 
Fortress), and make its most relevant interface visible to a (super) 
(LARM) container (Merlin, Fortress, or Phoenix).

> 3. Do you think dynamically configuring the whole pipeline from a 
> config
> file would be possible? The contents of com.celavi.crawl.Main.service()
> should come from a config file, say pipeline_xy.xml (more than one 
> pipeline
> config should be possible, say crawler_pipeline and indexer_pipeline).
> Depending on the contents of this file, another config file should 
> contain
> the config values for each component (say crawl_full, 
> crawl_incrementally)

Yes that should be possible. The pipeline I created is a very simple 
one. It is easy to configure as long as each stage implement a same 
"MessageListener" interface (with the additional lifecycles). You 
mention the ability to configure different pipeline. Interesting, I was 
just looking at this yesterday. I spend some time trying to find out 
what the hell this "event" excalibur package could bring us. the 
promise of a SEDA architecture. but I am not sure how that all work. I 
got a sample code @ http://67.116.155.180/~wdavidw/mySearch-event.zip.

> 4. What is your rule of thumb what becomes a component and what stays a
> class?

To me, a component is an instance that should be instantiated at the 
application startup and that should be accessible to many other units 
(components). This is not how I will define a component, but it is the 
approach I took when I started to refractor a code I didn't understand 
at the time (and there are still some stuff I am not familiar with).
I did not try to look at your code, take a breath and see how I could 
decompose the system into components. Instead, I see any object that 
will be instantiated at the application startup and destroyed at the 
application shutdown as a candidate.
More or less, everything present in your "FetcherMain" object became a 
component.


> 5. Why Fortress and not a different container (just curious, I don't 
> have
> any preference)?

I learn Avalon with Phoenix first. Great, I love it. Extremely easy to 
access Phoenix through AltRMI without a change in you code, same to 
configure your app with JMX. However, what if we want the crawler 
embedded inside another application. Phoenix can only be run in 
standalone. Here is were Merlin and Fortress can help.We can have our 
Fortress based application run from a Main method, inside a servlet, or 
even better, inside Phoenix as a block. I choose Fortress over Merlin 
because it is closer from a release.

> 6. It appears to me that Fortress is creating proxy components that 
> act as
> facades to the underlying component interfaces (am I right here?). 
> This is
> exactly what I wanted to avoid. It simply becomes too heavy weighted 
> (unless
> we use typical component patterns). Since we may well create 100,000
> URLMessages per second, it would kill us to send every call to
> urlMessageFactory.createURLMessage through a proxy. I wonder if the 
> other
> available containers work the same way? (I know Phoenix doesn't do 
> this)

I am not sure about this. Can someone help us? I think we should look 
at the component handlers (the lifestyle) in Fortress: 
org.apache.excalibur.fortress.handler package.

> 7. As far as I can see, each MessageProcessor (State/MessageListener 
> in your
> terms) adds _itself_ to a message handler that it has to know about (as
> defined in DefaultMessageListenerSelector.xinfo). Doesn't this violate 
> the
> IoC pattern? Shouldn't an external component initialize the message 
> handler
> with the listeners according to a defined order? (the order is at the 
> moment
> given only implicitly by the order the config files are processed).

You are right. It is the logical move. First, each stage was 
registering itself with the MessageHandler. Then I introduce the 
MessageListenerSelector which instantiate each stage and then register 
them. Now, MessageHandler should be registering the stages by calling 
the MessageListenerSelector.selectAll() during its own initialization.

still trying to find out a lot of stuffs... I really learn a lot from 
your code...

David


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org