You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Upayavira <uv...@upaya.co.uk> on 2003/03/04 18:14:48 UTC

Re: [RT] Fixing the CLI

On Mon, 24 Feb 2003 Nicola Ken wrote:
> Traversing optimizations
> -------------------------
> 
> As you know, the Cocoon CLI gets the content of a page 3 times.
> I had refactored these three calls to Cocoon in the methods (in call
> order):
> ....... 
> Now, with the -e option we basically don't need step 2. If done 
> correctly, this will increase the speed! :-)
> 
> So we have two steps left: getting links and getting the page.
> If we can make them into a single step we're done.
> 
> Cocoon has the concept of pluggable pipelines. And each pipeline is
> responsible of connecting the various components. If we used a
> pipeline that simply inserts between the source and the next
> components a pipe that records all links
> (org.apache.cocoon.xml.xlink.ExtendedXLinkPipe.java) into the
> Enviroment, we can effectively get both the result and the links in a
> single pass.

I have gone ahead and coded my attempt at Nicola Ken's CLI traversal 
optimisation. I didn't use pluggable pipelines. Maybe I should have.

Basically, if you switch off extension checking, then it is possible to gather links 
from within the pipeline. So, I hunted out the locations in the pipeline code 
(SerializeNode.java) where the LinkTranslator is added to the pipeline, and added 
an optional LinkGatherer in its place which, instead of translating links using a 
map in the ObjectModel, places the found links into a List in the ObjectModel. That 
list is then available to the CLI afterwards, which then adds the URIs to the list of 
links to be scanned. And it seems to work.

This means that so long as one does not want to confirm that extensions match 
the mime-type of the document, one can build a site generating each page only 
once, which is great.

So what now? Is anyone interested in seeing it?

Regards, Upayavira


Re: [RT] Fixing the CLI

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On Thursday, March 6, 2003, at 06:25 PM, Upayavira wrote:

>> Will any of this work you are doing, rationalising the site crawling,
>> have any effect on the Lucene Indexer, which also uses the Crawler?
>>
>> It takes about 12 hours to generate a full index of my site ATM!
>
> I'm afraid they have completely independent crawlers, so it won't help.

How strange, I wonder why this is so ....

I was hoping the reduction in the number of hits, you had achieved 
would help here too. Oh well.

Thanks

regards Jeremy


Re: [RT] Fixing the CLI

Posted by Upayavira <uv...@upaya.co.uk>.
> Will any of this work you are doing, rationalising the site crawling,
> have any effect on the Lucene Indexer, which also uses the Crawler?
> 
> It takes about 12 hours to generate a full index of my site ATM!

I'm afraid they have completely independent crawlers, so it won't help.

Regards, Upayavira


Re: [RT] Fixing the CLI

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Upayavira wrote, On 05/03/2003 16.33:
> Nicola Ken wrote:
> 
>>So, from quick glance, it seems that the way it's done is IMHO the
>>right way.
> 
> Glad you think so!

:-D

>>Upayavira wrote in bugzilla:
>
> I ran this on a site I built some time ago (with nasty things like Javascript: links, and 
> got files generated for:
> 
> #US
> #nonUS
> http_
> javascript_form.submit()
> mailto_editorial@dharmalife.com
> mailto_info@dharmalife.com
> mailto_windhorse@compuserve.com
> http_\www.magamall.com=client
> javascript_form.submit()
> 
> None of which should have been generated. I would therefore ignore links that begin 
> with #, javascript: or mailto:. 

I see... what does the "old" CLI do? If it doesn't do it, why? Hmmm...

> The controversial one I presume is ignoring links that 
> begin with http://. We could get around this by adding a configuration parameter that 
> specifies the name of the server that the site is based upon. So, when generating the 
> Cocoon site, we could specify that URIs that begin with http://xml.apache.org/cocoon 
> should be spidered, but references to (for example) http://www.w3.org should be 
> ignored. By default, I'd just ignore any links that begin with http://.
> 
> I haven't tried this using the old behaviour. I will, and will let you know.

Ah, ok :-)

>>Also, have you yet measured the speed increases?
> 
> I haven't measured it. I'll do that and report back. I'll add some code to report the time 
> taken to generate the site (much like the build script).

Very good. Can't wait to see a real-life comparison report :-)

>>Is it possible to also have the same 3 step behaviour there was
>>before?
 >
> Yes. I've left the original behaviour as the default. All other behaviours can be 
> configured in the xconf file.

Excellent :-)

< happy but frustrated I can't get my hands on it to test it,
   just email...  grrrrr ;-))  >

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Re: [RT] Fixing the CLI

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On Wednesday, March 5, 2003, at 03:33 PM, Upayavira wrote:

> Nicola Ken wrote:
>> So, from quick glance, it seems that the way it's done is IMHO the
>> right way.
>
> Glad you think so!
>

Will any of this work you are doing, rationalising the site crawling, 
have any effect on the Lucene Indexer, which also uses the Crawler?

It takes about 12 hours to generate a full index of my site ATM!

thanks

regards Jeremy


Re: [RT] Fixing the CLI

Posted by Upayavira <uv...@upaya.co.uk>.
Nicola Ken wrote:
> So, from quick glance, it seems that the way it's done is IMHO the
> right way.

Glad you think so!

> Upayavira wrote in bugzilla:
> "
> This code appears to try to check pages that begin with #, javascript:
> or http://. I plan to prevent this, and probably sort other things
> too, but I'd like to see what people think of this code before I do
> anything else. "

> Could you please explain it a bit more, and the changes you'd like to
> make. Especially, is this behaviour different from the previous one?

I ran this on a site I built some time ago (with nasty things like Javascript: links, and 
got files generated for:

#US
#nonUS
http_
javascript_form.submit()
mailto_editorial@dharmalife.com
mailto_info@dharmalife.com
mailto_windhorse@compuserve.com
http_\www.magamall.com=client
javascript_form.submit()

None of which should have been generated. I would therefore ignore links that begin 
with #, javascript: or mailto:. The controversial one I presume is ignoring links that 
begin with http://. We could get around this by adding a configuration parameter that 
specifies the name of the server that the site is based upon. So, when generating the 
Cocoon site, we could specify that URIs that begin with http://xml.apache.org/cocoon 
should be spidered, but references to (for example) http://www.w3.org should be 
ignored. By default, I'd just ignore any links that begin with http://.

I haven't tried this using the old behaviour. I will, and will let you know.

> Also, have you yet measured the speed increases?

I haven't measured it. I'll do that and report back. I'll add some code to report the time 
taken to generate the site (much like the build script).

> Is it possible to also have the same 3 step behaviour there was
> before?

Yes. I've left the original behaviour as the default. All other behaviours can be 
configured in the xconf file.

Regards, Upayavira

Re: [RT] Fixing the CLI

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Nicola Ken Barozzi wrote, On 05/03/2003 15.12:
> 
> 
> Upayavira wrote, On 05/03/2003 14.20:
...
>> I've posted my code as a patch to bugzilla. Let me know what you 
>> think. Should this be done using pluggable pipelines? 
...

Ok, I took a look (cannot run the patch yet ATM, so only looking):

In FileSavingEnvironment you do (along other things)

   this.objectModel.put(Constants.LINK_COLLECTION_OBJECT, gatheredLinks);

and in the interpreted sitemap you do this:

 
if(env.getObjectModel().containsKey(Constants.LINK_COLLECTION_OBJECT)) {
            context.getProcessingPipeline().addTransformer(
                 "<gatherer>", null, Parameters.EMPTY_PARAMETERS,
                                     Parameters.EMPTY_PARAMETERS
            );
}

So basically we are adding a contract to the sitemap, by saying that 
each sitemap implementation has to provide a list of links if requested 
to (as seen above).

Thinking about pluggable pipelines, since they are set in the sitemap, 
it's better if we don't mess with them, but we have to use the ones that 
the sitemap requests. The approach in the patch is better.

Also, the LinkTranslator.java is in the sitemap package. I would gather 
that in fact it makes sense to fully include this requirement in a 
sitemap contract.

So, from quick glance, it seems that the way it's done is IMHO the right 
way.

You wrote in bugzilla:
"
This code appears to try to check pages that begin with #, javascript: 
or http://. I plan to prevent this, and probably sort other things too, 
but I'd like to see what people think of this code before I do anything 
else.
"

Could you please explain it a bit more, and the changes you'd like to 
make. Especially, is this behaviour different from the previous one?

Also, have you yet measured the speed increases?
Is it possible to also have the same 3 step behaviour there was before?

Thanks :-)

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Re: [RT] Fixing the CLI

Posted by Berin Loritsch <bl...@apache.org>.
Nicola Ken Barozzi wrote:

 >
 >> I've posted my code as a patch to bugzilla. Let me know what you 
think. Should this be done using pluggable pipelines? Actually, I think 
you'll see that the actual code changes are pretty minimal within the 
Cocoon core, a bit more substantial within CocoonBean.java.
 >
 >
 >
 > I'll look into it, thanks a bunch :-)


With not changing the links to match some preconceived notion of
mimetype and extension, we can get those links to *.cgi scripts
working for the downloads without resorting to hackery like specifying
the link with full URI--we can take advantage of the Forrest
URN features!




Re: [RT] Fixing the CLI

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Upayavira wrote, On 05/03/2003 14.20:
>>>So what now? Is anyone interested in seeing it?
>>
>>Absolutely!
> 
> 
> I've posted my code as a patch to bugzilla. Let me know what you think. 
> Should this 
> be done using pluggable pipelines? Actually, I think you'll see that the actual code 
> changes are pretty minimal within the Cocoon core, a bit more substantial within 
> CocoonBean.java.

I'll look into it, thanks a bunch :-)

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Re: [RT] Fixing the CLI

Posted by Upayavira <uv...@upaya.co.uk>.
> > So what now? Is anyone interested in seeing it?
> 
> Absolutely!

I've posted my code as a patch to bugzilla. Let me know what you think. Should this 
be done using pluggable pipelines? Actually, I think you'll see that the actual code 
changes are pretty minimal within the Cocoon core, a bit more substantial within 
CocoonBean.java.

Regards, Upayavira

Re: [RT] Fixing the CLI

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Upayavira wrote, On 04/03/2003 18.14:
> On Mon, 24 Feb 2003 Nicola Ken wrote:
> 
>>Traversing optimizations
...
> 
> I have gone ahead and coded my attempt at Nicola Ken's CLI traversal 
> optimisation. 

Yeah! :-)

> So what now? Is anyone interested in seeing it?

Come on, don't let us wait too long ;-)

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Re: [RT] Fixing the CLI

Posted by Berin Loritsch <bl...@apache.org>.
Upayavira wrote:
> On Mon, 24 Feb 2003 Nicola Ken wrote:
> 
>>Traversing optimizations
>>-------------------------
>>
>>As you know, the Cocoon CLI gets the content of a page 3 times.
>>I had refactored these three calls to Cocoon in the methods (in call
>>order):
>>....... 
>>Now, with the -e option we basically don't need step 2. If done 
>>correctly, this will increase the speed! :-)
>>
>>So we have two steps left: getting links and getting the page.
>>If we can make them into a single step we're done.
>>
>>Cocoon has the concept of pluggable pipelines. And each pipeline is
>>responsible of connecting the various components. If we used a
>>pipeline that simply inserts between the source and the next
>>components a pipe that records all links
>>(org.apache.cocoon.xml.xlink.ExtendedXLinkPipe.java) into the
>>Enviroment, we can effectively get both the result and the links in a
>>single pass.
> 
> 
> I have gone ahead and coded my attempt at Nicola Ken's CLI traversal 
> optimisation. I didn't use pluggable pipelines. Maybe I should have.
> 
> Basically, if you switch off extension checking, then it is possible to gather links 
> from within the pipeline. So, I hunted out the locations in the pipeline code 
> (SerializeNode.java) where the LinkTranslator is added to the pipeline, and added 
> an optional LinkGatherer in its place which, instead of translating links using a 
> map in the ObjectModel, places the found links into a List in the ObjectModel. That 
> list is then available to the CLI afterwards, which then adds the URIs to the list of 
> links to be scanned. And it seems to work.
> 
> This means that so long as one does not want to confirm that extensions match 
> the mime-type of the document, one can build a site generating each page only 
> once, which is great.
> 
> So what now? Is anyone interested in seeing it?

Absolutely!