You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Jeff Turner <je...@apache.org> on 2003/08/02 14:08:21 UTC

cli.xconf questions

Hi,

I'm tinkering around with the CLI, thinking how to add
don't-crawl-this-page support, and have some questions on how cli.xconf
currently works.  The following block in cli.xconf has me confused..


  |  The old behaviour - appends uri to the specified destination
  |  directory (as specified in <dest-dir>):
  |
  |   <uri>documents/index.html</uri>

Do we still want this <uri>...</uri> behaviour?  Currently the CLI only
accepts <uri src="..."/>.  Come to think of it, the attribute name 'src'
doesn't really make sense.  What is the "source" of a Cocoon URI?  It
would be the XML (documents/index.xml), which is not what we're
specifying in @src.

  |  Append: append the generated page's URI to the end of the 
  |  source URI:
  |
  |   <uri type="append" src-prefix="documents/" src="index.html"
  |   dest="build/dest/"/>

What is a 'source URI' here, and why would we want to append another URI
(URIs are not additive)?  Does this mean documents/index.html would be
written to build/dest/?  If so, why separate @src-prefix and @src?

  |
  |  Replace: Completely ignore the generated page's URI - just 
  |  use the destination URI:
  |
  |   <uri type="replace" src-prefix="documents/" src="index.html" 
  |   dest="build/dest/docs.html"/>

Sounds fine, but again, since we know the whole URI
(documents/index.html), why separate into @src-prefix and @src?

  |
  |  Insert: Insert generated page's URI into the destination 
  |  URI at the point marked with a * (example uses fictional 
  |  zip protocol)
  |
  |   <uri type="insert" src-prefix="documents/" src="index.html" 
  |   dest="zip://*.zip/page.html"/>

Leaves me very confused.. what would be the result here?  An index.zip
file, containing the bytes from documents/index.html saved as page.html?
Is there a non-fictional scenario where this makes more sense? :)


Anyway, on to the subject of excluding certain URIs.. are there any
preferred ways of doing it?  I've currently got:

  <ignore-uri>....</ignore-uri>

working, which seems crude but effective.  Ideally I'd like to:
 - Use wildcards ("don't crawl '*.xml' URLs")
 - be able to exclude links based on which page they originate from
   ("ignore broken links from sitemap-ref.html")

I was thinking of some sort of nesting notation for indicating links from
a certain page:

  <!-- Ignore *.xml links from sitemap-ref.* -->
  <ignore from-uri="sitemap-ref.*"> 
      <uri>*.xml</uri>   
  </ignore>

Sorry I don't have any answers or even particularly coherent questions ;)
I have the feeling that cli.xconf's job, mapping URIs to the filesystem,
could potentially be quite intricate.  It is roughly an inverse of what
the sitemap does.  Perhaps we need an analogous syntax?


--Jeff

Re: cli.xconf questions

Posted by Upayavira <uv...@upaya.co.uk>.

On Mon, 4 Aug 2003 22:38:55 +1000, "Jeff Turner" <je...@apache.org> said:
> On Mon, Aug 04, 2003 at 08:25:01AM +0000, Upayavira wrote:
> > On Sat, 2 Aug 2003 22:08:21 +1000, "Jeff Turner" <je...@apache.org> said:
> > > Hi,
> > > 
> > > I'm tinkering around with the CLI, thinking how to add
> > > don't-crawl-this-page support, and have some questions on how cli.xconf
> > > currently works.  The following block in cli.xconf has me confused..
> > 
> > Jeff. Great to see you're engaging with it!
> 
> It doubled Forrest's speed - I love it ;)

Great. And there's more we can do.

> > I have also been working on the CLI. I've spent my week's spare time
> > completely reworking it. I'll post separately about what I've been up to,
> > but basically the whole thing should be much easier to understand, with a
> > separate crawler class, a separate class for handling Cocoon
> > initialisation, and another for handling URI arithmetic (which you're
> > talking about below). As to adding exclusions, I think it should merely
> > be a question of identifying the syntax. The rest, with my new code,
> > should be pretty easy (e.g. tell the crawler what to ignore with a set of
> > wildcard parameters).
> 
> Sounds marvellous.

I've started debugging now. I'll aim to commit later this week.
 
<snip/> 

> > When I've got this going, I'm going to convert the xconf code to use a
> > Configuration object, and then write an Ant task to do the same
> > ProcessXConf, so that you can have the xconf code directly in your Ant
> > script. This Ant task will be a simple wrapper around the bean, and
> > should be pretty trivial.
> 
> Mmm.. nice.  Might be some ideas to steal from Ant here, notably the idea
> of PatternSets and Mappers.

Yup. I'm keen to see what we can steal. Unfortunately, we'll have to code
it twice - it doesn't seem to be possible to share code between ant and
cocoon.

> > I have also, I think, just sorted my problem with my caching code not
> > working. Basically, the Cocoon cache is transient. So therefore it is
> > lost every time Cocoon starts. And Cocoon is started every time the CLI
> > starts. So if we want to have the CLI only generate new pages based upon
> > the cache, we've got to make the cache for the CLI persistent. Again, see
> > separate thread.
> 
> This would be really awesome :)  Lots of people have asked if Forrest
> could only regenerate pages that have changed.  I'll defer further
> thoughts till the other thread.

Thread will come when I've got the basic code working.
 
> ...
> > > Come to think of it, the attribute name 'src'
> > > doesn't really make sense.  What is the "source" of a Cocoon URI?  It
> > > would be the XML (documents/index.xml), which is not what we're
> > > specifying in @src.
> > 
> > It is the source for a source/destination pair. You could see it as a
> > cocoon: protocol source (almost). Would you suggest something different?
> 
> No, makes sense given that explanation.

Great.

> > > I have the feeling that cli.xconf's job, mapping URIs to the filesystem,
> > > could potentially be quite intricate.  It is roughly an inverse of what
> > > the sitemap does.  Perhaps we need an analogous syntax?
> > 
> > Perhaps. I think we've only just started trying to work out what is
> > possible here. I'd be pleased to carry on the conversation, as what we
> > have at the moment is purely what I thought best, and not the result of
> > much community discussion.
> >
> > There's alot we could discuss here. For example, how do we handle the
> > situation where we want to crawl a number of pages, but don't want to
> > have to repeat the destination for each of them? I think we could come up
> > with an elegant configuration for this. My <uri> thing is only the
> > beginning. 
> 
> There is ${variable} interpolation code in Avalon, if that helps.  Eg.
> ${context-root} in logkit.xconf.

I'll look into that.
 
> > The first thing to do is to start identifying the possible use cases for
> > URI mappings, so that we can see the range of the problem we're trying to
> > solve (and take it beyond the scope of just fixing my problems only!).
> 
> Well, two observations:
> 
> 1) Hosting a live Cocoon site is a PITA:
> 
>  - One has to fight with sysadmins to install JVMs.  Many site hosts
>    (like SF) don't even offer Java-based services.
>  - JVMs permanently chew up vast amounts of memory
>  - Servlet containers hang, crash, throw OutOfMemoryExceptions and are
>    generally unreliable.
>  - Cocoon is not particularly fast
> 
> 2) A surprising number of sites **don't need to be dynamic**
> 
> So in walks our hero, the CLI.  We can get most of the magic of Cocoon,
> with none of the pain.  Develop a site with a live Cocoon, and when
> you're ready to deploy, serialize it to disk and serve through Apache.
> 
> That's why I think the CLI is very important.  More than *anything* else,
> it has the potential to vastly widen Cocoon's audience.
> 
> So from this perspective, the need is simple.  We need the CLI to provide
> as accurate a representation of the live site as possible.  Generally
> this means simply mirroring the URI structure to disk.
 
> Currently, the biggest unmet need is the ability to exclude certain URLs.
> There is usually non-Cocoon-generated content like Javadocs, or other
> parts of the site, which needs to be excluded.

Well, lets get that working well.

Are you willing to test my new version when its ready?

Regards, Upayavira

Re: cli.xconf questions

Posted by Jeff Turner <je...@apache.org>.

On Mon, Aug 04, 2003 at 08:25:01AM +0000, Upayavira wrote:
> On Sat, 2 Aug 2003 22:08:21 +1000, "Jeff Turner" <je...@apache.org> said:
> > Hi,
> > 
> > I'm tinkering around with the CLI, thinking how to add
> > don't-crawl-this-page support, and have some questions on how cli.xconf
> > currently works.  The following block in cli.xconf has me confused..
> 
> Jeff. Great to see you're engaging with it!

It doubled Forrest's speed - I love it ;)

> I have also been working on the CLI. I've spent my week's spare time
> completely reworking it. I'll post separately about what I've been up to,
> but basically the whole thing should be much easier to understand, with a
> separate crawler class, a separate class for handling Cocoon
> initialisation, and another for handling URI arithmetic (which you're
> talking about below). As to adding exclusions, I think it should merely
> be a question of identifying the syntax. The rest, with my new code,
> should be pretty easy (e.g. tell the crawler what to ignore with a set of
> wildcard parameters).

Sounds marvellous.

> I haven't been able to debug this, as my copy of Eclipse insists on
> entering Java's Classloader code when I try to debug it. When I've worked
> out how to stop Eclipse doing that, I'll get it debugged, and put it into
> the scratchpad. 

IDEA also steps into JDK code, but can't you just 'step over' the code
instead of diving into it?  F6 I think.

> When I've got this going, I'm going to convert the xconf code to use a
> Configuration object, and then write an Ant task to do the same
> ProcessXConf, so that you can have the xconf code directly in your Ant
> script. This Ant task will be a simple wrapper around the bean, and
> should be pretty trivial.

Mmm.. nice.  Might be some ideas to steal from Ant here, notably the idea
of PatternSets and Mappers.

> I have also, I think, just sorted my problem with my caching code not
> working. Basically, the Cocoon cache is transient. So therefore it is
> lost every time Cocoon starts. And Cocoon is started every time the CLI
> starts. So if we want to have the CLI only generate new pages based upon
> the cache, we've got to make the cache for the CLI persistent. Again, see
> separate thread.

This would be really awesome :)  Lots of people have asked if Forrest
could only regenerate pages that have changed.  I'll defer further
thoughts till the other thread.

...
> > Come to think of it, the attribute name 'src'
> > doesn't really make sense.  What is the "source" of a Cocoon URI?  It
> > would be the XML (documents/index.xml), which is not what we're
> > specifying in @src.
> 
> It is the source for a source/destination pair. You could see it as a
> cocoon: protocol source (almost). Would you suggest something different?

No, makes sense given that explanation.

[snip enlightening description of cli.xconf syntax - thanks!]

> > I have the feeling that cli.xconf's job, mapping URIs to the filesystem,
> > could potentially be quite intricate.  It is roughly an inverse of what
> > the sitemap does.  Perhaps we need an analogous syntax?
> 
> Perhaps. I think we've only just started trying to work out what is
> possible here. I'd be pleased to carry on the conversation, as what we
> have at the moment is purely what I thought best, and not the result of
> much community discussion.
>
> There's alot we could discuss here. For example, how do we handle the
> situation where we want to crawl a number of pages, but don't want to
> have to repeat the destination for each of them? I think we could come up
> with an elegant configuration for this. My <uri> thing is only the
> beginning. 

There is ${variable} interpolation code in Avalon, if that helps.  Eg.
${context-root} in logkit.xconf.

> The first thing to do is to start identifying the possible use cases for
> URI mappings, so that we can see the range of the problem we're trying to
> solve (and take it beyond the scope of just fixing my problems only!).

Well, two observations:

1) Hosting a live Cocoon site is a PITA:

 - One has to fight with sysadmins to install JVMs.  Many site hosts
   (like SF) don't even offer Java-based services.
 - JVMs permanently chew up vast amounts of memory
 - Servlet containers hang, crash, throw OutOfMemoryExceptions and are
   generally unreliable.
 - Cocoon is not particularly fast

2) A surprising number of sites **don't need to be dynamic**


So in walks our hero, the CLI.  We can get most of the magic of Cocoon,
with none of the pain.  Develop a site with a live Cocoon, and when
you're ready to deploy, serialize it to disk and serve through Apache.

That's why I think the CLI is very important.  More than *anything* else,
it has the potential to vastly widen Cocoon's audience.

So from this perspective, the need is simple.  We need the CLI to provide
as accurate a representation of the live site as possible.  Generally
this means simply mirroring the URI structure to disk.

Currently, the biggest unmet need is the ability to exclude certain URLs.
There is usually non-Cocoon-generated content like Javadocs, or other
parts of the site, which needs to be excluded.


--Jeff

> I have said previously that the Bean interface should be declared
> alpha/unstable. By the sounds of it we also need to declare the xconf
> structure to be unstable too. See separate thread!
> 
> Regards, Upayavira

Re: cli.xconf questions

Posted by Upayavira <uv...@upaya.co.uk>.

On Sat, 2 Aug 2003 22:08:21 +1000, "Jeff Turner" <je...@apache.org> said:
> Hi,
> 
> I'm tinkering around with the CLI, thinking how to add
> don't-crawl-this-page support, and have some questions on how cli.xconf
> currently works.  The following block in cli.xconf has me confused..

Jeff. Great to see you're engaging with it!

I have also been working on the CLI. I've spent my week's spare time
completely reworking it. I'll post separately about what I've been up to,
but basically the whole thing should be much easier to understand, with a
separate crawler class, a separate class for handling Cocoon
initialisation, and another for handling URI arithmetic (which you're
talking about below). As to adding exclusions, I think it should merely
be a question of identifying the syntax. The rest, with my new code,
should be pretty easy (e.g. tell the crawler what to ignore with a set of
wildcard parameters).

I haven't been able to debug this, as my copy of Eclipse insists on
entering Java's Classloader code when I try to debug it. When I've worked
out how to stop Eclipse doing that, I'll get it debugged, and put it into
the scratchpad. 

When I've got this going, I'm going to convert the xconf code to use a
Configuration object, and then write an Ant task to do the same
ProcessXConf, so that you can have the xconf code directly in your Ant
script. This Ant task will be a simple wrapper around the bean, and
should be pretty trivial.

I have also, I think, just sorted my problem with my caching code not
working. Basically, the Cocoon cache is transient. So therefore it is
lost every time Cocoon starts. And Cocoon is started every time the CLI
starts. So if we want to have the CLI only generate new pages based upon
the cache, we've got to make the cache for the CLI persistent. Again, see
separate thread.

>   |  The old behaviour - appends uri to the specified destination
>   |  directory (as specified in <dest-dir>):
>   |
>   |   <uri>documents/index.html</uri>
> 
> Do we still want this <uri>...</uri> behaviour?  Currently the CLI only
> accepts <uri src="..."/>.  

I think someone (Joerg?) fixed a bug, that might have also disabled the
old
behaviour. I would be happy to let it go, but the benefit of it is where
you
have a lot of pages that share a destination. Otherwise you'd have to
repeat 
the destination URI for each page.

> Come to think of it, the attribute name 'src'
> doesn't really make sense.  What is the "source" of a Cocoon URI?  It
> would be the XML (documents/index.xml), which is not what we're
> specifying in @src.

It is the source for a source/destination pair. You could see it as a
cocoon: protocol source (almost). Would you suggest something different?
 
>   |  Append: append the generated page's URI to the end of the 
>   |  source URI:
>   |
>   |   <uri type="append" src-prefix="documents/" src="index.html"
>   |   dest="build/dest/"/>
> 
> What is a 'source URI' here, and why would we want to append another URI
> (URIs are not additive)?  Does this mean documents/index.html would be
> written to build/dest/?  If so, why separate @src-prefix and @src?

This is what I've started calling (after Bernard) URI Arithmetic.
Different ways to calculate your destination page from your source page
URIs.

I have to say, I haven't yet found the best language for explaining this,
so please do bear with me.

Let's take the example of Cocoon documentation. The Cocoon URI is
documents/index.html. We want the URI of the file produced to be
build/dest/index.html. So we don't want 'documents' in the destination
URI. But we need it in the source URI. So we therefore use this as the
src-prefix, i.e. it is included in the source URI, but excluded from the
destination URI.

Now, why have 'append', 'replace', etc? Well, sometimes you will want to
append the source URI to the destination URI - in our case appending
'index.html' to 'build/dest/' gives 'build/dest/index.html', which is
what we want. But also, if we crawl on to news.html, adding that to
'build/dest/' will give us 'build/dest/news.html', which again is what we
want.

However, a scenario I have is where no crawling is taking place, and
there is no relationship between the source and destination URIs. So for
example: /site/page1.html could be saved as /foobar/client1.html. In that
scenario one would use 'REPLACE' as the type.

>   |  Replace: Completely ignore the generated page's URI - just 
>   |  use the destination URI:
>   |
>   |   <uri type="replace" src-prefix="documents/" src="index.html" 
>   |   dest="build/dest/docs.html"/>
> 
> Sounds fine, but again, since we know the whole URI
> (documents/index.html), why separate into @src-prefix and @src?

In this scenario, the src-prefix isn't really needed, as the src is
ignored when calculating the destination uri.
 
>   |  Insert: Insert generated page's URI into the destination 
>   |  URI at the point marked with a * (example uses fictional 
>   |  zip protocol)
>   |
>   |   <uri type="insert" src-prefix="documents/" src="index.html" 
>   |   dest="zip://*.zip/page.html"/>
> 
> Leaves me very confused.. what would be the result here?  An index.zip
> file, containing the bytes from documents/index.html saved as page.html?
> Is there a non-fictional scenario where this makes more sense? :)

Fraid there isn't a non-fictional one ATM. This one was put there simply
for completeness (only took minutes to implement). To my mind, it is
append and replace that are the most important features.

> Anyway, on to the subject of excluding certain URIs.. are there any
> preferred ways of doing it?  I've currently got:
> 
>   <ignore-uri>....</ignore-uri>
> 
> working, which seems crude but effective.  Ideally I'd like to:
>  - Use wildcards ("don't crawl '*.xml' URLs")
>  - be able to exclude links based on which page they originate from
>    ("ignore broken links from sitemap-ref.html")
> 
> I was thinking of some sort of nesting notation for indicating links from
> a certain page:
> 
>   <!-- Ignore *.xml links from sitemap-ref.* -->
>   <ignore from-uri="sitemap-ref.*"> 
>       <uri>*.xml</uri>   
>   </ignore>

*************************************


> Sorry I don't have any answers or even particularly coherent questions ;)

Neither have I!

> I have the feeling that cli.xconf's job, mapping URIs to the filesystem,
> could potentially be quite intricate.  It is roughly an inverse of what
> the sitemap does.  Perhaps we need an analogous syntax?

Perhaps. I think we've only just started trying to work out what is
possible here. I'd be pleased to carry on the conversation, as what we
have at the moment is purely what I thought best, and not the result of
much community discussion.

There's alot we could discuss here. For example, how do we handle the
situation where we want to crawl a number of pages, but don't want to
have to repeat the destination for each of them? I think we could come up
with an elegant configuration for this. My <uri> thing is only the
beginning. 

The first thing to do is to start identifying the possible use cases for
URI mappings, so that we can see the range of the problem we're trying to
solve (and take it beyond the scope of just fixing my problems only!).

I have said previously that the Bean interface should be declared
alpha/unstable. By the sounds of it we also need to declare the xconf
structure to be unstable too. See separate thread!

Regards, Upayavira