You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Stefano Mazzocchi <st...@apache.org> on 2000/05/23 14:58:48 UTC

Cocoon Offline mode and sitemap changes

Instead of waiting for Paul to initiate the discussion, I went ahead and
did it myself after taking a deep look at his (very nice) code.

Paul added an offline processing module for Cocoon2 and he did it
following the site-walking model of web spiders.

Here is the doc-fragement that he added to its sitemap

  <offline target="target">
    <sitewalker
class="org.apache.cocoon.offline.spider.SpiderSiteWalker">
      <startpoint uri="/welcome.html"/>
      <handler type="text/html"
class="org.apache.cocoon.offline.spider.HtmlMimeHandler"/>
    </sitewalker>
  </offline>

which says:

for offline operation, use the specified sitewalker, start from
/welcome.html and process what's returned as the MIME type text/html
with the given handler.

After close analysys three actors can be identified:

1) the offline generator
2) the crawler
3) the link parser

Paul wrote all three of them they do their job very well. The problem is
are totally XML-unaware. And this is, IMO, a big design fault.

Let's get deeper:

1) the offline generator. The class that implements this is

 org.apache.cocoon.offline.CocoonOffline

I don't have problems in keeping this as it is, but suggestions are
welcome.

2) the crawler (Paul called it sitewalker, but I like crawler much more)

Paul identified the need for multiple crawlers to generate a site. Is
this flexibility syndrome? Should each target have one crawler? Should
we have more than one entry point?

3) the link parser.

This is the most important design decisions and I believe that while
clever, Paul's idea of using MIME-driven link parsing may become very
dangerous. Suppose we generate FO + SVG: do we have to parse it back to
have the links? Do we have to create a link parser for every meaningful
MIME-type our formatters support?

I still believe XLink is the solution.

Cocoon must be able to recognize crawlers and give them the "original"
XML view of the file, before adaptation.

But how can Cocoon enforce the creation of a semantic view before
adaptation?

I believe the sitemap needs to be changed to allow this but I still
don't know how to do it.

Something like

<process uri="hello">
 <source>
  <generator name="file">
   <parameter name="location" value="../hello.xml"/>
  </generator>
 </source>
 <view>
  <filter name="xslt">
   <parameter name="stylesheet" value="..."/>
  </filter>
  <serializer name="html">
   <parameter name="contentType" value="text/html"/>
  </serializer>
 </view>
</process>

<process uri="data/report">
 <source>
  <generator name="file">
   <parameter name="location" value="../report.xsp"/>
  </generator>
  <filter name="xsp">
   <parameter name="logicsheet" value="..."/>
  </filter>
  <filter name="rdf-izer"/>
 </source>
 <view>
  <filter name="xslt">
   <parameter name="stylesheet" value="..."/>
  </filter>
  <serializer name="html">
   <parameter name="contentType" value="text/html"/>
  </serializer>
 </view>
</process>

which indicates -clearly- the difference between an original XML source
and some adapted view (which is optional, of course).

This is due to the fact that the generator/filter/serializer doens't
indicate clearly _where_ semantic information is added, transformed or
lost, so we must indicate so.

Comments?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------



Re: Cocoon Offline mode and sitemap changes

Posted by Stefano Mazzocchi <st...@apache.org>.
Paul Russell wrote:
> 
> Hi All,
> 
> Firstly, apologies for the delay, I *started* to write this
> e-mail this morning, and got buried by other things (bank
> managers, mainly). Secondly, I have a feeling this e-mail
> is going to end up rambling somewhat, and for this I apolo-
> gise in advance.
> 
> Thirdly, this is taking things from close to the top. I'm
> explaining a fair bit of the basics of Cocoon2 here to make
> sure that as many people as possible can think about this
> (hey, I'm lazy, if everyone else is using their braincells,
> I can give mine a rest ;)
> 
> Since Stefano took the opportunity to fill you in on what
> I did first time around, I'll not go into too much detail
> on that front (he's pretty much got that licked).
> 
> On Tue, May 23, 2000 at 02:58:48PM +0200, Stefano Mazzocchi wrote:
> > Paul wrote all three of them they do their job very well. The problem is
> > are totally XML-unaware. And this is, IMO, a big design fault.
> 
> Yep, totally agree. I'm still (even now) getting to grips
> with all the semantics of some of the XML architecture,
> particularly XLink etc. The current offline module was
> written in what felt like the only way to do it given
> the current Cocoon2 architecture. Because I was new to
> Cocoon at the time, I didn't really feel confident enough
> to start suggesting changes to the sitemap ;)

I totally agree and I find your approach very interesting because
radically different from the Stylebook mindset (which, I find powerful
but very misleading in many ways).
 
> > 1) the offline generator. The class that implements this is
> >  org.apache.cocoon.offline.CocoonOffline
> > I don't have problems in keeping this as it is, but suggestions are
> > welcome.
> 
> I'm not keen on the way CocoonOffline works, currently. At
> present, it extends Cocoon, and I think this is semantically
> dubious. I'd much rather it *used* Cocoon. Again, the reason
> it was done like this initially was (a) to get it out the door,
> and (b) to avoid having to change too much of Pier's code. Now
> Cocoon2 is a bit more open, I think we can move it over to what
> IMO makes more sense. Do you guys agree that content providers
> (servlets, offline, [something else?!]) should *use*, rather
> than extend Cocoon?

No, I don't.

Cocoon (just like Servlets) is based on the "inversion of control"
design principle:

 "don't call me, I'll call you."

In fact, servlets are called by the servlet engine, generators are
called by Cocoon, etc...

This is why servlets _should_not_ call Cocoon as an API but transform
themselves into Cocoon components and let Cocoon call them.

Why? simply because it's _much_ easier to integrate stuff that way,
without you having to do all the pipeline.

But I agree that offline generation is somewhat different. (well, more
below)
 
> > 2) the crawler (Paul called it sitewalker, but I like crawler much more)
> > Paul identified the need for multiple crawlers to generate a site. Is
> > this flexibility syndrome?
> 
> This was me saying "I don't like crawling the site; there must
> be a better way, but I can't think what it is just yet, so I'm
> going to abstract that away as much as possible." I *think* I
> ended up poluting the abstraction slightly, looking back on it.

site crawling and web robots are "felt" like poor agents, but this
happens because the semantic meaning introduced by HTML is very small
and bot-aware behavior almost unexistant.

But if both Crawler and Server are written in the same package and are
aware one of the other, all the problems robots have just disappear.
 
> > Should each target have one crawler? Should we have more than
> > one entry point?
> 
> Not sure what Stefano means exactly here, so if I've misunder-
> stood, ignore me ;).
> 
> The primary reason for having multiple 'startpoints' as I called
> them was because sometimes a crawler won't find a certain page,
> either because the link is absolute (which my crawler simply
> ignored on the basis that it wasn't safe to try and handle that)
> or because there simply isn't a way of getting to it from the
> root.
> 
> When I refer to a 'target', I mean 'somewhere to put the result'.
> The initial implementation focused on output to a particular
> directory, specified by the 'target' attribute of the 'offline'
> tag. One possibility I looked at was to abstract the target
> so that the module could pump code directly onto a webserver,
> or store it in a database (mummy! scary!) or HTTP PUT it, or
> whatever other interesting ideas you guys come up with. For me,
> this is a double edged sword. Most of me says KISS (Keep It
> Simple, Stupid), and keep it to filesystems, the other side
> of me (the OO design side) goes with the Lock Stock and Two
> Smoking Barrels principle:
> 
>    "If it moves, abstract it; if it doesn't move, abstract
>     it anyway... Understand? Good, cos if you don't, I'm
>     gonna abstract ya."
> 
> What do you guys think? Are the potential risks of letting
> people target whatever they want (and risking codebase bloat)
> worth it? I'm inclined to say abstract it, but don't include
> and targets other than FileSystemTarget in the base system,
> unless we're really really sure it's A Good Thing.

My KISS principle would be

CocoonOffline [user-agent mask] [starting uri] [output directory]

and no offline parameters in the sitemap.

Or, some sort of Linkset that provides a complete "crawling base" for
the robot.

But I'm wide open to suggestions here.
 
> > 3) the link parser.
> >
> > This is the most important design decisions and I believe that while
> > clever, Paul's idea of using MIME-driven link parsing may become very
> > dangerous. Suppose we generate FO + SVG: do we have to parse it back to
> > have the links? Do we have to create a link parser for every meaningful
> > MIME-type our formatters support?
> 
> Again, I agree. (Stefano, could you kindly stop being right
> all the damned time? ;)
> 
> > I still believe XLink is the solution.
> 
> *Again*, I agree. (see above)
> 
> What I can't quite get my head around is how to actually get XLink
> into the equation. Linking one XML document to another is quite
> another thing to preserving those links through the XSLT translations
> that we're putting them through before they get to the client
> (which in the case of the offline code, happens to be a file)
> and working out what the request we need to give to the Cocoon
> object to generate the required result is.

I think that link translation is a very _big_ design mistake. I will
explain it better in a later email.
 
> ===============
> 
> Okay, at this point, I'm going to leave Stefano's e-mail, and
> basically explain the issues as I see them for offline
> generation. There is a distinct possibility I'll drift off
> into a few other things I've been thinking about recently,
> but consider them to be Random Thoughts (&copy; Stefano ;).
> 
> Both Cocoon1.x and Cocoon2 work on what I call the "request,
> response" principle. This works absolutely wonderfully for
> servlets and most other internet/web based scenarios, however
> it isn't ideal for offline work.
> 
> Let's turn the thing on its head.
> 
> How do I make an offline site? Well, I take a load of XML
> sources (notice they aren't necessarily static), I transform
> them and manipulate them in various ways, and then I serialize
> them into their final binary file format. The sitemap as it
> stands takes us about half way there. Given a target URI and
> a source URI, it can tell me what to do to get from one to
> the other.
> 
> At present, the sitemap works like this:
> 
> <sitemap>
>   <partition>
>     <process uri="/**" src="**.xml">
>       <generator name="file"/>
>       <filter name="xslt">
>         <parameter
>           name="stylesheet"
>           value="def.xsl"/>
>       </filter>
>       <serializer name="html">
>         <parameter
>           name="contentType"
>           value="text/html"/>
>       </serializer>
>     </process>
>   </partition>
> </sitemap>
> 
> So, what does this actually mean? To follow this, it might help
> to understand how Cocoon2 requests work...
> 
> Reqest Object   \
> Response Object  |--> Cocoon.process()
> Output Stream   /
> 
> Cocoon then works out from the sitemap (well, technically this
> is all handled within the Sitemap and SitemapPartition classes,
> but that's fairly academic at this stage) what the src URI is,
> and what processes to put the XML found in that souce URI
> through.
> 
> So, for example, using the above sitemap, say I requested
> '/index'. Cocoon2 would work out that the XML source came from
> a generator called 'file', and the URI to give that generator
> is 'index.xml' (note the matching sets of asterisks).
> It would then parse the XML file, and pass the resulting SAX
> stream through an XSLT translator and into an HTML serializer.
> 
> For servlets, and other 'live' requests, this works great.
> When a user asks for something, we generate it. If we can't,
> we keel over.
> 
> The problem comes when we attempt to do things the other way
> around. When we're generating a site offline, we have to
> work out all the possible combinations of requests users could
> throw at Cocoon. In the above case, where the XML is coming
> from a file, it's trivial - we just translate backwards from
> the files we can see on disk. If, however, the XML content
> comes from somewhere else, or we're using matching code that
> enables 'many to one' mappings, the whole thing falls apart.
> We can't 'guess' what 'source' URIs the generator supports,
> and we can't translate the 'one' to the 'many' without
> generating every purmutation. I don't know about you lot, but
> I don't have a quantum computer, so I don't fancy that last
> option ;)
> 
> The only answer I've come up with so far, is to 'Spider' or
> 'crawl' the site, in a similar way to my initial implemen-
> tation. If anyone can think of a better one, I'd love it,
> HintHint (any Wiki fans out there? ;).
> 
> Now, the way this worked in my implementation was to spider
> the *result* (post serialization) of the request, depending
> on mime type. This worked well(ish) for HTML, but it isn't
> going to work nearly well enough long term. How can I excuse
> myself? I was young and nieve, and it seemed like a sensible
> solution at the time ;)
> 
> As Stefano has said, XLink is the answer. This would enable
> the offline processor (name please!!) to spider over the
> source XML relatively easily. The problem with this comes
> with pluggable matchers - what if one source XML file/
> generator produces a number of target documents? 

Ok, let's try to build constraints, otherwise we'll never cover any
ground at all.

I think Paul's idea of "crawling to avoid permutating" is the best idea
so far in this area since Stylebook. Both Pier and I thought about it
several times but failed to see the power of crawling when associated to
xlink and semantic views.

The problem was that we was so aware of the abstraction problems we
could not find a way to crawl the site in any significant way. But your
idea, mixed with a newer XLink working draft, opened my eyes.

On the other hand, there is _NO_POSSIBLE_WAY_ for an offline processor
to match Cocoon behavior completely. So we MUST NOT TRY to do so.

Let's make an example:

suppose you have a site that is half dynamic and half static and you
support both HTML and WAP clients. (going to be very common in europe as
soon as UMTS gets implemented)

Now, you want to have a "site snapshot" that is able to work for both
user-agents at the same time, out of the same URI space.

This is not hard, it's simply impossible by Cocoon itself! You need to
do some user-agent matching, at the very least.

A possible solution?

CocoonOffline behave-like-mozilla / /www/mysite.com/html/
CocoonOffline behave-like-wap     / /www/mysite.com/wap/

then use mod_rewrite (or equivalent) to redirect from the root page into
one or the other. Or you could use virtual hosts, or anything.

But CocoonOffline _must_ maintain the crawling behavior neutral during
its operation.

> This might
> not seem like that likely an occourance, but Cocoon2 is
> designed to be a pretty damned serious piece of kit. 

you bet :)

> I fully intend to be generating title images, SVG
> visualisations, and god knows what else. Some of this is
> likely to come from inline data (particularly the title
> images), and so we have to consider this.

This is what stopped me to consider crawling in the first place: there
were too many degrees of freedom and too much xlink fog around.
 
> It's at this point that I get a bit stuck. I can't see a
> way around this problem. I could really do with you guys
> having a good hard think about it, to see if you visualise
> a way around it. I might just be being stupid or missing
> something simple (heck, I *hope* that's the case :) but
> I could do with a bit of external input on it, frankly.

Ok, we should set a couple of stones:

1) Cocoon shouldn't do link rewriting.
2) CocoonOffline mimics a single user-agent at a time.

Ok, this said we need:

a) a way to ask for "the original semantic view" as well as "the adapted
view for the current request parameters".

this implies:

b) a way to discriminate the semantic view from the adapted view in the
sitemap so that Cocoon can know what to do. 

So, the crawling algorithm should be:

1) start from the location indicated from the command line or crawler
configuration parameters.
2) create a fake CocoonRequest depending on the user-agent parameters
that we would like to fake.
3) current URI = starting URI
4) ask for the semantic view of the current URI
5) parse the returned XML for xlinks
6) ask for the adapted view of the current URI
7) save the obtained response on disk
8) for each unprocessed link go to 4)
9) until all links has been processed

Even if this requires some interesting code to be written (expecially
for #4), it works given that documents are written using xlink
attributes (either directly, in their DTD/Schema).

There is only one big problem.

Since Cocoon doesn't do link translation for dynamic operation, the URI
spaces should avoid using extentions as the plague. For this reason, you
end up having a bunch of resources on disk with no extention, which is
usually a big pain for web servers since they have to use mod_magic to
find out which mime type the resource has.

This wouldn't be a problem on more modern file systems where MIME-type
could be linked directly at the FS level, instead of using the hack of
modifying the file name.... unfortunately, these FS are not available
for most of the OS platforms.

A solution for this problem is to augment the views for resources:

1) semantic view
2) on-line view
3) off-line view

which indicates that off-line viewing should perform "MIME-type -> .ext"
expansion at the semantic level, introducing link translation where
possible (before loosing the semantic linking information due to format
transformations).

Shees, sounds like the end, doesn't it?

Now we just have to redesign the sitemap to match these needs and to
write millions of line of code to implement it :)

But I think it would be worth the effort, don't you think?

> Okay, it's now gone midnight over here, and I think it's
> time I got some sleep. I hope the above has given everyone
> something to think about, and hasn't confused people even
> more!!
> 
> All thoughts *very* greatfully recieved <g>

Same here :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------



Re: Cocoon Offline mode and sitemap changes

Posted by Stefano Mazzocchi <st...@apache.org>.
Ross Burton wrote:

> > What do you guys think? Are the potential risks of letting
> > people target whatever they want (and risking codebase bloat)
> > worth it? I'm inclined to say abstract it, but don't include
> > and targets other than FileSystemTarget in the base system,
> > unless we're really really sure it's A Good Thing.
> 
> I'd say pluggable.  A DAV target would be very cool, then I could run the
> offline gererator (can we call it OG for now?) and it would put the pages
> directly on the web server.

Yes, but let's keep this for 2.1 or later, ok? We already have enough to
do.
 
> > What I can't quite get my head around is how to actually get XLink
> > into the equation. Linking one XML document to another is quite
> > another thing to preserving those links through the XSLT translations
> > that we're putting them through before they get to the client
> > (which in the case of the offline code, happens to be a file)
> > and working out what the request we need to give to the Cocoon
> > object to generate the required result is.
> 
> I thnk the <source> and <view> tags which Stefano suggested are required.
> We spider the source, then cache the view.  Very verbose, but I can't see
> any other way of doing this.  For example, a page may begin as a file read
> from fisk, then goes through XSP, then LDAP, then XSLT to XHTML then
> serialized.  The source view would still need to go through the XSP and LDAP
> filters, but not XHTML.

Exactly!
 
> > The only answer I've come up with so far, is to 'Spider' or
> > 'crawl' the site, in a similar way to my initial implemen-
> > tation. If anyone can think of a better one, I'd love it,
> > HintHint (any Wiki fans out there? ;).
> 
> I liked the Stylebook <book> idea for small sites, but I don't think it will
> scale well at all.  I think crawling the site from a number of start points
> (so that URLs which are not linked to but required are crawled too) is the
> only solution.

Yes, totally.

Stylebook is incredibly simple and powerful "if you do what the skin
designer wanted you to do".

In practice, it's hardcore XSLT programming: not impossible to write for
a good XSLT programmer, but a total pain the ass to manage.

Completely impossible for anybody but an hard-core programmer. In fact
Stylebook placed _much_ of the site-generation logic directly into the
stylesheet, providing the worst possible case of context overlapping:
style and logic.

Instead, Cocoon is designed to totally overlap (of course) but also to
remove the need of a direct contract between logic and style contexts.

This is why Stylebook was abandoned.

> > As Stefano has said, XLink is the answer. This would enable
> > the offline processor (name please!!) to spider over the
> > source XML relatively easily. The problem with this comes
> > with pluggable matchers - what if one source XML file/
> > generator produces a number of target documents? This might
> > not seem like that likely an occourance, but Cocoon2 is
> > designed to be a pretty damned serious piece of kit. I
> > fully intend to be generating title images, SVG
> > visualisations, and god knows what else. Some of this is
> > likely to come from inline data (particularly the title
> > images), and so we have to consider this
> 
> The code I was planning on writing (I was waiting for Pier's "recent"
> commit) implemented the matchers, as a test I was writing IPAddressMatcher
> and UserAgentMatcher.  Both are capable of turning a single request to a set
> of responses.  Could be fun...

Can you elaborate more?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------