You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@forrest.apache.org by Nicola Ken Barozzi <ni...@apache.org> on 2002/12/13 17:31:59 UTC

Cocoon CLI - how to generate the whole site (Re: The Mythical Javadoc generator (Re: Conflict resolution))

Jeff Turner wrote:

> The javadocs are _already_ generated, and <javadoc> has already put them
> in build/site/apidocs/.  Now how is Cocoon (via the CLI) going to
> "publish" them?

Ok, now we finally get to the actual technical point. I will take this 
discussion in a general way, because the issue is in fact quite general.

                               -oOo-

ATM, the Cocoon CLI system is completely crawler based. This means that
it starts from a list of URLs, and "crawles" the site by getting the 
links from these pages, putting them in the list, purging the visited 
ones, and restrting the process with those.

If we only have XML documents, the system can be made to be very fast 
and semantically rich.

   - fast
    if we get the links while processing the file, we don't
    have to reparse it later for the crawling

   - semantically rich
     we get the links not from the output, but from the real source.
     In the sitemap, the source content, with all semantics, is
     tagged and used for the link gathering. So we can even gather
     links from an svg file that will become a jpeg image!

Things start breaking a bit down when we have to use resources that are 
not transformed to XML. Examples are CSS and massive docs to be included 
like javadocs.

The problem is not *reading* this files via Cocoon, but getting the 
links from them. In the case of CSS we need the links, in case of 
Javadocs, we know the dir structure and eventually would not need them.

For the CSS, the best thing is actually parsing them and passing them in 
the SAX pipeline. I see no technical nor conceptual problem with it.

The problem arises when we need to pass files in "bulk". In this case 
they are javadocs, but what about jars, binaries, images, all things 
that are not necessarily linked in the site, or that we simply want to 
dump in the resulting system?

This is the answer that I seek.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Jeff Turner wrote:
> On Mon, Dec 16, 2002 at 04:30:47PM +0100, Nicola Ken Barozzi wrote:
> ...
> 
>>>The way it works is a hack. I like the file: approach much better, 
>>
>>Why?
>>
>>I'm a user. I take a file. Put it in the directory. Link to it. See it 
>>in the result.
>>
>>What do you not like of this? Why is it better if I write the link with 
>>file: in it?
> 
> 
> You are perfectly right, this _should_ be how it works.  It is simple and
> intuitive.

Ok then, let's make it work :-)

> But think about it: when you said "I take a file. Put it in the
> directory. Link to it.", you're admitting that you're linking to the
> _Source_ URI.  Which is good, because you shouldn't be relying on the
> destination location.  

Ok, you have my support here. links are always done relative to current 
source location. I like this.

> But unfortunately, unprefixed links have a
> 'cocoon:' scheme, so <link href="index.pdf"> will not link to
> src/documentation/content/index.pdf.  That is why we need this file:
> prefix.

This is an implementation problem, not a conceptual one.

It could be that we will be forced by ignorance and impotence to use it 
because we cannot find a technical way of dealing with it.
But IMHO we are not there yet.

resource-exists is not a hack IMHO. If the user can put any file in the 
directory and want it to be picked up by *name without extension*, we 
cannot do without it, because we don't have enough metadata in the 
filesystem to keep mime/types alongside files, and encode the info in 
the file itself and in the name. Thus, this info has to be collected via 
*probing*, which is what resource-exists and CAPs do.

I sould be able to ask the source to give me a file, without extension, 
and have it tell me what Mime-type it is and other info. Based on that 
process it. Not having it, we use resource exists. If you have a better 
method of probing, I'm all for it.

If file systems had proper metadata, we wouldn't need all this, but 
these "hacks" as you call them are necessary given the reality of things.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Jeff Turner <je...@apache.org>.

On Mon, Dec 16, 2002 at 04:30:47PM +0100, Nicola Ken Barozzi wrote:
...
> >The way it works is a hack. I like the file: approach much better, 
> 
> Why?
> 
> I'm a user. I take a file. Put it in the directory. Link to it. See it 
> in the result.
> 
> What do you not like of this? Why is it better if I write the link with 
> file: in it?

You are perfectly right, this _should_ be how it works.  It is simple and
intuitive.

But think about it: when you said "I take a file. Put it in the
directory. Link to it.", you're admitting that you're linking to the
_Source_ URI.  Which is good, because you shouldn't be relying on the
destination location.  But unfortunately, unprefixed links have a
'cocoon:' scheme, so <link href="index.pdf"> will not link to
src/documentation/content/index.pdf.  That is why we need this file:
prefix.

--Jeff

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Jeff Turner <je...@apache.org>.

On Wed, Dec 18, 2002 at 04:57:00PM +0100, Nicola Ken Barozzi wrote:
> 
> Jeff Turner wrote:
> >On Thu, Dec 19, 2002 at 02:17:40AM +1100, Jeff Turner wrote:
> >...
> >
> >>Popping the argument stack a bit, remember that this whole silly example
> >>of index.xml/index.pdf is a pathological case, that won't have the
> >>desired effect no matter what the URI is.  You have ignored my main
> >>argument, that the 'cocoon:' prefix is implicit and _conceptually_ a
> >>file: scheme is required.
> >
> >For your convenience, here is the conceptual justification for 'file:',
> >11 emails ago:
> [...]
> ><<<<<
> >
> >To that, your response started:
> >
> >>First distinction: schemes are not IMV in the source URI space, but in
> >>the destination URI space
> >
> >In the intervening 11 emails, I hope I have at least convinced you of the
> >wrongness of that statement, and hence the position you held back then,
> >based on it.
> 
> I have already said that I have changed my mind on this particular 
> point.

Then do please respond to the snippet, and point out exactly where my
logic fails.  It is a clear set of logical inferences.

> Moreover, There were other comments during the letter, and the results
> of the discussion on those I haven't changed my mind.
> 
> A part that is still being discussed, for example, started here
> 
> "...since we have decided that link URIs should not end in extensions, 
> because of many reasons one of which is the fact that a URI can 
> reference different formats at different times in history, having a 
> scheme that effectively makes me serve two different versions of the 
> same file is totally off-target.
> "

Extensions describe _what_ the file contents is.  Schemes describe how to
get the resource.  They are not the same.  The "extensions are bad"
argument (which, if you recall, was my answer to your "lets have multiple
extensions") has no relevance here.  I described at length the solution
to "different formats at different times": have multiple output URIs.
However that is an implementation issue; the conceptual issue is the bit
you ignored the first time, and snipped this time.


> Address those. I do change my mind. But I have to be convinced, as 
> everyone here.

Strangely I don't see them -1'ing things.


--Jeff

> Don't try to short-circuit the discussion becuse it simply doesn't
> work.
>

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Jeff Turner wrote:
> On Thu, Dec 19, 2002 at 02:17:40AM +1100, Jeff Turner wrote:
> ...
> 
>>Popping the argument stack a bit, remember that this whole silly example
>>of index.xml/index.pdf is a pathological case, that won't have the
>>desired effect no matter what the URI is.  You have ignored my main
>>argument, that the 'cocoon:' prefix is implicit and _conceptually_ a
>>file: scheme is required.
> 
> For your convenience, here is the conceptual justification for 'file:',
> 11 emails ago:
[...]
> <<<<<
> 
> To that, your response started:
> 
>>First distinction: schemes are not IMV in the source URI space, but in
>>the destination URI space
> 
> In the intervening 11 emails, I hope I have at least convinced you of the
> wrongness of that statement, and hence the position you held back then,
> based on it.

I have already said that I have changed my mind on this particular 
point. Moreover,
There were other comments during the letter, and the results of the 
discussion on those I haven't changed my mind.

A part that is still being discussed, for example, started here

"...since we have decided that link URIs should not end in extensions, 
because of many reasons one of which is the fact that a URI can 
reference different formats at different times in history, having a 
scheme that effectively makes me serve two different versions of the 
same file is totally off-target.
"

Address those. I do change my mind. But I have to be convinced, as 
everyone here. Don't try to short-circuit the discussion becuse it 
simply doesn't work.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Jeff Turner <je...@apache.org>.

On Thu, Dec 19, 2002 at 02:17:40AM +1100, Jeff Turner wrote:
...
> Popping the argument stack a bit, remember that this whole silly example
> of index.xml/index.pdf is a pathological case, that won't have the
> desired effect no matter what the URI is.  You have ignored my main
> argument, that the 'cocoon:' prefix is implicit and _conceptually_ a
> file: scheme is required.

For your convenience, here is the conceptual justification for 'file:',
11 emails ago:

>>>>>
> Why would we need to rewrite "file:"s?

Given the above definition, what do you think the implied scheme for
<link href="hello.pdf"> is?  What syntactic and semantic restrictions are
there?  Can we link to anything?  No: we can only link to URIs defined by
sitemap rules.  Therefore the implied scheme is 'cocoon:'.  I need to
invoke Cocoon to get 'hello.pdf'.  If my editor were written in Java as
an Avalon component, it might really be able to invoke Cocoon and
retrieve 'hello.pdf'.

What about when a file is sitting on my harddisk?  Do I need Cocoon to
view it?  No; I can open it in an editor.  Hence the 'file:' protocol is
implied.  In fact, in vim I can type 'gf' and automatically traverse the
link.  My editor is a 'browser' of the Source URI space, just like
Mozilla browses the Destination URI space.

That is the important concept: the Source URI space is distinct from the
Destination URI space.  In the Source URI space (XML docs + <link>
elems), we have all sorts of schemes (linkmap:, java:, file:, person:
etc), but in the Destination URI space (HTML docs + <a> elems), we have
only one protocol, usually http: or file:.

I described this notion of separating the Source and Destination URI
space in a RT: http://marc.theaimsgroup.com/?t=103959284100002&r=1&w=2

So that is the theory: it is better to have an explicit file: scheme,
because it distinguishes those URIs from the implied 'cocoon:' scheme,
and fits in better in a world where there are schemes everywhere.

<<<<<

To that, your response started:

> First distinction: schemes are not IMV in the source URI space, but in
> the destination URI space

In the intervening 11 emails, I hope I have at least convinced you of the
wrongness of that statement, and hence the position you held back then,
based on it.

--Jeff

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Jeff Turner <je...@apache.org>.

On Wed, Dec 18, 2002 at 04:52:34PM +0100, Nicola Ken Barozzi wrote:
> 
> Jeff Turner wrote:
> >On Wed, Dec 18, 2002 at 03:23:03PM +0100, Nicola Ken Barozzi wrote:
> >...
> >
> >>>Firstly: do you agree that there _are_ two Sources?  That the user
> >>>_could_ create an index.pdf?  In fact, considering that the user isn't
> >>>meant to know that index.xml even *has* a PDF rendition, why shouldn't
> >>>they create an index.pdf?
> >>
> >>I don't agree here. The user creates documents to explain a concept. 
> >>"index" means it's the index.
> >
> >Since when do semantics come into the business of ensuring every source
> >has a URI?
> 
> A source is a piece of information. The name is a token that identifies 
> that piece of information.

Identification has absolutely zippo to do with meaning.  I can create
good URIs and I can create bad URIs.  Forrest should allow both, but
discourage the latter.

> It is placed in a context that is also named (directory). Where you
> place it has a sense -> semantics. The path is a moniker to what the
> piece of information *means*.
> 
> >Fact: users _can_ create an index.pdf.  Whether this is a good idea is
> >irrelevant: as a source of content, it deserves a source URI.
> 
> I'd say that from the discussion it comes out that users should not be 
> allowed to do it, and a check done as part of the validation, to ensure 
> that double-named files are not there.
>
> >We can
> >then say, "by the way, it's really dumb creating index.pdf when you've
> >got index.xml", but that's a layer above the raw URI space addressing
> >issue.
> 
> Not IMHO. Since we decided to link to "concepts", we have actually IMHO 
> decided that it's the filename that identifies the file, without the 
> extension.

That does not follow at all.  *Only* URIs starting with 'linkmap:' are
semantic URIs.  A linkmap is a maps from semantic addresses to source
filenames.  Let's say we have the following linkmap:

<site>
  <welcome src="index.xml"/>
  <product_catalog src="index.pdf"/>
</site>

A contrived example: imagine I have a product cataloging tool that
insists on naming its output 'index.pdf'.  With the above linkmap, I have
mapped two different concepts to two different sources.  Who cares if the
filenames are similar?


--Jeff

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Jeff Turner wrote:
> On Wed, Dec 18, 2002 at 03:23:03PM +0100, Nicola Ken Barozzi wrote:
> ...
> 
>>>Firstly: do you agree that there _are_ two Sources?  That the user
>>>_could_ create an index.pdf?  In fact, considering that the user isn't
>>>meant to know that index.xml even *has* a PDF rendition, why shouldn't
>>>they create an index.pdf?
>>
>>I don't agree here. The user creates documents to explain a concept. 
>>"index" means it's the index.
> 
> Since when do semantics come into the business of ensuring every source
> has a URI?

A source is a piece of information. The name is a token that identifies 
that piece of information. It is placed in a context that is also named 
(directory). Where you place it has a sense -> semantics. The path is a 
moniker to what the piece of information *means*.

> Fact: users _can_ create an index.pdf.  Whether this is a good idea is
> irrelevant: as a source of content, it deserves a source URI.

I'd say that from the discussion it comes out that users should not be 
allowed to do it, and a check done as part of the validation, to ensure 
that double-named files are not there.

> We can
> then say, "by the way, it's really dumb creating index.pdf when you've
> got index.xml", but that's a layer above the raw URI space addressing
> issue.

Not IMHO. Since we decided to link to "concepts", we have actually IMHO 
decided that it's the filename that identifies the file, without the 
extension.

> Popping the argument stack a bit, remember that this whole silly example
> of index.xml/index.pdf is a pathological case, that won't have the
> desired effect no matter what the URI is.  You have ignored my main
> argument, that the 'cocoon:' prefix is implicit and _conceptually_ a
> file: scheme is required.

I have not ignored it. I keep thinking that concetpually the file scheme 
is not require, for all the reasons I have explained.

Yes, the 'cocoon:' prefix is implicit. No, _conceptually_ it's not 
required *if* we decide that we cannot have more than one source file.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Jeff Turner <je...@apache.org>.

On Wed, Dec 18, 2002 at 03:23:03PM +0100, Nicola Ken Barozzi wrote:
...
> >Firstly: do you agree that there _are_ two Sources?  That the user
> >_could_ create an index.pdf?  In fact, considering that the user isn't
> >meant to know that index.xml even *has* a PDF rendition, why shouldn't
> >they create an index.pdf?
> 
> I don't agree here. The user creates documents to explain a concept. 
> "index" means it's the index.

Since when do semantics come into the business of ensuring every source
has a URI?

Fact: users _can_ create an index.pdf.  Whether this is a good idea is
irrelevant: as a source of content, it deserves a source URI.  We can
then say, "by the way, it's really dumb creating index.pdf when you've
got index.xml", but that's a layer above the raw URI space addressing
issue.

Popping the argument stack a bit, remember that this whole silly example
of index.xml/index.pdf is a pathological case, that won't have the
desired effect no matter what the URI is.  You have ignored my main
argument, that the 'cocoon:' prefix is implicit and _conceptually_ a
file: scheme is required.

--Jeff

> Who cares what the rendition is.
> Imagine the user making an index.xml and index.xhtml file in the same 
> dir. Does it make sense?
> 
> >Secondly, do you agree that conceptually, any source of content should be
> >assigned a Source URI?  _Regardless_ of whether it has a Destination URI?
> >Because Source and Destination URI spaces have no direct relation.  Heck,
> >I could generate a single PDF containing the entire site, thus mapping
> >lots of Source URIs to a single Destination URI.
> 
> Yes, on this I agree. We should always link to source URIs, so that what 
> you explain about a single PDF can be possible. And it's also easier for 
> the user. +1
> 
> -- 
> Nicola Ken Barozzi                   nicolaken@apache.org
>             - verba volant, scripta manent -
>    (discussions get forgotten, just code remains)
> ---------------------------------------------------------------------
>

Re: PDF transforms (was: Re: File prefix again)

Posted by Jeremias Maerki <de...@greenmail.ch>.

Hi Keiron

On 20.12.2002 09:41:52 Keiron Liddle wrote:
> On Thu, 2002-12-19 at 21:15, Jeremias Maerki wrote:
> > All cool, but how exactly is that better than having a PDF template that
> > is stitched behind or in front of the FOP result using iText or PJ?
> > Works well. Ok, PDF reading with our own library is a bonus as is better
> > XML output for debugging. But I don't see any immediate need for this at
> > the moment given our limited resources. Or do I miss anything?
> 
> Well I'm not really suggesting is is high priority, just an idea.
> One things is that the XML and the additions can work both in and out of
> Fop.
>
> At least outputing SAX in the XMLRenderer would probably be an
> improvement.

Ok then. Will you put this on the todo list?

Jeremias Maerki


---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org

Re: PDF transforms (was: Re: File prefix again)

Posted by Keiron Liddle <ke...@aftexsw.com>.

On Thu, 2002-12-19 at 21:15, Jeremias Maerki wrote:
> All cool, but how exactly is that better than having a PDF template that
> is stitched behind or in front of the FOP result using iText or PJ?
> Works well. Ok, PDF reading with our own library is a bonus as is better
> XML output for debugging. But I don't see any immediate need for this at
> the moment given our limited resources. Or do I miss anything?

Well I'm not really suggesting is is high priority, just an idea.
One things is that the XML and the additions can work both in and out of
Fop.

At least outputing SAX in the XMLRenderer would probably be an
improvement.




---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org

Re: PDF transforms (was: Re: File prefix again)

Posted by Jeremias Maerki <de...@greenmail.ch>.

All cool, but how exactly is that better than having a PDF template that
is stitched behind or in front of the FOP result using iText or PJ?
Works well. Ok, PDF reading with our own library is a bonus as is better
XML output for debugging. But I don't see any immediate need for this at
the moment given our limited resources. Or do I miss anything?

On 19.12.2002 08:05:54 Keiron Liddle wrote:
> On Wed, 2002-12-18 at 15:23, Nicola Ken Barozzi wrote:
> > > I don't get this.  How can PDFs be transformed?
> > 
> > There are Java libraries that read PDFs. What would be really cool is to 
> > have a reader or something like it that uses a PDF as a template.
> > Using FOP for just filling out forms is overkill, we just need templating.
> > 
> > This is a general use case of PDF transformation, and another that I 
> > would really like to see is to generate a "non-controlled copy" stamp on 
> > the PDF for the management of ISO9001 documentation.
> > 
> > Or simply by adding a copyright statement.
> 
> Sounds like some good ideas.
> 
> It would be possible to do some work with Fop so that it can:
> - convert xsl:fo to paged xml
> - convert paged xml to pdf (or other formats)
> - define templates with the paged xml
> - append paged xml to a current document
> 
> So it would be possible to create the paged xml from fo. Then to do a
> transform or directly convert or append the paged xml to pdf.
> Also the extensions and foreign xml can be passed through directly so
> that both formats support the same extensions, such as svg.
> 
> So the changes that would need to be made are:
> - improve and update xml renderer so that it can output SAX
> - improve and update AreaTreeBuilder so that it takes SAX input
> - make some additions to the pdf lib so it can load and read pdf
> documents
> 
> Then it shouldn't be so hard to add in extensions for pdf forms etc.


Jeremias Maerki


---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org

AW: PDF transforms (was: Re: File prefix again)

Posted by "J.U. Anderegg" <ha...@bluewin.ch>.

Hi Keiron,

> On Sun, 2002-12-22 at 02:18, Kevin O'Neill wrote:
> > Is the paged XML a new or existing format?
>
> A new format for now at least.
>
> It is possible there will be a w3c defined format.

Please give some pointer to w3c activities in this area. What is this thing
exactly supposed to do? What have externals to look like? etc...

Hansuli Anderegg



---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org

Re: PDF transforms (was: Re: File prefix again)

Posted by Keiron Liddle <ke...@aftexsw.com>.

On Sun, 2002-12-22 at 02:18, Kevin O'Neill wrote:
> > It would be possible to do some work with Fop so that it can:
> > - convert xsl:fo to paged xml
> 
> Is the paged XML a new or existing format?

A new format for now at least.

It is possible there will be a w3c defined format.



---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org

Re: PDF transforms (was: Re: File prefix again)

Posted by Kevin O'Neill <ke...@rocketred.com.au>.

On Thu, 2002-12-19 at 18:05, Keiron Liddle wrote:
> On Wed, 2002-12-18 at 15:23, Nicola Ken Barozzi wrote:
> > > I don't get this.  How can PDFs be transformed?
> > 
> > There are Java libraries that read PDFs. What would be really cool is to 
> > have a reader or something like it that uses a PDF as a template.
> > Using FOP for just filling out forms is overkill, we just need templating.
> > 
> > This is a general use case of PDF transformation, and another that I 
> > would really like to see is to generate a "non-controlled copy" stamp on 
> > the PDF for the management of ISO9001 documentation.
> > 
> > Or simply by adding a copyright statement.
> 
> Sounds like some good ideas.
> 
> It would be possible to do some work with Fop so that it can:
> - convert xsl:fo to paged xml

Is the paged XML a new or existing format?

> - convert paged xml to pdf (or other formats)
> - define templates with the paged xml
> - append paged xml to a current document
> 
> So it would be possible to create the paged xml from fo. Then to do a
> transform or directly convert or append the paged xml to pdf.
> Also the extensions and foreign xml can be passed through directly so
> that both formats support the same extensions, such as svg.
> 
> So the changes that would need to be made are:
> - improve and update xml renderer so that it can output SAX
> - improve and update AreaTreeBuilder so that it takes SAX input
> - make some additions to the pdf lib so it can load and read pdf
> documents
> 
> Then it shouldn't be so hard to add in extensions for pdf forms etc.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
> For additional commands, email: fop-dev-help@xml.apache.org
-- 
If you don't test then your code is only a collection of bugs which 
apparently behave like a working program. 

Website: http://www.rocketred.com.au/blogs/kevin/


---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org

PDF transforms (was: Re: File prefix again)

Posted by Keiron Liddle <ke...@aftexsw.com>.

On Wed, 2002-12-18 at 15:23, Nicola Ken Barozzi wrote:
> > I don't get this.  How can PDFs be transformed?
> 
> There are Java libraries that read PDFs. What would be really cool is to 
> have a reader or something like it that uses a PDF as a template.
> Using FOP for just filling out forms is overkill, we just need templating.
> 
> This is a general use case of PDF transformation, and another that I 
> would really like to see is to generate a "non-controlled copy" stamp on 
> the PDF for the management of ISO9001 documentation.
> 
> Or simply by adding a copyright statement.

Sounds like some good ideas.

It would be possible to do some work with Fop so that it can:
- convert xsl:fo to paged xml
- convert paged xml to pdf (or other formats)
- define templates with the paged xml
- append paged xml to a current document

So it would be possible to create the paged xml from fo. Then to do a
transform or directly convert or append the paged xml to pdf.
Also the extensions and foreign xml can be passed through directly so
that both formats support the same extensions, such as svg.

So the changes that would need to be made are:
- improve and update xml renderer so that it can output SAX
- improve and update AreaTreeBuilder so that it takes SAX input
- make some additions to the pdf lib so it can load and read pdf
documents

Then it shouldn't be so hard to add in extensions for pdf forms etc.

PDF transforms (was: Re: File prefix again)

Posted by Keiron Liddle <ke...@aftexsw.com>.

On Wed, 2002-12-18 at 15:23, Nicola Ken Barozzi wrote:
> > I don't get this.  How can PDFs be transformed?
> 
> There are Java libraries that read PDFs. What would be really cool is to 
> have a reader or something like it that uses a PDF as a template.
> Using FOP for just filling out forms is overkill, we just need templating.
> 
> This is a general use case of PDF transformation, and another that I 
> would really like to see is to generate a "non-controlled copy" stamp on 
> the PDF for the management of ISO9001 documentation.
> 
> Or simply by adding a copyright statement.

Sounds like some good ideas.

It would be possible to do some work with Fop so that it can:
- convert xsl:fo to paged xml
- convert paged xml to pdf (or other formats)
- define templates with the paged xml
- append paged xml to a current document

So it would be possible to create the paged xml from fo. Then to do a
transform or directly convert or append the paged xml to pdf.
Also the extensions and foreign xml can be passed through directly so
that both formats support the same extensions, such as svg.

So the changes that would need to be made are:
- improve and update xml renderer so that it can output SAX
- improve and update AreaTreeBuilder so that it takes SAX input
- make some additions to the pdf lib so it can load and read pdf
documents

Then it shouldn't be so hard to add in extensions for pdf forms etc.



---------------------------------------------------------------------
To unsubscribe, e-mail: fop-dev-unsubscribe@xml.apache.org
For additional commands, email: fop-dev-help@xml.apache.org

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Nicola Ken Barozzi <ni...@apache.org>.


Jeff Turner wrote:
> On Tue, Dec 17, 2002 at 03:07:57PM +0100, Nicola Ken Barozzi wrote:
> 
>>Jeff Turner wrote:
>>
>>>On Mon, Dec 16, 2002 at 04:08:37PM +0100, Nicola Ken Barozzi wrote:

[...]

>>Static or generated there is no difference. The use should not even know 
>>if Cocoon does something with it.
> 
> Yes! +1000.  But first, the user needs to identify what "it" is.  Is "it"
> the PDF rendition of index.xml, or the index.pdf file sitting on my
> harddisk?  They are two different Sources, containing completely
> different content, and they deserve different Source URIs.

My point is that there should be just one "index" file, whatever 
extension it has.

>>This is important. This is why I say that you are mixing concerns.
> 
> Identifying the source is the user's concern.  That is the I in URI.  We
> have two different Sources, we need two different URIs.

Excactly the point. Me says that we can have only one source with the 
same name. I don't see the need of having two.

>>What if the sitemap guy would want to take the PDF and transform it; 
>>with the file: protocol you are making this not possible. You are taking 
>>away from the sitemap the possibility of doing what the heck it wants 
>>with the files.
> 
> I don't get this.  How can PDFs be transformed?

There are Java libraries that read PDFs. What would be really cool is to 
have a reader or something like it that uses a PDF as a template.
Using FOP for just filling out forms is overkill, we just need templating.

This is a general use case of PDF transformation, and another that I 
would really like to see is to generate a "non-controlled copy" stamp on 
the PDF for the management of ISO9001 documentation.

Or simply by adding a copyright statement.

[...]

>Imagine I have
>>
>> ./index.xml
>> ./index.pdf
>>
>>If I link like this
>>
>>  <link href="index"/>
>>
>>Cocoon serves only one, as defined in the sitemap rules.
>>
>>If I introduce the file: protocol, I can do:
>>
>>  <link href="index"/>           ->  serve index.xml
>>  <link href="site:index.pdf"/>  ->  serve index.pdf
>>
>>Problem is, how can the browser as for
>>
>>  http://domain.ext/path/to/index
>>
>>and have one or other result?
>>
>>What would the above URL yield?
> 
> 
> Excellent point :)  One I completely missed.  So you're saying that
> disambiguating 'cocoon:index.pdf' and 'file:index.pdf' is well and good,
> but it causes a name clash in the Destination URI space.
> 
> Simple enough answer: we need two create two destination URIs, because
> there are two Source URIs.  Eg, generate:
> 
> http://localhost:8888/index.pdf    # The static index.pdf
> http://localhost:8888/index~.pdf   # index.pdf generated from XML   
> 
> But this is an implementation detail.  What I'm concerned about now is
> whether disambiguating the sources makes sense _conceptually_.
> 
> So say we have two distinct Source URIs: a static index.pdf file, and the
> PDF rendition of index.xml.  In "ideal world" syntax, we can write those
> two as:
> 
> <link href="index.pdf">
> <link href="index.xml" type="application/pdf">
> 
> In "real world: Jeff style" syntax, they'd be written as:
> 
> <link href="file:index.pdf">
> <link href="index.pdf">
> 
> In "real world: Nicola style" syntax, there'd just be:
> 
> <link href="index.pdf">
> 
> and you simply can't have an index.pdf file.

Not exactly.
If you have index.xml, that becomes the index.pdf.
If you have index.pdf, that becomes the index.pdf.

One filename, one result.

> Firstly: do you agree that there _are_ two Sources?  That the user
> _could_ create an index.pdf?  In fact, considering that the user isn't
> meant to know that index.xml even *has* a PDF rendition, why shouldn't
> they create an index.pdf?

I don't agree here. The user creates documents to explain a concept. 
"index" means it's the index. Who cares what the rendition is.
Imagine the user making an index.xml and index.xhtml file in the same 
dir. Does it make sense?

> Secondly, do you agree that conceptually, any source of content should be
> assigned a Source URI?  _Regardless_ of whether it has a Destination URI?
> Because Source and Destination URI spaces have no direct relation.  Heck,
> I could generate a single PDF containing the entire site, thus mapping
> lots of Source URIs to a single Destination URI.

Yes, on this I agree. We should always link to source URIs, so that what 
you explain about a single PDF can be possible. And it's also easier for 
the user. +1

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Jeff Turner <je...@apache.org>.

On Tue, Dec 17, 2002 at 03:07:57PM +0100, Nicola Ken Barozzi wrote:
> 
> Jeff Turner wrote:
> >On Mon, Dec 16, 2002 at 04:08:37PM +0100, Nicola Ken Barozzi wrote:
> [...]
> >
> >Your view is perfectly clear and simple: schemes are aliasing mechanisms
> >to simplify linking to the destination URI space.
> >
> >My view only makes sense once you a) buy into the notion that the Source
> >URI space exists and is distinct from the Destination URI space, b)
> >understand that, given a), the implied *source* protocol for links is
> >currently 'cocoon:'.  Only then does the reason for file: become
> >apparent: static links do _not_ have the implied 'cocoon:' scheme.  We
> >need a different scheme to disambiguate, say, a static index.pdf, and an
> >index.pdf generated from index.xml.
> 
> Static or generated there is no difference. The use should not even know 
> if Cocoon does something with it.

Yes! +1000.  But first, the user needs to identify what "it" is.  Is "it"
the PDF rendition of index.xml, or the index.pdf file sitting on my
harddisk?  They are two different Sources, containing completely
different content, and they deserve different Source URIs.

> This is important. This is why I say that you are mixing concerns.

Identifying the source is the user's concern.  That is the I in URI.  We
have two different Sources, we need two different URIs.

> What if the sitemap guy would want to take the PDF and transform it; 
> with the file: protocol you are making this not possible. You are taking 
> away from the sitemap the possibility of doing what the heck it wants 
> with the files.

I don't get this.  How can PDFs be transformed?

...
> >>>Secondly, introducing a 'file:' prefix fixes the current name clash
> >>>problem.  What if I have a static file called 'index.pdf'?  How do I
> >>>access the index.pdf generated from XML?  I can't, because the
> >>>resource-exists will always choose for me.
> >>
> >>Which is another seemingly good point, but since we have decided that 
> >>link URIs should not end in extensions, because of many reasons one of 
> >>which is the fact that a URI can reference different formats at 
> >>different times in history, having a scheme that effectively makes me 
> >>serve two different versions of the same file is totally off-target.
> >
> >See above.  There is _no way_ that a sitemap, with MIMETypeActions and
> >resource-exists and any other crazy hacks you care to name, can 100%
> >correctly choose between a static index.pdf and one generated from
> >index.xml.  Simply cannot, because there is missing info only the user
> >knows.  That is what the file: prefix adds.
> 
> Reread my point.
> 
> Imagine I have
> 
>  ./index.xml
>  ./index.pdf
> 
> If I link like this
> 
>   <link href="index"/>
> 
> Cocoon serves only one, as defined in the sitemap rules.
> 
> If I introduce the file: protocol, I can do:
> 
>   <link href="index"/>           ->  serve index.xml
>   <link href="site:index.pdf"/>  ->  serve index.pdf
> 
> Problem is, how can the browser as for
> 
>   http://domain.ext/path/to/index
> 
> and have one or other result?
> 
> What would the above URL yield?

Excellent point :)  One I completely missed.  So you're saying that
disambiguating 'cocoon:index.pdf' and 'file:index.pdf' is well and good,
but it causes a name clash in the Destination URI space.

Simple enough answer: we need two create two destination URIs, because
there are two Source URIs.  Eg, generate:

http://localhost:8888/index.pdf    # The static index.pdf
http://localhost:8888/index~.pdf   # index.pdf generated from XML   

But this is an implementation detail.  What I'm concerned about now is
whether disambiguating the sources makes sense _conceptually_.

So say we have two distinct Source URIs: a static index.pdf file, and the
PDF rendition of index.xml.  In "ideal world" syntax, we can write those
two as:

<link href="index.pdf">
<link href="index.xml" type="application/pdf">

In "real world: Jeff style" syntax, they'd be written as:

<link href="file:index.pdf">
<link href="index.pdf">

In "real world: Nicola style" syntax, there'd just be:

<link href="index.pdf">

and you simply can't have an index.pdf file.

Firstly: do you agree that there _are_ two Sources?  That the user
_could_ create an index.pdf?  In fact, considering that the user isn't
meant to know that index.xml even *has* a PDF rendition, why shouldn't
they create an index.pdf?

Secondly, do you agree that conceptually, any source of content should be
assigned a Source URI?  _Regardless_ of whether it has a Destination URI?
Because Source and Destination URI spaces have no direct relation.  Heck,
I could generate a single PDF containing the entire site, thus mapping
lots of Source URIs to a single Destination URI.

If you agree to both of those, then you'll agree that adding a file:
prefix to address static files makes conceptual sense.  If, in
pathological cases, that causes conflicts in the destination URI space,
well that's too bad; we'll fix it eventually.  Conceptually we did the
right thing.

--Jeff

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Jeff Turner wrote:
> On Mon, Dec 16, 2002 at 04:08:37PM +0100, Nicola Ken Barozzi wrote:
[...]
> 
> Your view is perfectly clear and simple: schemes are aliasing mechanisms
> to simplify linking to the destination URI space.
> 
> My view only makes sense once you a) buy into the notion that the Source
> URI space exists and is distinct from the Destination URI space, b)
> understand that, given a), the implied *source* protocol for links is
> currently 'cocoon:'.  Only then does the reason for file: become
> apparent: static links do _not_ have the implied 'cocoon:' scheme.  We
> need a different scheme to disambiguate, say, a static index.pdf, and an
> index.pdf generated from index.xml.

Static or generated there is no difference. The use should not even know 
if Cocoon does something with it. This is important. This is why I say 
that you are mixing concerns.

What if the sitemap guy would want to take the PDF and transform it; 
with the file: protocol you are making this not possible. You are taking 
away from the sitemap the possibility of doing what the heck it wants 
with the files.

>>>I described this notion of separating the Source and Destination URI
>>>space in a RT: http://marc.theaimsgroup.com/?t=103959284100002&r=1&w=2
>>
>>I read it, and I basically agree with it, except the above distinction 
>>which wasn't clear to me in the first place.
>>
>>
>>>So that is the theory: it is better to have an explicit file: scheme,
>>>because it distinguishes those URIs from the implied 'cocoon:' scheme,
>>>and fits in better in a world where there are schemes everywhere.
>>
>>Please expand on this. Do you mean file scheme=sources and cocoon 
>>scheme=resulting URI space?
> 
> Yes.
> 
> In a perfect world, the default scheme would be file:, not cocoon:.  So
> we could have <link href="primer.xml">, or <link href="hello.pdf">.
> Then, a linkmap would genuinely be an aliasing mechanism, but aliasing in
> the _Source_ URI space.  Eg, <link href="site:/primer"> would be exactly
> equivalent to <link href="primer.xml"> (or ../primer.xml or
> ../../primer.xml etc).  Ignore this paragraph if it doesn't make sense..

It kinda does.
I buy in the idea that I should link only to source files, and have the 
resulting URI space be created by the sitemap. But I don't buy the fact 
that in the perfect world I use the extension to reference the file, 
this because of the last comment below.

>>>Practically, right now, what is the difference?
>>>
>>>Well for a start, if we consistently used 'file:' for URIs identifying
>>>static files, we could throw away the current resource-exists action:
>>>
>>> <map:match pattern="**">
>>>
>>>   <map:act type="resource-exists">
>>>    <map:parameter name="url" value="content/{1}"/>
>>>    <map:read src="content/{../1}"/>
>>>   </map:act>
>>>   ....
>>>
>>>And replace it with a simple sitemap rule:
>>>
>>> <map:match pattern="file:**">
>>>   <map:read src="content/{1}"/>
>>> </map:match>
>>
>>Which is something I don't like.
>>
>>Again, you are telling Cocoon how to treat that file, which is not a 
>>concern of the editor.
> 
> The implied URI scheme is 'cocoon:'.  By adding a 'file:' prefix, the
> user is saying "no, this file is local".  There is nothing wrong with
> this, and no other way to distinguish between, say, a static index.pdf
> and one generated from index.xml.  

And there should not be. Se below again.

>>>Having to interrogate the filesystem to decide a URI's scheme is a total 
>>>hack.
>>>What happens if our docs are stored in Xindice, or anything other than a
>>>filesystem?  Resource-exists is going to break.
>>
>>Hmmm, this is a good point, but not a resource-exists "conceptual" 
>>problem. I can test if a resource exists also in remote repositories.
>>If the "file:" thing takes care different backends, there is no reason 
>>why a better resource-exists cannot. So seems is more about the 
>>deficiencies of the resource-exists implementation rather than the need 
>>of a site: scheme.
> 
> 
> Say I want to link to a static index.pdf, but I forget to create it.  I
> want that link to break!  I don't want Cocoon to be clever, and create
> one from index.xml.  Resource-exists is an utter hack that doesn't
> (cannot!) meet use-cases like this, because ultimately, only the user can
> know if they are referring to a local file, or one generated by Cocoon.

Given that we have ruled out extensions in the links, there can be only 
one file with the same name in the same dir. Hence there is no ambiguity.

>>>Secondly, introducing a 'file:' prefix fixes the current name clash
>>>problem.  What if I have a static file called 'index.pdf'?  How do I
>>>access the index.pdf generated from XML?  I can't, because the
>>>resource-exists will always choose for me.
>>
>>Which is another seemingly good point, but since we have decided that 
>>link URIs should not end in extensions, because of many reasons one of 
>>which is the fact that a URI can reference different formats at 
>>different times in history, having a scheme that effectively makes me 
>>serve two different versions of the same file is totally off-target.
> 
> See above.  There is _no way_ that a sitemap, with MIMETypeActions and
> resource-exists and any other crazy hacks you care to name, can 100%
> correctly choose between a static index.pdf and one generated from
> index.xml.  Simply cannot, because there is missing info only the user
> knows.  That is what the file: prefix adds.

Reread my point.

Imagine I have

  ./index.xml
  ./index.pdf

If I link like this

   <link href="index"/>

Cocoon serves only one, as defined in the sitemap rules.

If I introduce the file: protocol, I can do:

   <link href="index"/>           ->  serve index.xml
   <link href="site:index.pdf"/>  ->  serve index.pdf

Problem is, how can the browser as for

   http://domain.ext/path/to/index

and have one or other result?

What would the above URL yield?

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Jeff Turner <je...@apache.org>.

On Mon, Dec 16, 2002 at 04:08:37PM +0100, Nicola Ken Barozzi wrote:
...
> >>Why would we need to rewrite "file:"s?
> >
> >Given the above definition, what do you think the implied scheme for
> ><link href="hello.pdf"> is?  What syntactic and semantic restrictions are
> >there?  Can we link to anything?  No: we can only link to URIs defined by
> >sitemap rules.  Therefore the implied scheme is 'cocoon:'.  I need to
> >invoke Cocoon to get 'hello.pdf'.  If my editor were written in Java as
> >an Avalon component, it might really be able to invoke Cocoon and
> >retrieve 'hello.pdf'.
> >
> >What about when a file is sitting on my harddisk?  Do I need Cocoon to
> >view it?  No; I can open it in an editor.  Hence the 'file:' protocol is
> >implied.  In fact, in vim I can type 'gf' and automatically traverse the
> >link.  My editor is a 'browser' of the Source URI space, just like
> >Mozilla browses the Destination URI space.
> >
> >That is the important concept: the Source URI space is distinct from the
> >Destination URI space.  In the Source URI space (XML docs + <link>
> >elems), we have all sorts of schemes (linkmap:, java:, file:, person:
> >etc), but in the Destination URI space (HTML docs + <a> elems), we have
> >only one protocol, usually http: or file:.
> 
> First distinction: schemes are not IMV in the source URI space, but in 
> the destination URI space

In the destination URI space (HTML files), all our linkmap:, java:,
person:, mail: schemes have vanished.  The only exist in the source URI
space (XML files).

> hence my definition of link rewriting. Links are always seen from the
> outside IMV.

I edit XML files, which are source docs.  I edit the source links.
Currently, most source links are identical to destination links, but that
is what will change completely once we introduce schemes.  There is no
way you can pretend <link href="linkmap:/primer"> is a destination link,
because browsers don't understand the 'linkmap' protocol.  Only Cocoon
can.  Just as Cocoon translates source docs (XML) to destination docs
(HTML), it translates source URIs (link:, java:, etc URIs) to destination
URIs.

> With this in mind, you can infer why I don't see the need for a file:
> scheme.
> 
> Thus I link to the resulting URI space, not the source one.

You do currently.  <link href="primer.html"> is a link to the destination
URI space.  But we have agreed that that is wrong.

> The resulting URI space can be complicated, so to ease the linking I
> use schemes to make linking easier.
> 
> Well, it might as well be not the best thing to do, but this is what 
> I've been saying till now, so I see why we didn't really understand each 
> other.

Your view is perfectly clear and simple: schemes are aliasing mechanisms
to simplify linking to the destination URI space.

My view only makes sense once you a) buy into the notion that the Source
URI space exists and is distinct from the Destination URI space, b)
understand that, given a), the implied *source* protocol for links is
currently 'cocoon:'.  Only then does the reason for file: become
apparent: static links do _not_ have the implied 'cocoon:' scheme.  We
need a different scheme to disambiguate, say, a static index.pdf, and an
index.pdf generated from index.xml.

> >I described this notion of separating the Source and Destination URI
> >space in a RT: http://marc.theaimsgroup.com/?t=103959284100002&r=1&w=2
> 
> I read it, and I basically agree with it, except the above distinction 
> which wasn't clear to me in the first place.
> 
> >So that is the theory: it is better to have an explicit file: scheme,
> >because it distinguishes those URIs from the implied 'cocoon:' scheme,
> >and fits in better in a world where there are schemes everywhere.
> 
> Please expand on this. Do you mean file scheme=sources and cocoon 
> scheme=resulting URI space?

Yes.

In a perfect world, the default scheme would be file:, not cocoon:.  So
we could have <link href="primer.xml">, or <link href="hello.pdf">.
Then, a linkmap would genuinely be an aliasing mechanism, but aliasing in
the _Source_ URI space.  Eg, <link href="site:/primer"> would be exactly
equivalent to <link href="primer.xml"> (or ../primer.xml or
../../primer.xml etc).  Ignore this paragraph if it doesn't make sense..

> >Practically, right now, what is the difference?
> >
> >Well for a start, if we consistently used 'file:' for URIs identifying
> >static files, we could throw away the current resource-exists action:
> >
> >  <map:match pattern="**">
> >
> >    <map:act type="resource-exists">
> >     <map:parameter name="url" value="content/{1}"/>
> >     <map:read src="content/{../1}"/>
> >    </map:act>
> >    ....
> >
> >And replace it with a simple sitemap rule:
> >
> >  <map:match pattern="file:**">
> >    <map:read src="content/{1}"/>
> >  </map:match>
> 
> Which is something I don't like.
> 
> Again, you are telling Cocoon how to treat that file, which is not a 
> concern of the editor.

The implied URI scheme is 'cocoon:'.  By adding a 'file:' prefix, the
user is saying "no, this file is local".  There is nothing wrong with
this, and no other way to distinguish between, say, a static index.pdf
and one generated from index.xml.  The sitemap simply takes advantage of
the lexical difference.

> We decided to take away the extension to files, but this file: thing 
> does the same conceptual thing, it selects the sitemap to use inside the 
> link.

The difference is, the file: scheme is not added to make the sitemap
simpler.  That is just a nice side-effect.

> >Having to interrogate the filesystem to decide a URI's scheme is a total 
> >hack.
> >What happens if our docs are stored in Xindice, or anything other than a
> >filesystem?  Resource-exists is going to break.
> 
> Hmmm, this is a good point, but not a resource-exists "conceptual" 
> problem. I can test if a resource exists also in remote repositories.
> If the "file:" thing takes care different backends, there is no reason 
> why a better resource-exists cannot. So seems is more about the 
> deficiencies of the resource-exists implementation rather than the need 
> of a site: scheme.

Say I want to link to a static index.pdf, but I forget to create it.  I
want that link to break!  I don't want Cocoon to be clever, and create
one from index.xml.  Resource-exists is an utter hack that doesn't
(cannot!) meet use-cases like this, because ultimately, only the user can
know if they are referring to a local file, or one generated by Cocoon.

> >Secondly, introducing a 'file:' prefix fixes the current name clash
> >problem.  What if I have a static file called 'index.pdf'?  How do I
> >access the index.pdf generated from XML?  I can't, because the
> >resource-exists will always choose for me.
> 
> Which is another seemingly good point, but since we have decided that 
> link URIs should not end in extensions, because of many reasons one of 
> which is the fact that a URI can reference different formats at 
> different times in history, having a scheme that effectively makes me 
> serve two different versions of the same file is totally off-target.

See above.  There is _no way_ that a sitemap, with MIMETypeActions and
resource-exists and any other crazy hacks you care to name, can 100%
correctly choose between a static index.pdf and one generated from
index.xml.  Simply cannot, because there is missing info only the user
knows.  That is what the file: prefix adds.

--Jeff

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Steven Noels wrote:
> Nicola Ken Barozzi wrote:
> 
>> Jeff Turner wrote:
> 
> 
>>> Having to interrogate the filesystem to decide a URI's scheme is a 
>>> total hack.
>>> What happens if our docs are stored in Xindice, or anything other than a
>>> filesystem?  Resource-exists is going to break.
> 
> 
>> Hmmm, this is a good point, but not a resource-exists "conceptual" 
>> problem. I can test if a resource exists also in remote repositories.
>> If the "file:" thing takes care different backends, there is no reason 
>> why a better resource-exists cannot. So seems is more about the 
>> deficiencies of the resource-exists implementation rather than the 
>> need of a site: scheme.
> 
> 
> The way resource-exist was brought into Forrest was based on a hackish 
> idea. 

Please explain why.

> The way it works is a hack. I like the file: approach much better, 

Why?

I'm a user. I take a file. Put it in the directory. Link to it. See it 
in the result.

What do you not like of this? Why is it better if I write the link with 
file: in it? Because that will be the only difference to the user.

> and I don't feel like I don't understand Cocoon or anything else because 
> of that. It's on the same level of letting the user put hints in his 
> documents as we currently inform people about some obscure XLink 
> attribute which can be set to stop crawling. At the very least, file: 
> will have been designed & coded by a community.

I don't get this.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Steven Noels <st...@outerthought.org>.

Nicola Ken Barozzi wrote:

> Jeff Turner wrote:

>> Having to interrogate the filesystem to decide a URI's scheme is a 
>> total hack.
>> What happens if our docs are stored in Xindice, or anything other than a
>> filesystem?  Resource-exists is going to break.

> Hmmm, this is a good point, but not a resource-exists "conceptual" 
> problem. I can test if a resource exists also in remote repositories.
> If the "file:" thing takes care different backends, there is no reason 
> why a better resource-exists cannot. So seems is more about the 
> deficiencies of the resource-exists implementation rather than the need 
> of a site: scheme.

The way resource-exist was brought into Forrest was based on a hackish 
idea. The way it works is a hack. I like the file: approach much better, 
and I don't feel like I don't understand Cocoon or anything else because 
of that. It's on the same level of letting the user put hints in his 
documents as we currently inform people about some obscure XLink 
attribute which can be set to stop crawling. At the very least, file: 
will have been designed & coded by a community.

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at              http://radio.weblogs.com/0103539/
stevenn at outerthought.org                stevenn at apache.org

Re: File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Jeff Turner wrote:
> On Mon, Dec 16, 2002 at 02:01:52PM +0100, Nicola Ken Barozzi wrote:
> 
>>
>>Jeff Turner wrote:
>>
[...]

>>>The file: patch has two effects:
>>>
>>>- Introduce schemes in xdocs, starting with a 'file:' scheme.  I think
>>>  that schemes in general are uncontroversial.  When linkmaps arrive,
>>>  90% of links are going to be linkmap links, so having a scheme prefix
>>>  should be the norm. 
>>
>>I'm totally for the scheme concept. But schemes are IMHV onlt link 
>>rewriting rules, and should not address other concerns.
>>A file: scheme would not do any rewriting, so I don't see the need ATM.
> 
> ...
> 
>>>What we really need to agree on is the first point; whether we want to
>>>prefix static links with 'file:'.  When xdocs are swarming with linkmap:,
>>>java:, person:, mail:, etc links, why not have file:?  Conversely, if we
>>>want to "infer" the file: scheme, are we going to try to infer all the
>>>other schemes?
>>
>>Hmmm, I don't see the big problem here, but I may as well be wrong.
>>
>>The schemes are link-rewriting systems.
> 
> Schemes are what the URI RFC defines them to be:
> 
>   "The URI scheme (Section 3.1) defines the namespace of the URI, and
>   thus may further restrict the syntax and semantics of identifiers using
>   that scheme.
>     http://www.ietf.org/rfc/rfc2396.txt

Corrected: Forrest schemes IMV are link-rewriting systems. This is to 
make the resulting URI space be completely decoupled from the source space.

>>Why would we need to rewrite "file:"s?
> 
> Given the above definition, what do you think the implied scheme for
> <link href="hello.pdf"> is?  What syntactic and semantic restrictions are
> there?  Can we link to anything?  No: we can only link to URIs defined by
> sitemap rules.  Therefore the implied scheme is 'cocoon:'.  I need to
> invoke Cocoon to get 'hello.pdf'.  If my editor were written in Java as
> an Avalon component, it might really be able to invoke Cocoon and
> retrieve 'hello.pdf'.
> 
> What about when a file is sitting on my harddisk?  Do I need Cocoon to
> view it?  No; I can open it in an editor.  Hence the 'file:' protocol is
> implied.  In fact, in vim I can type 'gf' and automatically traverse the
> link.  My editor is a 'browser' of the Source URI space, just like
> Mozilla browses the Destination URI space.
> 
> That is the important concept: the Source URI space is distinct from the
> Destination URI space.  In the Source URI space (XML docs + <link>
> elems), we have all sorts of schemes (linkmap:, java:, file:, person:
> etc), but in the Destination URI space (HTML docs + <a> elems), we have
> only one protocol, usually http: or file:.

First distinction: schemes are not IMV in the source URI space, but in 
the destination URI space, hence my definition of link rewriting. Links 
are always seen from the outside IMV. With this in mind, you can infer 
why I don't see the need for a file: scheme.

Thus I link to the resulting URI space, not the source one. The 
resulting URI space can be complicated, so to ease the linking I use 
schemes to make linking easier.

Well, it might as well be not the best thing to do, but this is what 
I've been saying till now, so I see why we didn't really understand each 
other.

> I described this notion of separating the Source and Destination URI
> space in a RT: http://marc.theaimsgroup.com/?t=103959284100002&r=1&w=2

I read it, and I basically agree with it, except the above distinction 
which wasn't clear to me in the first place.

> So that is the theory: it is better to have an explicit file: scheme,
> because it distinguishes those URIs from the implied 'cocoon:' scheme,
> and fits in better in a world where there are schemes everywhere.

Please expand on this. Do you mean file scheme=sources and cocoon 
scheme=resulting URI space?

> Practically, right now, what is the difference?
> 
> Well for a start, if we consistently used 'file:' for URIs identifying
> static files, we could throw away the current resource-exists action:
> 
>   <map:match pattern="**">
> 
>     <map:act type="resource-exists">
>      <map:parameter name="url" value="content/{1}"/>
>      <map:read src="content/{../1}"/>
>     </map:act>
>     ....
> 
> And replace it with a simple sitemap rule:
> 
>   <map:match pattern="file:**">
>     <map:read src="content/{1}"/>
>   </map:match>

Which is something I don't like.

Again, you are telling Cocoon how to treat that file, which is not a 
concern of the editor.

We decided to take away the extension to files, but this file: thing 
does the same conceptual thing, it selects the sitemap to use inside the 
link.

> Having to interrogate the filesystem to decide a URI's scheme is a total hack.
> What happens if our docs are stored in Xindice, or anything other than a
> filesystem?  Resource-exists is going to break.

Hmmm, this is a good point, but not a resource-exists "conceptual" 
problem. I can test if a resource exists also in remote repositories.
If the "file:" thing takes care different backends, there is no reason 
why a better resource-exists cannot. So seems is more about the 
deficiencies of the resource-exists implementation rather than the need 
of a site: scheme.

> Secondly, introducing a 'file:' prefix fixes the current name clash problem.
> What if I have a static file called 'index.pdf'?  How do I access the index.pdf
> generated from XML?  I can't, because the resource-exists will always choose
> for me.

Which is another seemingly good point, but since we have decided that 
link URIs should not end in extensions, because of many reasons one of 
which is the fact that a URI can reference different formats at 
different times in history, having a scheme that effectively makes me 
serve two different versions of the same file is totally off-target.

> So there are two practical reasons, and a bunch of theory, as to why we should
> have a 'file:' prefix.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

File prefix again (Re: Cocoon CLI - how to generate the whole site)

Posted by Jeff Turner <je...@apache.org>.

On Mon, Dec 16, 2002 at 02:01:52PM +0100, Nicola Ken Barozzi wrote:
> 
> 
> Jeff Turner wrote:
> >On Mon, Dec 16, 2002 at 08:59:32AM +0100, Nicola Ken Barozzi wrote:
> >
> >>Jeff Turner wrote:
> >
> >...
> >
> >>>>We've established that Cocoon is not going to be invoking Javadoc.  That
> >>>>means that the user could generate the Javadocs _after_ they generate 
> >>>>the
> >>>>Cocoon docs.
> >>>>
> >>>>To handle this possibility, the only course of action is to ignore links
> >>>>to external directories like Javadocs.  What alternative is there?
> >>
> >>Yes, but I don't want this to happen, as I said in other mails.
> >>The fact is that for every URI sub-space we take away from Cocoon, we 
> >>should have something that manages it for Cocoon, and that's for *all* 
> >>the environments Cocoon has to offer, because Forrest is made to run in 
> >>all of them.
> >
> >
> >Ah, gotcha :)
> 
> Pfew, it took a long time didn't it?

;P

> >The file: patch has two effects:
> >
> > - Introduce schemes in xdocs, starting with a 'file:' scheme.  I think
> >   that schemes in general are uncontroversial.  When linkmaps arrive,
> >   90% of links are going to be linkmap links, so having a scheme prefix
> >   should be the norm. 
> 
> I'm totally for the scheme concept. But schemes are IMHV onlt link 
> rewriting rules, and should not address other concerns.
> A file: scheme would not do any rewriting, so I don't see the need ATM.
...
> >What we really need to agree on is the first point; whether we want to
> >prefix static links with 'file:'.  When xdocs are swarming with linkmap:,
> >java:, person:, mail:, etc links, why not have file:?  Conversely, if we
> >want to "infer" the file: scheme, are we going to try to infer all the
> >other schemes?
> 
> Hmmm, I don't see the big problem here, but I may as well be wrong.
> 
> The schemes are link-rewriting systems.

Schemes are what the URI RFC defines them to be:

  "The URI scheme (Section 3.1) defines the namespace of the URI, and
  thus may further restrict the syntax and semantics of identifiers using
  that scheme.
    http://www.ietf.org/rfc/rfc2396.txt

> Why would we need to rewrite "file:"s?

Given the above definition, what do you think the implied scheme for
<link href="hello.pdf"> is?  What syntactic and semantic restrictions are
there?  Can we link to anything?  No: we can only link to URIs defined by
sitemap rules.  Therefore the implied scheme is 'cocoon:'.  I need to
invoke Cocoon to get 'hello.pdf'.  If my editor were written in Java as
an Avalon component, it might really be able to invoke Cocoon and
retrieve 'hello.pdf'.

What about when a file is sitting on my harddisk?  Do I need Cocoon to
view it?  No; I can open it in an editor.  Hence the 'file:' protocol is
implied.  In fact, in vim I can type 'gf' and automatically traverse the
link.  My editor is a 'browser' of the Source URI space, just like
Mozilla browses the Destination URI space.

That is the important concept: the Source URI space is distinct from the
Destination URI space.  In the Source URI space (XML docs + <link>
elems), we have all sorts of schemes (linkmap:, java:, file:, person:
etc), but in the Destination URI space (HTML docs + <a> elems), we have
only one protocol, usually http: or file:.

I described this notion of separating the Source and Destination URI
space in a RT: http://marc.theaimsgroup.com/?t=103959284100002&r=1&w=2

So that is the theory: it is better to have an explicit file: scheme,
because it distinguishes those URIs from the implied 'cocoon:' scheme,
and fits in better in a world where there are schemes everywhere.

Practically, right now, what is the difference?

Well for a start, if we consistently used 'file:' for URIs identifying
static files, we could throw away the current resource-exists action:

  <map:match pattern="**">

    <map:act type="resource-exists">
     <map:parameter name="url" value="content/{1}"/>
     <map:read src="content/{../1}"/>
    </map:act>
    ....

And replace it with a simple sitemap rule:

  <map:match pattern="file:**">
    <map:read src="content/{1}"/>
  </map:match>

Having to interrogate the filesystem to decide a URI's scheme is a total hack.
What happens if our docs are stored in Xindice, or anything other than a
filesystem?  Resource-exists is going to break.

Secondly, introducing a 'file:' prefix fixes the current name clash problem.
What if I have a static file called 'index.pdf'?  How do I access the index.pdf
generated from XML?  I can't, because the resource-exists will always choose
for me.

So there are two practical reasons, and a bunch of theory, as to why we should
have a 'file:' prefix.

--Jeff

Re: Cocoon CLI - how to generate the whole site (Re: The Mythical Javadoc generator (Re: Conflict resolution))

Posted by Nicola Ken Barozzi <ni...@apache.org>.


Jeff Turner wrote:
> On Mon, Dec 16, 2002 at 08:59:32AM +0100, Nicola Ken Barozzi wrote:
> 
>>Jeff Turner wrote:
> 
> ...
> 
>>>>We've established that Cocoon is not going to be invoking Javadoc.  That
>>>>means that the user could generate the Javadocs _after_ they generate the
>>>>Cocoon docs.
>>>>
>>>>To handle this possibility, the only course of action is to ignore links
>>>>to external directories like Javadocs.  What alternative is there?
>>
>>Yes, but I don't want this to happen, as I said in other mails.
>>The fact is that for every URI sub-space we take away from Cocoon, we 
>>should have something that manages it for Cocoon, and that's for *all* 
>>the environments Cocoon has to offer, because Forrest is made to run in 
>>all of them.
> 
> 
> Ah, gotcha :)

Pfew, it took a long time didn't it?

> Though remember, with the file: patch, the sitemap *did* serve up files,
> through this rule:
> 
> <map:match pattern="**">
>   <map:act type="resource-exists">
>     <map:parameter name="url" value="content/{1}"/>
>     <map:read src="content/{../1}"/>
>   </map:act>
> 
> So it worked in both command-line and webapp.  The command-line solution
> just happened to bypass the Cocoon CLI.

Which is the point :-)

> The file: patch has two effects:
> 
>  - Introduce schemes in xdocs, starting with a 'file:' scheme.  I think
>    that schemes in general are uncontroversial.  When linkmaps arrive,
>    90% of links are going to be linkmap links, so having a scheme prefix
>    should be the norm. 

I'm totally for the scheme concept. But schemes are IMHV onlt link 
rewriting rules, and should not address other concerns.
A file: scheme would not do any rewriting, so I don't see the need ATM.

>  - Routes around a CLI bug, by copying static files with Ant, rather than
>    through the CLI.

Yup, that's the major point that I didn't like.

> What we really need to agree on is the first point; whether we want to
> prefix static links with 'file:'.  When xdocs are swarming with linkmap:,
> java:, person:, mail:, etc links, why not have file:?  Conversely, if we
> want to "infer" the file: scheme, are we going to try to infer all the
> other schemes?

Hmmm, I don't see the big problem here, but I may as well be wrong.

The schemes are link-rewriting systems. Why would we need to rewrite 
"file:"s? Remember that to get a specific type of "view" on the file we 
have the mime-type attribute in links.

>>If we had a CLI-only Forrest, I could say ok, let's do it, let's make 
>>Ant handle that, but I don't want to see different "special cases" of 
>>handling these spaces. Your proposal has IMHO the same drawbacks as it 
>>had before nevertheless.
> 
> Yes I see.  It hacks around a CLI bug, and introduces a mechanism by
> which further potentially-hack-requiring schemes (like java:) could be
> implemented.

I'm quite confident that we won't use "hack-requiring schemes".
At least that's my goal.

>>>>One thing we could do, is record all 'unprocessable' links in an external
>>>>file, and then the Ant script responsible for invoking Cocoon can look at
>>>>that, and ensure that the links won't break.  For example, say Cocoon
>>>>encounters an unprocessable 'java:org.apache.foo' link.  Cocoon records
>>>>that in unprocessed-files.txt, and otherwise ignore it.  Then, after the
>>>><java> task has finished running Cocoon, an Ant task examines
>>>>unprocessed-files.txt, and if any java: links are recorded, it invokes a
>>>>Javadoc task.
>>>>
>>>>So we have a kind of loose coupling between Cocoon and other doc
>>>>generators.  Cocoon isn't _responsible_ for generating Javadocs, but it
>>>>can _cause_ Javadocs to be generated, by recording that fact that it
>>>>encountered a java: link and couldn't handle it.
>>
>>Hmmm... this idea is somewhat new... the problem is that it breaks down 
>>with the Cocoon webapp.
> 
> It doesn't break down.  It makes the CLI solution independent of the
> webapp solution.  In the case of file:, the webapp happened to have
> solved the problem.
> 
> 
>>My point is IMHO simple: if the webapp Cocoon can handle it, the CLI 
>>should similarly handle it. No special cases. If Cocoon has to trigger 
>>some outer system, we already have Generators, Transformers, Actions, 
>>etc, no need to create another system that BTW bypasses all Cocoon 
>>environment abstractions.
> 
> 
> Yes, that's the ideal.
> 
> 
>>IMHO, Cocoon is the last step, the publishing step. This is the only way 
>>I see to keep consistency between the different Cocoon running modes. 
>>Hence I don't think that triggereing actions after Cocoon CLI is going 
>>to solve problems, but instead created more since it breaks the sitemap.
> 
> Not break, just doesn't solve the problem with the same mechanism.
> Remember we only have two 'running modes': webapp and CLI.

Not for long. Gianugo is probably gonna work on a EJB environment soon, 
we have an Any one in the works, and in the future an 
Avalon-native-component version.

>>You say that the webapp is the primary Cocoon-Forrest method, and as you 
>>know I agree. the CLI is just a way of recreating the same 
>>user-experience by acting as a user that clicks on all links.
>>
>>BUT the user doesn't necessarily work like this, the user can also type 
>>in a URL in the address filed, even if it's not linked, but CLI won't 
>>generate this.
>>Why?
>>Because Cocoon is not an invertible function. That means that given 
>>sources and a sitemap, we *cannot* create all the possible positive 
>>requests. Which in turn means that the Cocoon CLI will never be able to 
>>create a fully equivalent site as the webapp.
>>
>>So we should acknowledge that we need a mechanism that given some rules, 
>>can reasonably create an equivalent site. Crawling is it, and it 
>>generally works well, since usually sites need to be linked from a 
>>homepage to be accessed. Site usage goes through navigation, ie links.
>>
>>Now, Cocoon is not invertible, and this is IMHO a fact. But *parts* of 
>>the sitemap *are* invertible. These parts are basically those where a 
>>complete URI sub-space is mapped to a specific pipeline, and when no 
>>parts of it have been matched before.
>>
>>
>>    <map:match pattern="sub/URI/space/**">
>>       ...
>>    </map:match>
>>
>>
>>This means that we can safely invert Cocoon here, and look at the 
>>sources to know what the result will look like.
>>
>>Conceptually, this gives me the theorical possibility of doing CLI 
>>optimizations for crawling without changing the Cocoon usage patterns. 
>>It's an optimizations inside the CLI, and nothing outside changes.
> 
> Yes!  Today's Mr Clever Award goes to Nicola, for working all this out
> and presenting it so clearly :)
> 
> So really, the CLI could short-cut any URI served with <map:read>.

Not exactly. Also non-reads can be dealt this way. It's not the read 
part that it short-cuts, but the URI space handling.
IE, if a pipeline handles all the URI space, it can safely invert that 
*match* (not the pipeline). See below.

> The "how to invert a sitemap" question also pops up when trying to
> auto-generate a linkmap (specifically, link targets), so a general
> solution (insofar as one is possible) would be very useful.
> 
> One thing I don't see: how does the CLI know that when one Javadoc file
> is referenced, it must copy all of them across?  Remember, you stripped
> the 'java:' scheme in step 1.

Actually, it simply would not crawl that URI space.

This is how it could do it as a start:

1) get all the "matches" in the sitemap; attention must be put in nested 
matches.

2) the ones ending in ** are to be taken into account.

3) for each of those matches, it inverts the match and is able to "map" 
the source and output spaces. Basically it scans all the subdirs defined 
in the match, gathers all the filenames, rewrites them as URIs using the 
inverted match, and calls cocoon on them one by one.

3b) [secong optimization] *If* the pipeline is a read, it can simply 
copy the files across and change filemanes according to the inverted 
match rule.

4) then it can start crawling the docs, remembering not to follow links 
in the spaces already generated.

In essence, we are able to not use crawling to generate parts of a 
website, so it's done much faster.

>>Now, since the theory is solved, the question slides to how to do it, 
>>especially because the pattern can have sitemap variable substitutions 
>>in it.
> 
> So we have two options:
> 
> 1) Implement a sitemap inverter, use it to create a 'lookup table' of
> shortcuttable URIs, and then integrate this into the CLI.
> 2) Say "life's too short, let's just copy the files with Ant".
> 
> Now, practically, solution 1) is going to take a _long_ time to be
> developed.  If it comes down to me, it will be developed when the linkmap
> needs it.
> 
> So, given that 2) is dead simple and 90% implemented, how about going
> with it for now, and replacing it with 1) when that arrives?  As long as
> the public interface (link syntax) is maintained, we can switch
> implementations without affecting users.

Let's define then the syntax. I don't see the need for a "file:" scheme, 
let's argue on this then.

As for individual files, we should be able to fix it by using a 
MimeTypeAction that defines the actual mime-type of the file and/or 
fixing CLI so that it doesn't append the html to unknown mimetype stuff.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: Cocoon CLI - how to generate the whole site (Re: The Mythical Javadoc generator (Re: Conflict resolution))

Posted by Jeff Turner <je...@apache.org>.

On Mon, Dec 16, 2002 at 08:59:32AM +0100, Nicola Ken Barozzi wrote:
> 
> Jeff Turner wrote:
...
> >>We've established that Cocoon is not going to be invoking Javadoc.  That
> >>means that the user could generate the Javadocs _after_ they generate the
> >>Cocoon docs.
> >>
> >>To handle this possibility, the only course of action is to ignore links
> >>to external directories like Javadocs.  What alternative is there?
> 
> Yes, but I don't want this to happen, as I said in other mails.
> The fact is that for every URI sub-space we take away from Cocoon, we 
> should have something that manages it for Cocoon, and that's for *all* 
> the environments Cocoon has to offer, because Forrest is made to run in 
> all of them.

Ah, gotcha :)

Though remember, with the file: patch, the sitemap *did* serve up files,
through this rule:

<map:match pattern="**">
  <map:act type="resource-exists">
    <map:parameter name="url" value="content/{1}"/>
    <map:read src="content/{../1}"/>
  </map:act>

So it worked in both command-line and webapp.  The command-line solution
just happened to bypass the Cocoon CLI.

The file: patch has two effects:

 - Introduce schemes in xdocs, starting with a 'file:' scheme.  I think
   that schemes in general are uncontroversial.  When linkmaps arrive,
   90% of links are going to be linkmap links, so having a scheme prefix
   should be the norm. 

 - Routes around a CLI bug, by copying static files with Ant, rather than
   through the CLI.

What we really need to agree on is the first point; whether we want to
prefix static links with 'file:'.  When xdocs are swarming with linkmap:,
java:, person:, mail:, etc links, why not have file:?  Conversely, if we
want to "infer" the file: scheme, are we going to try to infer all the
other schemes?

> If we had a CLI-only Forrest, I could say ok, let's do it, let's make 
> Ant handle that, but I don't want to see different "special cases" of 
> handling these spaces. Your proposal has IMHO the same drawbacks as it 
> had before nevertheless.

Yes I see.  It hacks around a CLI bug, and introduces a mechanism by
which further potentially-hack-requiring schemes (like java:) could be
implemented.

> >>One thing we could do, is record all 'unprocessable' links in an external
> >>file, and then the Ant script responsible for invoking Cocoon can look at
> >>that, and ensure that the links won't break.  For example, say Cocoon
> >>encounters an unprocessable 'java:org.apache.foo' link.  Cocoon records
> >>that in unprocessed-files.txt, and otherwise ignore it.  Then, after the
> >><java> task has finished running Cocoon, an Ant task examines
> >>unprocessed-files.txt, and if any java: links are recorded, it invokes a
> >>Javadoc task.
> >>
> >>So we have a kind of loose coupling between Cocoon and other doc
> >>generators.  Cocoon isn't _responsible_ for generating Javadocs, but it
> >>can _cause_ Javadocs to be generated, by recording that fact that it
> >>encountered a java: link and couldn't handle it.
> 
> Hmmm... this idea is somewhat new... the problem is that it breaks down 
> with the Cocoon webapp.

It doesn't break down.  It makes the CLI solution independent of the
webapp solution.  In the case of file:, the webapp happened to have
solved the problem.

> My point is IMHO simple: if the webapp Cocoon can handle it, the CLI 
> should similarly handle it. No special cases. If Cocoon has to trigger 
> some outer system, we already have Generators, Transformers, Actions, 
> etc, no need to create another system that BTW bypasses all Cocoon 
> environment abstractions.

Yes, that's the ideal.

> IMHO, Cocoon is the last step, the publishing step. This is the only way 
> I see to keep consistency between the different Cocoon running modes. 
> Hence I don't think that triggereing actions after Cocoon CLI is going 
> to solve problems, but instead created more since it breaks the sitemap.

Not break, just doesn't solve the problem with the same mechanism.
Remember we only have two 'running modes': webapp and CLI.

> You say that the webapp is the primary Cocoon-Forrest method, and as you 
> know I agree. the CLI is just a way of recreating the same 
> user-experience by acting as a user that clicks on all links.
> 
> BUT the user doesn't necessarily work like this, the user can also type 
> in a URL in the address filed, even if it's not linked, but CLI won't 
> generate this.
> Why?
> Because Cocoon is not an invertible function. That means that given 
> sources and a sitemap, we *cannot* create all the possible positive 
> requests. Which in turn means that the Cocoon CLI will never be able to 
> create a fully equivalent site as the webapp.
> 
> So we should acknowledge that we need a mechanism that given some rules, 
> can reasonably create an equivalent site. Crawling is it, and it 
> generally works well, since usually sites need to be linked from a 
> homepage to be accessed. Site usage goes through navigation, ie links.
> 
> Now, Cocoon is not invertible, and this is IMHO a fact. But *parts* of 
> the sitemap *are* invertible. These parts are basically those where a 
> complete URI sub-space is mapped to a specific pipeline, and when no 
> parts of it have been matched before.
> 
> 
>     <map:match pattern="sub/URI/space/**">
>        ...
>     </map:match>
> 
> 
> This means that we can safely invert Cocoon here, and look at the 
> sources to know what the result will look like.
> 
> Conceptually, this gives me the theorical possibility of doing CLI 
> optimizations for crawling without changing the Cocoon usage patterns. 
> It's an optimizations inside the CLI, and nothing outside changes.

Yes!  Today's Mr Clever Award goes to Nicola, for working all this out
and presenting it so clearly :)

So really, the CLI could short-cut any URI served with <map:read>.

The "how to invert a sitemap" question also pops up when trying to
auto-generate a linkmap (specifically, link targets), so a general
solution (insofar as one is possible) would be very useful.

One thing I don't see: how does the CLI know that when one Javadoc file
is referenced, it must copy all of them across?  Remember, you stripped
the 'java:' scheme in step 1.

> Now, since the theory is solved, the question slides to how to do it, 
> especially because the pattern can have sitemap variable substitutions 
> in it.

So we have two options:

1) Implement a sitemap inverter, use it to create a 'lookup table' of
shortcuttable URIs, and then integrate this into the CLI.
2) Say "life's too short, let's just copy the files with Ant".

Now, practically, solution 1) is going to take a _long_ time to be
developed.  If it comes down to me, it will be developed when the linkmap
needs it.

So, given that 2) is dead simple and 90% implemented, how about going
with it for now, and replacing it with 1) when that arrives?  As long as
the public interface (link syntax) is maintained, we can switch
implementations without affecting users.

--Jeff

Re: Cocoon CLI - how to generate the whole site (Re: The Mythical Javadoc generator (Re: Conflict resolution))

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Jeff Turner wrote:
> Nicola,
> 
> Mind replying to this?  It describes why some links are unprocessable by
> the Cocoon CLI, and proposes a general system for handling these links,
> of which my file: patch was an example.

Np. I have difficulty in these days to process all the mail that passes 
in my inbox, I get more than 300 mails a day, so please do put my 
attention to important mails like these ones if I fail to see them.

> --Jeff
> 
> On Sat, Dec 14, 2002 at 04:06:18AM +1100, Jeff Turner wrote:
> 
>>On Fri, Dec 13, 2002 at 05:31:59PM +0100, Nicola Ken Barozzi wrote:
>>
>>>Jeff Turner wrote:
>>>
>>>
>>>>The javadocs are _already_ generated, and <javadoc> has already put them
>>>>in build/site/apidocs/.  Now how is Cocoon (via the CLI) going to
>>>>"publish" them?
>>>
>>>Ok, now we finally get to the actual technical point. I will take this 
>>>discussion in a general way, because the issue is in fact quite general.
>>>
>>>                              -oOo-
>>>
>>>ATM, the Cocoon CLI system is completely crawler based. This means that
>>>it starts from a list of URLs, and "crawles" the site by getting the 
>>>links from these pages, putting them in the list, purging the visited 
>>>ones, and restrting the process with those.
>>>
>>>If we only have XML documents, the system can be made to be very fast 
>>>and semantically rich.
>>>
>>>  - fast
>>>   if we get the links while processing the file, we don't
>>>   have to reparse it later for the crawling
>>>
>>>  - semantically rich
>>>    we get the links not from the output, but from the real source.
>>>    In the sitemap, the source content, with all semantics, is
>>>    tagged and used for the link gathering. So we can even gather
>>>    links from an svg file that will become a jpeg image!
>>>
>>>Things start breaking a bit down when we have to use resources that are 
>>>not transformed to XML. Examples are CSS and massive docs to be included 
>>>like javadocs.
>>>
>>>The problem is not *reading* this files via Cocoon, but getting the 
>>>links from them. In the case of CSS we need the links, in case of 
>>>Javadocs, we know the dir structure and eventually would not need them.
>>>
>>>For the CSS, the best thing is actually parsing them and passing them in 
>>>the SAX pipeline. I see no technical nor conceptual problem with it.
>>>
>>>The problem arises when we need to pass files in "bulk". In this case 
>>>they are javadocs, but what about jars, binaries, images, all things 
>>>that are not necessarily linked in the site, or that we simply want to 
>>>dump in the resulting system?
>>>
>>>This is the answer that I seek.
>>
>>There is only one answer.
>>
>>We've established that Cocoon is not going to be invoking Javadoc.  That
>>means that the user could generate the Javadocs _after_ they generate the
>>Cocoon docs.
>>
>>To handle this possibility, the only course of action is to ignore links
>>to external directories like Javadocs.  What alternative is there?

Yes, but I don't want this to happen, as I said in other mails.
The fact is that for every URI sub-space we take away from Cocoon, we 
should have something that manages it for Cocoon, and that's for *all* 
the environments Cocoon has to offer, because Forrest is made to run in 
all of them.

If we had a CLI-only Forrest, I could say ok, let's do it, let's make 
Ant handle that, but I don't want to see different "special cases" of 
handling these spaces. Your proposal has IMHO the same drawbacks as it 
had before nevertheless.

>>One thing we could do, is record all 'unprocessable' links in an external
>>file, and then the Ant script responsible for invoking Cocoon can look at
>>that, and ensure that the links won't break.  For example, say Cocoon
>>encounters an unprocessable 'java:org.apache.foo' link.  Cocoon records
>>that in unprocessed-files.txt, and otherwise ignore it.  Then, after the
>><java> task has finished running Cocoon, an Ant task examines
>>unprocessed-files.txt, and if any java: links are recorded, it invokes a
>>Javadoc task.
>>
>>So we have a kind of loose coupling between Cocoon and other doc
>>generators.  Cocoon isn't _responsible_ for generating Javadocs, but it
>>can _cause_ Javadocs to be generated, by recording that fact that it
>>encountered a java: link and couldn't handle it.

Hmmm... this idea is somewhat new... the problem is that it breaks down 
with the Cocoon webapp.

My point is IMHO simple: if the webapp Cocoon can handle it, the CLI 
should similarly handle it. No special cases. If Cocoon has to trigger 
some outer system, we already have Generators, Transformers, Actions, 
etc, no need to create another system that BTW bypasses all Cocoon 
environment abstractions.

IMHO, Cocoon is the last step, the publishing step. This is the only way 
I see to keep consistency between the different Cocoon running modes. 
Hence I don't think that triggereing actions after Cocoon CLI is going 
to solve problems, but instead created more since it breaks the sitemap.

You say that the webapp is the primary Cocoon-Forrest method, and as you 
know I agree. the CLI is just a way of recreating the same 
user-experience by acting as a user that clicks on all links.

BUT the user doesn't necessarily work like this, the user can also type 
in a URL in the address filed, even if it's not linked, but CLI won't 
generate this.
Why?
Because Cocoon is not an invertible function. That means that given 
sources and a sitemap, we *cannot* create all the possible positive 
requests. Which in turn means that the Cocoon CLI will never be able to 
create a fully equivalent site as the webapp.

So we should acknowledge that we need a mechanism that given some rules, 
can reasonably create an equivalent site. Crawling is it, and it 
generally works well, since usually sites need to be linked from a 
homepage to be accessed. Site usage goes through navigation, ie links.

Now, Cocoon is not invertible, and this is IMHO a fact. But *parts* of 
the sitemap *are* invertible. These parts are basically those where a 
complete URI sub-space is mapped to a specific pipeline, and when no 
parts of it have been matched before.

     <map:match pattern="sub/URI/space/**">
        ...
     </map:match>

This means that we can safely invert Cocoon here, and look at the 
sources to know what the result will look like.

Conceptually, this gives me the theorical possibility of doing CLI 
optimizations for crawling without changing the Cocoon usage patterns. 
It's an optimizations inside the CLI, and nothing outside changes.

Now, since the theory is solved, the question slides to how to do it, 
especially because the pattern can have sitemap variable substitutions 
in it.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: Cocoon CLI - how to generate the whole site (Re: The Mythical Javadoc generator (Re: Conflict resolution))

Posted by Jeff Turner <je...@apache.org>.

Nicola,

Mind replying to this?  It describes why some links are unprocessable by
the Cocoon CLI, and proposes a general system for handling these links,
of which my file: patch was an example.


--Jeff

On Sat, Dec 14, 2002 at 04:06:18AM +1100, Jeff Turner wrote:
> On Fri, Dec 13, 2002 at 05:31:59PM +0100, Nicola Ken Barozzi wrote:
> > 
> > Jeff Turner wrote:
> > 
> > >The javadocs are _already_ generated, and <javadoc> has already put them
> > >in build/site/apidocs/.  Now how is Cocoon (via the CLI) going to
> > >"publish" them?
> > 
> > Ok, now we finally get to the actual technical point. I will take this 
> > discussion in a general way, because the issue is in fact quite general.
> > 
> >                               -oOo-
> > 
> > ATM, the Cocoon CLI system is completely crawler based. This means that
> > it starts from a list of URLs, and "crawles" the site by getting the 
> > links from these pages, putting them in the list, purging the visited 
> > ones, and restrting the process with those.
> > 
> > If we only have XML documents, the system can be made to be very fast 
> > and semantically rich.
> > 
> >   - fast
> >    if we get the links while processing the file, we don't
> >    have to reparse it later for the crawling
> > 
> >   - semantically rich
> >     we get the links not from the output, but from the real source.
> >     In the sitemap, the source content, with all semantics, is
> >     tagged and used for the link gathering. So we can even gather
> >     links from an svg file that will become a jpeg image!
> > 
> > Things start breaking a bit down when we have to use resources that are 
> > not transformed to XML. Examples are CSS and massive docs to be included 
> > like javadocs.
> > 
> > The problem is not *reading* this files via Cocoon, but getting the 
> > links from them. In the case of CSS we need the links, in case of 
> > Javadocs, we know the dir structure and eventually would not need them.
> > 
> > For the CSS, the best thing is actually parsing them and passing them in 
> > the SAX pipeline. I see no technical nor conceptual problem with it.
> > 
> > The problem arises when we need to pass files in "bulk". In this case 
> > they are javadocs, but what about jars, binaries, images, all things 
> > that are not necessarily linked in the site, or that we simply want to 
> > dump in the resulting system?
> > 
> > This is the answer that I seek.
> 
> There is only one answer.
> 
> We've established that Cocoon is not going to be invoking Javadoc.  That
> means that the user could generate the Javadocs _after_ they generate the
> Cocoon docs.
> 
> To handle this possibility, the only course of action is to ignore links
> to external directories like Javadocs.  What alternative is there?
> 
> 
> One thing we could do, is record all 'unprocessable' links in an external
> file, and then the Ant script responsible for invoking Cocoon can look at
> that, and ensure that the links won't break.  For example, say Cocoon
> encounters an unprocessable 'java:org.apache.foo' link.  Cocoon records
> that in unprocessed-files.txt, and otherwise ignore it.  Then, after the
> <java> task has finished running Cocoon, an Ant task examines
> unprocessed-files.txt, and if any java: links are recorded, it invokes a
> Javadoc task.
> 
> So we have a kind of loose coupling between Cocoon and other doc
> generators.  Cocoon isn't _responsible_ for generating Javadocs, but it
> can _cause_ Javadocs to be generated, by recording that fact that it
> encountered a java: link and couldn't handle it.
> 
> 
> --Jeff

Re: Cocoon CLI - how to generate the whole site (Re: The Mythical Javadoc generator (Re: Conflict resolution))

Posted by Jeff Turner <je...@apache.org>.

On Fri, Dec 13, 2002 at 05:31:59PM +0100, Nicola Ken Barozzi wrote:
> 
> Jeff Turner wrote:
> 
> >The javadocs are _already_ generated, and <javadoc> has already put them
> >in build/site/apidocs/.  Now how is Cocoon (via the CLI) going to
> >"publish" them?
> 
> Ok, now we finally get to the actual technical point. I will take this 
> discussion in a general way, because the issue is in fact quite general.
> 
>                               -oOo-
> 
> ATM, the Cocoon CLI system is completely crawler based. This means that
> it starts from a list of URLs, and "crawles" the site by getting the 
> links from these pages, putting them in the list, purging the visited 
> ones, and restrting the process with those.
> 
> If we only have XML documents, the system can be made to be very fast 
> and semantically rich.
> 
>   - fast
>    if we get the links while processing the file, we don't
>    have to reparse it later for the crawling
> 
>   - semantically rich
>     we get the links not from the output, but from the real source.
>     In the sitemap, the source content, with all semantics, is
>     tagged and used for the link gathering. So we can even gather
>     links from an svg file that will become a jpeg image!
> 
> Things start breaking a bit down when we have to use resources that are 
> not transformed to XML. Examples are CSS and massive docs to be included 
> like javadocs.
> 
> The problem is not *reading* this files via Cocoon, but getting the 
> links from them. In the case of CSS we need the links, in case of 
> Javadocs, we know the dir structure and eventually would not need them.
> 
> For the CSS, the best thing is actually parsing them and passing them in 
> the SAX pipeline. I see no technical nor conceptual problem with it.
> 
> The problem arises when we need to pass files in "bulk". In this case 
> they are javadocs, but what about jars, binaries, images, all things 
> that are not necessarily linked in the site, or that we simply want to 
> dump in the resulting system?
> 
> This is the answer that I seek.

There is only one answer.

We've established that Cocoon is not going to be invoking Javadoc.  That
means that the user could generate the Javadocs _after_ they generate the
Cocoon docs.

To handle this possibility, the only course of action is to ignore links
to external directories like Javadocs.  What alternative is there?

One thing we could do, is record all 'unprocessable' links in an external
file, and then the Ant script responsible for invoking Cocoon can look at
that, and ensure that the links won't break.  For example, say Cocoon
encounters an unprocessable 'java:org.apache.foo' link.  Cocoon records
that in unprocessed-files.txt, and otherwise ignore it.  Then, after the
<java> task has finished running Cocoon, an Ant task examines
unprocessed-files.txt, and if any java: links are recorded, it invokes a
Javadoc task.

So we have a kind of loose coupling between Cocoon and other doc
generators.  Cocoon isn't _responsible_ for generating Javadocs, but it
can _cause_ Javadocs to be generated, by recording that fact that it
encountered a java: link and couldn't handle it.

--Jeff