You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cocoon.apache.org by Tony Collen <tc...@hist.umn.edu> on 2002/11/07 23:57:56 UTC

URL Theory & Best Practices

Apologies for the extra long post, but this has been bugging me for a 
while.

First, some background:

I'm attempting to put together a URL space using cocoon that will allow 
users to drop an XML file into a directory, say 
$TOMCAT_HOME/webapps/cocoon/documents/ and have it published.  This is 
easy enough:

<map:match pattern="*.html>
    <map:generate src="documents/{1}.html"/>
    <map:transform src="stylesheets/page2html.xsl"/>
    <map:serialize type="xhtml"/>
</map:match>

So then I decide that for organization's sake, I want to allow people to 
create subdirectories under documents/ any number of levels deep, and 
still have cocoon publish them. This is also fairly simple:

<map:match pattern="**/*.html">
    <map:generate src="documents/{1}.html"/>
    <map:transform src="stylesheets/page2html.xsl"/>
    <map:serialize type="xhtml"/>
</map:match>

However, later I realize that using file extensions is "bad".  Read 
http://www.alistapart.com/stories/slashforward/ for more info on this 
idea.  

This creates problems with how I automatically generate content using 
Cocoon.  I want to allow people to create content arbitrarily deep in 
the documents/ directory, but I run into a bunch of questions.

Should trailing slashes always be used? I think so.  

Therefore: Consider an HTTP request for "/a/b/c/".

    1. Is it a request for the discreet resource named "c" which is 
contained in "b"?
    2. Is it a request for the listing of all the contents of the "c" 
resource (which is in turn contained within "b")?
    3. Is this equivalent to a request for "/a/b/c"?  
        3b. Should a request for something w/o a trailing slash be 
redirected to the same URL, but with a trailing slash added?

Using the "best practice" of always having trailing slashes creates 
problems when mapping the virtual URL space to a physical directory 
structure.  Considering a request for "/a/b/c/", do I go into 
documents/a/b/c/ and generate from index.xml?  Or do I go to 
documents/a/b/ and generate from c.xml?  Having every "leaf" be a 
directory with an index.xml gets to be unmaintainable, IMO.

Likewise, do I generate from documents/a/b/d.xml or 
documents/a/b/d/index.xml for a request of "/a/b/d"?  Additionally, what 
should happen when there's a request for "/a/b/"?  Obviously, if the 
subdirectory "b" exists, it would not be correct to go to documents/a/ 
and look for b.xml.

Part of my reasoning behind all these questions lies in my quest for 
creating an uber-flexible "drop-in" directory structure where people can 
simply add their .xml files to the "documents" directory and have Cocoon 
automagically publish them, as I stated above.  The other reason for 
this is that I'm trying to devise a system which automatically creates 
navigation, as well. I've looked at the Bonebreaker example, and it's 
good, but has some limitations.  What if I don't want to use the naming 
scheme they have?

Oh well, thanks for listening to my ramblings, and hopefully I can get 
some light shed on this situation, as well as have a nifty autonavbar 
work eventually :)  

Regards,
Tony


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Kjetil Kjernsmo <kj...@kjernsmo.net>.

On Saturday 09 November 2002 23:57, Barbara Post wrote:
> Oh, I get 406 code, I didn't know this one !!
> I have IE6 SP1 on Windows 2000 Pro.

Hehe, oh well, that's another browser quirk, but a much less serious so. 
I use language negotation too, so what everybody _should_ do is go into 
their settings and make sure they enable all languages they know how to 
read... Check out http://www.debian.org/intro/cn for a howto... 
Lacking that, browser vendors should add an *;q=0.001 to their language 
strings to avoid this error, but that's a lot more IMHO than the other 
things I've written in this thread... :-) 

I'd tried to talk the Mozilla folks into that in Bug 55800, 
http://bugzilla.mozilla.org/show_bug.cgi?id=55800 and the fact that 
you're getting 406 is proof that they are wrong... :-)
Anyway, I've been logging language settings for a long time on one of my 
sites, and it was in fact very few users who had browsers were it would 
break, and language negotation is quite cool, so I decided to use it. 
Besides, those users who have it wrong could be catered for with a good 
error handler, if I had bothered... :-) 

Thanks for checking! :-)

Best,

Kjetil
-- 
Kjetil Kjernsmo
Astrophysicist/IT Consultant/Skeptic/Ski-orienteer/Orienteer/Mountaineer
kjetil@kjernsmo.net  webmaster@skepsis.no  editor@learn-orienteering.org
Homepage: http://www.kjetil.kjernsmo.net/

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Barbara Post <ba...@ifrance.com>.

Oh, I get 406 code, I didn't know this one !!
I have IE6 SP1 on Windows 2000 Pro.

Babs
--
website : www.babsfrance.fr.st
ICQ : 135868405
----- Original Message -----
From: "Kjetil Kjernsmo" <kj...@kjernsmo.net>
To: <co...@xml.apache.org>
Sent: Saturday, November 09, 2002 11:42 PM
Subject: Re: URL Theory & Best Practices


>
> Uh-oh.... I'm catching some bad vibs... Can someone do me a favour of
> going to http://www.kjernsmo.net/ with IE6 and see what happens?
>
> The mainpage isn't a big thing, it is pure XHTML, but per the XHTML 1.0
> spec, it is served as text/html, but it is using simple Apache content
> negotation to set that. So, I've got this bad feeling that IE is going
> to ignore the content-type header and just list it as raw XML with no
> stylesheet, because that is what would be a logical consequence of what
> you write. But I can't for the life of me understand how it can be
> standards-compliant...
>
> Best,
>
> Kjetil
> --
> Kjetil Kjernsmo
> Astrophysicist/IT Consultant/Skeptic/Ski-orienteer/Orienteer/Mountaineer
> kjetil@kjernsmo.net  webmaster@skepsis.no  editor@learn-orienteering.org
> Homepage: http://www.kjetil.kjernsmo.net/
>
>
> ---------------------------------------------------------------------
> Please check that your question  has not already been answered in the
> FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>
>
> To unsubscribe, e-mail:     <co...@xml.apache.org>
> For additional commands, e-mail:   <co...@xml.apache.org>
>
>
> __________________________________________________
> Modem offert : 150,92 euros remboursés sur le Pack eXtense de Wanadoo !
> Haut débit à partir de 30 euros/mois : http://www.ifrance.com/_reloc/w


__________________________________________________
Modem offert : 150,92 euros remboursés sur le Pack eXtense de Wanadoo ! 
Haut débit à partir de 30 euros/mois : http://www.ifrance.com/_reloc/w


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by "Antonio A. Gallardo Rivera" <ag...@agsoftware.dnsalias.com>.

I dont know if MS IE 6.0 is standard compliant of not. I dont care. I
think MS IE try to be compliant. I develop under Linux. But I know too
that MS IE 6.0 is the most used browser in the world. Then? I also check
the MS IE presentation.

I dont want to continue this polemic about if this will work or not. Make
a simple example and you will see that:

"The use of PDF filename without extension in the MS IE 6.0 SP1 currently
does not work. This is a fact!

Maybe in the future we can make use of the URI without extension. But I am
developing for current browser. Every developer must take care of that.

Again, I agree with the theory of not use the extension. But for now. It
does not work for PDF case under MS Windows.

I already make use of the FOP and PDF serialization. Why make some wrong
afirmation to people (mainly newbies) about how to make it when it does
not work?

At the end, please dont take me wrong. I am just trying to help. :-D

Regards,

Antonio Gallardo.

Kjetil Kjernsmo dijo:
> On Saturday 09 November 2002 23:41, Antonio A. Gallardo Rivera wrote:
>> The true is that I wrote. If you dont believe me, I recommend you to
>> check the archive of this mailing list. This was not my fault. Not
>> only I found this error many other people had the same problem with IE
>> 6.0 SP1. I fighted with generation of PDF the content for a day after
>> I realize that the extension must be .pdf or it will not work!
>>
>> This is why I told you about the fine theory and the cruel reality.
>> :-D
>
> Uh-oh.... I'm catching some bad vibs... Can someone do me a favour of
> going to http://www.kjernsmo.net/ with IE6 and see what happens?
>
> The mainpage isn't a big thing, it is pure XHTML, but per the XHTML 1.0
> spec, it is served as text/html, but it is using simple Apache content
> negotation to set that. So, I've got this bad feeling that IE is going
> to ignore the content-type header and just list it as raw XML with no
> stylesheet, because that is what would be a logical consequence of what
> you write. But I can't for the life of me understand how it can be
> standards-compliant...
>
> Best,
>
> Kjetil
> --
> Kjetil Kjernsmo
> Astrophysicist/IT Consultant/Skeptic/Ski-orienteer/Orienteer/Mountaineer
> kjetil@kjernsmo.net  webmaster@skepsis.no  editor@learn-orienteering.org
> Homepage: http://www.kjetil.kjernsmo.net/
>
>
> ---------------------------------------------------------------------
> Please check that your question  has not already been answered in the
> FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>
>
> To unsubscribe, e-mail:     <co...@xml.apache.org>
> For additional commands, e-mail:   <co...@xml.apache.org>

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Kjetil Kjernsmo wrote:
> So, I've got this bad feeling that IE is going 
> to ignore the content-type header ...
 > But I can't for the life of me understand how it can be
> standards-compliant...  

Well, IEx does not in general ignore the content-type
header, and it is, more or less, standards compliant,
just in a somewhat special way.
 From various rumours and gossip I compiled the following
story: IEx uses a variety of COM components for handling
content. A correct implementation would be to open the
network connection, read the headers including the content
type header, decide which component handles the content,
and then hand over the relevant headers and the open
connection to the component. It seems that handing open
connections to arbitrary COM components is difficult, or
was difficult at the time the architecture of IEx was decided,
therefore the browser component takes a look at the URL,
extracts what it thinks could be a "file extension", then
looks up whatever component is registered for this string
in the Windows registry (note that MIME types are not keys
there) and then hands the URL to the component. Obviously
it's up to the component what happens if the content type
does not match one of the possible types the component can
handle, or whether it even honors the content-type header.
In many cases a mismatch causes the connection to be closed
and another component determined by the content-type gets
the URL. BTW this is the mechanism the Klez virus uses
to get into windows systems. Some components seem to take a
second look at the URL, and sometimes they return errors or
something which causes the browser component to fall back
to the default HTML renderer which then most often draws a
blank. Caching plays a role too. Also, the algorithms for
extracting a "file extension" and perhaps content negotiation
seem to be implemented multiple times and probably in
different ways in various components, or perhaps the
components don't have access to necessary data (like
cookies) all the time.
The user usually doesn't notice anything. Problems arise
if the URL points to dynamic content where a second GET
can cause different stuff to be retrieved, in particular if
the content was'n completely read or wasn't cached for other
reasons (like SSL).
Disclaimer: most of the above is second hand knowledge.

HTH
J.Pietschmann

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by "Antonio A. Gallardo Rivera" <ag...@agsoftware.dnsalias.com>.


Erik Bruchez dijo:
>> Uh-oh.... I'm catching some bad vibs... Can someone do me a favour of
>> going to http://www.kjernsmo.net/ with IE6 and see what happens?
>>
> I don't think IE ignores the content-type header. It may be more lax
> than NS when it sees a file extension, but the home page of your site
> does not have any. The page displays fine with IE 6.
>
> -Erik

I talked about the case of PDF serialization!

Antonio Gallardo
>
>
>
> ---------------------------------------------------------------------
> Please check that your question  has not already been answered in the
> FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>
>
> To unsubscribe, e-mail:     <co...@xml.apache.org>
> For additional commands, e-mail:   <co...@xml.apache.org>




---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Erik Bruchez <er...@bruchez.org>.

> Uh-oh.... I'm catching some bad vibs... Can someone do me a favour of
> going to http://www.kjernsmo.net/ with IE6 and see what happens?
>
I don't think IE ignores the content-type header. It may be more lax 
than NS when it sees a file extension, but the home page of your site 
does not have any. The page displays fine with IE 6.

-Erik



---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Kjetil Kjernsmo <kj...@kjernsmo.net>.

On Saturday 09 November 2002 23:41, Antonio A. Gallardo Rivera wrote:
> The true is that I wrote. If you dont believe me, I recommend you to
> check the archive of this mailing list. This was not my fault. Not
> only I found this error many other people had the same problem with
> IE 6.0 SP1. I fighted with generation of PDF the content for a day
> after I realize that the extension must be .pdf or it will not work!
>
> This is why I told you about the fine theory and the cruel reality.
> :-D

Uh-oh.... I'm catching some bad vibs... Can someone do me a favour of 
going to http://www.kjernsmo.net/ with IE6 and see what happens?

The mainpage isn't a big thing, it is pure XHTML, but per the XHTML 1.0 
spec, it is served as text/html, but it is using simple Apache content 
negotation to set that. So, I've got this bad feeling that IE is going 
to ignore the content-type header and just list it as raw XML with no 
stylesheet, because that is what would be a logical consequence of what 
you write. But I can't for the life of me understand how it can be 
standards-compliant...  

Best,

Kjetil
-- 
Kjetil Kjernsmo
Astrophysicist/IT Consultant/Skeptic/Ski-orienteer/Orienteer/Mountaineer
kjetil@kjernsmo.net  webmaster@skepsis.no  editor@learn-orienteering.org
Homepage: http://www.kjetil.kjernsmo.net/

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by "Antonio A. Gallardo Rivera" <ag...@agsoftware.dnsalias.com>.

The true is that I wrote. If you dont believe me, I recommend you to check
the archive of this mailing list. This was not my fault. Not only I found
this error many other people had the same problem with IE 6.0 SP1. I
fighted with generation of PDF the content for a day after I realize that
the extension must be .pdf or it will not work!

This is why I told you about the fine theory and the cruel reality. :-D


Antonio Gallardo.

Kjetil Kjernsmo dijo:
> On Saturday 09 November 2002 21:33, Miles Elam wrote:
>> Antonio A. Gallardo Rivera wrote:
>> >Kjetil Kjernsmo dijo:
>> >>On Thursday 07 November 2002 23:57, Tony Collen wrote:
>> >>>However, later I realize that using file extensions is "bad".
>> >>> Read http://www.alistapart.com/stories/slashforward/ for more info
>> on this idea.
>> >
>> >I know about that. The theory is fine, but in the real world... Are
>> > you tried to open a PDF file without the .pdf extension with MS IE
>> 6.0 SP1?
>
> No, I have barely touched IE since 3.0.
>
>> > It does not work. MS IE relays mainly on the extension of
>> > the file to open a pdf file.
>
> What?!? What you're saying is that IE is ignoring the content-type?
> That's just incredibly silly...
>
>> How we can address this? I already
>> > know that Carsten and Mathew in his book dont recommend the use of
>> extension and I agree. But how we can tell MS Internet Explorer
>> about that?
>>
>> PDF isn't IE's normal method of receiving information (ease of use
>> with Acrobat aside).  If you specifically want the PDF
>> representation, specify *.pdf.  If what you want is the resource, then
>> you aren't asking specifically for PDF.  If all you have is PDF and
>> PDF is the only representation, then having your URL specify that you
>> are serving PDF hurts no one and corrupts no URLs.
>
> Yes it does! What representation is chosen should only depend on the
> Accept header, and what the UA should do with a file it receives should
> have nothing to do with the filename whatsoever, it should be based on
> the Content-Type-header in the response, solely. It's been a while
> since I read the HTTP 1.1 spec, but IIRC, it is pretty clearly spelled
> out there. It should only depend on the MIME-type. On the server and
> client sides, separately, how it is done is of no concern of anybody,
> but that the client depends on what file extension the server uses has
> to be a violation of the spec, again IIRC.
>
> During content negotation, an extensionless URL should be responded to
> with 200 if the server has a representation which is acceptable
> according to the client's Accept*-headers, with a Location-header
> saying where to find the best file, and that file may well have a .pdf
> extension. If no appropriate representation is found, the server should
> respond with 406.
>
> </rant>
>
> Cheers,
>
> Kjetil
> --
> Kjetil Kjernsmo
> Astrophysicist/IT Consultant/Skeptic/Ski-orienteer/Orienteer/Mountaineer
> kjetil@kjernsmo.net  webmaster@skepsis.no  editor@learn-orienteering.org
> Homepage: http://www.kjetil.kjernsmo.net/
>
>
> ---------------------------------------------------------------------
> Please check that your question  has not already been answered in the
> FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>
>
> To unsubscribe, e-mail:     <co...@xml.apache.org>
> For additional commands, e-mail:   <co...@xml.apache.org>




---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Kjetil Kjernsmo <kj...@kjernsmo.net>.

On Saturday 09 November 2002 21:33, Miles Elam wrote:
> Antonio A. Gallardo Rivera wrote:
> >Kjetil Kjernsmo dijo:
> >>On Thursday 07 November 2002 23:57, Tony Collen wrote:
> >>>However, later I realize that using file extensions is "bad". 
> >>> Read http://www.alistapart.com/stories/slashforward/ for more
> >>> info on this idea.
> >
> >I know about that. The theory is fine, but in the real world... Are
> > you tried to open a PDF file without the .pdf extension with MS IE
> > 6.0 SP1?

No, I have barely touched IE since 3.0.

> > It does not work. MS IE relays mainly on the extension of
> > the file to open a pdf file. 

What?!? What you're saying is that IE is ignoring the content-type? 
That's just incredibly silly... 

> How we can address this? I already
> > know that Carsten and Mathew in his book dont recommend the use of
> > extension and I agree. But how we can tell MS Internet Explorer
> > about that?
>
> PDF isn't IE's normal method of receiving information (ease of use
> with Acrobat aside).  If you specifically want the PDF
> representation, specify *.pdf.  If what you want is the resource,
> then you aren't asking specifically for PDF.  If all you have is PDF
> and PDF is the only representation, then having your URL specify that
> you are serving PDF hurts no one and corrupts no URLs.

Yes it does! What representation is chosen should only depend on the 
Accept header, and what the UA should do with a file it receives should 
have nothing to do with the filename whatsoever, it should be based on 
the Content-Type-header in the response, solely. It's been a while 
since I read the HTTP 1.1 spec, but IIRC, it is pretty clearly spelled 
out there. It should only depend on the MIME-type. On the server and 
client sides, separately, how it is done is of no concern of anybody, 
but that the client depends on what file extension the server uses has 
to be a violation of the spec, again IIRC. 

During content negotation, an extensionless URL should be responded to 
with 200 if the server has a representation which is acceptable 
according to the client's Accept*-headers, with a Location-header 
saying where to find the best file, and that file may well have a .pdf 
extension. If no appropriate representation is found, the server should 
respond with 406. 

</rant>

Cheers,

Kjetil
-- 
Kjetil Kjernsmo
Astrophysicist/IT Consultant/Skeptic/Ski-orienteer/Orienteer/Mountaineer
kjetil@kjernsmo.net  webmaster@skepsis.no  editor@learn-orienteering.org
Homepage: http://www.kjetil.kjernsmo.net/

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Miles Elam <mi...@pcextremist.com>.

Antonio A. Gallardo Rivera wrote:

>Kjetil Kjernsmo dijo:
>  
>
>>On Thursday 07 November 2002 23:57, Tony Collen wrote:
>>    
>>
>>>However, later I realize that using file extensions is "bad".  Read
>>>http://www.alistapart.com/stories/slashforward/ for more info on this
>>>idea.
>>>      
>>>
>
>I know about that. The theory is fine, but in the real world... Are you
>tried to open a PDF file without the .pdf extension with MS IE 6.0 SP1? It
>does not work. MS IE relays mainly on the extension of the file to open a
>pdf file. How we can address this? I already know that Carsten and Mathew
>in his book dont recommend the use of extension and I agree. But how we
>can tell MS Internet Explorer about that?
>
PDF isn't IE's normal method of receiving information (ease of use with 
Acrobat aside).  If you specifically want the PDF representation, 
specify *.pdf.  If what you want is the resource, then you aren't asking 
specifically for PDF.  If all you have is PDF and PDF is the only 
representation, then having your URL specify that you are serving PDF 
hurts no one and corrupts no URLs.

- Miles



---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by "Antonio A. Gallardo Rivera" <ag...@agsoftware.dnsalias.com>.

Kjetil Kjernsmo dijo:
> Hi!
>
> Interesting thread! Most things has been said allready, but I'll just
> add a little .02 (whatever currency) :-)
>
> On Thursday 07 November 2002 23:57, Tony Collen wrote:
>
>> However, later I realize that using file extensions is "bad".  Read
>> http://www.alistapart.com/stories/slashforward/ for more info on this
>> idea.

I know about that. The theory is fine, but in the real world... Are you
tried to open a PDF file without the .pdf extension with MS IE 6.0 SP1? It
does not work. MS IE relays mainly on the extension of the file to open a
pdf file. How we can address this? I already know that Carsten and Mathew
in his book dont recommend the use of extension and I agree. But how we
can tell MS Internet Explorer about that?

Antonio Gallardo.
>
> The article is interesting, but a little too narrow. Indeed, using file
> extensions are Bad[tm] for URIs because you tie the address to a
> specific technology, which you may not be using in some years. For that
> reason not changing the default "cocoon" in Cocoon URIs are also a  Bad
> Thing[tm].
>
> The authorative reference on this topic is TimBL's rant "Cool URIs don't
>  change": http://www.w3.org/Provider/Style/URI :-)
>
> So, using directories for everything is one possibility, and if you do,
> make sure to include the trailing /, to avoid useless 301 redirects.
> Another option is to use Content Negotation, which is well defined in
> HTTP 1.1 (and earlier, IIRC), it's weird that it isn't more widely
> used.
>
> But, both content negotation and using directories for everything are
> both solutions that exists mainly because URIs have been so strongly
> tied to the file system of the server, and the mentioned article seems
> to take as granted that this connection is a necessity, but as Cocoon
> proves, this is not so. Just use sensible matches. It also means that
> requiring a trailing slash on every URI is a bit too much, I only do
> that if there is logically a hierarchal substructure.
>
> As for the problem of serving different formats to the client, I really
> have no good solution. What the user agents should do, was to let the
> user easily manipulate the Accept-header, so if the user wants a
> PDF-file, he would send only application/pdf in the Accept header, and
> the server would know that the user wanted a PDF-file, and send that.
> Given that it doesn't exist, appending the type to the URI is probably
> not too bad.
>
> Best,
>
> Kjetil
> --
> Kjetil Kjernsmo
> Astrophysicist/IT Consultant/Skeptic/Ski-orienteer/Orienteer/Mountaineer
> kjetil@kjernsmo.net  webmaster@skepsis.no  editor@learn-orienteering.org
> Homepage: http://www.kjetil.kjernsmo.net/
>
>
> ---------------------------------------------------------------------
> Please check that your question  has not already been answered in the
> FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>
>
> To unsubscribe, e-mail:     <co...@xml.apache.org>
> For additional commands, e-mail:   <co...@xml.apache.org>




---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Kjetil Kjernsmo <kj...@kjernsmo.net>.

Hi!

Interesting thread! Most things has been said allready, but I'll just 
add a little .02 (whatever currency) :-)

On Thursday 07 November 2002 23:57, Tony Collen wrote:

> However, later I realize that using file extensions is "bad".  Read
> http://www.alistapart.com/stories/slashforward/ for more info on this
> idea.

The article is interesting, but a little too narrow. Indeed, using file 
extensions are Bad[tm] for URIs because you tie the address to a 
specific technology, which you may not be using in some years. For that 
reason not changing the default "cocoon" in Cocoon URIs are also a 
Bad Thing[tm].

The authorative reference on this topic is TimBL's rant "Cool URIs don't 
change": http://www.w3.org/Provider/Style/URI :-)

So, using directories for everything is one possibility, and if you do, 
make sure to include the trailing /, to avoid useless 301 redirects. 
Another option is to use Content Negotation, which is well defined in 
HTTP 1.1 (and earlier, IIRC), it's weird that it isn't more widely 
used. 

But, both content negotation and using directories for everything are 
both solutions that exists mainly because URIs have been so strongly 
tied to the file system of the server, and the mentioned article seems 
to take as granted that this connection is a necessity, but as Cocoon 
proves, this is not so. Just use sensible matches. It also means that 
requiring a trailing slash on every URI is a bit too much, I only do 
that if there is logically a hierarchal substructure. 

As for the problem of serving different formats to the client, I really 
have no good solution. What the user agents should do, was to let the 
user easily manipulate the Accept-header, so if the user wants a 
PDF-file, he would send only application/pdf in the Accept header, and 
the server would know that the user wanted a PDF-file, and send that. 
Given that it doesn't exist, appending the type to the URI is probably 
not too bad. 

Best,

Kjetil
-- 
Kjetil Kjernsmo
Astrophysicist/IT Consultant/Skeptic/Ski-orienteer/Orienteer/Mountaineer
kjetil@kjernsmo.net  webmaster@skepsis.no  editor@learn-orienteering.org
Homepage: http://www.kjetil.kjernsmo.net/

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Miles Elam <mi...@pcextremist.com>.

Nevermind.  I take it all back (well...some of it).  I admit that the 
trailing slash is an artifact of my experience and not of utility.

Damned mailing lists...you can't remove things you wish you hadn't said. 
 Accountability's overrated.  ;-)

- Miles

Miles Elam wrote:

> Justin Fagnani-Bell wrote:
>
>> This is definitely where we differ. I don't see why an intrinsic 
>> resource should always end in a '/'. If /a/b.pdf is the PDF 
>> representation then why shouldn't /a/b be the intrinsic resource? The 
>> only reason I see why the trailing slash is recommended is because 
>> developers are used to having their URI space tied to their 
>> filesystem structure with a static server like Apache. The trailing 
>> slash, from our experience with filesystems,  indicates that 
>> something is a directory, that it has children. But in a URI a 
>> resource can be both a viewable resource and a container node at the 
>> same time. There's certainly nothing stopping /a/b/, /a/b, /a/b.pdf 
>> and /a/b/c.pdf from all being valid URI's in the same space. To me 
>> the trailing slash simple indicates that there's more to come at 
>> lower levels, and the absence of it means the resource is a leaf.
>
> You're right in that it is what we are used to but not necessarily 
> because of the filesystem.  I misspoke in this case where /a/b could 
> indeed be a resource in some cases.  One major problem lies in clients 
> like IE (for better or for worse the dominant viewer) which don't 
> always behave correctly even when the correct MIME type is sent.  The 
> other is when the resource references other resources.
>
> Take a web article by Oreilly for example.  These articles have 
> images, multiple pages, talkbacks, etc.  If /a/b is the intrinsic 
> resource, how do we logically access the first figure in that 
> article?  How do we access the third page?  Aren't multiple pages just 
> another representation of the resource?  PDFs can encompass multiple 
> pages.  A web page made for printout would encompass only one long 
> page.  Would it be /a/b/printable.html?
>
> Is one more correct than another?  I don't think so -- it seems to 
> come down to personal preference, all other things being equal.  I 
> think IE would have fewer problems with the slash.  I personally don't 
> view the trailing slash as a directory but as a resource collection.  
> Perhaps a collection of representations?  But that's just semantics 
> and I'm grasping here.
>
> In the example of the Oreilly article, I think that there is more to 
> come at the lower levels, there is no absence of lower levels when 
> representations are considered lower levels, and that it's a node and 
> not a leaf.  I can only think that a resource would be a leaf if it 
> and its siblings never have inline constituents like images, multiple 
> pages, plugins, etc..




---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Miles Elam <mi...@pcextremist.com>.

Justin Fagnani-Bell wrote:

>>>> This is where we differ slightly.  In my mind /a/b/ is the 
>>>> intrinsic resource.  /a/b/index.html is the explicit call for HTML 
>>>> represention of /a/b/.  If you redirect a client to /a/b/index.html 
>>>> and the client bookmarks it, they are bookmarking the HTML 
>>>> representation, not the intrinsic resource.
>>>
> This is definitely where we differ. I don't see why an intrinsic 
> resource should always end in a '/'. If /a/b.pdf is the PDF 
> representation then why shouldn't /a/b be the intrinsic resource? The 
> only reason I see why the trailing slash is recommended is because 
> developers are used to having their URI space tied to their filesystem 
> structure with a static server like Apache. The trailing slash, from 
> our experience with filesystems,  indicates that something is a 
> directory, that it has children. But in a URI a resource can be both a 
> viewable resource and a container node at the same time. There's 
> certainly nothing stopping /a/b/, /a/b, /a/b.pdf and /a/b/c.pdf from 
> all being valid URI's in the same space. To me the trailing slash 
> simple indicates that there's more to come at lower levels, and the 
> absence of it means the resource is a leaf.

You're right in that it is what we are used to but not necessarily 
because of the filesystem.  I misspoke in this case where /a/b could 
indeed be a resource in some cases.  One major problem lies in clients 
like IE (for better or for worse the dominant viewer) which don't always 
behave correctly even when the correct MIME type is sent.  The other is 
when the resource references other resources.

Take a web article by Oreilly for example.  These articles have images, 
multiple pages, talkbacks, etc.  If /a/b is the intrinsic resource, how 
do we logically access the first figure in that article?  How do we 
access the third page?  Aren't multiple pages just another 
representation of the resource?  PDFs can encompass multiple pages.  A 
web page made for printout would encompass only one long page.  Would it 
be /a/b/printable.html?

Is one more correct than another?  I don't think so -- it seems to come 
down to personal preference, all other things being equal.  I think IE 
would have fewer problems with the slash.  I personally don't view the 
trailing slash as a directory but as a resource collection.  Perhaps a 
collection of representations?  But that's just semantics and I'm 
grasping here.

In the example of the Oreilly article, I think that there is more to 
come at the lower levels, there is no absence of lower levels when 
representations are considered lower levels, and that it's a node and 
not a leaf.  I can only think that a resource would be a leaf if it and 
its siblings never have inline constituents like images, multiple pages, 
plugins, etc..

>> P.S.  Thank god for the mailing lists.  They actually encourages me 
>> to write down some of my thoughts.  Even they are off the mark more 
>> often than not...  Does this make email better than web or simply 
>> justify the need for more discussion on the web? 
>
But it apparently has a detrimental effect on my proper use of English 
grammar.  *sigh*

- Miles

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Kjetil Kjernsmo <kj...@kjernsmo.net>.

On Sunday 10 November 2002 01:23, Justin Fagnani-Bell wrote:
> file extensions can (and IMO should) exist side-by-side. 

Is there a well-recognized standardization of file extensions? RFC? ISO 
standard? W3C recommendation? I'm curious, because I'm not aware of any 
such standard. 

Best,

Kjetil
-- 
Kjetil Kjernsmo
Astrophysicist/IT Consultant/Skeptic/Ski-orienteer/Orienteer/Mountaineer
kjetil@kjernsmo.net  webmaster@skepsis.no  editor@learn-orienteering.org
Homepage: http://www.kjetil.kjernsmo.net/


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Justin Fagnani-Bell <ju...@paraliansoftware.com>.

On Saturday, November 9, 2002, at 12:08  PM, Miles Elam wrote:

> Tony Collen wrote:
>
>> Comments inline...
>>
>> Miles Elam wrote:
>>
>>> But can't delivered types differ by the incoming client?
>>
>> Yes, but a problem then arises when someone is using IE and they want 
>> a PDF, when your user-agent rules will only serve a PDF for FooCo PDF 
>> Browser 1.0.  IMO browsers should respect the mime-type header.  I 
>> believe the mime-type headers is very useful when you want to use 
>> something like a PHP script to send an image or a .tar.gz file.  In 
>> fact, it's essential for it to work, otherwise the browser interprets 
>> the data as garbage.
>
> No, that's wasn't my intention at all.  If someone is using IE and 
> they want a pdf (not a default expectation for that particular browser 
> like html or xml), then the URL they would get directed to would be 
> *.pdf. This is not the intrinsic resource.  You are explicitly asking 
> for the PDF representation of that resource.
>
> If the browser's default expectation is PDF (like in your FooCo PDF 
> Browser 1.0 example), the trailing slash resource would give it PDF. 
> However, it could still be pointed to *.pdf if you wanted to make it 
> explicit.

This is very well put Miles, and was my intention with the previous 
email I wrote. Content negotiation and file extensions can (and IMO 
should) exist side-by-side. There is no precedent for a browser 
changing its accept header on a per-request basis, as someone 
suggested, not is there a way to specify this behavior in a hyper-link. 
If you have a link on a site that says "Click here for a PDF" then I 
would expect that the URI would end in .pdf, at least that's what makes 
the most sense to me.

> In those cases where only PDF is available (common when it's not 
> dynamically generated), I see no reason why the URI wouldn't be *.pdf.

Exactly.

>>> This is where we differ slightly.  In my mind /a/b/ is the intrinsic 
>>> resource.  /a/b/index.html is the explicit call for HTML 
>>> represention of /a/b/.  If you redirect a client to /a/b/index.html 
>>> and the client bookmarks it, they are bookmarking the HTML 
>>> representation, not the intrinsic resource.

This is definitely where we differ. I don't see why an intrinsic 
resource should always end in a '/'. If /a/b.pdf is the PDF 
representation then why shouldn't /a/b be the intrinsic resource? The 
only reason I see why the trailing slash is recommended is because 
developers are used to having their URI space tied to their filesystem 
structure with a static server like Apache. The trailing slash, from 
our experience with filesystems,  indicates that something is a 
directory, that it has children. But in a URI a resource can be both a 
viewable resource and a container node at the same time. There's 
certainly nothing stopping /a/b/, /a/b, /a/b.pdf and /a/b/c.pdf from 
all being valid URI's in the same space. To me the trailing slash 
simple indicates that there's more to come at lower levels, and the 
absence of it means the resource is a leaf.

As for redirects, I don't see it being too much of a problem with more 
recent protocols. Also it should only happen when a visitor is being 
referred from an external page, since all the URL's in your site should 
be in the correct form. If you are linking to the intrinsic resource, I 
don't see the need for a redirect (as long as the browser correctly 
understands the mime-type header), so I don't see a problem with 
bookmarking.

> - Miles
>
> P.S.  Thank god for the mailing lists.  They actually encourages me to 
> write down some of my thoughts.  Even they are off the mark more often 
> than not...  Does this make email better than web or simply justify 
> the need for more discussion on the web?

Here here. I say both, linear discussion is good, so is collaboration. 
A discussion board combined with a wiki would be awesome. Discuss a 
topic and collaborate on a document summing up the ideas at the same 
time. Hmm... Cocoon could do that :)

-Justin

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Miles Elam <mi...@pcextremist.com>.

Tony Collen wrote:

> Comments inline...
>
> Miles Elam wrote:
>
>> But can't delivered types differ by the incoming client?
>
> Yes, but a problem then arises when someone is using IE and they want 
> a PDF, when your user-agent rules will only serve a PDF for FooCo PDF 
> Browser 1.0.  IMO browsers should respect the mime-type header.  I 
> believe the mime-type headers is very useful when you want to use 
> something like a PHP script to send an image or a .tar.gz file.  In 
> fact, it's essential for it to work, otherwise the browser interprets 
> the data as garbage. 

No, that's wasn't my intention at all.  If someone is using IE and they 
want a pdf (not a default expectation for that particular browser like 
html or xml), then the URL they would get directed to would be *.pdf. 
 This is not the intrinsic resource.  You are explicitly asking for the 
PDF representation of that resource.

If the browser's default expectation is PDF (like in your FooCo PDF 
Browser 1.0 example), the trailing slash resource would give it PDF. 
 However, it could still be pointed to *.pdf if you wanted to make it 
explicit.

In those cases where only PDF is available (common when it's not 
dynamically generated), I see no reason why the URI wouldn't be *.pdf. 
 In fact, if in the future more presentation types are added, a special 
case for *.pdf to return a static resource and all other variations 
being dynamically generated (or some other mixing and matching) would 
still be valid and a stable URI space.

As far as a php script returning an image, that's fine, but if the URL 
ends with (or even contains) any reference to "php", you are tying your 
URI to a particular technology/delivery method.  With Cocoon, why not 
map /foo/bar/alpha.png to the PHP script that returns a PNG image?  In 
this case, I'm not advocating the trailing slash.  I am advocating that 
you not have PHP even mentioned in the URL.  In this case, the resource 
is a PNG image without regard to client -- have the URL reflect this.

>> This is where we differ slightly.  In my mind /a/b/ is the intrinsic 
>> resource.  /a/b/index.html is the explicit call for HTML represention 
>> of /a/b/.  If you redirect a client to /a/b/index.html and the client 
>> bookmarks it, they are bookmarking the HTML representation, not the 
>> intrinsic resource.  I understand the efficiency issues, but a user 
>> agent match when viewed in the context of sitemap matches, 
>> server-side logic, servlet request and response object creation and 
>> other assorted methods calls is just a couple of string comparisons.
>
> This is pretty much the original problem I was trying to solve.  Sure, 
> having a clean URL space that always ends in a / is useful, but if you 
> look at how that would work on the server, side, it means you create a 
> physical directory for each page and then create an index.html.  You 
> have tons of files named index.html on your web server, but at least 
> it's all organized with the directories. 

Hmmm...  Why is it that your physical directory structure must have 
ANYTHING to do with the URL?  This flies right in the face of the reason 
for Cocoon's sitemap and the resources made available from Apache's 
httpd.conf.  You would indeed have many URLs that point to a resource 
called index.html, but your filesystem need not have any.  Your 
filesystem could be flat without any directories at all.  It could be 
replaced with a database.  ...or LDAP or xmldb or PHP...

If your filesystem is to be 1:1 with your URLs, why use Cocoon and a 
servlet engine at all?  A flat file webserver would serve things much 
faster.  The reason I want to use Cocoon is that it makes things 
*better* and not faster -- although I have methods for getting extra speed.

>> In my opinion, URLs should not change.
>
> As further explained at http://www.useit.com/alertbox/990321.html 
> The rundown:
>
>    - URLs should not change
>    - URLs are easy to remember (and therefore are organized logically)
>    - URLs are easy to type and are generally all in lowercase
>
>> That is one of the main things that drew me to Cocoon: URI 
>> abstraction.  Once the URL is abstracted enough to act as a true URI, 
>> it can start acting as a true indentifier instead of an ad hoc, vague 
>> gobbledygook.  Of course this also assumes that the URL/URI remains 
>> set in stone and not a moving target.
>
>
> Yes! This is exactly the conclusion I was coming to on my own. URIs 
> are no more than data abstractions.  They usually provide a view to 
> some data, and more often than not, a URL on a web server directly 
> correlates with a physical file on a disk (e.g. index.html).  Cocoon 
> allows one to create a purely virtual URL space in which no real files 
> on the server could exist.  It probably doesn't matter how the 
> underlying data is abstracted, whether it be a one-to-one correlation 
> to a directory tree on a disk somewhere, or an xpath statement into an 
> xml file, or arguments to a CGI script that accesses a database 
> depending on the order of the items in the request. Imagine a request 
> for /articles/bydate/2002/10/31/ mapping to 
> articles.php?mode=bydate&year=2002&month=10&day=31, which in turn 
> queries a database. 
> Accessing a URL can provide a default view of the data, and depending 
> on the request, the data can be presented different ways.  In the case 
> of things like PHP and CGI scripts, the URL sometimes accepts incoming 
> data (GET or POST data) and will return different results based on the 
> messages passed to it.  Cocoon allows you to provide different views 
> of a resource based on the User-Agent string which is supplied by the 
> browser.  URLs represent objects.  

We are in agreement.

>>> This way the extension isn't revealing the underlying technology of 
>>> the site, but the type of file the client is expecting, and this 
>>> goes for directories too.
>>
> If all we're really serving up is data, and XML is "just data" 
> (http://radio.weblogs.com/0101679/), then perhaps all of our matches 
> should match for *.xml.  Based on other things, like the User-Agent 
> string, or request parameters, we can provide different views of the 
> data (PDF, SVG, HTML etc).  A page named "foo.xml" could be an 
> instance of intelligent data, whereby Cocoon supplies the "smarts" to 
> change the data depending on any number of conditions. 
> In the end, it probably doesn't matter how the data is abstracted, as 
> long as it's consistent, easy to use, and is mostly permanent (or 
> rather, will be flexible if the abstraction changes in the future)
>
> Life will be so much easier in 5 years when we're just serving up 
> straight up xml files.   Unfortunately this puts Cocoon out of 
> business ;) 

No, a match for *.xml would be a request for the XML *representation* of 
the resource.  XML is not intrinsic.  It may be the starting point for 
Cocoon's pipelines.  It may be the contracts all through the pipelines. 
 It may be a format that can represent the semantic meaning behind a 
resource.  It is not the resource.

All of your matchers may indeed be for *.xml if your client base fits 
what you are serving.  Still, that's not the intrinsic resource.  Your 
starting point in a pipeline could be a simple, tab-delimited text file 
and export it as XML.  Plain text is still not the intrinsic resource. 
 The intrinsic resource is the information.  Period.  As soon as it's 
serialized in some format, as soon as it is marked up, as soon as it is 
generated, it ceases to be pure information -- an intrinsic resource. 
 This is the point I am trying to drive home.  An intrinsic resource can 
never be what people see.  You might as well try to draw a picture of 
someone's brain to illustrate what they know.

It may be that clients in the future are all XML/XSLT/XInclude/XForms 
capable.  Doesn't change much for Cocoon.  The only way that serving XML 
files might kill Cocoon is if there was no dynamic data.  With the sheer 
volume of information today let alone tomorrow, I don't see that happening.

Life won't be easier in five years;  It'll be the same with different 
trappings.  Accessing information may be easier in five years though as 
long as people try to make it more accessible.

- Miles

P.S.  Thank god for the mailing lists.  They actually encourages me to 
write down some of my thoughts.  Even they are off the mark more often 
than not...  Does this make email better than web or simply justify the 
need for more discussion on the web?

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Tony Collen <tc...@hist.umn.edu>.

Comments inline...

Miles Elam wrote:

> Justin Fagnani-Bell wrote:
>
>>   I've wrestled with similar problems for a while with my content 
>> management system, which uses a database for content and structure. 
>> I'm in the process of setting the system to use file extensions for 
>> the client to specify the file type and have Cocoon return that type. 
>> If they request /a.html, they get html, /a.pdf and they get pdf, and 
>> so on. This seems elegant, but it has problems when you consider the 
>> points covered in the slashforward article. Here's the compromise 
>> I've come up with so far, adapted to a filesystem like you're using. 
>> I'm still toying with these ideas, so i'd like to hear comments.
>>
>> 1) Instead of having directories with index.xml files, have a 
>> directory and an xml file with the same name at the same level.
>> so you have /a/b/ actually returning /a/b.xml. you could map a 
>> request for /a/b/index.html to /a/b.xml as well. This way you can add 
>> a leaf, and if you need to later add sub-nodes, and turn the leaf 
>> into a node, you just add a directory and some files underneath it. 
>
>
> sounds good to me
>
>> 2) Redirect all urls to *not* end in a slash. I see the point of the 
>> article you've linked to, and agree with it, but the file extension 
>> is the only form of file meta data that's pretty standard. Ending all 
>> urls in slashes only works, in my opinion, if all the files are the 
>> same type, if not it's really nice to have a way of identifying the 
>> type from the url, not just the mime-type response header. So 
>> considering that any request is going to point to a leaf (or an error 
>> page), then I would redirect /a/b/ to /a/b.html 
>
>
> But can't delivered types differ by the incoming client?

Yes, but a problem then arises when someone is using IE and they want a 
PDF, when your user-agent rules will only serve a PDF for FooCo PDF 
Browser 1.0.  IMO browsers should respect the mime-type header.  I 
believe the mime-type headers is very useful when you want to use 
something like a PHP script to send an image or a .tar.gz file.  In 
fact, it's essential for it to work, otherwise the browser interprets 
the data as garbage.

> This is where we differ slightly.  In my mind /a/b/ is the intrinsic 
> resource.  /a/b/index.html is the explicit call for HTML represention 
> of /a/b/.  If you redirect a client to /a/b/index.html and the client 
> bookmarks it, they are bookmarking the HTML representation, not the 
> intrinsic resource.  I understand the efficiency issues, but a user 
> agent match when viewed in the context of sitemap matches, server-side 
> logic, servlet request and response object creation and other assorted 
> methods calls is just a couple of string comparisons.

This is pretty much the original problem I was trying to solve.  Sure, 
having a clean URL space that always ends in a / is useful, but if you 
look at how that would work on the server, side, it means you create a 
physical directory for each page and then create an index.html.  You 
have tons of files named index.html on your web server, but at least 
it's all organized with the directories.

> In particular, as new clients become more and more capable, a give and 
> take can take place when the resource identifier is left ambiguous.  
> For example giving Opera the XHTML/CSS version and IE6 the XML w/ XSLT 
> processing instruction.  I'm sure we're all aware of IE's fixation on 
> file extension (or at least anyone who's fought with serving PDFs when 
> the URL didn't end in PDF).  If you pass XML w/ processing instruction 
> from a URL tagged with .html, I'm not entirely convinced that IE will 
> get this straight.  The file extension can become a straightjacket.
>
> As clients become more advanced, some work (ie. XSLT processing, 
> XInclude work, etc) can be offloaded from the server.  If someone has 
> the .html version bookmarked or copied to email, we have basically 
> made a contract with the user that they will always receive HTML for 
> this resource no matter the capabilities of the client.

> In my opinion, URLs should not change. 

As further explained at http://www.useit.com/alertbox/990321.html  

The rundown:

    - URLs should not change
    - URLs are easy to remember (and therefore are organized logically)
    - URLs are easy to type and are generally all in lowercase

> That is one of the main things that drew me to Cocoon: URI 
> abstraction.  Once the URL is abstracted enough to act as a true URI, 
> it can start acting as a true indentifier instead of an ad hoc, vague 
> gobbledygook.  Of course this also assumes that the URL/URI remains 
> set in stone and not a moving target.

Yes! This is exactly the conclusion I was coming to on my own. URIs are 
no more than data abstractions.  They usually provide a view to some 
data, and more often than not, a URL on a web server directly correlates 
with a physical file on a disk (e.g. index.html).  Cocoon allows one to 
create a purely virtual URL space in which no real files on the server 
could exist.  It probably doesn't matter how the underlying data is 
abstracted, whether it be a one-to-one correlation to a directory tree 
on a disk somewhere, or an xpath statement into an xml file, or 
arguments to a CGI script that accesses a database depending on the 
order of the items in the request. Imagine a request for 
/articles/bydate/2002/10/31/ mapping to 
articles.php?mode=bydate&year=2002&month=10&day=31, which in turn 
queries a database.  

Accessing a URL can provide a default view of the data, and depending on 
the request, the data can be presented different ways.  In the case of 
things like PHP and CGI scripts, the URL sometimes accepts incoming data 
(GET or POST data) and will return different results based on the 
messages passed to it.  Cocoon allows you to provide different views of 
a resource based on the User-Agent string which is supplied by the 
browser.  URLs represent objects.  

>
>> This way the extension isn't revealing the underlying technology of 
>> the site, but the type of file the client is expecting, and this goes 
>> for directories too.
>

If all we're really serving up is data, and XML is "just data" 
(http://radio.weblogs.com/0101679/), then perhaps all of our matches 
should match for *.xml.  Based on other things, like the User-Agent 
string, or request parameters, we can provide different views of the 
data (PDF, SVG, HTML etc).  A page named "foo.xml" could be an instance 
of intelligent data, whereby Cocoon supplies the "smarts" to change the 
data depending on any number of conditions.  

In the end, it probably doesn't matter how the data is abstracted, as 
long as it's consistent, easy to use, and is mostly permanent (or 
rather, will be flexible if the abstraction changes in the future)

Life will be so much easier in 5 years when we're just serving up 
straight up xml files.   Unfortunately this puts Cocoon out of business ;)

Phew. THAT was way more than I was hoping to write :)

Tony

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Miles Elam <mi...@pcextremist.com>.

Justin Fagnani-Bell wrote:

>   I've wrestled with similar problems for a while with my content 
> management system, which uses a database for content and structure. 
> I'm in the process of setting the system to use file extensions for 
> the client to specify the file type and have Cocoon return that type. 
> If they request /a.html, they get html, /a.pdf and they get pdf, and 
> so on. This seems elegant, but it has problems when you consider the 
> points covered in the slashforward article. Here's the compromise I've 
> come up with so far, adapted to a filesystem like you're using. I'm 
> still toying with these ideas, so i'd like to hear comments.
>
> 1) Instead of having directories with index.xml files, have a 
> directory and an xml file with the same name at the same level.
> so you have /a/b/ actually returning /a/b.xml. you could map a request 
> for /a/b/index.html to /a/b.xml as well. This way you can add a leaf, 
> and if you need to later add sub-nodes, and turn the leaf into a node, 
> you just add a directory and some files underneath it. 

sounds good to me

> 2) Redirect all urls to *not* end in a slash. I see the point of the 
> article you've linked to, and agree with it, but the file extension is 
> the only form of file meta data that's pretty standard. Ending all 
> urls in slashes only works, in my opinion, if all the files are the 
> same type, if not it's really nice to have a way of identifying the 
> type from the url, not just the mime-type response header. So 
> considering that any request is going to point to a leaf (or an error 
> page), then I would redirect /a/b/ to /a/b.html 

But can't delivered types differ by the incoming client?

This is where we differ slightly.  In my mind /a/b/ is the intrinsic 
resource.  /a/b/index.html is the explicit call for HTML represention of 
/a/b/.  If you redirect a client to /a/b/index.html and the client 
bookmarks it, they are bookmarking the HTML representation, not the 
intrinsic resource.  I understand the efficiency issues, but a user 
agent match when viewed in the context of sitemap matches, server-side 
logic, servlet request and response object creation and other assorted 
methods calls is just a couple of string comparisons.

In particular, as new clients become more and more capable, a give and 
take can take place when the resource identifier is left ambiguous.  For 
example giving Opera the XHTML/CSS version and IE6 the XML w/ XSLT 
processing instruction.  I'm sure we're all aware of IE's fixation on 
file extension (or at least anyone who's fought with serving PDFs when 
the URL didn't end in PDF).  If you pass XML w/ processing instruction 
from a URL tagged with .html, I'm not entirely convinced that IE will 
get this straight.  The file extension can become a straightjacket.

As clients become more advanced, some work (ie. XSLT processing, 
XInclude work, etc) can be offloaded from the server.  If someone has 
the .html version bookmarked or copied to email, we have basically made 
a contract with the user that they will always receive HTML for this 
resource no matter the capabilities of the client.

In my opinion, URLs should not change.  That is one of the main things 
that drew me to Cocoon: URI abstraction.  Once the URL is abstracted 
enough to act as a true URI, it can start acting as a true indentifier 
instead of an ad hoc, vague gobbledygook.  Of course this also assumes 
that the URL/URI remains set in stone and not a moving target.

> This way the extension isn't revealing the underlying technology of 
> the site, but the type of file the client is expecting, and this goes 
> for directories too.

Yup, although I think people underestimate the utility of the default 
directory listing when there is no index.html (or default.htm, 
home.html, etc.).  If you think back to the beginnings of the web, what 
was index.html but a dressed up view of all resources in the general area?

> The matchers would look something like this: (i might have this wrong)
>
> <map:match pattern="**/">
>   <map:redirect-to uri="{1}.html"/>
> </map:match>
>
> <map:match pattern="**/*.html">
>   <map:generate src="documents/{1}.xml/>
>   <map:transform src="stylesheets/page2html.xsl"/>
>   <map:serialize type="xhtml"/>
> </map:match> 

Shouldn't this be <map:generate src="documents/{1}/{2}.xml"/>?  But 
yeah, that's assuming that the resource will be HTML.  A valid 
assumption for most sites...for the time being.  A lot has changed in 
the last few years and a lot of new clients have jumped on the scene. 
 As I mentioned before, I believe URLs should be as permanent as 
possible.  This has no flexibility for the future.

This is based upon the sitemap we're using as a working model (I might 
also have something wrong):

<!-- Index is a directory listing with the assumption that files
     in the same directory are at least somewhat related -->
<map:match pattern="**/index.xml">
  <map:generate type="directory" src="{1}"/>
  <map:transform src="dir2page.xsl"/>
  <map:transform src="stylesheets/processing-instruction.xsl">
    <map:parameter name="stylesheet" value="stylesheets/page2xhtml.xsl"/>
  </map:transform>
  <map:serialize type="xml"/>
</map:match>

<!-- Index views -->
<map:match pattern="**/index.*">
  <map:generate src="cocoon:/{1}/index.xml"/>
  <map:transform src="stylesheets/page2{2}.xsl"/>
  <map:serialize type="{2}"/>
</map:match>

<!-- Raw XML access : Including the processing instruction here for
       simplicity's sake -->
<map:match pattern="**/page.xml">
  <map:generate src="{1}.xml"/>
  <map:transform src="stylesheets/processing-instruction.xsl">
    <map:parameter name="stylesheet" value="stylesheets/page2xhtml.xsl"/>
  </map:transform>
  <map:serialize type="xml"/>
</map:match>

<!-- resource views -->
<map:match pattern="**/page.*">
  <map:generate src="cocoon:/{1}.xml"/>
  <map:transform src="stylesheets/page2{2}.xsl"/>
  <map:serialize type="{2}"/>
</map:match>

<!-- Client sniff -->
<map:match pattern="**/">
  <map:select type="browser">
    <map:when test="wap">
      <map:generate src="cocoon:/{1}/page.wml"/>
      <map:serialize type="wml"/>
    </map:when>
    <map:when test="xslt">
      <map:generate src="cocoon:/{1}/page.xml"/>
      <map:serialize type="xml"/><!-- processing instruction will render 
it -->
    </map:when>
    <map:when test="html"><!-- For older browsers that aren't up to 
snuff -->
      <map:generate src="cocoon:/{1}/page.html"/>
      <map:serialize type="html"/>
    </map:when>
    <map:otherwise>
      <map:generate src="cocoon:/{1}/page.xhtml"/>
      <map:serialize type="xhtml"/>
    </map:otherwise>
  </map:select>
</match>

This all works on the following assumptions:

  "/a/b/d/" refers to a resource independant of presentation.  From 
here, we do browser type checking for the appropriate output type.

  "/a/b/d/index.xml" refers to a list of resources associated with "/a/b/d/"

  "/a/b/d/page.xml" refers to the resource explicitly as XML.

  "/a/b/d/page.html" refers to the resource explicitly as HTML.

--------------

This also reflects the change we made to the browser selector.  In 
effect, we've turned it into a poor man's Deli.

<map:selector logger="sitemap.selector.browser" name="client" 
src="org.apache.cocoon.selection.BrowserSelector">
    <browser name="xslt" useragent="MSIE 6"/>
    <browser name="xhtml" useragent="MSIE"/>
    <browser name="xhtml" useragent="MSPIE"/>
    <browser name="xhtml" useragent="HandHTTP"/>
    <browser name="xhtml" useragent="AvantGo"/>
    <browser name="xhtml" useragent="DoCoMo"/>
    <browser name="xhtml" useragent="Opera"/>
    <browser name="xhtml" useragent="Lynx"/>
    <browser name="xhtml" useragent="Java"/>
    <browser name="wap" useragent="Nokia"/>
    <browser name="wap" useragent="UP"/>
    <browser name="wap" useragent="Wapalizer"/>
    <browser name="xhtml" useragent="Mozilla/5"/>
    <browser name="xhtml" useragent="Netscape6/"/>
    <browser name="xhtml" useragent="Netscape7"/>
    <browser name="html" useragent="Mozilla"/>
</map:selector>

This basically designates what class of content goes where as opposed to 
what the client name is.  FYI: I know that Mozilla supports XSLT as 
well, but I ran into a CSS rendering bug with regard to background-color 
on the body tag that prevents its use.

--------------

In other news, I found that just using the filesystem, while simple, 
lacks some flexibility for other purposes: for example being able to 
list files published in a certain time frame or by a particular author. 
 We ended up looking at database solutions before too long.  If however 
this really is just a document bucket, no problem.

The examples I gave were tailored to what I gather of the original 
problem set where it seems basically text documents are being handled.

For documents where it is assumed images, media files, etc. will be 
associated, I'd actually recommend a setup where index.* refers to the 
document itself instead of a directory listing.  The document would have 
references to the media pieces and thus would fit the description of a 
overview or listing.

For our site, the directory structure is more like:

  /articles/00000001/

with index.xml, index.html, index.xhtml, etc. being views for the 
article.  Any images et al would be referenced as:

  /articles/00000001/image1.png

This precludes hierarchy however.  We got around this by having 
alternate hierarchies independant of this one.  This would be a listing 
of all software having to do with software reviews:

  /articles/reviews/software/

This would be software published last month:

  /articles/2002/10/

This would be articles written by me:

  /users/melam/articles/

This fits for articles being fixed, independant items (via article ID), 
and available by other organizational means.  Unfortunately this isn't 
really feasible without a database -- we tried and realized that flat 
files weren't any easier or simpler for what we wanted to do. 
 Relational database, object database, or XML database doesn't really 
cause a need for a URL/URI change as they're relative to the resource as 
a concept instead of a filesystem layout.

Okay...  Anyone going to poke holes in my arguments?  Our setup is 
pretty young and the database is relatively empty.  I'd love to hear 
about problems before our dataset gets much larger and unwieldy.

- Miles

P.S.  That was far longer than I was originally planning...

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>

Re: URL Theory & Best Practices

Posted by Justin Fagnani-Bell <ju...@paraliansoftware.com>.

Tony,

   I've wrestled with similar problems for a while with my content 
management system, which uses a database for content and structure. I'm 
in the process of setting the system to use file extensions for the 
client to specify the file type and have Cocoon return that type. If 
they request /a.html, they get html, /a.pdf and they get pdf, and so 
on. This seems elegant, but it has problems when you consider the 
points covered in the slashforward article. Here's the compromise I've 
come up with so far, adapted to a filesystem like you're using. I'm 
still toying with these ideas, so i'd like to hear comments.

1) Instead of having directories with index.xml files, have a directory 
and an xml file with the same name at the same level.
so you have /a/b/ actually returning /a/b.xml. you could map a request 
for /a/b/index.html to /a/b.xml as well. This way you can add a leaf, 
and if you need to later add sub-nodes, and turn the leaf into a node, 
you just add a directory and some files underneath it.

2) Redirect all urls to *not* end in a slash. I see the point of the 
article you've linked to, and agree with it, but the file extension is 
the only form of file meta data that's pretty standard. Ending all urls 
in slashes only works, in my opinion, if all the files are the same 
type, if not it's really nice to have a way of identifying the type 
from the url, not just the mime-type response header. So considering 
that any request is going to point to a leaf (or an error page), then I 
would redirect /a/b/ to /a/b.html

This way the extension isn't revealing the underlying technology of the 
site, but the type of file the client is expecting, and this goes for 
directories too.

The matchers would look something like this: (i might have this wrong)

<map:match pattern="**/">
   <map:redirect-to uri="{1}.html"/>
</map:match>

<map:match pattern="**/*.html">
   <map:generate src="documents/{1}.xml/>
   <map:transform src="stylesheets/page2html.xsl"/>
   <map:serialize type="xhtml"/>
</map:match>

Add matchers, or use selectors, for more file types.

-Justin

On Thursday, November 7, 2002, at 02:57  PM, Tony Collen wrote:

> Apologies for the extra long post, but this has been bugging me for a 
> while.
>
> First, some background:
>
> I'm attempting to put together a URL space using cocoon that will 
> allow users to drop an XML file into a directory, say 
> $TOMCAT_HOME/webapps/cocoon/documents/ and have it published.  This is 
> easy enough:
>
> <map:match pattern="*.html>
>    <map:generate src="documents/{1}.html"/>
>    <map:transform src="stylesheets/page2html.xsl"/>
>    <map:serialize type="xhtml"/>
> </map:match>
>
> So then I decide that for organization's sake, I want to allow people 
> to create subdirectories under documents/ any number of levels deep, 
> and still have cocoon publish them. This is also fairly simple:
>
> <map:match pattern="**/*.html">
>    <map:generate src="documents/{1}.html"/>
>    <map:transform src="stylesheets/page2html.xsl"/>
>    <map:serialize type="xhtml"/>
> </map:match>
>
> However, later I realize that using file extensions is "bad".  Read 
> http://www.alistapart.com/stories/slashforward/ for more info on this 
> idea.
> This creates problems with how I automatically generate content using 
> Cocoon.  I want to allow people to create content arbitrarily deep in 
> the documents/ directory, but I run into a bunch of questions.
>
> Should trailing slashes always be used? I think so.
> Therefore: Consider an HTTP request for "/a/b/c/".
>
>    1. Is it a request for the discreet resource named "c" which is 
> contained in "b"?
>    2. Is it a request for the listing of all the contents of the "c" 
> resource (which is in turn contained within "b")?
>    3. Is this equivalent to a request for "/a/b/c"?         3b. Should 
> a request for something w/o a trailing slash be redirected to the same 
> URL, but with a trailing slash added?
>
> Using the "best practice" of always having trailing slashes creates 
> problems when mapping the virtual URL space to a physical directory 
> structure.  Considering a request for "/a/b/c/", do I go into 
> documents/a/b/c/ and generate from index.xml?  Or do I go to 
> documents/a/b/ and generate from c.xml?  Having every "leaf" be a 
> directory with an index.xml gets to be unmaintainable, IMO.
>
> Likewise, do I generate from documents/a/b/d.xml or 
> documents/a/b/d/index.xml for a request of "/a/b/d"?  Additionally, 
> what should happen when there's a request for "/a/b/"?  Obviously, if 
> the subdirectory "b" exists, it would not be correct to go to 
> documents/a/ and look for b.xml.
>
> Part of my reasoning behind all these questions lies in my quest for 
> creating an uber-flexible "drop-in" directory structure where people 
> can simply add their .xml files to the "documents" directory and have 
> Cocoon automagically publish them, as I stated above.  The other 
> reason for this is that I'm trying to devise a system which 
> automatically creates navigation, as well. I've looked at the 
> Bonebreaker example, and it's good, but has some limitations.  What if 
> I don't want to use the naming scheme they have?
>
> Oh well, thanks for listening to my ramblings, and hopefully I can get 
> some light shed on this situation, as well as have a nifty autonavbar 
> work eventually :)
> Regards,
> Tony
>
>
> ---------------------------------------------------------------------
> Please check that your question  has not already been answered in the
> FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>
>
> To unsubscribe, e-mail:     <co...@xml.apache.org>
> For additional commands, e-mail:   <co...@xml.apache.org>
>

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>