You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by "William A. Rowe, Jr." <wr...@covalent.net> on 2001/10/03 03:33:21 UTC

Re: [PATCH] mod_negotiation, suffix order

Francis,

  I don't see any of my earlier replies on this topic.  I think I may
have confused your contribution with a post by Brian Pane.  In any case,

  I am very impressed by this idea for Apache 2.0.  But I don't like the
many to many mapping.  If we change your underlying rule here to require
that each filename extension is passed in sequence, I would be _very_ 
happy to commit this patch :)  E.g. index.en _could_ match index.html.en.
But index.en.html would _not_ match index.html.en.

  Thanks for this submission, and sharing your ideas,

Bill

----- Original Message ----- 
From: "Francis Daly" <de...@daoine.org>
To: <ne...@apache.org>
Sent: Monday, April 30, 2001 1:09 PM
Subject: [PATCH] mod_negotiation, suffix order


> 
> Hi there,
> 
> this is essentially a repost of some mails earlier this month with the
> same patch and a similar Subject:.
> 
> Appended to this mail is a patch to remove the requirements on the
> order of suffixes when using MultiViews / mod_negotiation.  It does
> have the down side of increasing the number of valid URLs for the same
> content, but to a limited extent that is implicit in mod_negotiation
> anyway.
> 
> The patch is relative to the version of mod_negotiation.c distributed
> with apache-2.0.16.  There is a newer version in CVS, but the patch
> should still apply cleanly.
> 
> But first, some notes:
> 
> The current method takes the "file" part of r->filename (either the
> bit after the final / in the URI, or the value of DirectoryIndex).
> First, if the exact filename matches, mod_negotiation declines to
> handle it.  Second, for each file in the directory, it tries to match
> /^file\./.
> 
> This patched method does an extra strchr(), and uses a few extra
> int's and char *'s; and then for the requested file "file" does the
> same thing.
> 
> However, if the r->filename is actually "file.s1.s2.sZ" (with dots),
> the current way looks for /^file\.s1\.s2\.sZ\./; the patched way looks
> for each of /^file\./, /\.s1/, /\.s2/, /\.sZ/.  It bails out at the
> first failure.
> 
> Extra pointer and string manipulation is needed to do this, per dot in
> the requested file name, per file in the directory.
> 
> Some consequences of this implementation are:
> 
> Current method: file "name.html.en" is only accessible through
> (partial) URIs "name", "name.html", or "name.html.en"
> 
> Patched method: The same three work, as do "name.en" and
> "name.en.html".  That is good.  However: so do "name.htm",
> "name.htm.en", and "name.en.htm".  That may be considered good.  More
> however: so do "name.h", "name.h.h", "name...h.e.e..e.h.h.", and an
> infinite number of similar variations.  That may not be considered
> good.
> 
> In fact, the infinite number of possibilities is limited by the
> requirement that the length of the file name must be at least the
> length of the request in order to be considered, so a request with a
> dozen trailing dots will only have the hit of many strstr()s for files
> that match the prefix and have long enough names.
> 
> In each case, the content is returned with a Content-Location: header
> indicating the canonical filename.
> 
> The requirements are (1)r->filename up to the first dot must match the
> real filename up to the first dot; (2)r->filename may not be longer
> than the real filename; (3)each .suffix in r->filename must exist
> (string match) in the real filename; (4)the real filename must
> correspond to a known mime-type, encoding, etc -- which I think means
> that the final suffix must be known, and only suffixes followed by
> known suffixes are considered.
> 
> As a real example, testing with the apache "It worked!" page (named
> index.html.LANG), if I request index.html.fr, I get the page back.  If
> I request index.fr.html, or just index.fr, I get back the 406 Not
> Acceptable page, with a link to index.html.fr, _unless_ I include fr
> as an acceptable language.  If I include fr as a language, I can
> request /index.fr, /index.fr.html, or /index.html.fr successfully.  If
> I include fr as my preferred language, I can additionally request /,
> /index, and /index.html.  (As well as the .h, .ht, .htm, .f variants
> referred to earlier).  If I request /index.d, I get a 406 with links
> to index.html.de and index.html.dk
> 
> As a faked example, consider five files in the DocumentRoot, with no
> special customisations to the (MIME) configuration:
> 
> files a.b.c, d.e.html, g.h.i.j.k.en, m.n.o.p.q.html, s.t.html.u.v
> 
> The following requests have the indicated results:
> 
> GET /a            -> not found
> GET /a.b          -> not found
> GET /a.c          -> not found
> GET /a.b.c        -> success
> GET /d            -> success
> GET /d.e          -> success
> GET /d.h          -> success
> GET /d.html       -> success
> GET /d....html    -> not found
> GET /g            -> not found
> GET /g.h          -> not found
> GET /g.h.i.j.k    -> not found
> GET /g.h.i.j.k.en -> success
> GET /g.h.i.k.j.en -> not found
> GET /m            -> success
> GET /m.html       -> success
> GET /m.o.q.p.n    -> success
> GET /m.o.r.p.n    -> not found
> GET /s.t.html.u.v -> success
> GET /s            -> not found
> GET /s.t.html.u   -> not found
> 
> note that in the "not found" cases there (except for /m.o.r.p.n and
> /d....html), the patched code does pass the file down as being
> potentially valid -- it's later code which decides that it doesn't
> know how to treat the final suffix, and fails it.
> 
> As another faked example, with files ..d.f.html and .e.txt, I can
> successfully issue GETs for /.d, /.f, /.h, /.e and /.t, as well as
> things like /....t. (whether or not the final . there is punctuation). 
> 
> So that's it.  If I've missed something obvious, like r->filename being
> read-only or something, I'll head back to the drawing board.
> 
> All the best,
> 
> f
> -- 
> Francis Daly        deva@daoine.org
> 


Re: [PATCH] mod_negotiation, suffix order

Posted by "William A. Rowe, Jr." <wr...@covalent.net>.
From: "Lars Eilebrecht" <la...@hyperreal.org>
Sent: Wednesday, October 03, 2001 11:57 AM


> According to Rodent of Unusual Size:
> 
> >  Negociation is done using the header field values, NOT the
> >  URI.  
> [...]
> >  If the URI is "index.en", an explicitly English variant must
> >  match "index.en*.en*".
> >  
> >  Ordering is an issue for sure, but playing games, by decomposing
> >  the URI and trying to guess what it means, only complicates
> >  matters and is not on.
> 
> I tend to agree.
> Fiddling around with the extensions just makes negotiation
> more complicated and a source for problems and bugs.

I share your general fear.  That's why I'm against Francis' proposal to
allow index.html.en to match index.en.html.  I believe this can be much
simpler and more trustworthy by requiring the matched extensions to occur
in their given order.  That code is more easily proven than a many-to-many
matching logic.

Bill


Re: [PATCH] mod_negotiation, suffix order

Posted by Lars Eilebrecht <la...@hyperreal.org>.
According to Rodent of Unusual Size:

>  Negociation is done using the header field values, NOT the
>  URI.  
[...]
>  If the URI is "index.en", an explicitly English variant must
>  match "index.en*.en*".
>  
>  Ordering is an issue for sure, but playing games, by decomposing
>  the URI and trying to guess what it means, only complicates
>  matters and is not on.

I tend to agree.
Fiddling around with the extensions just makes negotiation
more complicated and a source for problems and bugs.


ciao...
-- 
Lars Eilebrecht           - Life is both difficult and time consuming.
lars@hyperreal.org


Re: [PATCH] mod_negotiation, suffix order

Posted by Rodent of Unusual Size <Ke...@Golux.Com>.
"William A. Rowe, Jr." wrote:
> 
> > That is, if the URI is index.bak, we can only negociate
> > amongst variants matching index.bak* -- NOT index.*.bak*.
> 
> What's your rational?  I agree that index[.*].bak[.*] is broader
> than index.bak[.*] --- but I'm wondering why you feel this way?
> 
> Say that we want to point the user to the english index page.
> Why shouldn't a request for index.en discover index.html.en or
> index.cgi.en?

Negociation is done using the header field values, NOT the
URI.  "index.en" is NOT a request for an English variant
of "index", it is a request for [possibly some variant of]
an object named "index.en" -- period.  That a portion of the
specified URI happens to match a value that has meaning in
negociation is completely coincidental -- and irrelevant.  We
cannot co-opt nor interpret nor decompose the value of the
URI in negociation; all we can use are the parameters in the
header and the resource (read: file) names.

If it were meant to be used as a negociation axis, it would
be in the header fields and absent from the URI.  That it is
explicit in the URI removes it from participation in any
negociation axes.

If the URI is "index.en", an explicitly English variant must
match "index.en*.en*".

Ordering is an issue for sure, but playing games, by decomposing
the URI and trying to guess what it means, only complicates
matters and is not on.

All IM[NS?]HO..  although Roy may have something to say on
this.
-- 
#ken	P-)}

Ken Coar, Sanagendamgagwedweinini  http://Golux.Com/coar/
Author, developer, opinionist      http://Apache-Server.Com/

"All right everyone!  Step away from the glowing hamburger!"

Re: [PATCH] mod_negotiation, suffix order

Posted by "William A. Rowe, Jr." <wr...@covalent.net>.
Bringing us back from random stream of conciousness... here's the thread
to date (goes back to April, so I suppose a full repost is in order.)

My fresh commentary is inline with Ken's comments below.

----- Original Message ----- 
From: "Francis Daly" <de...@daoine.org>
To: <ne...@apache.org>
Sent: Monday, April 30, 2001 1:09 PM
Subject: [PATCH] mod_negotiation, suffix order


> Hi there,
> 
> this is essentially a repost of some mails earlier this month with the
> same patch and a similar Subject:.
> 
> Appended to this mail is a patch to remove the requirements on the
> order of suffixes when using MultiViews / mod_negotiation.  It does
> have the down side of increasing the number of valid URLs for the same
> content, but to a limited extent that is implicit in mod_negotiation
> anyway.
> 
> The patch is relative to the version of mod_negotiation.c distributed
> with apache-2.0.16.  There is a newer version in CVS, but the patch
> should still apply cleanly.
> 
> But first, some notes:
> 
> The current method takes the "file" part of r->filename (either the
> bit after the final / in the URI, or the value of DirectoryIndex).
> First, if the exact filename matches, mod_negotiation declines to
> handle it.  Second, for each file in the directory, it tries to match
> /^file\./.
> 
> This patched method does an extra strchr(), and uses a few extra
> int's and char *'s; and then for the requested file "file" does the
> same thing.
> 
> However, if the r->filename is actually "file.s1.s2.sZ" (with dots),
> the current way looks for /^file\.s1\.s2\.sZ\./; the patched way looks
> for each of /^file\./, /\.s1/, /\.s2/, /\.sZ/.  It bails out at the
> first failure.
> 
> Extra pointer and string manipulation is needed to do this, per dot in
> the requested file name, per file in the directory.
> 
> Some consequences of this implementation are:
> 
> Current method: file "name.html.en" is only accessible through
> (partial) URIs "name", "name.html", or "name.html.en"
> 
> Patched method: The same three work, as do "name.en" and
> "name.en.html".  That is good.  However: so do "name.htm",
> "name.htm.en", and "name.en.htm".  That may be considered good.  More
> however: so do "name.h", "name.h.h", "name...h.e.e..e.h.h.", and an
> infinite number of similar variations.  That may not be considered
> good.
> 
> In fact, the infinite number of possibilities is limited by the
> requirement that the length of the file name must be at least the
> length of the request in order to be considered, so a request with a
> dozen trailing dots will only have the hit of many strstr()s for files
> that match the prefix and have long enough names.
> 
> In each case, the content is returned with a Content-Location: header
> indicating the canonical filename.
> 
> The requirements are (1)r->filename up to the first dot must match the
> real filename up to the first dot; (2)r->filename may not be longer
> than the real filename; (3)each .suffix in r->filename must exist
> (string match) in the real filename; (4)the real filename must
> correspond to a known mime-type, encoding, etc -- which I think means
> that the final suffix must be known, and only suffixes followed by
> known suffixes are considered.
> 
> As a real example, testing with the apache "It worked!" page (named
> index.html.LANG), if I request index.html.fr, I get the page back.  If
> I request index.fr.html, or just index.fr, I get back the 406 Not
> Acceptable page, with a link to index.html.fr, _unless_ I include fr
> as an acceptable language.  If I include fr as a language, I can
> request /index.fr, /index.fr.html, or /index.html.fr successfully.  If
> I include fr as my preferred language, I can additionally request /,
> /index, and /index.html.  (As well as the .h, .ht, .htm, .f variants
> referred to earlier).  If I request /index.d, I get a 406 with links
> to index.html.de and index.html.dk
> 
> As a faked example, consider five files in the DocumentRoot, with no
> special customisations to the (MIME) configuration:
> 
> files a.b.c, d.e.html, g.h.i.j.k.en, m.n.o.p.q.html, s.t.html.u.v
> 
> The following requests have the indicated results:
> 
> GET /a            -> not found
> GET /a.b          -> not found
> GET /a.c          -> not found
> GET /a.b.c        -> success
> GET /d            -> success
> GET /d.e          -> success
> GET /d.h          -> success
> GET /d.html       -> success
> GET /d....html    -> not found
> GET /g            -> not found
> GET /g.h          -> not found
> GET /g.h.i.j.k    -> not found
> GET /g.h.i.j.k.en -> success
> GET /g.h.i.k.j.en -> not found
> GET /m            -> success
> GET /m.html       -> success
> GET /m.o.q.p.n    -> success
> GET /m.o.r.p.n    -> not found
> GET /s.t.html.u.v -> success
> GET /s            -> not found
> GET /s.t.html.u   -> not found
> 
> note that in the "not found" cases there (except for /m.o.r.p.n and
> /d....html), the patched code does pass the file down as being
> potentially valid -- it's later code which decides that it doesn't
> know how to treat the final suffix, and fails it.
> 
> As another faked example, with files ..d.f.html and .e.txt, I can
> successfully issue GETs for /.d, /.f, /.h, /.e and /.t, as well as
> things like /....t. (whether or not the final . there is punctuation). 
> 
> So that's it.  If I've missed something obvious, like r->filename being
> read-only or something, I'll head back to the drawing board.


----- Original Message ----- 
From: "William A. Rowe, Jr." <wr...@covalent.net>
To: <ne...@apache.org>; <de...@daoine.org>
Sent: Tuesday, October 02, 2001 8:33 PM
Subject: Re: [PATCH] mod_negotiation, suffix order

> Francis,
> 
>   I don't see any of my earlier replies on this topic.  I think I may
> have confused your contribution with a post by Brian Pane.  In any case,
> 
>   I am very impressed by this idea for Apache 2.0.  But I don't like the
> many to many mapping.  If we change your underlying rule here to require
> that each filename extension is passed in sequence, I would be _very_ 
> happy to commit this patch :)  E.g. index.en _could_ match index.html.en.
> But index.en.html would _not_ match index.html.en.


----- Original Message ----- 
From: "Rodent of Unusual Size" <Ke...@Golux.Com>
To: <de...@httpd.apache.org>
Sent: Wednesday, October 03, 2001 8:54 AM
Subject: Re: .asis handler isn't driven


> "William A. Rowe, Jr." wrote:
> > 
> > [There is a weakness.  We need to evaluate the exception
> > list by component, right now we simply strcmp.  There is
> > a note in status to that effect.  E.g. requesting index.bak
> > -should- match index.html.bak
> 
> Um, no, I definitely think not.  I think the portion of
> the filename that's specified in the URL should be
> considered opaque, and that we can only negociate using
> the bits that are tailed on the file names but not the
> URL.

This post didn't mean what you expected [see my reply on the subject
.asis handler isn't driven], but your intepretation is relevant to this
thread here.
 
> That is, if the URI is index.bak, we can only negociate
> amongst variants matching index.bak* -- NOT index.*.bak*.



What's your rational?  I agree that index[.*].bak[.*] is broader
than index.bak[.*] --- but I'm wondering why you feel this way?

Say that we want to point the user to the english index page.
Why shouldn't a request for index.en discover index.html.en or
index.cgi.en?

This would resolve a _major_ Headache (with a capital H) on Win32,
since we can't handle index.html.en by filename extension.  However,
anyone could read the document win_service.en.html by double-clicking
a local copy of that file.  

The historical problem has been ordering, since we know the index page
will summon win_service.html.  Because the wildcards can only tail the
filename, we cannot server win_service.en.html from that request.

I've really got problems with the attitude that "Well, that's win32's
brokenness, to hell with letting them double click on the docs ... they
aught to know how to start the server before they read the docs."  That's
pretty bogus.  Contrawise, I don't disagree with allowing index.html to
find index.html.en or index.en.html, and not breaking anyone.

I'm arguing against a many-to-many, but not against allowing the parser
to test for unspecified segments between the filename and last given
extension.  The CPU hit will be negligable for mismatches, and only
slightly larger for matches.  

Bill


Re: [PATCH] mod_negotiation, suffix order

Posted by Francis Daly <de...@daoine.org>.
On Tue, Oct 02, 2001 at 08:33:21PM -0500, William A.  Rowe, Jr.  wrote:

>   I am very impressed by this idea for Apache 2.0.  But I don't like the
> many to many mapping.  If we change your underlying rule here to require
> that each filename extension is passed in sequence, I would be _very_ 
> happy to commit this patch :)  E.g. index.en _could_ match index.html.en.
> But index.en.html would _not_ match index.html.en.

By that, I take it you mean something like a requirement (5), or
perhaps (3b), to be added to the description below, along the lines of
(3b)"each .suffix in r->filename must exist in the real filename, in
the same sequence as they were in r->filename"?  (r->filename here
means "the bit of it after the final /")

>> The requirements are (1)r->filename up to the first dot must match the
>> real filename up to the first dot; (2)r->filename may not be longer
>> than the real filename; (3)each .suffix in r->filename must exist
>> (string match) in the real filename; (4)the real filename must
>> correspond to a known mime-type, encoding, etc -- which I think means
>> that the final suffix must be known, and only suffixes followed by
>> known suffixes are considered.

[ I note that others feel that (1) above should be replaced with
something more like "all of r->filename must match the start of the
real filename", which would make the remainder of this mail
irrelevant.  I'll continue anyway, but feel free to bin it if this is
the Wrong Thing ]

In case my interpretation of (3b) is unclear: as a (hopefully)
complete example, given the file "name.a.b.x.cd.e.f", and presuming
that "x" is the only suffix which is _not_ a recognised mime extension
(type, language, encoding, whatever) then which of the following
requests should be accepted, and which not?

name
name.a
name.a.b
name.a.c
name.a.cd
name.b.c.e
name.b.x.e.f
name.e
name.x.c
name.x.f
name.a.b...f

name.b.a
name.a.b.f.c
name.x.b
name.c.x
name.f.e

name.e.
name.e.e
name.f.

(my understanding is that the first group should all be passed down as
possibilities, the second group shouldn't, and the third group could
be anything.  I'd plump for "yes" for all three, probably.)

And for extra fun, which would be different if the file were called
"name.a.b.x.cd.e.f.e".

(my understanding is that the third group definitely becomes "yes",
name.f.e from the second group becomes "yes", and the rest stay as
they were.)

As mentioned in the earlier mail, this patch just decides whether or
not to allow the file as a possibility -- later code gets a shot at
deciding how to handle the suffixes, so if any of the trailing
not-explicitly-listed-in-r->filename suffixes aren't actually
recognised, the only way to get the file would be to request it
by the full name, and therefore bypass mod_negotiation.

For the specific example above, this means that the only requests that
would actually return the file would be name.b.x.e.f, name.x.c, and
name.x.f

The change to the patch to limit the matches as described above is
mostly straightforward -- instead of starting each strstr() at the
start of "name", start it at the point of the previous match (either
the start or the end -- it'd presumably make a difference if someone
requests "file.html.html").

A new const char * which points into dirent.name is the only addition
over the previous patch.  However, unless someone has a
case-insensitive strstr() lying around, the CASE_BLIND_FILESYSTEM
cases won't work sensibly -- the "name" part would match
insensitively, but each suffix won't.

I'm including the reworked patch below, in case it's considered
useful.  Written and somewhat tested against mod_negotiation.c from
httpd-2.0.25; it applies cleanly to CVS version 1.84.

	f
-- 
Francis Daly        deva@daoine.org

=============================

--- mod_negotiation.c.orig	Tue Aug 28 04:08:31 2001
+++ mod_negotiation.c	Wed Oct  3 21:44:12 2001
@@ -1019,6 +1019,11 @@
     struct var_rec mime_info;
     struct accept_rec accept_info;
     void *new_var;
+    char *pos;
+    int pos_len;
+    int not_this_dirent;        /* actually, boolean. */
+    int dots_in_request = 0;    /* 1 == one dot, 2 == some dots */
+    const char *dpos;           /* points into the dirent.name */
     int anymatch = 0;
 
     clean_var_rec(&mime_info);
@@ -1041,20 +1046,92 @@
         return HTTP_FORBIDDEN;
     }
 
+    if ((pos = strchr(filp, '.'))) {
+        dots_in_request = 1;
+        if (strchr(++pos, '.')) {
+            dots_in_request = 2;
+        }
+    }
+
     while (apr_dir_read(&dirent, APR_FINFO_DIRENT, dirp) == APR_SUCCESS) {
         apr_array_header_t *exception_list;
         request_rec *sub_req;
         
-        /* Do we have a match? */
+        if (!dots_in_request) {
+
+            /* Given "name", check for "name." */
 #ifdef CASE_BLIND_FILESYSTEM
-        if (strncasecmp(dirent.name, filp, prefix_len)) {
+            if (strncasecmp(dirent.name, filp, prefix_len)) {
 #else
-        if (strncmp(dirent.name, filp, prefix_len)) {
+            if (strncmp(dirent.name, filp, prefix_len)) {
 #endif
-            continue;
-        }
-        if (dirent.name[prefix_len] != '.') {
-            continue;
+                continue;
+            }
+            if (dirent.name[prefix_len] != '.') {
+                continue;
+            }
+
+        } else {
+
+            /* Given "name.suffixes", check for "name." */
+            pos = strchr(filp, '.');
+            pos_len = pos - filp + 1;
+#ifdef CASE_BLIND_FILESYSTEM
+            if (strncasecmp(dirent.name, filp, pos_len)) {
+#else
+            if (strncmp(dirent.name, filp, pos_len)) {
+#endif
+                continue;
+            }
+
+            /* Next search can start at the first dot in dirent.name */
+            dpos = &dirent.name[pos_len-1];
+            not_this_dirent = 0;
+            filp = ++pos;
+
+            /* Given "name.suf1.suf2.suffix", check for each ".sufN",
+               somewhere after the previous match */
+            if (2 == dots_in_request) {
+                /* Give up now if the request is longer than the file */
+                if (prefix_len > strlen(dirent.name)) {
+                    filp -= pos_len;
+                    continue;
+                }
+
+                while ((pos = strchr(filp, '.'))) {
+
+                    --filp;
+                    pos_len = pos - filp ;
+                    filp[pos_len]='\0';
+                    if ((dpos = strstr(dpos, filp)) == NULL) {
+                        not_this_dirent=1;
+                    }
+
+                    filp[pos_len] = '.';
+                    filp += pos_len + 1;
+                    
+                    if (not_this_dirent) {
+                        /* get to next dirent */
+                        break;
+                    }
+                }
+                if (not_this_dirent) {
+                    /* reset filp before trying next dirent */
+                    pos_len = strlen(filp);
+                    filp -= prefix_len - pos_len;
+                    continue;
+                }
+            }
+            --filp;
+            pos_len = strlen(filp);
+
+            /* Check for the final ".suffix" */
+
+            if (!strstr(dpos, filp)) {
+                filp -= prefix_len - pos_len;
+                continue;
+            }
+            filp -= prefix_len - pos_len;
         }
 
         /* Ok, something's here.  Maybe nothing useful.  Remember that