You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Martin Kraemer <Ma...@mch.sni.de> on 1997/10/06 18:37:22 UTC

[PATCH] Parsing URI into its components - comments welcome

Hi again,

Please find attached my first try at parsing the several fields present
in a request's URL into a common structure which can then later be used
by other modules. The proxy module will profit most of this (the
XXX_canon_netloc() routines will become unneccessary).

This is a WIP and I just would like you to utter a short comment on the
concept. Do you think this will de-stabilize apache, or is it a sensible
way to go? In all areas where I have been missing information like the
decomposed URL I think it would be a useful addition (http_protocol:
decision if a proxy request is really a local request and should be
rewritten; proxy module: extracting user and password from URI;
http_log/mod_log_config (not yet): conceal the password so that it
cannot be misused by scanning the logfiles;...).

Currently, I coded the routines directly into http_protocol.c . A better
place IMO would be to add it to util.c because a routine like
parse_uri_components() could be used to parse config file parameters
as well (e.g., "ProxyRemote * http://proxy.mch.sni.de", it would
simplify implementation of more specific arguments to directives like
"NoProxy http://my.host:81"- right now, NoProxy only allows
hosts/domains, not scheme and port)

If you find the time, have a glimpse at the diffs and tell me if it
makes sense to continue with this.

On Sun, Oct 05, 1997 at 06:47:07PM -0700, Dean Gaudet wrote:
> Open issues:
> 
>     * PR#1049: name-based, multi-port servers don't work
>       Dean posted a patch but Martin mentioned it was wrong and proposed a
>       better solution.  Waiting for his work.

Anyway, I think that Dean's patch (removing
-    if (port && (port != r->server->port))
-        return;
from the beginning check_hostalias()) should be committed in any case.
Even if it not the final solution, it much improves current behavior.
Maybe it should be coupled with always using
    port = ntohs(r->connection->local_addr.sin_port);
instead of the request's port string?


    Martin

====
What I did to httpd.h:

    *   Added HTTP_PROTO(major,minor) macro as an abstraction layer for
	the internal representation (proto_num) of a protocol string.

    *   Added uri_components structure. It contains variables for the
	syntactic elements found in a typical request. Parsing these
	once simplifies handling in the modules.

What I did to http_protocol.c:

    *   Changed all port references to ignore the port string supplied
	in the request, and use the actually used port instead.
    @@@ Please review these changes: is it sensible to always use the
    @@@ actual port? In which situation would it make sense to look at
    @@@ the port given in the request?

    *   rfc1739 states:
	    ; the scheme is in lower case; interpreters should use case-ignore
	    scheme         = 1*[ lowalpha | digit | "+" | "-" | "." ]
	Therefore I changed references like strcmp(scheme,"http") to use
	strcasecmp() instead.

    *   Changed references to r->proto_num to use the HTTP_PROTO() macro.
	Fixed initial parsing of the protocol string to not accept
	"HTTP/0.1001" as a valid HTTP/1.1 request. Well, it still
	defaults to HTTP/1.0 as before.

    *   Changed most of the HTTP_ return codes from old to new style
	(e.g., SERVER_ERROR -> HTTP_INTERNAL_SERVER_ERROR)

    *   Added a short "Side Effects:" header to some routines where
	knowing the side effects helps understanding the server logic.

    *   Added a pair of routines to parse/unparse an uri string to/from
	an uri_components structure:
	    int parse_uri_components(pool *p, const char *uri, uri_components *c, int *pHostlen);
	    char *unparse_uri_components(pool *p, const uri_components *c, int *pHostlen);
	These probably should go to into util.c. Right now they still
	live in http_protocol.c .

    *   Used the new unparse_uri_components() to rewrite r->the_request
	in the case where a password is contained in the request. This
	password should never be logged in plaintext, so
	unparse_uri_components() rewrites r->the_request to just contain
	XXXXXXXX.
	("GET ftp://john.doe:XXXXXXXX@some.host/some/file HTTP/1.0")

    *   At the moment, changes which are based on the r->parsed_uri fields
	are encapsulated in #ifdef MnKr_uri_components (even in comments
	if that seemed to clarify things). The other changes are not
	#ifdef'ed

    *   Changed parse_uri(), check_fulluri(), read_request_line(),
	and check_hostalias() to make use of the additional information.
-- 
| S I E M E N S |  <Ma...@mch.sni.de>  |      Siemens Nixdorf
| ------------- |   Voice: +49-89-636-46021     |  Informationssysteme AG
| N I X D O R F |   FAX:   +49-89-636-44994     |   81730 Munich, Germany
~~~~~~~~~~~~~~~~My opinions only, of course; pgp key available on request

[CGI Security] Hole in Netscape Commerce Server

Posted by Martin Kraemer <Ma...@mch.sni.de>.
The german computer magazine c't reported in an article published on
24-Sep-97 <URL:http://www.heise.de/newsticker/data/hb-24.09.97-000/>
(in german!) about a hole that exists in Netscape Commerce- and
Communication Server Version 1.12 and 2.0 (and which was stuffed
in apache long ago -- was it?!?):
when a CGI document is access protected, the protection can be
circumvented just by appending a /xxx path info to the CGI URL. The
examples which are demonstrated by c't can be tested at
<URL:http://www.heise.de/bin/showsecrets>   (protected CGI script) and
<URL:http://www.heise.de/bin/showsecrets/foobar> (accessible)....

Didn't see a notice about it on this list yet.

    Martin
-- 
| S I E M E N S |  <Ma...@mch.sni.de>  |      Siemens Nixdorf
| ------------- |   Voice: +49-89-636-46021     |  Informationssysteme AG
| N I X D O R F |   FAX:   +49-89-636-44994     |   81730 Munich, Germany
~~~~~~~~~~~~~~~~My opinions only, of course; pgp key available on request

Re: [PATCH] Parsing URI into its components - comments welcome

Posted by Dean Gaudet <dg...@arctic.org>.
I'm pretty certain that my vhost revamp covers Ed's patch.

Dean

On Mon, 6 Oct 1997, Martin Kraemer wrote:

> On Mon, Oct 06, 1997 at 07:37:22PM +0200, Martin Kraemer wrote:
> > Anyway, I think that Dean's patch (removing
> > -    if (port && (port != r->server->port))
> > -        return;
> > from the beginning check_hostalias()) should be committed in any case.
> 
> Oooh... I just realized I haven't integrated Ed Korthof's patch to
> check_hostalias() yet....
> (<Pi...@eat.organic.com>) Yes, it is a
> security patch and I think it should go into 1.3b1 as well.
> 
> 
>     Martin
> -- 
> | S I E M E N S |  <Ma...@mch.sni.de>  |      Siemens Nixdorf
> | ------------- |   Voice: +49-89-636-46021     |  Informationssysteme AG
> | N I X D O R F |   FAX:   +49-89-636-44994     |   81730 Munich, Germany
> ~~~~~~~~~~~~~~~~My opinions only, of course; pgp key available on request
> 


Re: [PATCH] Parsing URI into its components - comments welcome

Posted by Martin Kraemer <Ma...@mch.sni.de>.
On Mon, Oct 06, 1997 at 07:37:22PM +0200, Martin Kraemer wrote:
> Anyway, I think that Dean's patch (removing
> -    if (port && (port != r->server->port))
> -        return;
> from the beginning check_hostalias()) should be committed in any case.

Oooh... I just realized I haven't integrated Ed Korthof's patch to
check_hostalias() yet....
(<Pi...@eat.organic.com>) Yes, it is a
security patch and I think it should go into 1.3b1 as well.


    Martin
-- 
| S I E M E N S |  <Ma...@mch.sni.de>  |      Siemens Nixdorf
| ------------- |   Voice: +49-89-636-46021     |  Informationssysteme AG
| N I X D O R F |   FAX:   +49-89-636-44994     |   81730 Munich, Germany
~~~~~~~~~~~~~~~~My opinions only, of course; pgp key available on request

Re: [PATCH] Parsing URI into its components - comments welcome

Posted by Martin Kraemer <Ma...@mch.sni.de>.
Thanks a lot, Dean, for taking the time and looking at my patch. Roy,
thanks to you, too, I just printed your url.txt and am going to look at
it later. BTW: No, I do't think it's a good solution to use regex
parsing, because then we have the same situation as before, that the
parsing is done repeatedly at different code locations, and every
code location does it "a bit differently". My solution will parse the URL
once, extracting everything that any module would need, and providing
an unparser (similar to the one described in your text) which can
re-create an URL from the single parts.

> -    sscanf(r->protocol, "HTTP/%d.%d", &major, &minor);
> -    r->proto_num = 1000 * major + minor;
> +    if (2 == sscanf(r->protocol, "HTTP/%d.%d", &major, &minor)
> +      && minor < HTTP_PROTO(1,0))	/* don't allow HTTP/0.1000 */
> +	r->proto_num = HTTP_PROTO(major,minor);
> +    else
> +	r->proto_num = HTTP_PROTO(1,0);
>  
> This breaks, for example, on HTTP/1.1000, which is a valid input and
> should be treated like HTTP/1.1.

RFC2068 says...:
## The version of an HTTP message is indicated by an HTTP-Version field
## in the first line of the message.
##
##        HTTP-Version   = "HTTP" "/" 1*DIGIT "." 1*DIGIT
##
## Note that the major and minor numbers MUST be treated as separate
## integers and that each may be incremented higher than a single digit.
## Thus, HTTP/2.4 is a lower version than HTTP/2.13, which in turn is
## lower than HTTP/12.3. Leading zeros MUST be ignored by recipients and
## MUST NOT be sent.

So, HTTP/1.1000 would the 999th update to, and be compatible with,
but not be a synonym for, HTTP/1.1;  It is therefore indeed the
safest game to check sensible input like you propose:

> How about this:
> 
>     if (2 == sscan(r->protocol, "HTTP/%u.%u", &major, &minor)) {
> 	if (minor < HTTP_PROTO(1,0)) {
> 	    r->proto_num = HTTP_PROTO(major,minor);
> 	}
> 	else if (major < 1) {
> 	    bogosity, die with a 505
...because before HTTP/1.0, no version string was given anyway, right.
> 	}
> 	else {
> 	    /* it has to be HTTP/1.1 backwards compliant */
>...
> Oh yeah, multiplying by 1000 is silly, it'd be better as 256, which
> shaves off many cycles :)

It's not _that_ bad, since the multiplication is executed only once per
request. Oh, you had a smilie there... ;-)

I kept the 1000 factor in my HTTP_PROTO() macro to be compatible
with all the other code locations where the macro is not used yet.

    Martin
-- 
| S I E M E N S |  <Ma...@mch.sni.de>  |      Siemens Nixdorf
| ------------- |   Voice: +49-89-636-46021     |  Informationssysteme AG
| N I X D O R F |   FAX:   +49-89-636-44994     |   81730 Munich, Germany
~~~~~~~~~~~~~~~~My opinions only, of course; pgp key available on request

Re: [PATCH] Parsing URI into its components - comments welcome

Posted by Dean Gaudet <dg...@arctic.org>.

On Mon, 6 Oct 1997, Martin Kraemer wrote:

> Anyway, I think that Dean's patch (removing
> -    if (port && (port != r->server->port))
> -        return;
> from the beginning check_hostalias()) should be committed in any case.
> Even if it not the final solution, it much improves current behavior.
> Maybe it should be coupled with always using
>     port = ntohs(r->connection->local_addr.sin_port);
> instead of the request's port string?

Hey Martin, I'll study your patch later, but it looks like you and I are
trouncing in the same section of code, my bad.  We're doing different
things though -- it'll take a hand merge.  But if you look at my vhost
patch recently you'll see I did exactly what you suggest there. 

Dean

Re: [PATCH] Parsing URI into its components - comments welcome

Posted by Dean Gaudet <dg...@arctic.org>.
On Mon, 6 Oct 1997, Martin Kraemer wrote:

> This is a WIP and I just would like you to utter a short comment on the
> concept. Do you think this will de-stabilize apache, or is it a sensible
> way to go? In all areas where I have been missing information like the
> decomposed URL I think it would be a useful addition (http_protocol:
> decision if a proxy request is really a local request and should be
> rewritten; proxy module: extracting user and password from URI;
> http_log/mod_log_config (not yet): conceal the password so that it
> cannot be misused by scanning the logfiles;...).

I think this is a Good Thing.  You've exposed a few bugs and this helps
clarify how some of the uri parsing happens.

> Currently, I coded the routines directly into http_protocol.c . A better
> place IMO would be to add it to util.c because a routine like
> parse_uri_components() could be used to parse config file parameters
> as well (e.g., "ProxyRemote * http://proxy.mch.sni.de", it would
> simplify implementation of more specific arguments to directives like
> "NoProxy http://my.host:81"- right now, NoProxy only allows
> hosts/domains, not scheme and port)

Yes the new routines would be better off in util.c.  Or even a new file
util_uri.c. 

> If you find the time, have a glimpse at the diffs and tell me if it
> makes sense to continue with this.

Yup!  Specific comments follow.

> Anyway, I think that Dean's patch (removing
> -    if (port && (port != r->server->port))
> -        return;
> from the beginning check_hostalias()) should be committed in any case.
> Even if it not the final solution, it much improves current behavior.
> Maybe it should be coupled with always using
>     port = ntohs(r->connection->local_addr.sin_port);
> instead of the request's port string?

Yeah as I mentioned this is superceded by my vhosts revamp.


> What I did to http_protocol.c:
> 
>     *   Changed all port references to ignore the port string supplied
> 	in the request, and use the actually used port instead.
>     @@@ Please review these changes: is it sensible to always use the
>     @@@ actual port? In which situation would it make sense to look at
>     @@@ the port given in the request?
> 
>     *   rfc1739 states:
> 	    ; the scheme is in lower case; interpreters should use case-ignore
> 	    scheme         = 1*[ lowalpha | digit | "+" | "-" | "." ]
> 	Therefore I changed references like strcmp(scheme,"http") to use
> 	strcasecmp() instead.
> 
>     *   Changed references to r->proto_num to use the HTTP_PROTO() macro.
> 	Fixed initial parsing of the protocol string to not accept
> 	"HTTP/0.1001" as a valid HTTP/1.1 request. Well, it still
> 	defaults to HTTP/1.0 as before.
> 
>     *   Changed most of the HTTP_ return codes from old to new style
> 	(e.g., SERVER_ERROR -> HTTP_INTERNAL_SERVER_ERROR)
> 
>     *   Added a short "Side Effects:" header to some routines where
> 	knowing the side effects helps understanding the server logic.
> 
>     *   Added a pair of routines to parse/unparse an uri string to/from
> 	an uri_components structure:
> 	    int parse_uri_components(pool *p, const char *uri, uri_components *c, int *pHostlen);
> 	    char *unparse_uri_components(pool *p, const uri_components *c, int *pHostlen);
> 	These probably should go to into util.c. Right now they still
> 	live in http_protocol.c .
> 
>     *   Used the new unparse_uri_components() to rewrite r->the_request
> 	in the case where a password is contained in the request. This
> 	password should never be logged in plaintext, so
> 	unparse_uri_components() rewrites r->the_request to just contain
> 	XXXXXXXX.
> 	("GET ftp://john.doe:XXXXXXXX@some.host/some/file HTTP/1.0")

I think I'd prefer if it were the caller's responsibility to do this
... or at least have some way for the caller to override this action to
get the real uri.  You've got 10 bitfields in there already, you could
add an 11th reveal_password, which defaults to 0.

I'd prefer the pHostLen to also be stored in the uri_components structure.

BTW you should reorder the structure so that the char *s appear first,
then the port, and then the bitfields... and then it'll be 2 bytes
shorter on almost all architectures.

> 
>     *   At the moment, changes which are based on the r->parsed_uri fields
> 	are encapsulated in #ifdef MnKr_uri_components (even in comments
> 	if that seemed to clarify things). The other changes are not
> 	#ifdef'ed
> 
>     *   Changed parse_uri(), check_fulluri(), read_request_line(),
> 	and check_hostalias() to make use of the additional information.

--- apachen/src/main/httpd.h	Sun Sep 28 01:33:35 1997
+++ src/main/httpd.h	Mon Oct  6 15:46:06 1997
+    /* XXX: Hmmmm... Should we accept the given port,
+     * or always use the port the request actually came in on?!
+     */
+/*XXX:
+    port = (r->parsed_uri.has_port) ? r->parsed_uri.port : ntohs(r->connection->local_addr.sin_port);
+*/
+    port = ntohs(r->connection->local_addr.sin_port);

I'm not sure if this, in check_fulluri, is right.  You're trying
to determine if you should proxy or not ... and if the given port is
different than the one it came in on then you should proxy ... besides,
I enforce the port restrictions later on in the new vhost code.

+       if (r->parsed_uri.dns_looked_up)
+           /* looked up earlier; re-use: */
+           hp = &r->parsed_uri.hostaddr;
+       else {
+           hp = gethostbyname(host);
+           r->parsed_uri.dns_looked_up = 1;
+           r->parsed_uri.dns_resolved = (hp != NULL);
+           if (hp != NULL)
+               r->parsed_uri.hostaddr = *hp;
+       }
+

I don't think you can just copy hp here, because the gethostbyname result
uses static storage ... even for things like h_name, h_aliases, and
h_addr_list.

+	if (r->parsed_uri.dns_resolved) {
+	    for (n = 0; hp->h_addr_list[n] != NULL; n++) {
+		if (r->connection->local_addr.sin_addr.s_addr ==
+		    (((struct in_addr *) (hp->h_addr_list[n]))->s_addr)
+/* XXX:@@MnKr@@ does this make sense? Or should we compare against all VHosts?
+   If a port was given, we check if it matches the actually used port.
+		    && (port == ntohs(r->connection->local_addr.sin_port))
+   Could this be a Security Risk if a wrong port was given in the request?
+   Or should only ntohs(r->connection->local_addr.sin_port) ever be used?
+ */
+		    ) {
+		    r->proxyreq = 0;
+		    r->uri += r->hostlen;
+		    r->hostlen = 0;
+		    break;
+		}
+	    }
+	}

You see, this entire section of code has never made sense to me.  It is a
terribly incomplete comparison to see if you're proxying to the same
address and doesn't do anything right in my opinion.  It doesn't deal
with ServerAlias, it doesn't handle all the hostnames/addresses from
the <VirtualHost> line ... yadda yadda.  Can someone explain what it's
supposed to do?

 
     r->assbackwards = (ll[0] == '\0');
     r->protocol = pstrdup(r->pool, ll[0] ? ll : "HTTP/0.9");
-    sscanf(r->protocol, "HTTP/%d.%d", &major, &minor);
-    r->proto_num = 1000 * major + minor;
+    if (2 == sscanf(r->protocol, "HTTP/%d.%d", &major, &minor)
+      && minor < HTTP_PROTO(1,0))	/* don't allow HTTP/0.1000 */
+	r->proto_num = HTTP_PROTO(major,minor);
+    else
+	r->proto_num = HTTP_PROTO(1,0);
 
This breaks, for example, on HTTP/1.1000, which is a valid input and
should be treated like HTTP/1.1.  How about this:

    if (2 == sscan(r->protocol, "HTTP/%u.%u", &major, &minor)) {
	if (minor < HTTP_PROTO(1,0)) {
	    r->proto_num = HTTP_PROTO(major,minor);
	}
	else if (major < 1) {
	    bogosity, die with a 505
	}
	else {
	    /* it has to be HTTP/1.1 backwards compliant */
	    r->proto_num = HTTP_PROTO(1,1);
	}
    }
    else {
	bogosity, die with a 400
    }

Oh yeah, multiplying by 1000 is silly, it'd be better as 256, which
shaves off many cycles :)

Dean