You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by Todd Wilson <to...@screen-scraper.com> on 2006/08/09 05:13:10 UTC

"Unable to parse header" issue

Greetings,

Can't seem to find anything on this in Bugzilla or the list archives, so
I thought I'd throw it out to the group before submitting a bug.

This code:

----------------------------------------------------------
import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.params.HttpMethodParams;

HttpClient client = new HttpClient();

GetMethod method = new GetMethod(
"http://212.50.188.107/cgi-win/vebra.cgi?details1?src=vebra&PropertyCode=1007003/ASHGR/38878/3"
);

try 
{
  // Execute the method.
  int statusCode = client.executeMethod(method);

  if (statusCode != HttpStatus.SC_OK) 
	{
    System.out.println("Method failed: " + method.getStatusLine());
  }

  // Read the response body.
  byte[] responseBody = method.getResponseBody();

  // Deal with the response.
  response = new String( responseBody );

	System.out.println( "Result from get: " + response );
} 
catch (Exception e) 
{
  System.out.println("Error: " + e.getMessage());
}
finally 
{
  // Release the connection.
  method.releaseConnection();
}
----------------------------------------------------------

Produces this exception:

----------------------------------------------------------
org.apache.commons.httpclient.ProtocolException: Unable to parse header:
HTTP/1.0 200 OK
	at org.apache.commons.httpclient.HttpParser.parseHeaders(Ljava.io.InputStream;Ljava.lang.String;)[Lorg.apache.commons.httpclient.Header;(Unknown Source)
	at org.apache.commons.httpclient.HttpMethodBase.readResponseHeaders(Lorg.apache.commons.httpclient.HttpState;Lorg.apache.commons.httpclient.HttpConnection;)V(Unknown Source)
	at org.apache.commons.httpclient.HttpMethodBase.readResponse(Lorg.apache.commons.httpclient.HttpState;Lorg.apache.commons.httpclient.HttpConnection;)V(Unknown Source)
	at org.apache.commons.httpclient.HttpMethodBase.execute(Lorg.apache.commons.httpclient.HttpState;Lorg.apache.commons.httpclient.HttpConnection;)I(Unknown Source)
	at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Lorg.apache.commons.httpclient.HttpMethod;)V(Unknown Source)
	at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Lorg.apache.commons.httpclient.HttpMethod;)V(Unknown Source)
	at org.apache.commons.httpclient.HttpClient.executeMethod(Lorg.apache.commons.httpclient.HostConfiguration;Lorg.apache.commons.httpclient.HttpMethod;Lorg.apache.commons.httpclient.HttpState;)I(Unknown Source)
	at org.apache.commons.httpclient.HttpClient.executeMethod(Lorg.apache.commons.httpclient.HttpMethod;)I(Unknown Source)
----------------------------------------------------------

If I try the HTTP request manually via telnet, here's what I get for the
HTTP response:

----------------------------------------------------------
HTTP/1.1 200 OK
Server: Microsoft-IIS/4.0
Date: Tue, 08 Aug 2006 16:31:46 GMT
HTTP/1.0 200 OK
Content-type: Text/HTML

<HTML>
  <HEAD>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
    charset=iso-8859-1">
<title>Burwell -   1 bed Flat/ Maisonette</title>
<script language="JavaScript" >
<!--
function MM_openBrWindow(theURL,winName,features) { //v2.0
  window.open(theURL,winName,features);
}
//-->
.
.
.
----------------------------------------------------------

What do you think?  I'm honestly not sure what the cause is.  I copied
the HTTP response directly from a command prompt window, but it's
possible there could be some other white space in there that I didn't
include.

By the way, I realize the URL is completely malformed, but it seems to
work just fine in a browser.  The server is also probably doing who
knows what contrary to the HTTP spec, but such it is.

Thanks,

Todd Wilson

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: "Unable to parse header" issue

Posted by Roland Weber <ht...@dubioso.net>.
Hello Robert,

> Only handling strict http/1.x will mean that httpclient can not be used
> for many things.

Like cooking coffee :-) It can still be used for what it's meant to
be used: communicating with HTTP.

> For now rabbit has a flag to
> enable/disable strict http. That flag has to be set to false as default.

We whish we could do that as well. But we can't. The HttpClient 3.x
code base is monolithic. We can't disable a "strict" check, we'd have
to add code for tolerance.

> I have not checked if httpclient handles newline issues strict or if is
> forgiving there. Most of the problems I saw with strict http handling
> where with CGI sending \n\n instead of \r\n\r\n.

It is forgiving, see HttpParser.readLine(InputStream, String).
We are also forgiving about spurious newlines at the beginning
of a response, which is necessary for keep-alive with servers
that send an extra newline after the response.

> If I were to try to build a web proxy with httpclient it would not work
> well, so all I can recomend is to at least make it possible to handle
> broken headers similar to how browsers treat them.

We are working on HttpComponents, which introduces a modular design.
When that is ready for prime time, it shouldn't be too hard to plug
in a different header parser that does something useful with non-
header lines. Then one could implement two, one that considers trash
as a trash line and continues to read headers, and one that considers
trash as an indication of a missing empty line after the headers and
finishes header parsing there.

> Has anyone tried to run something like the co-advisor http tests with
> httpclient?
> http://coad.measurement-factory.com/

Not that I know of. We know that the design of HttpClient 3.x is
broken, so why should we run tests for things we wouldn't implement
there anyway? The redesign is our first priority, and it will keep
us very busy for months to come. Once we've got something we feel
comfortable with, we can consider adding features, including fault
tolerance. Thanks a lot for the link - sounds very interesting.
I'll remember it until the time is ripe.

> Just my thoughts, now I need to read through the nio-version to see how
> httpclient handles that, just to compare it to my proxy (which uses full
> nio for all handling).

HttpClient doesn't handle it at all. HttpComponents will, but
NIO support is only just in the making.

cheers,
  Roland

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: "Unable to parse header" issue

Posted by Robert Olofsson <ro...@khelekore.org>.
Ortwin Glück wrote:
> The question is not *how* to handle this in your code, but rather *if*!
> That system is not speaking HTTP, but something else. Why bother at all?

Only handling strict http/1.x will mean that httpclient can not be used
for many things. In many ways I agree that only strict http should be
handled, but when I made that the default in my http proxy, rabbit, too
many sites stopped workning. Rabbit is not based on httpclient and it is
for the moment only a web proxy. For now rabbit has a flag to
enable/disable strict http. That flag has to be set to false as default.

I have not checked if httpclient handles newline issues strict or if is
forgiving there. Most of the problems I saw with strict http handling 
where with CGI sending \n\n instead of \r\n\r\n.

If I were to try to build a web proxy with httpclient it would not work
well, so all I can recomend is to at least make it possible to handle
broken headers similar to how browsers treat them.

Has anyone tried to run something like the co-advisor http tests with 
httpclient?
http://coad.measurement-factory.com/

That is for http proxies so it may not be usable directly. The tests 
cost money, but "...free online access for qualified open-source 
software projects are available...".

> If you can, you should speak to the operators of that system and make 
> them comply with the (at least the basic) HTTP standard.

That is always something to try, it will not always work, at least not 
fast enough. Servers today are usually better than they were a few years
ago though, so in time maybe we will have conforming clients and servers.

Just my thoughts, now I need to read through the nio-version to see how 
httpclient handles that, just to compare it to my proxy (which uses full 
nio for all handling).

/robo

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: "Unable to parse header" issue

Posted by Ortwin Glück <od...@odi.ch>.
Todd,

The question is not *how* to handle this in your code, but rather *if*!

That system is not speaking HTTP, but something else. Why bother at all?

If you can, you should speak to the operators of that system and make 
them comply with the (at least the basic) HTTP standard.

Odi


Todd Wilson wrote:
> I can certainly understand not wanting to deal with this.  From the
> standpoint of an HttpClient user, the tricky part is that I can't think
> of a workaround if I still want to be able to work with this site.  The
> only option I can think of would be to fork HttpClient and provide my
> own fix, which I really have no desire to do.
> 
> I guess the question becomes, to what degree should provisions be made
> to deal with non-conforming web servers?  HttpClient already does this
> to some degree in the way it works with cookies (e.g., a "Compatibility"
> setting), among other things.  In this particular case, the server's
> response is obviously very flawed, so it may fall outside of this
> threshold of scenarios you're willing to deal with.  Again, the trouble
> is that I have no way of elegantly handling this in my own code.  If I
> want to use HttpClient I simply wouldn't be able to work with this site.
> 
> Todd

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: "Unable to parse header" issue

Posted by Todd Wilson <to...@screen-scraper.com>.
I can certainly understand not wanting to deal with this.  From the
standpoint of an HttpClient user, the tricky part is that I can't think
of a workaround if I still want to be able to work with this site.  The
only option I can think of would be to fork HttpClient and provide my
own fix, which I really have no desire to do.

I guess the question becomes, to what degree should provisions be made
to deal with non-conforming web servers?  HttpClient already does this
to some degree in the way it works with cookies (e.g., a "Compatibility"
setting), among other things.  In this particular case, the server's
response is obviously very flawed, so it may fall outside of this
threshold of scenarios you're willing to deal with.  Again, the trouble
is that I have no way of elegantly handling this in my own code.  If I
want to use HttpClient I simply wouldn't be able to work with this site.

Todd


On Wed, 09 Aug 2006 10:42:42 +0200, "Oleg Kalnichevski"
<ol...@apache.org> said:
> On Wed, 2006-08-09 at 10:34 +0200, Ortwin Glück wrote:
> > Oleg,
> > 
> > Of course I agree. But I remember that we had seen this before. And I 
> > thought that there was code to check for duplicate status lines. But I 
> > can't seem to remember any details. Does anyone know more?
> > 
> > Odi
> > 
> 
> We have seen something similar a couple of years ago. This kind of
> problem is not that uncommon, especially in HTTP responses generated by
> CGI scripts. As far as I remember the argument was all about "common
> browsers tolerate such protocol violations", which I personally do not
> find very convincing
> 
> Oleg
> 
> 
> > Oleg Kalnichevski wrote:
> > > Todd,
> > > 
> > > The request head is completely messed up. Note second instance of the
> > > status line (HTTP/1.1 200 OK) between Date and Content-Type headers.
> > > HttpClient is absolutely correct in rejecting this request as malformed
> > > 
> > > Oleg
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org
> > 
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: "Unable to parse header" issue

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Wed, 2006-08-09 at 10:34 +0200, Ortwin Glück wrote:
> Oleg,
> 
> Of course I agree. But I remember that we had seen this before. And I 
> thought that there was code to check for duplicate status lines. But I 
> can't seem to remember any details. Does anyone know more?
> 
> Odi
> 

We have seen something similar a couple of years ago. This kind of
problem is not that uncommon, especially in HTTP responses generated by
CGI scripts. As far as I remember the argument was all about "common
browsers tolerate such protocol violations", which I personally do not
find very convincing

Oleg


> Oleg Kalnichevski wrote:
> > Todd,
> > 
> > The request head is completely messed up. Note second instance of the
> > status line (HTTP/1.1 200 OK) between Date and Content-Type headers.
> > HttpClient is absolutely correct in rejecting this request as malformed
> > 
> > Oleg
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: "Unable to parse header" issue

Posted by Ortwin Glück <od...@odi.ch>.
Oleg,

Of course I agree. But I remember that we had seen this before. And I 
thought that there was code to check for duplicate status lines. But I 
can't seem to remember any details. Does anyone know more?

Odi

Oleg Kalnichevski wrote:
> Todd,
> 
> The request head is completely messed up. Note second instance of the
> status line (HTTP/1.1 200 OK) between Date and Content-Type headers.
> HttpClient is absolutely correct in rejecting this request as malformed
> 
> Oleg

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org


Re: "Unable to parse header" issue

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Tue, 2006-08-08 at 21:13 -0600, Todd Wilson wrote:
> Greetings,
> 
> Can't seem to find anything on this in Bugzilla or the list archives, so
> I thought I'd throw it out to the group before submitting a bug.
...

> If I try the HTTP request manually via telnet, here's what I get for the
> HTTP response:
> 
> ----------------------------------------------------------
> HTTP/1.1 200 OK
> Server: Microsoft-IIS/4.0
> Date: Tue, 08 Aug 2006 16:31:46 GMT
> HTTP/1.0 200 OK
> Content-type: Text/HTML
> 
Todd,

The request head is completely messed up. Note second instance of the
status line (HTTP/1.1 200 OK) between Date and Content-Type headers.
HttpClient is absolutely correct in rejecting this request as malformed

Oleg


> <HTML>
>   <HEAD>
>     <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
>     charset=iso-8859-1">
> <title>Burwell -   1 bed Flat/ Maisonette</title>
> <script language="JavaScript" >
> <!--
> function MM_openBrWindow(theURL,winName,features) { //v2.0
>   window.open(theURL,winName,features);
> }
> //-->
> .
> .
> .
> ----------------------------------------------------------
> 
> What do you think?  I'm honestly not sure what the cause is.  I copied
> the HTTP response directly from a command prompt window, but it's
> possible there could be some other white space in there that I didn't
> include.
> 
> By the way, I realize the URL is completely malformed, but it seems to
> work just fine in a browser.  The server is also probably doing who
> knows what contrary to the HTTP spec, but such it is.
> 
> Thanks,
> 
> Todd Wilson
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org