You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by "Kalnichevski, Oleg" <ol...@bearingpoint.com> on 2003/03/21 15:00:25 UTC

[PATCH] HttpParser.readRawLine

The patch fixes RFC 822 non-compliant line termination problem reported by Carl A. Dunham

Cheers

Oleg

-----Original Message-----
From: Carl A. Dunham [mailto:httpc@strategyforward.com]
Sent: Donnerstag, 20. März 2003 20:44
To: 'Commons HttpClient Project'
Subject: CRLF and Connection: close


First, let me say that finding HttpClient was this week's bacon-saver. The 
other alternatives out there are, shall we say politely, a bit lacking.

I have, however, run into a couple of things. Apologies if these have been 
covered before. I tried searching the archives, and found some close hits, but 
none dead on.

The application I'm working on is a simple page downloader, not really a 
spider, just downloads pages from a list and parses the text out of the html. 
So, it ends up having to deal with a bunch of less-than-ideal page constructs, 
and generally will not visit the same site more than once. It also will visit 
thousands of sites in a session.

The first thing has to do with the specs for Internet text messages (RFC 822), 
which HTTP messages are. Apparently this is not really followed, at least in 
terms of line termination. It is pretty clear that every line is supposed to 
end with CRLF (\r\n), yet even a brief look at real-world messages will show 
you that this is routinely ignored. In fact, it seems that a majority of 
messages are terminated only with LF (\n).

Where this comes into play is in HttpParser.readLine() and readRawLine(), 
which assume standards-compliant messages. Fortunately, most of the time the 
non-compliant messages work, because they are transmitted as single lines, so 
inputStream.read() returns -1 at the end, and everything is fine. However, 
occasionally this will not work, for example if the server-side code has 
something like:

setHeader("Something: value\nSomething-else: value\n\n");

depending on the implementation, unexpected things can happen. Here is an 
example from a log:

2003/03/20 00:24:58:765 EST [TRACE] HttpParser - -enter HttpConnection.readLine()
2003/03/20 00:24:58:765 EST [TRACE] HttpParser - -enter HttpConnection.readRawLine()
2003/03/20 00:24:58:769 EST [DEBUG] wire - -<< "Content-type: text/html
Page-Completion-Status: Normal

 
" [\r\n]
2003/03/20 00:24:58:769 EST [TRACE] HttpParser - -enter HttpConnection.readLine()
2003/03/20 00:24:58:769 EST [TRACE] HttpParser - -enter HttpConnection.readRawLine()
2003/03/20 00:24:58:770 EST [DEBUG] wire - -<< "<html> 
" [\r\n]
2003/03/20 00:24:58:772 EST [WARN] HttpMethod - -Recoverable exception caught when reading response
2003/03/20 00:24:58:773 EST [DEBUG] HttpMethod - -Closing the connection.

The recoverable exception was an invalid header line "<html>".

I can provide example URLs, if this will help.

The second thing has to do with how Keep-alive connections behave. This is a 
multi-threaded app, using MultiThreadedHttpConnectionManager. It works great, 
however I don't get much benefit of the shared Connections, because I'm not 
connecting to the same site more than once, generally. That's OK, the problem 
I run into is that after running for not very long, I suddenly start getting 
everything timing out. It's hard to really pinpoint the timing, giving all the 
activity, and no thread identifiers in the log messages, but I think what is 
happening is that the system is simply running out of file handles or 
system-level connections. A quick "netstat -n" shows a whole bunch of open, 
TIME_WAIT, and other connections. It seems that the Connection Manager is 
keeping them around for re-use, and following HTTP/1.1. One fix was to send 
"Connection: close" as a RequestHeader, which really fixed things up, but now 
I am running into sites that are not responding, and not timing out. The log 
traces into ReadRawLine() and just sits there. I am still tracking this down, 
I just wonder if anyone else has seen this also?

Well, sorry to be so long-winded, and thanks!

Carl

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-httpclient-dev-help@jakarta.apache.org