You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by sdeck <sc...@gmail.com> on 2007/01/31 05:09:27 UTC

httpresponse + xml = not reading all bytes

Hey all,
 I have played around with the HTTPResponse object for a few days now trying
to figure this out. Not the httpclient plugin, just the http plugin.
It seems that certain rss feeds don't get fully read.  here is an example
url: http://blog.news-record.com/sportsextra/index.xml

It does not seem to happen on all of my feeds, just some of them.  Let's say
the content-length comes back as 5K, well the response may read something
like 3K, but then return -1 (EOF) and the response just goes on. No timeout
exception, no exception at all. 
I have tried so many different things. Adding in sleeps to pause and then
try and keep reading data. I have tried switching to httpclient, and it does
the same thing.  The weird thing, I put the url into my browser and it loads
fine.

So, the question is, has anyone run into the socket not really returning all
data without throwing an exception? Or, can someone try the above url and
see if they also run into the issue?
I have more example urls.  The only connection I seem to find, is that they
all map to
application/xhtml+xml

Thoughts anyone?
Scott
-- 
View this message in context: http://www.nabble.com/httpresponse-%2B-xml-%3D-not-reading-all-bytes-tf3146593.html#a8722984
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: httpresponse + xml = not reading all bytes

Posted by sdeck <sc...@gmail.com>.
I always like answering my own questions =)
So, the way I fixed this was to hack at the HttpResponse object in the http
protocol.

Basically, I added Pragma nocache headers
keep alive and keep alive connection time values
a last modified since header
All of that seemed to work well.
Then, I also found another issue, in that we were not looking for transfer
encoding of "chunked" So, if that came in, then I sent the stream to the
readChunkedEncoding method.
All of my feed readers seem to work now.

Now I just have issues with the Fetcher (and Fetcher2) of blocking on
socket.read (s)
1-5 threads seem to work fine, but I get thread waits after I start passing
the 10 thread mark. very strange/weird



sdeck wrote:
> 
> Hey all,
>  I have played around with the HTTPResponse object for a few days now
> trying to figure this out. Not the httpclient plugin, just the http
> plugin.
> It seems that certain rss feeds don't get fully read.  here is an example
> url: http://blog.news-record.com/sportsextra/index.xml
> 
> It does not seem to happen on all of my feeds, just some of them.  Let's
> say the content-length comes back as 5K, well the response may read
> something like 3K, but then return -1 (EOF) and the response just goes on.
> No timeout exception, no exception at all. 
> I have tried so many different things. Adding in sleeps to pause and then
> try and keep reading data. I have tried switching to httpclient, and it
> does the same thing.  The weird thing, I put the url into my browser and
> it loads fine.
> 
> So, the question is, has anyone run into the socket not really returning
> all data without throwing an exception? Or, can someone try the above url
> and see if they also run into the issue?
> I have more example urls.  The only connection I seem to find, is that
> they all map to
> application/xhtml+xml
> 
> Thoughts anyone?
> Scott
> 

-- 
View this message in context: http://www.nabble.com/httpresponse-%2B-xml-%3D-not-reading-all-bytes-tf3146593.html#a8774451
Sent from the Nutch - User mailing list archive at Nabble.com.