You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2006/05/08 04:36:59 UTC

http chunked content

Hi,

looks like the http protocol plugin does not handle chunked content. :(
The method readChunkedContent is never used and readPlainContent does  
not handle chunked content.
As far I know a lot of http servers response with chunked content at  
least all that return dynamically generated pages.
Should I file a bug?
Any thoughts?
Stefan 

Re: http chunked content

Posted by Stefan Groschupf <sg...@media-style.com>.
I'm almost sure that this is not related to http 1.0 requests.

Am 08.05.2006 um 03:20 schrieb Jérôme Charron:

>> As far I know a lot of http servers response with chunked content at
>> least all that return dynamically generated pages.
>> Should I file a bug?
>> Any thoughts?
>
> In fact, the requests issued from http plugin are in HTTP 1.0, so the
> servers should never return some chuncked content.
> I think that the readChunkedContent was included in the code for a  
> future
> use.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/


Re: http chunked content

Posted by Chris Fellows <cc...@sbcglobal.net>.
Okay, saw the code in the http-protocol plugin. I
remember looking at this about a year ago. RFC 2616
(HTTP/1.1) does say, as Jerome pointed out:

"A server MUST NOT send transfer-codings to an
HTTP/1.0 client."

Regardless, I can attest that there are servers out
there that return chunked content regardless of the
client.

We had a socket implementation akin to the
HttpResponse.java in http-protocol plugin and were
stumped on how to handle identifying whether the
response was chunked or not - as we could not reliably
use the Transfer-coding header. The only way we could
see was trying to use the initial hex characters
denoting the size of the first chunk.

"The chunk-size field is a string of hex digits
indicating the size of the chunk. The chunked encoding
is ended by any chunk whose size is zero, followed by
the trailer, which is terminated by an empty line." -
more from RFC 2616

But in practice this was error prone. Switching over
to apache httpclient eliminated this problem, as it
transparently handles chunked and un-chunked content.
But httpclient is much more heavy weight and so the
conversion could only be done after implementing some
basic resource pooling on the primary httpclient
object. 

It does look like this would be a serious refactor job
as nutch uses all java.net classes. On the other hand,
it might simplify some areas of the nutch protocol
classes and httpclient does have some interesting
built in support for multi-threading/performance
tuning requests.

I hope this helps towards a solution.

Best Regards,

Chris

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Chris Fellows wrote:
> > Just remembered, got around it by using HTTPClient
> > which handles reading the response (chunked or
> not)
> > transparently. Haven't looked at the nutch code,
> but
> > if we were to use HTTPClient 3.0.x or later,
> should
> > take care of it.
> >
> >   
> 
> Take a look at protocol-httpclient. This discussion
> is on whether/how to 
> fix protocol-http. The other plugin already supports
> this.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 


Re: http chunked content

Posted by Andrzej Bialecki <ab...@getopt.org>.
Chris Fellows wrote:
> Just remembered, got around it by using HTTPClient
> which handles reading the response (chunked or not)
> transparently. Haven't looked at the nutch code, but
> if we were to use HTTPClient 3.0.x or later, should
> take care of it.
>
>   

Take a look at protocol-httpclient. This discussion is on whether/how to 
fix protocol-http. The other plugin already supports this.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: http chunked content

Posted by Chris Fellows <cc...@sbcglobal.net>.
Just remembered, got around it by using HTTPClient
which handles reading the response (chunked or not)
transparently. Haven't looked at the nutch code, but
if we were to use HTTPClient 3.0.x or later, should
take care of it.

--- Chris Fellows <cc...@sbcglobal.net> wrote:

> > Furthermore, we can read in HTTP/1.1 specification
> > that "A server MUST NOT
> > send
> > transfer-codings to an HTTP/1.0 client".
> 
> I once did an socket implementation against
> Anonymizer. This is well established proxy service
> that services $100K+ government and private
> contracts.
> 
> Their server always sent chunked content despite all
> headers. I'm pretty sure that there are other well
> established servers that send chunked content
> despite
> the rfc.
> 
> Guessing that it might have something to do with
> wanting to control content compression. All the
> browsers can handle it, and that's probably all
> apple
> is concerned with - even though they're overriding
> an
> rfc spec req.
> 
> Chris
> 
> --- Jérôme Charron <je...@gmail.com> wrote:
> 
> > > http://www.apple.com for example answer with
> > chunked content also if
> > > you request with a http 1.0 header.
> > 
> > 
> > Stefan,
> > 
> > I don't see any "Transfer-Encoding: chunked"
> header
> > in responses from
> > www.apple.com
> > Furthermore, we can read in HTTP/1.1 specification
> > that "A server MUST NOT
> > send
> > transfer-codings to an HTTP/1.0 client".
> > 
> > Jérôme
> > 
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/
> > 
> 
> 


Re: http chunked content

Posted by Chris Fellows <cc...@sbcglobal.net>.
> Furthermore, we can read in HTTP/1.1 specification
> that "A server MUST NOT
> send
> transfer-codings to an HTTP/1.0 client".

I once did an socket implementation against
Anonymizer. This is well established proxy service
that services $100K+ government and private contracts.

Their server always sent chunked content despite all
headers. I'm pretty sure that there are other well
established servers that send chunked content despite
the rfc.

Guessing that it might have something to do with
wanting to control content compression. All the
browsers can handle it, and that's probably all apple
is concerned with - even though they're overriding an
rfc spec req.

Chris

--- Jérôme Charron <je...@gmail.com> wrote:

> > http://www.apple.com for example answer with
> chunked content also if
> > you request with a http 1.0 header.
> 
> 
> Stefan,
> 
> I don't see any "Transfer-Encoding: chunked" header
> in responses from
> www.apple.com
> Furthermore, we can read in HTTP/1.1 specification
> that "A server MUST NOT
> send
> transfer-codings to an HTTP/1.0 client".
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 


Re: http chunked content

Posted by Jérôme Charron <je...@gmail.com>.
> http://www.apple.com for example answer with chunked content also if
> you request with a http 1.0 header.


Stefan,

I don't see any "Transfer-Encoding: chunked" header in responses from
www.apple.com
Furthermore, we can read in HTTP/1.1 specification that "A server MUST NOT
send
transfer-codings to an HTTP/1.0 client".

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: http chunked content

Posted by Stefan Groschupf <sg...@media-style.com>.
http://www.apple.com for example answer with chunked content also if  
you request with a http 1.0 header.

Am 08.05.2006 um 03:20 schrieb Jérôme Charron:

>> As far I know a lot of http servers response with chunked content at
>> least all that return dynamically generated pages.
>> Should I file a bug?
>> Any thoughts?
>
> In fact, the requests issued from http plugin are in HTTP 1.0, so the
> servers should never return some chuncked content.
> I think that the readChunkedContent was included in the code for a  
> future
> use.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/


Re: http chunked content

Posted by Jérôme Charron <je...@gmail.com>.
> As far I know a lot of http servers response with chunked content at
> least all that return dynamically generated pages.
> Should I file a bug?
> Any thoughts?

In fact, the requests issued from http plugin are in HTTP 1.0, so the
servers should never return some chuncked content.
I think that the readChunkedContent was included in the code for a future
use.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/