You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/11/09 19:19:18 UTC

protocol-http versus protocol-httpclient

I was recently benchmarking fetching at a site with lots of bandwidth, 
and it seemed to me that protocol-http is capable of faster crawling 
than protocol-httpclient.  So I don't think we should discard 
protocol-http just yet.  But there's a lot of duplicate code between 
these, which is difficult to maintain.

I think we should thus merge these, with a configuration parameter 
determining which http backend is used, much like parse-html, which can 
switch between neko and tagsoup.

What do others think?

Doug

Re: protocol-http versus protocol-httpclient

Posted by Matt Kangas <ka...@gmail.com>.

+1

I've been planning to switch my crawler over to use protocol- 
httpclient, but haven't got there yet. Interesting that there seems  
to be a performance impact with the new plugin.

(In my crawl setup, I override the default HTTP plugin so I can  
modify HTML content before it is written to a segment. I'd prefer if  
there was a hook for rewriting content regardless of protocol, but  
this works for now.)

--Matt

On Nov 9, 2005, at 1:19 PM, Doug Cutting wrote:

> I was recently benchmarking fetching at a site with lots of  
> bandwidth, and it seemed to me that protocol-http is capable of  
> faster crawling than protocol-httpclient.  So I don't think we  
> should discard protocol-http just yet.  But there's a lot of  
> duplicate code between these, which is difficult to maintain.
>
> I think we should thus merge these, with a configuration parameter  
> determining which http backend is used, much like parse-html, which  
> can switch between neko and tagsoup.
>
> What do others think?
>
> Doug

--
Matt Kangas / kangas@gmail.com

Re: protocol-http versus protocol-httpclient

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ken Krugler wrote:

> 1. We needed to modify the commons-httpclient code to fix one hang 
> that sometimes occurs in

[...]

> So the question here is what to do with these changes. I will try to 
> get them integrated into the commons-httpclient code, but that might 
> take a while before they circle back into Nutch. Suggestions for what 
> to do in the short term?
>

Please submit them to the commons-httpclient people - I found them very 
responsive to my bug reports. Even before they accept the patches we 
could use a "fixed" version of the library - see e.g. parse-rss where a 
similar situation occured.

> 2. Our other changes are a mixture of dealing more effectively with 
> bad hosts so fetcher threads don't get hung up, and changes to do a 
> better job of crawling a limited domain space (vertical crawl).
>
> The first set of changes seem like something that could get merged in 
> (if deemed useful) without too much effort. The second set are more 
> architectural in nature - and I'm a bit worried about what happens 
> when we try to integrate these into 0.8. Plus we're still in the 
> middle of getting the wrinkles ironed out, so it would be premature to 
> submit any patches.
>

The Fetcher in 0.8 (or rather in mapred branch) is somewhat different 
from 0.7.

> But are we going to be running into trouble by waiting? Would it make 
> sense to send out patches of what we've done to date, even if the code 
> isn't ready for prime time?


IMHO you should definitely submit the patches to commons-httpclient. 
Regarding our code - please create a bug issue and attach the patches. 
This gives a chance for others to work on them.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: protocol-http versus protocol-httpclient

Posted by Ken Krugler <kk...@transpac.com>.

>I was recently benchmarking fetching at a site with lots of 
>bandwidth, and it seemed to me that protocol-http is capable of 
>faster crawling than protocol-httpclient. So I don't think we should 
>discard protocol-http just yet.  But there's a lot of duplicate code 
>between these, which is difficult to maintain.
>
>I think we should thus merge these, with a configuration parameter 
>determining which http backend is used, much like parse-html, which 
>can switch between neko and tagsoup.
>
>What do others think?

Merging would be great - at least then there's only one plug-in to 
focus debugging energies on.

BTW, we've been tweaking code in this area to fix some issues we've 
run into. Some of the changes are minor, others are more significant. 
Some questions:

1. We needed to modify the commons-httpclient code to fix one hang 
that sometimes occurs in ChunkedInputStream.exhaustInputStream(). We 
found sites that were trickling lots of data back to us (e.g. 
60Kbits/sec), so we'd wind up waiting a really long time (up to two 
hours) for a fetcher thread to terminate.

What we did was have this routine throw an HttpException (cause is 
InterruptedException) whenever it notices that its thread has been 
interrupted. Then we monitor performance in fetcher.Fetcher.run() and 
interrupt any thread that has been working on a URL past a 
configurable time limit.

We also modified HttpMethodDirector.HttpMethodDirector(). It now sets 
the connection manager time-out (http.connection-manager.timeout HTTP 
parameter) to 10 minutes, rather than letting this default to 0. This 
prevents the connection manager from looping forever when it doesn't 
have a free connection to satisfy the client. It's not obvious how we 
get into this state of no free connections, but it has happened, and 
at least we don't hang now.

Plus some minor changes to tone down the level of logging for some 
messages, so our logs (when running at INFO) show only important 
status and real warnings/errors.

So the question here is what to do with these changes. I will try to 
get them integrated into the commons-httpclient code, but that might 
take a while before they circle back into Nutch. Suggestions for what 
to do in the short term?

2. Our other changes are a mixture of dealing more effectively with 
bad hosts so fetcher threads don't get hung up, and changes to do a 
better job of crawling a limited domain space (vertical crawl).

The first set of changes seem like something that could get merged in 
(if deemed useful) without too much effort. The second set are more 
architectural in nature - and I'm a bit worried about what happens 
when we try to integrate these into 0.8. Plus we're still in the 
middle of getting the wrinkles ironed out, so it would be premature 
to submit any patches.

But are we going to be running into trouble by waiting? Would it make 
sense to send out patches of what we've done to date, even if the 
code isn't ready for prime time?

Thanks for any advice,

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: protocol-http versus protocol-httpclient

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:

> I was recently benchmarking fetching at a site with lots of bandwidth, 
> and it seemed to me that protocol-http is capable of faster crawling 
> than protocol-httpclient.  So I don't think we should discard 
> protocol-http just yet.  But there's a lot of duplicate code between 
> these, which is difficult to maintain.
>

Where do you think is the performance loss in protocol-httpclient?

> I think we should thus merge these, with a configuration parameter 
> determining which http backend is used, much like parse-html, which 
> can switch between neko and tagsoup.
>
> What do others think?

I think it's a good idea. Things like authentication, robots, redirects, 
SSL setup and HTTP result code handling logic are nearly the same.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: protocol-http versus protocol-httpclient

Posted by Fuad Efendi <fu...@efendi.ca>.

Doug Cutting wrote:
>... protocol-http is capable of faster crawling than protocol-httpclient.
> So I don't think we should discard protocol-http just yet. 

>What do others think?

I think:

HttpClient-based [protocol-httpclient] uses own Threads. 
[protocol-http] does not create Threads.

We should manage this, [protocol-httpclient] is just temporary solution for
Cookies, Proxy, HTTPS etc.; [protocol-httpclient] still caches DNS-to-IP
mappings forever; Thread-related issues are very important...

Additionally, we should have such a setting:
"Wait 5 second between requests to SLOW servers"

- it means, that Nutch can dynamically define fast/slow servers and work
faster/slower...

Fuad