You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/11/09 19:19:18 UTC
protocol-http versus protocol-httpclient
I was recently benchmarking fetching at a site with lots of bandwidth,
and it seemed to me that protocol-http is capable of faster crawling
than protocol-httpclient. So I don't think we should discard
protocol-http just yet. But there's a lot of duplicate code between
these, which is difficult to maintain.
I think we should thus merge these, with a configuration parameter
determining which http backend is used, much like parse-html, which can
switch between neko and tagsoup.
What do others think?
Doug
Re: protocol-http versus protocol-httpclient
Posted by Matt Kangas <ka...@gmail.com>.
+1
I've been planning to switch my crawler over to use protocol-
httpclient, but haven't got there yet. Interesting that there seems
to be a performance impact with the new plugin.
(In my crawl setup, I override the default HTTP plugin so I can
modify HTML content before it is written to a segment. I'd prefer if
there was a hook for rewriting content regardless of protocol, but
this works for now.)
--Matt
On Nov 9, 2005, at 1:19 PM, Doug Cutting wrote:
> I was recently benchmarking fetching at a site with lots of
> bandwidth, and it seemed to me that protocol-http is capable of
> faster crawling than protocol-httpclient. So I don't think we
> should discard protocol-http just yet. But there's a lot of
> duplicate code between these, which is difficult to maintain.
>
> I think we should thus merge these, with a configuration parameter
> determining which http backend is used, much like parse-html, which
> can switch between neko and tagsoup.
>
> What do others think?
>
> Doug
--
Matt Kangas / kangas@gmail.com
Re: protocol-http versus protocol-httpclient
Posted by Andrzej Bialecki <ab...@getopt.org>.
Ken Krugler wrote:
> 1. We needed to modify the commons-httpclient code to fix one hang
> that sometimes occurs in
[...]
> So the question here is what to do with these changes. I will try to
> get them integrated into the commons-httpclient code, but that might
> take a while before they circle back into Nutch. Suggestions for what
> to do in the short term?
>
Please submit them to the commons-httpclient people - I found them very
responsive to my bug reports. Even before they accept the patches we
could use a "fixed" version of the library - see e.g. parse-rss where a
similar situation occured.
> 2. Our other changes are a mixture of dealing more effectively with
> bad hosts so fetcher threads don't get hung up, and changes to do a
> better job of crawling a limited domain space (vertical crawl).
>
> The first set of changes seem like something that could get merged in
> (if deemed useful) without too much effort. The second set are more
> architectural in nature - and I'm a bit worried about what happens
> when we try to integrate these into 0.8. Plus we're still in the
> middle of getting the wrinkles ironed out, so it would be premature to
> submit any patches.
>
The Fetcher in 0.8 (or rather in mapred branch) is somewhat different
from 0.7.
> But are we going to be running into trouble by waiting? Would it make
> sense to send out patches of what we've done to date, even if the code
> isn't ready for prime time?
IMHO you should definitely submit the patches to commons-httpclient.
Regarding our code - please create a bug issue and attach the patches.
This gives a chance for others to work on them.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: protocol-http versus protocol-httpclient
Posted by Ken Krugler <kk...@transpac.com>.
>I was recently benchmarking fetching at a site with lots of
>bandwidth, and it seemed to me that protocol-http is capable of
>faster crawling than protocol-httpclient. So I don't think we should
>discard protocol-http just yet. But there's a lot of duplicate code
>between these, which is difficult to maintain.
>
>I think we should thus merge these, with a configuration parameter
>determining which http backend is used, much like parse-html, which
>can switch between neko and tagsoup.
>
>What do others think?
Merging would be great - at least then there's only one plug-in to
focus debugging energies on.
BTW, we've been tweaking code in this area to fix some issues we've
run into. Some of the changes are minor, others are more significant.
Some questions:
1. We needed to modify the commons-httpclient code to fix one hang
that sometimes occurs in ChunkedInputStream.exhaustInputStream(). We
found sites that were trickling lots of data back to us (e.g.
60Kbits/sec), so we'd wind up waiting a really long time (up to two
hours) for a fetcher thread to terminate.
What we did was have this routine throw an HttpException (cause is
InterruptedException) whenever it notices that its thread has been
interrupted. Then we monitor performance in fetcher.Fetcher.run() and
interrupt any thread that has been working on a URL past a
configurable time limit.
We also modified HttpMethodDirector.HttpMethodDirector(). It now sets
the connection manager time-out (http.connection-manager.timeout HTTP
parameter) to 10 minutes, rather than letting this default to 0. This
prevents the connection manager from looping forever when it doesn't
have a free connection to satisfy the client. It's not obvious how we
get into this state of no free connections, but it has happened, and
at least we don't hang now.
Plus some minor changes to tone down the level of logging for some
messages, so our logs (when running at INFO) show only important
status and real warnings/errors.
So the question here is what to do with these changes. I will try to
get them integrated into the commons-httpclient code, but that might
take a while before they circle back into Nutch. Suggestions for what
to do in the short term?
2. Our other changes are a mixture of dealing more effectively with
bad hosts so fetcher threads don't get hung up, and changes to do a
better job of crawling a limited domain space (vertical crawl).
The first set of changes seem like something that could get merged in
(if deemed useful) without too much effort. The second set are more
architectural in nature - and I'm a bit worried about what happens
when we try to integrate these into 0.8. Plus we're still in the
middle of getting the wrinkles ironed out, so it would be premature
to submit any patches.
But are we going to be running into trouble by waiting? Would it make
sense to send out patches of what we've done to date, even if the
code isn't ready for prime time?
Thanks for any advice,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Re: protocol-http versus protocol-httpclient
Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> I was recently benchmarking fetching at a site with lots of bandwidth,
> and it seemed to me that protocol-http is capable of faster crawling
> than protocol-httpclient. So I don't think we should discard
> protocol-http just yet. But there's a lot of duplicate code between
> these, which is difficult to maintain.
>
Where do you think is the performance loss in protocol-httpclient?
> I think we should thus merge these, with a configuration parameter
> determining which http backend is used, much like parse-html, which
> can switch between neko and tagsoup.
>
> What do others think?
I think it's a good idea. Things like authentication, robots, redirects,
SSL setup and HTTP result code handling logic are nearly the same.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: protocol-http versus protocol-httpclient
Posted by Fuad Efendi <fu...@efendi.ca>.
Doug Cutting wrote:
>... protocol-http is capable of faster crawling than protocol-httpclient.
> So I don't think we should discard protocol-http just yet.
>What do others think?
I think:
HttpClient-based [protocol-httpclient] uses own Threads.
[protocol-http] does not create Threads.
We should manage this, [protocol-httpclient] is just temporary solution for
Cookies, Proxy, HTTPS etc.; [protocol-httpclient] still caches DNS-to-IP
mappings forever; Thread-related issues are very important...
Additionally, we should have such a setting:
"Wait 5 second between requests to SLOW servers"
- it means, that Nutch can dynamically define fast/slow servers and work
faster/slower...
Fuad