You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2006/01/05 21:34:32 UTC

Re: problems http-client

Stefan Groschupf wrote:
> However in case it is known as buggy, we may should not set up as  
> default http protocol plugin as it is by today.

+1

I have found protocol-http to be more reliable for large crawls than 
protocol-httpclient and would be in favor of switching the default back 
to protocol-http.  When folks need advanced features then they can 
switch to protocol-httpclient.  Thoughts?

A related issue is that these two plugins replicate a lot of code.  At 
some point we should try to fix that.  See:

http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html

Doug

Re: problems http-client

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:

>>>A related issue is that these two plugins replicate a lot of code.  At
>>>some point we should try to fix that.  See:
>>>
>>>
>>>      
>>>
>>http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
>>    
>>
>
>I have beginning working on this. Nobody else? Can I go on?
>
>  
>

Please do go on!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: problems http-client

Posted by Jérôme Charron <je...@gmail.com>.
> > A related issue is that these two plugins replicate a lot of code.  At
> > some point we should try to fix that.  See:
> >
> >
> http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html

I have beginning working on this. Nobody else? Can I go on?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: problems http-client

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Hmm... I'm not saying it's flawless, there were surely some mysterious 
> things going on with it. That large crawl you mention, was it with the 
> (recently updated in Nutch) release 3.0? What were the issues?

No, it was in early December, with the previous version.  I don't recall 
the details, but it seemed slower, had a higher error rate, and seemed 
to result in more hung thread incidents.

> The main advantage of protocol-http is that it's so simple that few 
> things can go wrong, but this also means it's relatively 
> unsophisticated, and adding more advanced features could mean a lot of 
> work. Namely, adding support for https, cookies and authentication.

These are all good reasons to use protocol-httpclient.  But if you don't 
need any of those features, protocol-http seems to presently work better.

Perhaps we should get more feedback on the 3.0 version before we make a 
decision?

Doug

Re: problems http-client

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Stefan Groschupf wrote:
>
>> However in case it is known as buggy, we may should not set up as  
>> default http protocol plugin as it is by today.
>
>
> +1
>
> I have found protocol-http to be more reliable for large crawls than 
> protocol-httpclient and would be in favor of switching the default 
> back to protocol-http.  When folks need advanced features then they 
> can switch to protocol-httpclient.  Thoughts?
>

Hmm... I'm not saying it's flawless, there were surely some mysterious 
things going on with it. That large crawl you mention, was it with the 
(recently updated in Nutch) release 3.0? What were the issues?

The main advantage of protocol-http is that it's so simple that few 
things can go wrong, but this also means it's relatively 
unsophisticated, and adding more advanced features could mean a lot of 
work. Namely, adding support for https, cookies and authentication.

> A related issue is that these two plugins replicate a lot of code.  At 
> some point we should try to fix that.  See:
>
> http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html 
>


Yes.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com