You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2011/08/01 21:28:27 UTC

protocol-httpclient

I have just recently learned that it is recommended not to use
protocol-httpclient due to the underlying commons http library and problems
with this. 

I am very disappointed to learn this as about half of my domains to crawl
use https and require certs.  Does anyone know how much of an effort it
would be to port to the apache http client?

Also, are their any JIRA issues open that might describe some of the
problems we are having with it.  I had it working perfectly fine in 1.2,
upgraded to 1.3 and now it is not working :-(

--
View this message in context: http://lucene.472066.n3.nabble.com/protocol-httpclient-tp3216821p3216821.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: protocol-httpclient

Posted by webdev1977 <we...@gmail.com>.

Are there any plans to fix the protocol-httpclient plugin? I do not have the
nor the expertise necessary to upgrade it myself.  I mean I COULD do it, but
it would take me ions :-)

--
View this message in context: http://lucene.472066.n3.nabble.com/protocol-httpclient-tp3216821p3376333.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: protocol-httpclient

Posted by webdev1977 <we...@gmail.com>.

Thanks for your reply!

I had not seen any weird exceptions before using it in v. 1.2  This version
I am able to fetch the first page from an https html page, but then it
doesn't find any outlinks.  I tried the ParserChecker and got the same
results. 

So it stops after this first round.  I have tried changing my filters to
allow everything (just to make sure that wasn't the issue) and nothing.

Another strange thing is that it seems to think that I have already fetched
it? I get the -shouldFetch rejected" message in the logs for the seed url. 
I am not sure how it is determining this, since I am using a new directory
for each test crawl.  I even deleted the temporary hadoop folders just to be
sure and I got the same result. 

--
View this message in context: http://lucene.472066.n3.nabble.com/protocol-httpclient-tp3216821p3218662.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: protocol-httpclient

Posted by Julien Nioche <li...@gmail.com>.

There hasn't been any changes to it between 1.2 and 1.3 and it was already
broken then. It does not handle multithreading well, leading to all sorts of
random exceptions. A good replacement would be to use the code in
Crawler-Commons that Ken contributed and wrap it as a protocol endpoint. Not
entirely sure whether it can already handle certificates but if not this
could be a good thing to add it to CC.

Sorry if you've already done so but would you mind explaining what doesn't
work for you anymore and what exceptions you are getting?

On 1 August 2011 20:28, webdev1977 <we...@gmail.com> wrote:

> I have just recently learned that it is recommended not to use
> protocol-httpclient due to the underlying commons http library and problems
> with this.
>
> I am very disappointed to learn this as about half of my domains to crawl
> use https and require certs.  Does anyone know how much of an effort it
> would be to port to the apache http client?
>
> Also, are their any JIRA issues open that might describe some of the
> problems we are having with it.  I had it working perfectly fine in 1.2,
> upgraded to 1.3 and now it is not working :-(
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/protocol-httpclient-tp3216821p3216821.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com