You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Theral Mackey <tm...@zetta.net> on 2011/07/08 23:28:51 UTC

Alternatvie to httpclient for crawling with basic auth?

Is there an alternative to protocol-httpclient that can do basic auth? I am
running into a wall right now trying to get nutch to get anything past the
seed URL of my site. It requires auth, so I configured httpclient, which
(according to apache logs) is correctly sending credentials when it gets a
401 auth request returned from the server, but after getting '/', it quits
with:

Stopping at depth=1 - no more URLs to fetch.

Running again stops at depth=0. The target page is an apache mod_autoindex
page with 15 or so directories listed so it should not be hitting any limit
since it only fetching the 1 page total (turned off the
ignore.db.internal.links option even though I think I read it only applies
to index scoring, not the crawlDB). I thought it might be one of the regexp
filters blocking, so I trimmed them down to +.*, still nothing. I pointed it
at a server that does not require auth, and it spit out a "unzipBestEffort
returned null" error, even though nothing on the page is a zip/gz/tgz, and
server compression is not on. I traced this to NUTCH-990, which is marked
"won't fix", and everything pointing at upgrading to httpclient4 says it
wont happen.... so is there an alternative, or some way to get this
working?? Crawling the non-auth site with protocol-http works as expected,
nutch starts crawling the autoindex pages and I can watch from the console
or the apache access log.

-T

Re: Alternatvie to httpclient for crawling with basic auth?

Posted by Markus Jelsma <ma...@openindex.io>.
You can open an issue for rewriting httpclient for version 4 and maybe submit 
a patch ;) 

A dirty fix would be hacking protocol-http to send a cookie or HTTP auth 
credentials along with its requests.

> Is there an alternative to protocol-httpclient that can do basic auth? I am
> running into a wall right now trying to get nutch to get anything past the
> seed URL of my site. It requires auth, so I configured httpclient, which
> (according to apache logs) is correctly sending credentials when it gets a
> 401 auth request returned from the server, but after getting '/', it quits
> with:
> 
> Stopping at depth=1 - no more URLs to fetch.
> 
> Running again stops at depth=0. The target page is an apache mod_autoindex
> page with 15 or so directories listed so it should not be hitting any limit
> since it only fetching the 1 page total (turned off the
> ignore.db.internal.links option even though I think I read it only applies
> to index scoring, not the crawlDB). I thought it might be one of the regexp
> filters blocking, so I trimmed them down to +.*, still nothing. I pointed
> it at a server that does not require auth, and it spit out a
> "unzipBestEffort returned null" error, even though nothing on the page is
> a zip/gz/tgz, and server compression is not on. I traced this to
> NUTCH-990, which is marked "won't fix", and everything pointing at
> upgrading to httpclient4 says it wont happen.... so is there an
> alternative, or some way to get this working?? Crawling the non-auth site
> with protocol-http works as expected, nutch starts crawling the autoindex
> pages and I can watch from the console or the apache access log.
> 
> -T