You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Guruprasad Iyer <mu...@gmail.com> on 2006/10/12 11:50:02 UTC

crawling sites which require authentication

Hi,

I need to know how to crawl (intranet) sites which require authentication.
One suggestion was that I replace protocol-http with protocol-httpclient in
the value field of plugin.includes tag in the nutch-default.xml file.
However, this did not solve the problem.
Can you help me out on this? Thanks.

Regards,
Guruprasad
-- 
If you want to build a ship, don't drum up the men to gather wood, divide
the work and give orders. Instead, teach them the desire for the sea.

- Antoine-Marie-Roger de Saint-Exupery

Re: crawling sites which require authentication

Posted by Jim Wilson <wi...@gmail.com>.
Standard community response: it's not built in, but you could write an
extension!

(I asked this myself a few months back).

-- Jim

On 10/12/06, Guruprasad Iyer <mu...@gmail.com> wrote:
>
> Hi,
>
> I need to know how to crawl (intranet) sites which require authentication.
> One suggestion was that I replace protocol-http with protocol-httpclient
> in
> the value field of plugin.includes tag in the nutch-default.xml file.
> However, this did not solve the problem.
> Can you help me out on this? Thanks.
>
> Regards,
> Guruprasad
> --
> If you want to build a ship, don't drum up the men to gather wood, divide
> the work and give orders. Instead, teach them the desire for the sea.
>
> - Antoine-Marie-Roger de Saint-Exupery
>
>

Re: crawling sites which require authentication

Posted by Ravi Chintakunta <ra...@gmail.com>.
Switching from protocol-http to protocol-httpclient will help in
crawling secured sites (https).

If your site supports HTTP Basic authentication, then you can modify
the HTTP class in the protocol-httpclient plugin.

These are minor changes in the configureClient method:

client.getParams().setAuthenticationPreemptive(true); // This is
required if your site does /not throw an authentication challenge.

 client.getState().setCredentials(new AuthScope("site.com",
AuthScope.ANY_PORT, AuthScope.ANY_REALM), new User
namePasswordCredentials(username, password));

Replace the site with your site name (without the http or https
prefix), and include your login credentials for username and password.

You may also include the login credentials in the nutch conf file and read it.

Hope this helps.

- Ravi Chintakunta


On 10/12/06, Tomi NA <he...@gmail.com> wrote:
> 2006/10/12, Guruprasad Iyer <mu...@gmail.com>:
> > Hi,
> >
> > I need to know how to crawl (intranet) sites which require authentication.
> > One suggestion was that I replace protocol-http with protocol-httpclient in
> > the value field of plugin.includes tag in the nutch-default.xml file.
> > However, this did not solve the problem.
> > Can you help me out on this? Thanks.
>
> I don't know what kind of authentication scheme you're up against, but
> recently I had to work with NTLM authentication in an intranet and
> worked arround it using a ntlmaps proxy. You tell nutch to use the
> proxy and you provide the proxy with adequate access priviledges. As
> simple as that and works like a charm. I imagine the nutch proxy
> support could be extended so that e.g. it selects a proxy based on
> regexp matching of urls. That way it would be possible to provide all
> the login/password pairs needed to crawl all of the sites you're
> interested in.
>
> t.n.a.
>

Re: crawling sites which require authentication

Posted by Jim Wilson <wi...@gmail.com>.
Yeah seriously - if NTLM auth (or HTTP Basic for that matter) is supported
natively by Nutch, I'd love to read the documentation on it!

-- Jim

On 10/14/06, Tomi NA <he...@gmail.com> wrote:
>
> 2006/10/14, Toufeeq Hussain <to...@gmail.com>:
>
> > From internal tests with ntlmaps + Nutch the conclusion we came to was
> > that though it "kinda-works" it puts a huge load on the Nutch server
> > as ntlmaps is a major memory-hog and the mixture of the two leads to
> > performance issues. For a PoC this will do but for
> > production-deployments I would not suggest one goes the ntlmaps way.
> >
> > An alternate would be to have a separate ntlmaps-server ,a dedicated
> > machine acting as the NTLM proxy for the Nutch-box which sits behind
> > it.
>
> I haven't noticed the added resource drain, but then again, I haven't
> really tested all that much: the constraints on the partical project I
> implemented the approach weren't very strict.
> I'll keep my eye on the cpu usage.
>
> > The right way would be to use the in-built authentication features of
> > Nutch for Auth based crawling.
>
> Nutch supports ntlm authentication? I see I've got some reading to
> catch up on...
>
> t.n.a.
>

Re: crawling sites which require authentication

Posted by Toufeeq Hussain <to...@gmail.com>.
Hi Tomi,

On 10/22/06, Tomi NA <he...@gmail.com> wrote:
>
> Toufeeq, could you say anything more on the topic of nutch in-built
> NTLM authentication support?

My work has been limited to 0.7.X version of nutch. Below are some of
my findings..

The file src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
has mention of a configuration option called 'http.auth.ntlm.username'
and a 'http.auth.ntlm.passwd'

I'm going ahead and guessing that if these options are set in the
default.xml or site.xml config files, The nutch crawler should pick it
up.

I did try crawling with having these options in the conf files but it
didn't help much. I haven't given much thought to it also. :)

Please update the list if you are successful in getting NTLM auth to
work while crawling.

-Toufeeq
--
blog @ http://toufeeq.net

Re: crawling sites which require authentication

Posted by Tomi NA <he...@gmail.com>.
2006/10/14, Toufeeq Hussain <to...@gmail.com>:

> From internal tests with ntlmaps + Nutch the conclusion we came to was
> that though it "kinda-works" it puts a huge load on the Nutch server
> as ntlmaps is a major memory-hog and the mixture of the two leads to
> performance issues. For a PoC this will do but for
> production-deployments I would not suggest one goes the ntlmaps way.
>
> An alternate would be to have a separate ntlmaps-server ,a dedicated
> machine acting as the NTLM proxy for the Nutch-box which sits behind
> it.

I haven't noticed the added resource drain, but then again, I haven't
really tested all that much: the constraints on the partical project I
implemented the approach weren't very strict.
I'll keep my eye on the cpu usage.

> The right way would be to use the in-built authentication features of
> Nutch for Auth based crawling.

Nutch supports ntlm authentication? I see I've got some reading to
catch up on...

t.n.a.

Re: crawling sites which require authentication

Posted by Toufeeq Hussain <to...@gmail.com>.
Hi Tomi,

On 10/13/06, Tomi NA <he...@gmail.com> wrote:
> Guruprasad,
> please use "reply-all" so your messages end up on the list as well. As
> far as ntlmaps is concerned, you can read about it here
> http://ntlmaps.sourceforge.net/ od download it here
> http://sourceforge.net/project/showfiles.php?group_id=69259&package_id=68110&release_id=303755.
> If you're using linux chances are all you need to do is issue a
> command like "emerge ntlmaps" or "apt-get install ntlmaps".
> Read the ntlmaps documentation on how you set it up or just follow the
> comments in its config file: /etc/ntlmaps/server.cfg.

>From internal tests with ntlmaps + Nutch the conclusion we came to was
that though it "kinda-works" it puts a huge load on the Nutch server
as ntlmaps is a major memory-hog and the mixture of the two leads to
performance issues. For a PoC this will do but for
production-deployments I would not suggest one goes the ntlmaps way.

An alternate would be to have a separate ntlmaps-server ,a dedicated
machine acting as the NTLM proxy for the Nutch-box which sits behind
it.

The right way would be to use the in-built authentication features of
Nutch for Auth based crawling.

-Toufeeq
-- 
blog @ http://toufeeq.net

Re: crawling sites which require authentication

Posted by Tomi NA <he...@gmail.com>.
2006/10/13, Guruprasad Iyer <mu...@gmail.com>:
> Hi Tomi,
>
> "using a ntlmaps proxy"
> How do I get this proxy?
>
> "You tell nutch to use the proxy and you provide the proxy with adequate
> access priviledges."
> How do I do this? Can you elaborate?
>
> I am a new Nutch user and am very much in the learning phase. Thanks.
>
> Cheers,
> Guruprasad

Guruprasad,
please use "reply-all" so your messages end up on the list as well. As
far as ntlmaps is concerned, you can read about it here
http://ntlmaps.sourceforge.net/ od download it here
http://sourceforge.net/project/showfiles.php?group_id=69259&package_id=68110&release_id=303755.
If you're using linux chances are all you need to do is issue a
command like "emerge ntlmaps" or "apt-get install ntlmaps".
Read the ntlmaps documentation on how you set it up or just follow the
comments in its config file: /etc/ntlmaps/server.cfg.
The only thing left for you to do is to edit the nutch-site.xml file
and set the http.proxy.host to (probably) "localhost" and
http.proxy.port to whatever port you set the proxy to listen on.

Looking at what I've written, I should have just said google is your
friend...ah well, what's done is done. :)

Hope this helps,
t.n.a.

Re: crawling sites which require authentication

Posted by Tomi NA <he...@gmail.com>.
2006/10/12, Guruprasad Iyer <mu...@gmail.com>:
> Hi,
>
> I need to know how to crawl (intranet) sites which require authentication.
> One suggestion was that I replace protocol-http with protocol-httpclient in
> the value field of plugin.includes tag in the nutch-default.xml file.
> However, this did not solve the problem.
> Can you help me out on this? Thanks.

I don't know what kind of authentication scheme you're up against, but
recently I had to work with NTLM authentication in an intranet and
worked arround it using a ntlmaps proxy. You tell nutch to use the
proxy and you provide the proxy with adequate access priviledges. As
simple as that and works like a charm. I imagine the nutch proxy
support could be extended so that e.g. it selects a proxy based on
regexp matching of urls. That way it would be possible to provide all
the login/password pairs needed to crawl all of the sites you're
interested in.

t.n.a.

Re: crawling sites which require authentication

Posted by Toufeeq Hussain <to...@gmail.com>.
Oops..

Sorry about the mail below. Did not know reply-to munging was being done. :)

-Toufeeq

On 10/30/06, Toufeeq Hussain <to...@gmail.com> wrote:
> dude..
>
> You got Nutch working with NTLM ?
>
> -Toufeeq
>
> On 10/12/06, Guruprasad Iyer <mu...@gmail.com> wrote:
> > Hi,
> >
> > I need to know how to crawl (intranet) sites which require authentication.
> > One suggestion was that I replace protocol-http with protocol-httpclient in
> > the value field of plugin.includes tag in the nutch-default.xml file.
> > However, this did not solve the problem.
> > Can you help me out on this? Thanks.
> >
> > Regards,
> > Guruprasad
> > --
> > If you want to build a ship, don't drum up the men to gather wood, divide
> > the work and give orders. Instead, teach them the desire for the sea.
> >
> > - Antoine-Marie-Roger de Saint-Exupery
> >
> >
>
>
> --
> blog @ http://toufeeq.net
>


-- 
blog @ http://toufeeq.net

Re: crawling sites which require authentication

Posted by Toufeeq Hussain <to...@gmail.com>.
dude..

You got Nutch working with NTLM ?

-Toufeeq

On 10/12/06, Guruprasad Iyer <mu...@gmail.com> wrote:
> Hi,
>
> I need to know how to crawl (intranet) sites which require authentication.
> One suggestion was that I replace protocol-http with protocol-httpclient in
> the value field of plugin.includes tag in the nutch-default.xml file.
> However, this did not solve the problem.
> Can you help me out on this? Thanks.
>
> Regards,
> Guruprasad
> --
> If you want to build a ship, don't drum up the men to gather wood, divide
> the work and give orders. Instead, teach them the desire for the sea.
>
> - Antoine-Marie-Roger de Saint-Exupery
>
>


-- 
blog @ http://toufeeq.net