You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Piccuirro <mi...@gmail.com> on 2008/09/03 17:10:32 UTC

Re: A problem for web site needing username & password

If you're talking about basic http authentication I had the same problem
using nutch 0.9.  I saw a few articles explaining how to do it by modifying
config files and nothing worked.   So as a messy quick fixed I just modified
this file:

src\plugin\protocol-httpclient\src\java\org\apache\nutch\protocol\httpclient\Http.java

I just grab a username/password from the config:

   String  basicUsername = conf.get("http.auth.basic.username");
   String  basicPassword = conf.get("http.auth.basic.password");

//then set the credentials like this:

Credentials ntCreds = new NTCredentials(ntlmUsername, ntlmPassword,
ntlmHost, ntlmDomain);
      client.getState().setCredentials(new AuthScope(ntlmHost,
AuthScope.ANY_PORT), ntCreds);

      if (LOG.isInfoEnabled()) { LOG.info("**** setting basic auth
credentials ****"); }
        client.getParams().setAuthenticationPreemptive(true);

        client.getState().setCredentials(
            new    AuthScope("www.mydomain.com", AuthScope.ANY_PORT,
AuthScope.ANY_REALM),
            new UsernamePasswordCredentials(basicUsername, basicPassword));


Not the best way to do this but it'll work.

Change the www.mydomain.com to your domain.


Also another way around it is you can have nutch go through a proxy then
have the proxy tack on the auth header. I was using CharlesProxy.  Again not
the best to do this at all but it'll get you going.


On Mon, Jul 28, 2008 at 2:53 AM, zhengsj03 User <zh...@163.com> wrote:

> Hi!
> In many web sites username and password are needed to login.If I want to
> crawl a web site like this,and I know the username and password,how can
> I let the crawler know the username and password to login the site like
> a human doing.How can I change the configuration files?
> Thanks!
>
>
>

Re: A problem for web site needing username & password

Posted by zhengsj03 User <zh...@163.com>.
These days , I have tied to solve the problem by modifying the source
code,but failed.
I think your method will help me .I will try it. Thanks!
> If you're talking about basic http authentication I had the same problem
> using nutch 0.9.  I saw a few articles explaining how to do it by modifying
> config files and nothing worked.   So as a messy quick fixed I just modified
> this file:
> 
> src\plugin\protocol-httpclient\src\java\org\apache\nutch\protocol\httpclient\Http.java
> 
> I just grab a username/password from the config:
> 
>    String  basicUsername = conf.get("http.auth.basic.username");
>    String  basicPassword = conf.get("http.auth.basic.password");
> 
> //then set the credentials like this:
> 
> Credentials ntCreds = new NTCredentials(ntlmUsername, ntlmPassword,
> ntlmHost, ntlmDomain);
>       client.getState().setCredentials(new AuthScope(ntlmHost,
> AuthScope.ANY_PORT), ntCreds);
> 
>       if (LOG.isInfoEnabled()) { LOG.info("**** setting basic auth
> credentials ****"); }
>         client.getParams().setAuthenticationPreemptive(true);
> 
>         client.getState().setCredentials(
>             new    AuthScope("www.mydomain.com", AuthScope.ANY_PORT,
> AuthScope.ANY_REALM),
>             new UsernamePasswordCredentials(basicUsername, basicPassword));
> 
> 
> Not the best way to do this but it'll work.
> 
> Change the www.mydomain.com to your domain.
> 
> 
> Also another way around it is you can have nutch go through a proxy then
> have the proxy tack on the auth header. I was using CharlesProxy.  Again not
> the best to do this at all but it'll get you going.
> 
> 
> On Mon, Jul 28, 2008 at 2:53 AM, zhengsj03 User <zh...@163.com> wrote:
> 
> > Hi!
> > In many web sites username and password are needed to login.If I want to
> > crawl a web site like this,and I know the username and password,how can
> > I let the crawler know the username and password to login the site like
> > a human doing.How can I change the configuration files?
> > Thanks!
> >
> >
> >