You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Laura McCord <lm...@ucmerced.edu> on 2014/03/21 18:32:49 UTC

Crawling an authenticated site

Hi,

I have another question... If I have an authenticated site that I want 
to crawl in which I have access with my username/password. Is there a 
configuration step where I would add my credentials or is this something 
that had to be customized on my end?

Thanks Again,
  Laura

Re: Crawling an authenticated site

Posted by Laura McCord <lm...@ucmerced.edu>.
It is form-based but it uses Jasig CAS SSO as the solution and not a basic authentication method. So the way it works is if the application lacks either a valid session or a service ticket parameter it redirects to the login page. What I’m trying to do is create a servlet from webserver#1 that takes a user to the login page to create a session and upon successfully authenticating I want to run a nutch script against webserver#2. However, I’m not sure if that will work. 

Thanks

On Mar 22, 2014, at 4:35 AM, remi tassing <ta...@gmail.com> wrote:

> Hi,
> 
> If it's a form-based authentication where you need to send Http POST
> requests, then I would suggest you modify HttpResponse.java for the purpose
> 
> Remi
> 
> 
> On Sat, Mar 22, 2014 at 2:31 AM, John Lafitte <jl...@brandextract.com>wrote:
> 
>> I haven't done it myself but it's documented here:
>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>> 
>> I'm not sure how you would do it with forms based auth, but if it's a
>> custom app you might be able to just automatically grant it access if it
>> the user agent and/or IP match up.
>> 
>> 
>> On Fri, Mar 21, 2014 at 12:32 PM, Laura McCord <lm...@ucmerced.edu>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I have another question
>>> ... If I have an authenticated site that I want to
>>> crawl in which I have access with my username/password. Is there a
>>> configuration step where I would add my credentials or is this something
>>> that had to be customized on my end?
>>> 
>>> Thanks Again,
>>> Laura
>>> 
>> 


Re: Crawling an authenticated site

Posted by remi tassing <ta...@gmail.com>.
Hi,

If it's a form-based authentication where you need to send Http POST
requests, then I would suggest you modify HttpResponse.java for the purpose

Remi


On Sat, Mar 22, 2014 at 2:31 AM, John Lafitte <jl...@brandextract.com>wrote:

> I haven't done it myself but it's documented here:
> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>
> I'm not sure how you would do it with forms based auth, but if it's a
> custom app you might be able to just automatically grant it access if it
> the user agent and/or IP match up.
>
>
> On Fri, Mar 21, 2014 at 12:32 PM, Laura McCord <lm...@ucmerced.edu>
> wrote:
>
> > Hi,
> >
> > I have another question... If I have an authenticated site that I want to
> > crawl in which I have access with my username/password. Is there a
> > configuration step where I would add my credentials or is this something
> > that had to be customized on my end?
> >
> > Thanks Again,
> >  Laura
> >
>

Re: Crawling an authenticated site

Posted by John Lafitte <jl...@brandextract.com>.
I haven't done it myself but it's documented here:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

I'm not sure how you would do it with forms based auth, but if it's a
custom app you might be able to just automatically grant it access if it
the user agent and/or IP match up.


On Fri, Mar 21, 2014 at 12:32 PM, Laura McCord <lm...@ucmerced.edu> wrote:

> Hi,
>
> I have another question... If I have an authenticated site that I want to
> crawl in which I have access with my username/password. Is there a
> configuration step where I would add my credentials or is this something
> that had to be customized on my end?
>
> Thanks Again,
>  Laura
>