You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Laura McCord <lm...@ucmerced.edu> on 2014/03/21 18:32:49 UTC
Crawling an authenticated site
Hi,
I have another question... If I have an authenticated site that I want
to crawl in which I have access with my username/password. Is there a
configuration step where I would add my credentials or is this something
that had to be customized on my end?
Thanks Again,
Laura
Re: Crawling an authenticated site
Posted by Laura McCord <lm...@ucmerced.edu>.
It is form-based but it uses Jasig CAS SSO as the solution and not a basic authentication method. So the way it works is if the application lacks either a valid session or a service ticket parameter it redirects to the login page. What I’m trying to do is create a servlet from webserver#1 that takes a user to the login page to create a session and upon successfully authenticating I want to run a nutch script against webserver#2. However, I’m not sure if that will work.
Thanks
On Mar 22, 2014, at 4:35 AM, remi tassing <ta...@gmail.com> wrote:
> Hi,
>
> If it's a form-based authentication where you need to send Http POST
> requests, then I would suggest you modify HttpResponse.java for the purpose
>
> Remi
>
>
> On Sat, Mar 22, 2014 at 2:31 AM, John Lafitte <jl...@brandextract.com>wrote:
>
>> I haven't done it myself but it's documented here:
>> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>
>> I'm not sure how you would do it with forms based auth, but if it's a
>> custom app you might be able to just automatically grant it access if it
>> the user agent and/or IP match up.
>>
>>
>> On Fri, Mar 21, 2014 at 12:32 PM, Laura McCord <lm...@ucmerced.edu>
>> wrote:
>>
>>> Hi,
>>>
>>> I have another question
>>> ... If I have an authenticated site that I want to
>>> crawl in which I have access with my username/password. Is there a
>>> configuration step where I would add my credentials or is this something
>>> that had to be customized on my end?
>>>
>>> Thanks Again,
>>> Laura
>>>
>>
Re: Crawling an authenticated site
Posted by remi tassing <ta...@gmail.com>.
Hi,
If it's a form-based authentication where you need to send Http POST
requests, then I would suggest you modify HttpResponse.java for the purpose
Remi
On Sat, Mar 22, 2014 at 2:31 AM, John Lafitte <jl...@brandextract.com>wrote:
> I haven't done it myself but it's documented here:
> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>
> I'm not sure how you would do it with forms based auth, but if it's a
> custom app you might be able to just automatically grant it access if it
> the user agent and/or IP match up.
>
>
> On Fri, Mar 21, 2014 at 12:32 PM, Laura McCord <lm...@ucmerced.edu>
> wrote:
>
> > Hi,
> >
> > I have another question... If I have an authenticated site that I want to
> > crawl in which I have access with my username/password. Is there a
> > configuration step where I would add my credentials or is this something
> > that had to be customized on my end?
> >
> > Thanks Again,
> > Laura
> >
>
Re: Crawling an authenticated site
Posted by John Lafitte <jl...@brandextract.com>.
I haven't done it myself but it's documented here:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes
I'm not sure how you would do it with forms based auth, but if it's a
custom app you might be able to just automatically grant it access if it
the user agent and/or IP match up.
On Fri, Mar 21, 2014 at 12:32 PM, Laura McCord <lm...@ucmerced.edu> wrote:
> Hi,
>
> I have another question... If I have an authenticated site that I want to
> crawl in which I have access with my username/password. Is there a
> configuration step where I would add my credentials or is this something
> that had to be customized on my end?
>
> Thanks Again,
> Laura
>