You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yoav Shapira <yo...@yoavshapira.com> on 2008/10/01 15:35:06 UTC

How do I crawl a site with a cookie for authentication?

Hi,

I would like to use Nutch to crawl and index an intranet web site for
internal use.  The site requires authentication, and stores the
credentials in a cookie.  I've got a valid login and I have the cookie
saved, no problem.  How do I tell Nutch to use it?

I did some research online before asking, but unfortunately I couldn't
find a step-by-step answer for a newbie like myself.  I see there's an
http-client plugin that can support some authentication.  Is that what
I should use for cookies?  If so, how do I configure it?

Or is there something else I should be doing?  If the documentation /
answer exists, sorry for the hassle and please just point me to it ;)

-- 
Thanks,

Yoav

Re: How do I crawl a site with a cookie for authentication?

Posted by Yoav Shapira <yo...@yoavshapira.com>.
Hi Patrick,

Thanks for your help.  I'll dig around a bit more, try the proxy
thing, maybe try the database approach, and see how it goes.  Much
appreciated,

Yoav

On Wed, Oct 1, 2008 at 1:14 PM, Patrick Markiewicz
<pm...@sim-gtech.com> wrote:
> Hi Yoav,
>        If the content is dynamic, presumably it is stored in a
> database?  I was just thinking that it might be easier to use some
> database utilities to index the information.
>
>        Do you know how to use JMeter to record the requests that a web
> browser makes?  The browser uses a particular port as a proxy.  I know
> that the JMeter cookie manager can save the cookies that are gathered as
> part of the request.
>        I'm pretty sure that nutch can use a proxy.
> http://wiki.apache.org/nutch/SetupProxyForNutch
>
> According to this page here:
> http://jakarta.apache.org/jmeter/usermanual/component_reference.html#HTT
> P_Cookie_Manager
> you can manually add a cookie that will be used by all threads.  I am
> guessing that if you set up JMeter to act as a proxy, that this thread
> would be included as one of those that contains the cookie.
>
> If the proxy thread can not have cookies added manually, then this
> strategy wouldn't work.
>
> Patrick
>
> -----Original Message-----
> From: yoavshapira@gmail.com [mailto:yoavshapira@gmail.com] On Behalf Of
> Yoav Shapira
> Sent: Wednesday, October 01, 2008 11:47 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: How do I crawl a site with a cookie for authentication?
>
> Patrick,
> Thank you for the answers.  More below:
>
> 2008/10/1 Patrick Markiewicz <pm...@sim-gtech.com>:
>> Is it possible for you to retrieve a resource by using the url:
>> http://username:password@intranetsite/path/to/resource.htm
>
> The system does not support HTTP Basic authentication at this time,
> unfortunately.
>
>> I'm not sure what level of authority you have with the intranet site.
> You could do a similar >trick by crawling the local filesystem of that
> site, and then just having the search page edit
>
> The site is dynamically generated.  There are no meaningful static
> files on the file system.
>
>> If you only have your own account, and can't change any other things,
> then you might be >able to use JMeter to add a cookie and have nutch use
> JMeter as a proxy.  I have never
>
> This is very intriguing.  How would I get started on this?  I've used
> JMeter in the past for simple test plans, but never as an HTTP proxy.
>
> Yoav
>



-- 
Thanks,

Yoav

RE: How do I crawl a site with a cookie for authentication?

Posted by Patrick Markiewicz <pm...@sim-gtech.com>.
Hi Yoav,
	If the content is dynamic, presumably it is stored in a
database?  I was just thinking that it might be easier to use some
database utilities to index the information.

	Do you know how to use JMeter to record the requests that a web
browser makes?  The browser uses a particular port as a proxy.  I know
that the JMeter cookie manager can save the cookies that are gathered as
part of the request.
	I'm pretty sure that nutch can use a proxy.
http://wiki.apache.org/nutch/SetupProxyForNutch

According to this page here:
http://jakarta.apache.org/jmeter/usermanual/component_reference.html#HTT
P_Cookie_Manager
you can manually add a cookie that will be used by all threads.  I am
guessing that if you set up JMeter to act as a proxy, that this thread
would be included as one of those that contains the cookie.

If the proxy thread can not have cookies added manually, then this
strategy wouldn't work. 

Patrick

-----Original Message-----
From: yoavshapira@gmail.com [mailto:yoavshapira@gmail.com] On Behalf Of
Yoav Shapira
Sent: Wednesday, October 01, 2008 11:47 AM
To: nutch-user@lucene.apache.org
Subject: Re: How do I crawl a site with a cookie for authentication?

Patrick,
Thank you for the answers.  More below:

2008/10/1 Patrick Markiewicz <pm...@sim-gtech.com>:
> Is it possible for you to retrieve a resource by using the url:
> http://username:password@intranetsite/path/to/resource.htm

The system does not support HTTP Basic authentication at this time,
unfortunately.

> I'm not sure what level of authority you have with the intranet site.
You could do a similar >trick by crawling the local filesystem of that
site, and then just having the search page edit

The site is dynamically generated.  There are no meaningful static
files on the file system.

> If you only have your own account, and can't change any other things,
then you might be >able to use JMeter to add a cookie and have nutch use
JMeter as a proxy.  I have never

This is very intriguing.  How would I get started on this?  I've used
JMeter in the past for simple test plans, but never as an HTTP proxy.

Yoav

Re: How do I crawl a site with a cookie for authentication?

Posted by Yoav Shapira <yo...@yoavshapira.com>.
Patrick,
Thank you for the answers.  More below:

2008/10/1 Patrick Markiewicz <pm...@sim-gtech.com>:
> Is it possible for you to retrieve a resource by using the url:
> http://username:password@intranetsite/path/to/resource.htm

The system does not support HTTP Basic authentication at this time,
unfortunately.

> I'm not sure what level of authority you have with the intranet site.  You could do a similar >trick by crawling the local filesystem of that site, and then just having the search page edit

The site is dynamically generated.  There are no meaningful static
files on the file system.

> If you only have your own account, and can't change any other things, then you might be >able to use JMeter to add a cookie and have nutch use JMeter as a proxy.  I have never

This is very intriguing.  How would I get started on this?  I've used
JMeter in the past for simple test plans, but never as an HTTP proxy.

Yoav

RE: How do I crawl a site with a cookie for authentication?

Posted by Patrick Markiewicz <pm...@sim-gtech.com>.
Is it possible for you to retrieve a resource by using the url:
http://username:password@intranetsite/path/to/resource.htm

If that works, you could temporarily give a "nutchuser" an account on the site (with as little permission as possible), then crawl the intranet site, and disable the account.  Then edit the nutch search page to strip out the "nutchusername:nutchpassword@" part of each URL when you present results to the user.  That way, only the users who previously authenticated would have access to that resource.

I'm not sure what level of authority you have with the intranet site.  You could do a similar trick by crawling the local filesystem of that site, and then just having the search page edit each URL to replace the file system path with a URL path that would work for a logged in user.

If you only have your own account, and can't change any other things, then you might be able to use JMeter to add a cookie and have nutch use JMeter as a proxy.  I have never done this, so I don't actually remember if JMeter can add a cookie to a request being made by an application that it proxies.

-----Original Message-----
From: Doğacan Güney [mailto:dogacan@gmail.com] 
Sent: Wednesday, October 01, 2008 10:08 AM
To: nutch-user@lucene.apache.org
Subject: Re: How do I crawl a site with a cookie for authentication?

On Wed, Oct 1, 2008 at 4:35 PM, Yoav Shapira <yo...@yoavshapira.com> wrote:
> Hi,
>
> I would like to use Nutch to crawl and index an intranet web site for
> internal use.  The site requires authentication, and stores the
> credentials in a cookie.  I've got a valid login and I have the cookie
> saved, no problem.  How do I tell Nutch to use it?
>
> I did some research online before asking, but unfortunately I couldn't
> find a step-by-step answer for a newbie like myself.  I see there's an
> http-client plugin that can support some authentication.  Is that what
> I should use for cookies?  If so, how do I configure it?
>
> Or is there something else I should be doing?  If the documentation /
> answer exists, sorry for the hassle and please just point me to it ;)
>

Unfortunately, nutch doesn't have such a feature yet. (One of the problems
is that we do not have a place to store cookies in a distributed setup)

> --
> Thanks,
>
> Yoav
>



-- 
Doğacan Güney

Re: How do I crawl a site with a cookie for authentication?

Posted by Doğacan Güney <do...@gmail.com>.
On Wed, Oct 1, 2008 at 4:35 PM, Yoav Shapira <yo...@yoavshapira.com> wrote:
> Hi,
>
> I would like to use Nutch to crawl and index an intranet web site for
> internal use.  The site requires authentication, and stores the
> credentials in a cookie.  I've got a valid login and I have the cookie
> saved, no problem.  How do I tell Nutch to use it?
>
> I did some research online before asking, but unfortunately I couldn't
> find a step-by-step answer for a newbie like myself.  I see there's an
> http-client plugin that can support some authentication.  Is that what
> I should use for cookies?  If so, how do I configure it?
>
> Or is there something else I should be doing?  If the documentation /
> answer exists, sorry for the hassle and please just point me to it ;)
>

Unfortunately, nutch doesn't have such a feature yet. (One of the problems
is that we do not have a place to store cookies in a distributed setup)

> --
> Thanks,
>
> Yoav
>



-- 
Doğacan Güney