You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yoav Shapira <yo...@apache.org> on 2008/05/06 02:49:50 UTC

How to authenticate with cookies?

Hi,

I'm using Nutch to crawl an intranet site that is behind form
authentication.  I know Nutch doesn't support form authentication yet
(right?), but I think this site would also work with cookies.  I have
the right set of cookie names and values, at least for testing, but I
don't know how to have Nutch use these cookies with every HTTP
requests during its crawl.

I saw a reference to a "protocol-httpclient" plugin.  Is that true / relevant?

Any help on configuring Nutch to use cookies for authentication would
be appreciated.

-- 
Thanks,

Yoav

Re: How to authenticate with cookies?

Posted by Yoav Shapira <yo...@yoavshapira.com>.
On Thu, May 8, 2008 at 11:14 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
> * you have to use protocol-httpclient. There is no support for cookies in
> protocol-http.

OK, how do I make sure protocol-httpclient is used?

> * your fetchlist needs to have more than 1 url from the host - the first
> request will presumably set the cookies, if you are lucky. ;)

No, the first fetch will ask for authentication.  I want to get past
this point by supplying the cookies myself, that's why I asked the
question ;)

Thanks,

Yoav

Re: How to authenticate with cookies?

Posted by Andrzej Bialecki <ab...@getopt.org>.
POIRIER David wrote:
> Yoav,
> 
> You are right. With the help of the "protocol-httpclient" plugin you
> will be able to use cookies when crawling. There is one thing that you
> need to watch out though (quoting Susam Pal): "protocol-httpclient does
> this for a single fetch cycle". 
> 
> To be honest I don't exactly know how to define a "fetch cycle". Based
> on my experience it seems that every time the fetcher goes one level
> deeper into a web site it starts a new cycle... or if it doesn't I loose
> the cookie. It might be because of something else, but I don't think so.
> 
> If anybody has the answer to that, please let Yoav and I know.

This is correct. It comes from the fact that Nutch doesn't store cookies 
(that's yet another potential use for the planned HostDB functionality). 
This means that in order to accept and use cookies:

* you have to use protocol-httpclient. There is no support for cookies 
in protocol-http.

* your fetchlist needs to have more than 1 url from the host - the first 
request will presumably set the cookies, if you are lucky. ;)

* cookies are accumulated and kept in memory for the duration of the 
current crawl task.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: How to authenticate with cookies?

Posted by Susam Pal <su...@gmail.com>.
Please see my reply inline.

On Thu, May 8, 2008 at 12:04 PM, POIRIER David
<DP...@cross-systems.com> wrote:
> Yoav,
>
>  You are right. With the help of the "protocol-httpclient" plugin you
>  will be able to use cookies when crawling. There is one thing that you
>  need to watch out though (quoting Susam Pal): "protocol-httpclient does
>  this for a single fetch cycle".
>
>  To be honest I don't exactly know how to define a "fetch cycle". Based
>  on my experience it seems that every time the fetcher goes one level
>  deeper into a web site it starts a new cycle... or if it doesn't I loose
>  the cookie. It might be because of something else, but I don't think so.

Yes, that's what I meant. For a crawl at a new depth, a new fetcher
process is invoked. The cookies are not saved between processes. So,
everytime the crawl goes one level deeper, the cookies are lost.

Regards,
Susam pal

>
>  If anybody has the answer to that, please let Yoav and I know.
>
>  Thanks,
>
>  David

RE: How to authenticate with cookies?

Posted by POIRIER David <DP...@cross-systems.com>.
Yoav,

You are right. With the help of the "protocol-httpclient" plugin you
will be able to use cookies when crawling. There is one thing that you
need to watch out though (quoting Susam Pal): "protocol-httpclient does
this for a single fetch cycle". 

To be honest I don't exactly know how to define a "fetch cycle". Based
on my experience it seems that every time the fetcher goes one level
deeper into a web site it starts a new cycle... or if it doesn't I loose
the cookie. It might be because of something else, but I don't think so.

If anybody has the answer to that, please let Yoav and I know.

Thanks,

David


 


-----Original Message-----
From: yoavshapira@gmail.com [mailto:yoavshapira@gmail.com] On Behalf Of
Yoav Shapira
Sent: mardi, 6. mai 2008 02:50
To: nutch-user@lucene.apache.org
Subject: How to authenticate with cookies?

Hi,

I'm using Nutch to crawl an intranet site that is behind form
authentication.  I know Nutch doesn't support form authentication yet
(right?), but I think this site would also work with cookies.  I have
the right set of cookie names and values, at least for testing, but I
don't know how to have Nutch use these cookies with every HTTP
requests during its crawl.

I saw a reference to a "protocol-httpclient" plugin.  Is that true /
relevant?

Any help on configuring Nutch to use cookies for authentication would
be appreciated.

-- 
Thanks,

Yoav