You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dan Fundatureanu <da...@gmail.com> on 2006/03/09 18:02:31 UTC

Indexing a web site over HTTPS using username/passwd

Hi,

Could you point me were I can find some info about how can I use NUTCH to
crawl over a website where the access is provided only via HTTPS using
username/passwd?
Are there any config settings that I have to do or do I have to hack in the
code to change this?

Thanks,
Dan Fundatureanu

RE: Indexing a web site over HTTPS using username/passwd

Posted by Richard Braman <rb...@bramantax.com>.
I don't know Dan, but its something on my list too.  I kind of doubt
that this is a feature in nutch, because generally this is thought of as
specialized intelligent agent (IA) capability instead of more general
spidering/indexing technology.  Certainly it is possible to do, but
there are two problems that need to be addressed to get any IA to do
this deed for you.

Firstly HTTPS has generally nothing to do with this, as this is just
another protocol that the agent has to support.  I think that nutch does
support https, as indicated by several prior posts, but can someone
confirm?

The following technology could probably be implemented as a plug-in to
nutch, but some pretty subtantial work would need to be done.

1.  You have to be able to address a login page for a set of content, in
other words you have to tell your IA this is the page where I need to
submit the login credentials to gain access.  Your IA must also be able
to look up your credentials from a database, and submit those
credentials via a name-value pair to the server via an HTTP post (in
most scenarios).

2.  You have to specify a page to begin crawling once your credientials
have been accepted.  You shouldn't rely on the redirect that takes place
to be you start page, as there probably is specific content you are
after.

3.  You IA must be able to manage the session with the web server.  Most
authentication schemes rely on a flag on the server that indicates that
you are logged in,  If the IA does not resend the session cookie
properly, the web server will think you are logged off.

Once you have the servers content indexed, when the user of your serach
engine clicks on one of the links, she would have to submit her
credentials to agin access anyway.  Automatically including the
credentials originally used to acess the content is possible, but would
probably raise the ire of the sysadmin, and also expose the credentials
you you used to access the content in the first place to the world.

I can honeslty say nutch is not the right solution for everything.  If
you are after indexing content behind a wall, its probably best to use
some code better suited to the task, unless someone has made a truly
custom hack for nutch in this area already.

I recently dusted off a book that I have entitled Programming Bots,
Spiders, and Intelligent Agents in Microsoft Visual C++, by David
Pallman, which was published way back in 1999.  Suprisingly, not much
has changed since then.  This would be a good read for those aspiring to
know more about the topic.


-----Original Message-----
From: Dan Fundatureanu [mailto:dan.fundatureanu@gmail.com] 
Sent: Thursday, March 09, 2006 12:03 PM
To: nutch-user@lucene.apache.org
Subject: Indexing a web site over HTTPS using username/passwd


Hi,

Could you point me were I can find some info about how can I use NUTCH
to crawl over a website where the access is provided only via HTTPS
using username/passwd? Are there any config settings that I have to do
or do I have to hack in the code to change this?

Thanks,
Dan Fundatureanu