You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by cb...@mac.com on 2006/02/09 00:57:22 UTC

Indexing password protected content

Hi All,

I've only recently discovered Lucene / Nutch and I'm extremely impressed with it's indexing ability and production of relevant results.

Can anyone give me any guidance or point me at documents that could help me index password protected content?

I have a web site, www.sheerpoetry.co.uk , which is run by four UK poets whose work is studied by children for school / college exams in the UK.  We'd like to use Nutch to index and display search results (by customising the Nutch demo web-app).  Our problem is that the web site content is protected by a Tomcat Security Realm so that only registered, logged in users can view the content.  This means that the Nutch crawler will somehow have to login to index the content.  If anyone has any suggestions on how to do, I'd be most grateful to hear them.

Kind Regards,
Chris

Re: Indexing password protected content

Posted by Ravi Chintakunta <ra...@gmail.com>.
Hi Chris,

Since this site is being maintained by you, you may modify the
security settings on your web site to only allow the Nutch bot to
crawl the site without a login. You have to use the HTTP request user
agent  to compare this. And also you should validate the client IP
address from which the Nutch bot is being hit to be the IP address of
your crawler machine.

Hope this helps.

- Ravi

On 2/8/06, cb11@mac.com <cb...@mac.com> wrote:
> Hi All,
>
> I've only recently discovered Lucene / Nutch and I'm extremely impressed with it's indexing ability and production of relevant results.
>
> Can anyone give me any guidance or point me at documents that could help me index password protected content?
>
> I have a web site, www.sheerpoetry.co.uk , which is run by four UK poets whose work is studied by children for school / college exams in the UK.  We'd like to use Nutch to index and display search results (by customising the Nutch demo web-app).  Our problem is that the web site content is protected by a Tomcat Security Realm so that only registered, logged in users can view the content.  This means that the Nutch crawler will somehow have to login to index the content.  If anyone has any suggestions on how to do, I'd be most grateful to hear them.
>
> Kind Regards,
> Chris
>