You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gonçalo Gaiolas <go...@outsystems.com> on 2006/02/22 18:15:17 UTC

Intranet search - some questions

Hi everyone,

 

I’m new to the Nutch world and I am currently implementing an Intranet
search project using it. The preliminary results have been really good, but
I do have some questions that I’d like to pose:

 

-          Is there any way to perform form based authentication? I know
that this is a common request but I haven’t found a “good-enough” answer to
it. The only references I’ve found are about basic auth, which I’d prefer to
avoid. I ask this because I’ve noticed that SearchBlox, which uses Nutch
internally, has an option to support form based auth. Was this something
they developed on their own?

-          Another issue I have is authorization support. The intranet I’m
working on has different security profiles, with sensitive stuff that must
be hidden from some users but has to be searchable by others. What is the
best way to do this? To have an index per profile?

-          What is the best reference to implement incremental indexing? I
wouldn’t like to rebuild my index in every crawl session. I would rather
have it being update incrementally. Is this possible?

-          Can the companion web app (the search web app included in Nutch
distribution) perform the crawling process too? I ask this because I’ve
noticed that it has included a nutch-default.xml file. Maybe it uses Quartz
or something to perform asynch processing?

-          Can Nutch perform stemming? 

 

Please feel free to answer only one of these questions at a time. (I know
there are a lot of questions).

 

Thanks,

Gonçalo Gaiolas

 

 


Re: Intranet search - some questions

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,

> -          Is there any way to perform form based authentication? I  
> know
> that this is a common request but I haven’t found a “good-enough”  
> answer to
> it. The only references I’ve found are about basic auth, which I’d  
> prefer to
> avoid. I ask this because I’ve noticed that SearchBlox, which uses  
> Nutch
> internally, has an option to support form based auth. Was this  
> something
> they developed on their own?
I'm not the expert in this things but I would say without hacking  
some code this is today not possible.
In general there is http client plugin that uses commons httpclient.  
If it is possible with httpclient somehow than it possible with nutch  
somehow. :-o
>
> -          Another issue I have is authorization support. The  
> intranet I’m
> working on has different security profiles, with sensitive stuff  
> that must
> be hidden from some users but has to be searchable by others. What  
> is the
> best way to do this? To have an index per profile?
In case you can extract these information from the page or based on a  
url pattern I suggest to implement a indexing filter plugin that  
'tag' each document with a profile:
something like;
doc.add(Field.KeyWord("profil", theProfile));
Also you need a Query Filter and than you can extend the user query with
QueryString = QueryString +"profile:managers";

>
> -          What is the best reference to implement incremental  
> indexing? I
> wouldn’t like to rebuild my index in every crawl session. I would  
> rather
> have it being update incrementally. Is this possible?
I'm not sure what you mean. Use the step by step crawl commands  
instead of the crawl command and merge you indexes together, also  
deduging is a good idea.
See the tutorial and wiki for more details.
>
> -          Can the companion web app (the search web app included  
> in Nutch
> distribution) perform the crawling process too?
No. only command line support for now.
> I ask this because I’ve
> noticed that it has included a nutch-default.xml file. Maybe it  
> uses Quartz
> or something to perform asynch processing?
:-) Not yet.
>
> -          Can Nutch perform stemming?
Not by default, if you know lucene it would be easy to add.

HTH
Stefan