You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by visava <vi...@hotmail.com> on 2007/05/03 21:05:38 UTC

Re: How to use multiple indexes

One way is to use separate crawls and indexes and store them in different
directories
e.g. /usr/home/idxdir1
      /usr/home/idxdir2

you can then use 2 different nutch-site.xml files (e.g. nutch-site.xml ,
nutch-site2.xml)
For the first search you can use the default nutch-site.xml and point it to
first index directory and you can use the default search.jsp that was
provided.

For searching the second index use nutch-site2.xml and point it to second
index directory.
Then use search2.jsp which is a copy of search.jsp with following
modifications.

/*
 Comment this original line of code and use code below.
     Configuration nutchConf = NutchConfiguration.get(application);
*/

Configuration nutchConf = application.getAttribute("myconfig");
if (nutchConf  == null) {
	nutchConf = new Configuration();
	nutchConf.addDefaultResource("nutch-default.xml");
	nutchConf.addFinalResource("nutch-site2.xml");
                application.setAttribute("myconfig",nutchConf);
}

You can extend this idea to as many different indexes as you want.Note I
have used this with 0.8 version.
If you look at the source code for NutchConfiguration.java you will get an
idea about the code above
and you can do something similar in other versions if it is different.

Maher wrote:
> 
> Hello everybody,
>  
>  I'm building a little documents search engine using Nutch 0.7 and Tomcat
> 5.5.16 and I'm wondering if it can handle 3 different indexes (db), one
> for each of the three types of documents I'm going to crawl ? So that I
> can have three independant db and I can search in each of them from a
> single front end page.
>  
>  The main problem is the path to the index in nutch-site.xml
> (searcher.dir) how to use 3 different paths...etc.
>  
>  Thanks
> 
>  		
> ---------------------------------
>  Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son
> interface révolutionnaire.
> 

-- 
View this message in context: http://www.nabble.com/How-to-use-multiple-indexes-tf1884905.html#a10311229
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to use multiple indexes

Posted by ravi chintakunta <ra...@gmail.com>.

ravi chintakunta wrote:
> 
> My reply to this feature of searching multiple indexes with a single
> instance of Nutch has bounced because of an attachment.
> 
> To search multiple indexes with a single instance of Nutch:
> 
> - I modified web.xml to include the paths to various search indexes
> - Modified Nutch.java to read all the indexes and create IndexReaders
> - Modified IndexSearcher.java to handle multiple IndexReaders
> 
> https://issues.apache.org/jira/browse/NUTCH-480 contains the patch.
> 
> In the attached file you will find the patch to the Nutch 0.8 code
> base and also the newly added files:
> 
> - SearchServlet - a servlet that is the web interface for search. This
> is simplified version of jsp versions (without the i18n) and outputs
> the results in text, xml or json format.
> - SearchConstants - an interface for messages and constants
> 
> Please note that the patch includes the functionality for spell check
> - aka "Did you mean?"
> 
> With this implementation, you may add check boxes to the search page
> for each index that you are hosting for search, by reading the web.xml
> file. With check boxes, user can narrow or widen the search across all
> the indexes. The results page can also display the number of hits in
> each index.
> 
> Hope this helps.
> 
> - Ravi
> 
> 

In reply to another user's question of how this works, and whether the patch
can be used with Nutch 1.0, I am putting my explanation here for the benefit
of other users.

I am not sure if the patch would cleanly apply to Nutch 1.0. However I have
noticed now that I have wrongly included my SpellChecker stuff in the patch,
which you may not need. I would suggest, you modify the code looking at the
patch, and I believe it should work with Nutch 1.0 too.

To explain you how Nutch works: Nutch uses a IndexSearcher to search the
index. IndexSearcher uses IndexReader to read the index. And IndexSearcher
can be built with a MultiReader object which can take multiple IndexReaders.
Typically one NutchBean instance will have one IndexSearcher. So you will
have a multiple NutchBean instances in memory. The NutchBean instances are
set as attributes of the ApplicationContext of the Servlet Container mapped
to the name of your index key combination.

You may look at the web.xml file how I have create context params to map the
index key name to the index location. 

When a user checks a combination of checkboxes for your indexes, you build a
key concatenating the index keys with : as a separator. Then you call a
factory method to get a NutchBean instance. That factory method will check
if you have a NutchBean with that index key combination and return it if it
exists. If not, it will create a new NutchBean instance and add it to the
ApplicationContext.

To create a NutchBean instance for a combination of index keys, you split
the index keys, for each index key you will create the IndexReaders  and
create a MultiReader object with all the IndexReaders you have created and
then create a IndexSearcher with the MultiReader object. NutchBean will work
as usual with the newly created MultiReader object. 

Thanks,
Ravi Chintakunta

-- 
View this message in context: http://old.nabble.com/How-to-use-multiple-indexes-tp5152770p27090016.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to use multiple indexes

Posted by Ravi Chintakunta <ra...@gmail.com>.

My reply to this feature of searching multiple indexes with a single
instance of Nutch has bounced because of an attachment.

To search multiple indexes with a single instance of Nutch:

- I modified web.xml to include the paths to various search indexes
- Modified Nutch.java to read all the indexes and create IndexReaders
- Modified IndexSearcher.java to handle multiple IndexReaders

https://issues.apache.org/jira/browse/NUTCH-480 contains the patch.

In the attached file you will find the patch to the Nutch 0.8 code
base and also the newly added files:

- SearchServlet - a servlet that is the web interface for search. This
is simplified version of jsp versions (without the i18n) and outputs
the results in text, xml or json format.
- SearchConstants - an interface for messages and constants

Please note that the patch includes the functionality for spell check
- aka "Did you mean?"

With this implementation, you may add check boxes to the search page
for each index that you are hosting for search, by reading the web.xml
file. With check boxes, user can narrow or widen the search across all
the indexes. The results page can also display the number of hits in
each index.

Hope this helps.

- Ravi

On 5/3/07, visava <vi...@hotmail.com> wrote:
>
> One way is to use separate crawls and indexes and store them in different
> directories
> e.g. /usr/home/idxdir1
>       /usr/home/idxdir2
>
> you can then use 2 different nutch-site.xml files (e.g. nutch-site.xml ,
> nutch-site2.xml)
> For the first search you can use the default nutch-site.xml and point it to
> first index directory and you can use the default search.jsp that was
> provided.
>
> For searching the second index use nutch-site2.xml and point it to second
> index directory.
> Then use search2.jsp which is a copy of search.jsp with following
> modifications.
>
> /*
>  Comment this original line of code and use code below.
>      Configuration nutchConf = NutchConfiguration.get(application);
> */
>
> Configuration nutchConf = application.getAttribute("myconfig");
> if (nutchConf  == null) {
>         nutchConf = new Configuration();
>         nutchConf.addDefaultResource("nutch-default.xml");
>         nutchConf.addFinalResource("nutch-site2.xml");
>                 application.setAttribute("myconfig",nutchConf);
> }
>
> You can extend this idea to as many different indexes as you want.Note I
> have used this with 0.8 version.
> If you look at the source code for NutchConfiguration.java you will get an
> idea about the code above
> and you can do something similar in other versions if it is different.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Maher wrote:
> >
> > Hello everybody,
> >
> >  I'm building a little documents search engine using Nutch 0.7 and Tomcat
> > 5.5.16 and I'm wondering if it can handle 3 different indexes (db), one
> > for each of the three types of documents I'm going to crawl ? So that I
> > can have three independant db and I can search in each of them from a
> > single front end page.
> >
> >  The main problem is the path to the index in nutch-site.xml
> > (searcher.dir) how to use 3 different paths...etc.
> >
> >  Thanks
> >
> >
> > ---------------------------------
> >  Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son
> > interface révolutionnaire.
> >
>
> --
> View this message in context: http://www.nabble.com/How-to-use-multiple-indexes-tf1884905.html#a10311229
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>