You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Juho Mäkinen <ju...@gmail.com> on 2005/06/22 08:01:48 UTC

Ideas to fetch different data and use them from single search interface

Hello,

I'm looking ways to implement a decent search system for our company
intranet. I'd like to index different types of data and use a single
nutch web interface to search all them at the same time.

I'd like to index our intranet, our news server (we have a news to web
gateway, so the crawler could just walk around this service and thus
index a web pages which represents the news posts), our phonebook
(again, this has also a web gateway) etc..

I think that I should fetch these different types of data into
different segments, but I don't yet know how to do that. Can I name
the segments so that I can clearly see what segment holds intranet and
what holds news data, or can I store the segments in different
subdirectories?

And how should I merge/index the segments into one usable index for
the final searching? I haven't found any good source which clearly
describes what the segments holds, how the indexing system works etc,
so I can't design this on my own :(

I also haven't found any resource, which would describe if it's
possible to mark different keywords from search queries, which would
then trigger much higher search points with some rules. An example:
The user searches for "news foobar". The word "news" in the query
would trigger a feature that pages which are found from url
"^http://news.intranet.com/.*" would have much higher points, because
the user propably wants to search the news, but the search results
could also return pages from intranet which describes how the news
systems work in our company.

Hope that I'm not asking too much. I have found the current
documentation about nuch to be quite frustrating, because there is
many important parts missing. I know I could easily write documents
into the wiki, but I don't know yet enough to write anything :(

Thanks in advantage,

 Juho Mäkinen

Re: Ideas to fetch different data and use them from single search interface

Posted by Andrzej Bialecki <ab...@getopt.org>.

Juho Mäkinen wrote:
>>Yes, it's possible - you can use an indexing filter (see e.g.
>>index-more) to add special fields to documents. Then, during query
>>processing you can use a query filter to expand the query, adding
>>highly-boosted clauses that would match your "keyword" fields.
> 
> 
> Thanks for the other info, but I couldn't find any documentation
> about the query and index more -filteres. Could you point me into
> somewhere if you know any better? (or explain it me, I can Wiki
> it later if I get enough info =)

Well, there is really no documentation - I meant you should look into 
the source code for index-more and query-more plugins, they illustrate 
the principle. You could use them as skeletons for your own plugins.


>>Yes, we should improve the docs, no doubt about it...
> 
> I Wiki'ed a FAQ entry about selecting custom configs =)

Very useful, thank you!

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Ideas to fetch different data and use them from single search interface

Posted by Juho Mäkinen <ju...@gmail.com>.

> > I also haven't found any resource, which would describe if it's
> > possible to mark different keywords from search queries, which would
> > then trigger much higher search points with some rules. An example:
> > The user searches for "news foobar". The word "news" in the query
> > would trigger a feature that pages which are found from url
> > "^http://news.intranet.com/.*" would have much higher points, because
> > the user propably wants to search the news, but the search results
> > could also return pages from intranet which describes how the news
> > systems work in our company.
> 
> Yes, it's possible - you can use an indexing filter (see e.g.
> index-more) to add special fields to documents. Then, during query
> processing you can use a query filter to expand the query, adding
> highly-boosted clauses that would match your "keyword" fields.

Thanks for the other info, but I couldn't find any documentation
about the query and index more -filteres. Could you point me into
somewhere if you know any better? (or explain it me, I can Wiki
it later if I get enough info =)
 

> Yes, we should improve the docs, no doubt about it...
I Wiki'ed a FAQ entry about selecting custom configs =)

 - Juho Mäkinen

Re: Ideas to fetch different data and use them from single search interface

Posted by Andrzej Bialecki <ab...@getopt.org>.

Juho Mäkinen wrote:
> Hello,
> 
> I'm looking ways to implement a decent search system for our company
> intranet. I'd like to index different types of data and use a single
> nutch web interface to search all them at the same time.
> 
> I'd like to index our intranet, our news server (we have a news to web
> gateway, so the crawler could just walk around this service and thus
> index a web pages which represents the news posts), our phonebook
> (again, this has also a web gateway) etc..
> 
> I think that I should fetch these different types of data into
> different segments, but I don't yet know how to do that. Can I name
> the segments so that I can clearly see what segment holds intranet and
> what holds news data, or can I store the segments in different
> subdirectories?

You can run several crawls, using different urlfilters, so that the 
results of each crawl will stay "focused" on one of the data sources.

Then, you can use these distinct sets of segments to provide different 
search front-ends, or you can combine them to provide a search over all 
data sources.

> And how should I merge/index the segments into one usable index for
> the final searching? I haven't found any good source which clearly
> describes what the segments holds, how the indexing system works etc,
> so I can't design this on my own :(

You could use SegmentMergeTool, or you could just copy these segments 
into search.dir location (if you use a merged index, then you would have 
to merge indexes from new segments into the master index, see "merge" 
command).

> 
> I also haven't found any resource, which would describe if it's
> possible to mark different keywords from search queries, which would
> then trigger much higher search points with some rules. An example:
> The user searches for "news foobar". The word "news" in the query
> would trigger a feature that pages which are found from url
> "^http://news.intranet.com/.*" would have much higher points, because
> the user propably wants to search the news, but the search results
> could also return pages from intranet which describes how the news
> systems work in our company.

Yes, it's possible - you can use an indexing filter (see e.g. 
index-more) to add special fields to documents. Then, during query 
processing you can use a query filter to expand the query, adding 
highly-boosted clauses that would match your "keyword" fields.

> 
> Hope that I'm not asking too much. I have found the current
> documentation about nuch to be quite frustrating, because there is
> many important parts missing. I know I could easily write documents
> into the wiki, but I don't know yet enough to write anything :(

Yes, we should improve the docs, no doubt about it...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com