You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Sönke Goldbeck <go...@avail.de> on 2009/09/30 19:00:39 UTC

Adding data from nutch to a Solr index

Alright, first post to this list and I hope the question
is not too stupid or misplaced ...

what I currently have:
- a nicely working Solr 1.3 index with information about some
entities e.g. organisations, indexed from an RDBMS. Many of these
entities have an URL pointing at further information, e.g. the
website of an institute or company.

- an installation of nutch 0.9 with which I can crawl for the
URLs that I can extract from the RDBMS mentioned above and put
into a seed file

- tutorials about how to put crawled and indexed data from
nutch 1.0 (which I could install w/o problems) into a separate
Solr index


what I want:
- combine the indexed information from the RDBMS and the website
in one Solr index so that I can search both in one and with the
capability of using all the Solr features. E.g. having the following
(example) fields in one document:

<doc>
   <name-from-RDBMS>
   <indexed-content-from-RDBMS>
   <indexed-content-from-website>
   <URL>
   <...>
</doc>

Any input appreciated!

Cheers, Sönke

Re: Adding data from nutch to a Solr index

Posted by Andrzej Bialecki <ab...@getopt.org>.

Sönke Goldbeck wrote:
> Alright, first post to this list and I hope the question
> is not too stupid or misplaced ...
> 
> what I currently have:
> - a nicely working Solr 1.3 index with information about some
> entities e.g. organisations, indexed from an RDBMS. Many of these
> entities have an URL pointing at further information, e.g. the
> website of an institute or company.
> 
> - an installation of nutch 0.9 with which I can crawl for the
> URLs that I can extract from the RDBMS mentioned above and put
> into a seed file
> 
> - tutorials about how to put crawled and indexed data from
> nutch 1.0 (which I could install w/o problems) into a separate
> Solr index
> 
> 
> what I want:
> - combine the indexed information from the RDBMS and the website
> in one Solr index so that I can search both in one and with the
> capability of using all the Solr features. E.g. having the following
> (example) fields in one document:
> 
> <doc>
>   <name-from-RDBMS>
>   <indexed-content-from-RDBMS>
>   <indexed-content-from-website>
>   <URL>
>   <...>
> </doc>

I believe that this kind of document merging is not possible (at least 
not easily) - you have to assemble the whole document before you index 
it in Solr.

If these documents use the same primary key (I guess they do, otherwise 
how would you merge them...) then you can do the merging in your 
front-end application, which would have to submit the main query to 
Solr, and then for each Solr document on the list of results it would 
retrieve a Nutch document (using NutchBean API).

(The not so easy way involves writing a SearchComponent that does the 
latter part of that process on the Solr side.)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com