You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tkach <nt...@peapod.com> on 2008/04/08 23:01:01 UTC

Solr + Complex Legacy Schema -- Best Practices?

I just was wondering, has anybody dealt with trying to "translate" the data from a big, legacy DB schema to a Solr installation?  What I mean is, our company has (drawn from a big data warehouse) a series of 6 tables A, B, C, D, E, and F of product information that we've currently been making searchable via a home-brew setup.  Since that is kind of error-prone and involves a lot of manual data entry, we're looking at making it searchable via Solr.  I know I've seen postings here where people were dealing with large DB collections, but it was a fairly simple 1-to-1 setup, whereas this setup has a bunch of foreign-key constraints and composite IDs.  Is this the kind of thing that best-practice would still be to join it all together and index that or would we be better off trying to work out a multi-core setup?  I should add, we are *not* trying to search the DB live.  We are only looking to do something like dump its contents once a day (via some external script) and index that
 .

The tables are pretty big (of about 17k, 287k, 127k, 50k, 56k, 29k rows respectively) so a join is not ideal as you can imagine (~800Mb actually), but we certainly can do it if we need to.

Re: Solr + Complex Legacy Schema -- Best Practices?

Posted by Chris Hostetter <ho...@fucit.org>.
: I just was wondering, has anybody dealt with trying to "translate" the 
: data from a big, legacy DB schema to a Solr installation?  What I mean 

there's really no general answer to that question -- it all comes down to 
what you want to query on, and what kinds of results you want to get 
out... if you want your queries to result in lists of "products" then you 
should have one Document per product -- if you want to be able to query on 
the text of user reviews then you need to flatten all the user reviews for 
each product into the Document for each product.  Sometimes you'll want 
two types of Documents ... one Document per product, containing all the 
text of all the user reviews, and one Document per user review, with the 
Product information duplicated in each so you can search for...

  q=reviewtext:solr&fq=doctype:product&fq=productype:camera

...to get a list of all the products that are cameras that contain the 
word solr in the text of *a* review, or you can search for...

  q=reviewtext:solr&fq=doctype:review&fq=producttype:camera

...to get a list of all the reviews that contain the word solr, and 
are about products that are cameras.  

Your use cases and goals will be differnet then everyone elses.


-Hoss


Re: Solr + Complex Legacy Schema -- Best Practices?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
800MB does not seem that big. Since all of your 6 tables have product
information it should not be very difficult to join them together and import
them into one Solr index. Again, all of this depends on what you're
searching on and what you want to display as results.

Have you taken a look at http://wiki.apache.org/solr/DataImportHandler ?
Although it is still in development, we've used it successfully to import
data from a large collection of tables joined together.

On Wed, Apr 9, 2008 at 2:31 AM, Tkach <nt...@peapod.com> wrote:

> I just was wondering, has anybody dealt with trying to "translate" the
> data from a big, legacy DB schema to a Solr installation?  What I mean is,
> our company has (drawn from a big data warehouse) a series of 6 tables A, B,
> C, D, E, and F of product information that we've currently been making
> searchable via a home-brew setup.  Since that is kind of error-prone and
> involves a lot of manual data entry, we're looking at making it searchable
> via Solr.  I know I've seen postings here where people were dealing with
> large DB collections, but it was a fairly simple 1-to-1 setup, whereas this
> setup has a bunch of foreign-key constraints and composite IDs.  Is this the
> kind of thing that best-practice would still be to join it all together and
> index that or would we be better off trying to work out a multi-core setup?
>  I should add, we are *not* trying to search the DB live.  We are only
> looking to do something like dump its contents once a day (via some external
> script) and index that
>  .
>
> The tables are pretty big (of about 17k, 287k, 127k, 50k, 56k, 29k rows
> respectively) so a join is not ideal as you can imagine (~800Mb actually),
> but we certainly can do it if we need to.
>



-- 
Regards,
Shalin Shekhar Mangar.