You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mari Masuda <ma...@stanford.edu> on 2011/06/16 22:41:32 UTC

getting started

Hello,

I am new to Solr and am in the beginning planning stage of a large project and could use some advice so as not to make a huge design blunder that I will regret down the road.

Currently I have about 10 MySQL databases that store information about different archival collections.  For example, we have data and metadata about a political poster collection, a television program, documents and photographs of and about a famous author, etc.  My job is to work with the staff archivists to come up with a standard metadata template so the 10 databases can be consolidated into one.  

Currently the info in these databases is accessed through 10 different sets of PHP pages that were written a long time ago for PHP 4.  My plan is to write a new Java application that will handle both public display of the info as well as an administrative interface so that staff members can add or edit the records.

I have decided to use Solr as the search mechanism for this project.  Because the info in each of our 10 collections is slightly different (e.g., a record about a poster does not contain duration information, but a record about a TV show does) I was thinking it would be good to separate each collection's index into a separate Solr core so that commits coming from one collection do not bog down the other unrelated collections.  One reservation I have is that eventually we would like to be able to type in "Iraq" and find records across all of the collections at once instead of having to search each collection separately.  Although I don't know anything about it at this stage, I did Google "sharding" after reading someone's recent post on this list and it sounds like that may be a potential answer to my question.  Does anyone have any advice on how I should initially set up Solr for my situation?  I am slowly making my way through the wiki and RTFMing, but I wanted to see what the experts have to say because at this point I don't really know where to start.

Thank you very much,
Mari

Re: getting started

Posted by Jonathan Rochkind <ro...@jhu.edu>.
On 6/16/2011 4:41 PM, Mari Masuda wrote:
> One reservation I have is that eventually we would like to be able to type in "Iraq" and find records across all of the collections at once instead of having to search each collection separately.  Although I don't know anything about it at this stage, I did Google "sharding" after reading someone's recent post on this list and it sounds like that may be a potential answer to my question.

So this kind of stuff can be tricky, but with that eventual requirement 
I would NOT put these in seperate cores. Sharding isn't (IMO, if someone 
disagrees, they will hopefully say so!) a good answer to searching 
accross entirely different 'schemas', or avoiding frequent-commit issues 
-- sharding is really just for scaling/performance when your index gets 
very very large. (Which it doesn't sound like yours will be, but you can 
deal with that as a separate issue if it becomes so).

If you're going to want to search across all the collections, put them 
all in the same core.  Either in the exact same indexed fields, or using 
certain common indexed fields -- those common ones are the ones you'll 
be able to search across all collections on. It's okay if some 
collections have unique indexed fields too --- documents in the core 
that don't belong to that collection just won't have any terms in that 
indexed field that is only used by a certain collection, no problem. 
(Then you can distribute this single core into shards if you need to for 
performance reasons related to number of documents/size of index).

You're right to be thinking about the fact that very frequent commits 
can be performance issues in Solr. But separating in different cores is 
going to create more problems for yourself (if you want to be able to 
search accross all collections), in an attempt to solve that one.  
(Among other things, not every Solr feature works in a 
distributed/sharded environment, it's just a more complicated and 
somewhat less mature setup for Solr).

The way I deal with the frequent-commit issue is by NOT doing frequent 
commits to my production Solr. Instead, I use Solr replication to have a 
'master' Solr index that I do commits to whenever I want, and a 'slave' 
Solr index that serves the production searches, and which only 
replicates from master periodically -- not too often to be 
too-frequent-commits.  That seems to be a somewhat common solution, if 
that use pattern works for you.

There are also some "near real time" features in more recent versions of 
Solr, that I'm not very familiar with. (not sure if any are included in 
the current latest release, or if they are all only still in the repo)  
My sense is that they too only work for certain use patterns, they 
aren't magic bullets for "commit whatever you want as often as you want 
to Solr".  In general Solr isn't so great at very frequent major changes 
to the index.   Depending on exactly what sort of use pattern you are 
predicting/planning for your commits, maybe people can give you advice 
on how (or if) to do it.

But I personally don't think your idea of splitting your collections 
(that you'll eventually want to search accross into a single search) 
into shards is a good solution to frequent-commit issues. You'd be 
complicating your setup and causing other problems for yourself, and not 
really even entirely addressing the too-frequent-commit issue with that 
setup.

Re: getting started

Posted by Sascha SZOTT <sz...@gmx.de>.
Hi Mari,

it depends ...

* How many records are stored in your MySQL databases?
* How often will updates occur?
* How many db records / index documents are changed per update?

I would suggest to start with a single Solr core first. Thereby, you can 
concentrate on the basics and do not need to deal with more advanced 
things like sharding. In case you encounter performance issues later on, 
you can switch to a multi-core setup.

-Sascha

Mari Masuda wrote:
> Hello,
>
> I am new to Solr and am in the beginning planning stage of a large project and could use some advice so as not to make a huge design blunder that I will regret down the road.
>
> Currently I have about 10 MySQL databases that store information about different archival collections.  For example, we have data and metadata about a political poster collection, a television program, documents and photographs of and about a famous author, etc.  My job is to work with the staff archivists to come up with a standard metadata template so the 10 databases can be consolidated into one.
>
> Currently the info in these databases is accessed through 10 different sets of PHP pages that were written a long time ago for PHP 4.  My plan is to write a new Java application that will handle both public display of the info as well as an administrative interface so that staff members can add or edit the records.
>
> I have decided to use Solr as the search mechanism for this project.  Because the info in each of our 10 collections is slightly different (e.g., a record about a poster does not contain duration information, but a record about a TV show does) I was thinking it would be good to separate each collection's index into a separate Solr core so that commits coming from one collection do not bog down the other unrelated collections.  One reservation I have is that eventually we would like to be able to type in "Iraq" and find records across all of the collections at once instead of having to search each collection separately.  Although I don't know anything about it at this stage, I did Google "sharding" after reading someone's recent post on this list and it sounds like that may be a potential answer to my question.  Does anyone have any advice on how I should initially set up Solr for my situation?  I am slowly making my way through the wiki and RTFMing, but I wanted to see what
 the experts have to say because at this point I don't really know where to start.
>
> Thank you very much,
> Mari