You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Olson, Ron" <RO...@lbpc.com> on 2010/10/19 16:39:12 UTC

Documents and cores

Hi all-

I have a newbie design question about documents, especially with SQL databases. I am trying to set up Solr to go against a database that, for example, has "items" and "people". The way I see it, and I don't know if this is right or not (thus the question), is that I see both as separate documents as an item may contain a list of parts, which the user may want to search, and, as part of the "item", view the list of people who have ordered the item.

Then there's the actual "people", who the user might want to search to find a name and, consequently, what items they ordered. To me they are both "top level" things, with some overlap of fields. If I'm searching for "people", I'm likely not going to be interested in the parts of the item, while if I'm searching for "items" the likelihood is that I may want to search for "42532" which is, in this instance, a SKU, and not get hits on the zip code section of the "people".

Does it make sense, then, to separate these two out as separate documents? I believe so because the documentation I've read suggests that a document should be analogous to a row in a table (in this case, very de-normalized). What is tripping me up is, as far as I can tell, you can have only one document type per index, and thus one document per core. So in this example, I have two cores, "items" and "people". Is this correct? Should I embrace the idea of having many cores or am I supposed to have a single, unified index with all documents (which doesn't seem like Solr supports).

The ultimate question comes down to the search interface. I don't necessarily want to have the user explicitly state which document they want to search; I'd like them to simply type "42532" and get documents from both cores, and then possibly allow for filtering results after the fact, not before. As I've only used the admin site so far (which is core-specific), does the client API allow for unified searching across all cores? Assuming it does, I'd think my idea of multiple-documents is okay, but I'd love to hear from people who actually know what they're doing. :)

Thanks,

Ron

DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company.
Thank you.

Re: Documents and cores

Posted by Erick Erickson <er...@gmail.com>.

This is something most everybody has to get over when transitioning from the
DB
world to Solr/Lucene. The schema describes the #possible# fields in the
document.
There is absolutely no requirement that #every# document in the index have
all these fields in them (unless #you# define it so with <field .....
required="true">.

Solr will happily index documents that have fields missing, so feel free...
You should be able to define your people and parts documents as you
choose, with perhaps some common fields.

You'll have to take some care not to form queries like name:ralph AND
sku:12345
assuming that the name field is only in people and sku only in parts....

Do continue down the path of de-normalization. That's another thing most DB
folks
don't want to do. Each document you index should contain all the data you
need.
The moment you find yourself asking "how to I do a join" you should stop and
consider further de-normalization.....

HTH
Erick


On Tue, Oct 19, 2010 at 10:39 AM, Olson, Ron <RO...@lbpc.com> wrote:

> Hi all-
>
> I have a newbie design question about documents, especially with SQL
> databases. I am trying to set up Solr to go against a database that, for
> example, has "items" and "people". The way I see it, and I don't know if
> this is right or not (thus the question), is that I see both as separate
> documents as an item may contain a list of parts, which the user may want to
> search, and, as part of the "item", view the list of people who have ordered
> the item.
>
> Then there's the actual "people", who the user might want to search to find
> a name and, consequently, what items they ordered. To me they are both "top
> level" things, with some overlap of fields. If I'm searching for "people",
> I'm likely not going to be interested in the parts of the item, while if I'm
> searching for "items" the likelihood is that I may want to search for
> "42532" which is, in this instance, a SKU, and not get hits on the zip code
> section of the "people".
>
> Does it make sense, then, to separate these two out as separate documents?
> I believe so because the documentation I've read suggests that a document
> should be analogous to a row in a table (in this case, very de-normalized).
> What is tripping me up is, as far as I can tell, you can have only one
> document type per index, and thus one document per core. So in this example,
> I have two cores, "items" and "people". Is this correct? Should I embrace
> the idea of having many cores or am I supposed to have a single, unified
> index with all documents (which doesn't seem like Solr supports).
>
> The ultimate question comes down to the search interface. I don't
> necessarily want to have the user explicitly state which document they want
> to search; I'd like them to simply type "42532" and get documents from both
> cores, and then possibly allow for filtering results after the fact, not
> before. As I've only used the admin site so far (which is core-specific),
> does the client API allow for unified searching across all cores? Assuming
> it does, I'd think my idea of multiple-documents is okay, but I'd love to
> hear from people who actually know what they're doing. :)
>
> Thanks,
>
> Ron
>
> DISCLAIMER: This electronic message, including any attachments, files or
> documents, is intended only for the addressee and may contain CONFIDENTIAL,
> PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended
> recipient, you are hereby notified that any use, disclosure, copying or
> distribution of this message or any of the information included in or with
> it is  unauthorized and strictly prohibited.  If you have received this
> message in error, please notify the sender immediately by reply e-mail and
> permanently delete and destroy this message and its attachments, along with
> any copies thereof. This message does not create any contractual obligation
> on behalf of the sender or Law Bulletin Publishing Company.
> Thank you.
>

Re: Documents and cores

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Documents and cores
: References: <4C...@atcult.it>
:  <AA...@mail.gmail.com>
: In-Reply-To: <AA...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss