You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joachim Martin <jm...@path-works.com> on 2006/09/19 21:57:34 UTC

relational design in solr?

I am trying to integrate solr search results with results from a rdbms 
query.  It's working ok, but fairly complicated  due to large size of 
the results from the database, and many different sort requirements.

I know that solr/lucene was not designed to intelligently handle 
multiple document types in the same collection, i.e. provide join 
features, but I'm wondering if anyone on this list has any thoughts on 
how to do it in lucene, and how it might be integrated into a custom 
solr deployment.  I can't see going back to vanilla lucene after solr!

My basic idea is to add an objType field that would be used to define a 
"table".  There would be one main objType, any related objTypes would 
have a field pointing back to the main objs via id, like a foreign key.

I'd run multiple parallel searches and merge the results based on 
foreign keys, either using a Filter or just using custom code.  I'm 
anticipating that iterating through the results to retrieve the foreign 
key values will be too slow.

Our data is highly textual, temporal and spatial, which pretty much 
correspond to the 3 tables I would have.  I can de-normalize a lot of 
the data, but the combination of times, locations and textual 
representations would be way too large to fully flatten.

I'm about to start experimenting with different strategies, and I would 
appreciate any insight anyone can provide.  Would the faceting code help 
here somehow?

Thanks --Joachim






Re: relational design in solr?

Posted by Chris Hostetter <ho...@fucit.org>.
: The best example I can think of is a resume database.  You could
: certainly just put the whole resume
: document into the text index and do full text searches.  But to answer
: the question of what people
: received a Harvard MBA in the last 10 years and have worked at Intel in
: the last 5 years, you have
: to correlate the years of attendance with the schoolName entry.
: Otherwise you might be getting years
: for some other education/work history entry.

ah ... I see what you are describing, yes correllating matches on seperate
fields is definteily a non-trivial problem in Solr/Lucene ...

for simple things like "has an MBA from harvard" using position gaps and
span queries can be used to distinguish people who acctually have an MBA
from Harverd from those who have a BA from Harvard and an MBA from
somewhere else ... but the numeric tests start to make things tricky.  one
solution is to put all aspects of the criteria that are *not* numeric into
a dynamic field name and query on it, ala...

   school_mba_harvard:[1996 TO *]

...but that's not a super generalized solution.  It is however the best
solution i can offer  :)

: By adding an objType field and combining search results, you can be sure
: that the year/schoolName
: query matched a unique education record.  The tricky bit is in getting a
: list of field values (e.g. foreign
: keys, which are essentially facets) for a result set very quickly.
	...
: We'll see.  I have my doubts that it will work for any but the smallest
: of collections, which ours certainly isn't.

I think you'll find that you are right about that.  what you're descrbing
seems almost more of like a prolog-ish rule based knowledge engine type
problem.




-Hoss


Re: relational design in solr?

Posted by Joachim Martin <jm...@path-works.com>.
Chris,

I think what I am trying to do is actually much simpler than what you 
are talking about here.
I do plan on returning document ids and retrieving full entity data from 
the database- solr would
just be used for the search, not for results display.

The problem is that some data cannot be "flattened", for example when a 
document has repeating
fields that are complex types, such as address.

The best example I can think of is a resume database.  You could 
certainly just put the whole resume
document into the text index and do full text searches.  But to answer 
the question of what people
received a Harvard MBA in the last 10 years and have worked at Intel in 
the last 5 years, you have
to correlate the years of attendance with the schoolName entry.  
Otherwise you might be getting years
for some other education/work history entry.

By adding an objType field and combining search results, you can be sure 
that the year/schoolName
query matched a unique education record.  The tricky bit is in getting a 
list of field values (e.g. foreign
keys, which are essentially facets) for a result set very quickly.

If this can be done, figuring out a generic way of specifying multiple 
searches and relationships between
result sets (without reinventing SQL) becomes the challenge.

We'll see.  I have my doubts that it will work for any but the smallest 
of collections, which ours certainly
isn't.

Thanks --Joachim

Chris Hostetter wrote:

>While it's certianly possible to "join" the results of multiple indexes, i
>would do so only when absolutely neccessary -- in my experience the only
>time i've found that it makes sense, is when one aspect of the data
>changes extremely rapidly compared to everything else, making complex
>reindexing a pain, but reindexing just the changed data in it's own index
>is a lot more feasible.
>
>As a rule of thumb, when building "paginated" style search applications, I
>would advise people to try and flatten their index as much as possible, so
>that the application can do one "user query" (based on the users input)
>to get a single page of results, and then use the uniqueKeys from that
>page of results to lookup ancillary data from any other indexes (or
>databases that you need) -- the key being that all the data you want to
>search on, and all hte data you need to sort are in the index, but other
>data you needto return to the user can come from other sources.
>
>If you find yourself wanting to "join" to indexes for hte purposes of
>matching or sorting, the amount of work you wind up doing tends to be
>prohibitive on really large indexes -- and if your indxes aren't that
>large, it would probably just be easier to puteverything in one index and
>rebuild it frequently.
>
>: I am trying to integrate solr search results with results from a rdbms
>: query.  It's working ok, but fairly complicated  due to large size of
>: the results from the database, and many different sort requirements.
>:
>: I know that solr/lucene was not designed to intelligently handle
>: multiple document types in the same collection, i.e. provide join
>: features, but I'm wondering if anyone on this list has any thoughts on
>: how to do it in lucene, and how it might be integrated into a custom
>: solr deployment.  I can't see going back to vanilla lucene after solr!
>:
>: My basic idea is to add an objType field that would be used to define a
>: "table".  There would be one main objType, any related objTypes would
>: have a field pointing back to the main objs via id, like a foreign key.
>:
>: I'd run multiple parallel searches and merge the results based on
>: foreign keys, either using a Filter or just using custom code.  I'm
>: anticipating that iterating through the results to retrieve the foreign
>: key values will be too slow.
>:
>: Our data is highly textual, temporal and spatial, which pretty much
>: correspond to the 3 tables I would have.  I can de-normalize a lot of
>: the data, but the combination of times, locations and textual
>: representations would be way too large to fully flatten.
>:
>: I'm about to start experimenting with different strategies, and I would
>: appreciate any insight anyone can provide.  Would the faceting code help
>: here somehow?
>
>
>
>-Hoss
>  
>


Re: relational design in solr?

Posted by Chris Hostetter <ho...@fucit.org>.
While it's certianly possible to "join" the results of multiple indexes, i
would do so only when absolutely neccessary -- in my experience the only
time i've found that it makes sense, is when one aspect of the data
changes extremely rapidly compared to everything else, making complex
reindexing a pain, but reindexing just the changed data in it's own index
is a lot more feasible.

As a rule of thumb, when building "paginated" style search applications, I
would advise people to try and flatten their index as much as possible, so
that the application can do one "user query" (based on the users input)
to get a single page of results, and then use the uniqueKeys from that
page of results to lookup ancillary data from any other indexes (or
databases that you need) -- the key being that all the data you want to
search on, and all hte data you need to sort are in the index, but other
data you needto return to the user can come from other sources.

If you find yourself wanting to "join" to indexes for hte purposes of
matching or sorting, the amount of work you wind up doing tends to be
prohibitive on really large indexes -- and if your indxes aren't that
large, it would probably just be easier to puteverything in one index and
rebuild it frequently.

: I am trying to integrate solr search results with results from a rdbms
: query.  It's working ok, but fairly complicated  due to large size of
: the results from the database, and many different sort requirements.
:
: I know that solr/lucene was not designed to intelligently handle
: multiple document types in the same collection, i.e. provide join
: features, but I'm wondering if anyone on this list has any thoughts on
: how to do it in lucene, and how it might be integrated into a custom
: solr deployment.  I can't see going back to vanilla lucene after solr!
:
: My basic idea is to add an objType field that would be used to define a
: "table".  There would be one main objType, any related objTypes would
: have a field pointing back to the main objs via id, like a foreign key.
:
: I'd run multiple parallel searches and merge the results based on
: foreign keys, either using a Filter or just using custom code.  I'm
: anticipating that iterating through the results to retrieve the foreign
: key values will be too slow.
:
: Our data is highly textual, temporal and spatial, which pretty much
: correspond to the 3 tables I would have.  I can de-normalize a lot of
: the data, but the combination of times, locations and textual
: representations would be way too large to fully flatten.
:
: I'm about to start experimenting with different strategies, and I would
: appreciate any insight anyone can provide.  Would the faceting code help
: here somehow?



-Hoss