You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Frank DeRose <fd...@guidewire.com> on 2012/06/29 21:13:59 UTC

Join support across multiple document types in Lucene

All,

I have a question about join support across multiple document types in Solr/Lucene. Let me lay out the use case.

Suppose I have 3 tables:

* Table A has 3 columns, id, a1, a2.

* Table B has 4 columns, id, b1, b2, and aid, which is a foreign key referencing A.id.

* Table C has 4 columns, id, c1, c2, and aid, which is a foreign key referencing A.id.

I want to be able to perform the following searches:

* Search for rows in A by specifying just values for columns in A. For example,

select * from A where A.a1 = 'value'

* Search for rows in A by specifying just values for columns in B or C or both. For example,

select A.*, B.* from A, B where B.b1 = 'value' and B.aid = A.id
select A.*, C.* from A, C where C.c1 = 'value' and C.aid = A.id
select A.*, B.*, C.* from A, B, C where B.b1 = 'value' and B.aid = A.id and C.c1 = 'value' and C.aid = A.id

Suppose that I want to store the data from A, B, and C in Solr/Lucene. How would I perform these searches in a Solr/Lucene environment?

It seems that there are two possible approaches:

1.) Denormalize all data into one document. That is, my query in data-config.xml for doing a full-import would be:

select A.id, A.a1, A.a2, B.b1, B.b2, C.c1, C.c2 from A inner join B on B.aid = A.id inner join C on C.aid = A.id

I believe this means that the number of documents in my Lucene index will be on the order of the product:

cardinality(A) * cardinality(B) * cardinality(C)

This will result in a large amount of redundant data in my index.

2.) Store the data from each table into a separate document, say, docA, docB, docC. This would require me to perform three separate searches and to join the result based on the A.id, B.aid, C.aid columns.

I am (dimly?) aware that the Solr/Lucene community is working on various solutions to this problem. For example, I've read Mike McCandless' description of the BlockJoinQuery<http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html>. This approach does not seem to solve our problem since (unless I am mistaken) the query requires at least one predicate to be specified for parent entity (A in my example). We, on the other hand, want to be able to have the ability to perform searches where only predicates for the child entities (B and C in my example) are specified. To give a concrete example, Table A might be a Claim table and Table B might be a Contact table and we want to search for Claims based on Contact info, for example: search for all claims where the lastName of a Contact matches 'DeRose'. Is my analysis correct? That is, is BlockJoinQuery only unidirectional from parent to child?

On the other hand, Lucene "query time joining" discussed here<http://www.searchworkings.org/blog/-/blogs/query-time-joining-in-lucene> seems to address our problem. The following paragraph seems to imply that queries can be specified in terms of data contained in the child documents:

You could also change the example and give all articles that match with a certain comment query. In this example the multipleValuesPerDocument is set to false and the fromField (the id field) only contains one value per document. However, the example would still work if multipleValuesPerDocument variable were set to true, but it would then work in a less efficient manner.

That is, Lucene "query time joining" is bidirectional. Of course, this begs the question: How efficient are these queries. The reason why we thought about moving these queries from our RDBMS to Solr/Lucene is because executing equivalent queries in the RDBMS sometimes produced pathological worst-case behavior (queries taking 10's of minutes). Are we going to encounter the same problems in Solr/Lucene?

So, any comments on the correctness of my analysis and any pointers to applicable resources that discuss this problem are appreciated.

_________________________________________
Frank DeRose
Guidewire Software | Senior Software Engineer
Cell: 510 -589-0752
fderose@guidewire.com<ma...@guidewire.com> | www.guidewire.com<http://www.guidewire.com/>
Deliver insurance your way with flexible core systems from Guidewire.

Re: Join support across multiple document types in Lucene

Posted by Erick Erickson <er...@gmail.com>.

Performance here is an issue. The performance of Solr/Lucene's query joins is a
function of the number of _unique_ values in the fields being joined, and can
be unacceptably slow in the wild, it depends on your use-case.

denormalization is the usual  approach, but that gives DB folks the hives. But
how many documents would that amount to? You haven't said how big
things are, and
Solr handles very large numbers of documents reasonably well. There
are some other
techniques, but it largely depends on _why_ you're doing joins. What's
the use-case
here? There may be clever indexing you can do that will ameliorate the problem.

For instance, some people have used a hybrid approach, using Solr to do the
searching part on the textual data and pulling the joined bits from the DB. This
can make sense and it can be totally idiotic as a proposal It Depends (tm)....

Best
Erick

On Fri, Jun 29, 2012 at 3:13 PM, Frank DeRose <fd...@guidewire.com> wrote:
> All,
>
> I have a question about join support across multiple document types in Solr/Lucene. Let me lay out the use case.
>
> Suppose I have 3 tables:
>
>
> *         Table A has 3 columns, id, a1, a2.
>
> *         Table B has 4 columns, id, b1, b2, and aid, which is a foreign key referencing A.id.
>
> *         Table C has 4 columns, id, c1, c2, and aid, which is a foreign key referencing A.id.
>
> I want to be able to perform the following searches:
>
>
> *         Search for rows in A by specifying just values for columns in A. For example,
>
> select * from A where A.a1 = 'value'
>
>
> *         Search for rows in A by specifying just values for columns in B or C or both. For example,
>
> select A.*, B.* from A, B where B.b1 = 'value' and B.aid = A.id
> select A.*, C.* from A, C where C.c1 = 'value' and C.aid = A.id
> select A.*, B.*, C.* from A, B, C where B.b1 = 'value' and B.aid = A.id and C.c1 = 'value' and C.aid = A.id
>
> Suppose that I want to store the data from A, B, and C in Solr/Lucene. How would I perform these searches in a Solr/Lucene environment?
>
> It seems that there are two possible approaches:
>
>
> 1.)     Denormalize all data into one document. That is, my query in data-config.xml for doing a full-import would be:
>
> select A.id, A.a1, A.a2, B.b1, B.b2, C.c1, C.c2 from A inner join B on B.aid = A.id inner join C on C.aid = A.id
>
> I believe this means that the number of documents in my Lucene index will be on the order of the product:
>
> cardinality(A) * cardinality(B) * cardinality(C)
>
> This will result in a large amount of redundant data in my index.
>
>
> 2.)    Store the data from each table into a separate document, say, docA, docB, docC. This would require me to perform three separate searches and to join the result based on the A.id, B.aid, C.aid columns.
>
> I am (dimly?) aware that the Solr/Lucene community is working on various solutions to this problem. For example, I've read Mike McCandless' description of the BlockJoinQuery<http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html>. This approach does not seem to solve our problem since (unless I am mistaken) the query requires at least one predicate to be specified for parent entity (A in my example). We, on the other hand, want to be able to have the ability to perform searches where only predicates for the child entities (B and C in my example) are specified. To give a concrete example, Table A might be a Claim table and Table B might be a Contact table and we want to search for Claims based on Contact info, for example: search for all claims where the lastName of a Contact matches 'DeRose'. Is my analysis correct? That is, is BlockJoinQuery only unidirectional from parent to child?
>
> On the other hand, Lucene "query time joining" discussed here<http://www.searchworkings.org/blog/-/blogs/query-time-joining-in-lucene> seems to address our problem. The following paragraph seems to imply that queries can be specified in terms of data contained in the child documents:
>
> You could also change the example and give all articles that match with a certain comment query. In this example the multipleValuesPerDocument is set to false and the fromField  (the id field) only contains one value per document. However, the example would still work if multipleValuesPerDocument  variable were set to true, but it would then work in a less efficient manner.
>
> That is, Lucene "query time joining" is bidirectional. Of course, this begs the question: How efficient are these queries. The reason why we thought about moving these queries from our RDBMS to Solr/Lucene is because executing equivalent queries in the RDBMS sometimes produced pathological worst-case behavior (queries taking 10's of minutes). Are we going to encounter the same problems in Solr/Lucene?
>
> So, any comments on the correctness of my analysis and any pointers to applicable resources that discuss this problem are appreciated.
>
> F
>
> _________________________________________
> Frank DeRose
> Guidewire Software | Senior Software Engineer
> Cell: 510 -589-0752
> fderose@guidewire.com<ma...@guidewire.com> | www.guidewire.com<http://www.guidewire.com/>
> Deliver insurance your way with flexible core systems from Guidewire.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org