You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gustav <xb...@sharklasers.com> on 2013/04/25 16:17:16 UTC

What is the difference between a Join Query and Embedded Entities in Solr DIH?

Hello guys, i saw this thread on stackoverflow, but still not satisfied with
the answers. 

I am trying to index data across multiple tables using Solr's Data Import
Handler. The official wiki on the DIH suggests using embedded entities to
link multiple tables like so:

<document>
    <entity name="item" pk="id" query="SELECT * FROM item">
        <entity name="member" pk="memberid" query="SELECT * FROM member
WHERE memberid='${item.memberid}'>
        </entity>
    </entity>
</document>

Another way that works is:

<document>
    <entity name="item" pk="id" query="SELECT * FROM item INNER JOIN member
ON item.memberid=member.memberid">
    </entity>
</document>

Are these two methods functionally different? Is there a performance
difference?

Another though would be that, if using join tables in MySQL, using the SQL
query method with multiple joins could cause multiple documents to be
indexed instead of one.




--
View this message in context: http://lucene.472066.n3.nabble.com/What-is-the-difference-between-a-Join-Query-and-Embedded-Entities-in-Solr-DIH-tp4058923.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What is the difference between a Join Query and Embedded Entities in Solr DIH?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I think JOIN is more performant as - by default - DIH will run an
inner query for each outer one. You can use cached source, but JOIN
will be still more efficient.

The nested entities are more useful when the sources are heterogeneous
(e.g. DB and XML) or when you need to do custom transformers in
between.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Apr 25, 2013 at 10:17 AM, Gustav <xb...@sharklasers.com> wrote:
> Hello guys, i saw this thread on stackoverflow, but still not satisfied with
> the answers.
>
> I am trying to index data across multiple tables using Solr's Data Import
> Handler. The official wiki on the DIH suggests using embedded entities to
> link multiple tables like so:
>
> <document>
>     <entity name="item" pk="id" query="SELECT * FROM item">
>         <entity name="member" pk="memberid" query="SELECT * FROM member
> WHERE memberid='${item.memberid}'>
>         </entity>
>     </entity>
> </document>
>
> Another way that works is:
>
> <document>
>     <entity name="item" pk="id" query="SELECT * FROM item INNER JOIN member
> ON item.memberid=member.memberid">
>     </entity>
> </document>
>
> Are these two methods functionally different? Is there a performance
> difference?
>
> Another though would be that, if using join tables in MySQL, using the SQL
> query method with multiple joins could cause multiple documents to be
> indexed instead of one.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/What-is-the-difference-between-a-Join-Query-and-Embedded-Entities-in-Solr-DIH-tp4058923.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: What is the difference between a Join Query and Embedded Entities in Solr DIH?

Posted by Shawn Heisey <so...@elyograg.org>.
On 4/25/2013 8:17 AM, Gustav wrote:
> Are these two methods functionally different? Is there a performance
> difference?
> 
> Another though would be that, if using join tables in MySQL, using the SQL
> query method with multiple joins could cause multiple documents to be
> indexed instead of one.

They may be equivalent in terms of results, but they work differently
and probably will NOT have the same performance.

When using nested entities in DIH, the main entity results in one SQL
query, but the inner entities will result in a separate SQL query for
every single item returned by the main query.  If you have exactly 1
million rows in your main table and you're using a nested config with
two entities, you will be executing 1000001 queries.  DIH will be
spending a fair amount of time doing nothing but waiting for the latency
on a million individual queries via JDBC.  It probably also results in
extra work for the database server.

With a server-side join, you're down to one query via JDBC, and the
database server is doing the work of combining your tables, normally
something it can do very efficiently.

Thanks,
Shawn


RE: What is the difference between a Join Query and Embedded Entities in Solr DIH?

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
Gustav,

DIH should give you the same results in both scenarios.  The performance trade-offs depend on your data.  In your case, it looks like there is a 1-to-1 or many-to-1 relationship between "item" and "member", so use the SQL Join.  You'll get all of your data in one query and you'll be using your rbdms for what it does best.

But in the case there was a 1-to-many relationship between "item" and "member", and especially if each "item" has several "member" rows, you might get better performance using the child entity setup.  Although by default DIH is going to do an "n+1" select on member.  For every row in item, it will issue a separate query to the db.  Also, DIH does not use prepared statements, so this might be a bad choice.  

To work around this, specify "cacheImpl='SortedMapBackedCache'" on the child entity (this is the same as using CachedSqlEntityProcessor instead of SqlEntityProcessor).  Do not include a "where" clause in this child entity.  Instead, specify "cacheKey='memberId'" and "cacheLookup='item.memberId'".  DIH will now pull down your entire "member" table in 1 query and cache it in memory, then it can do fast hash joins against "item".

But if your "member" table is too big to fit into memory, then you need to use a disk-backed cache instead of SortedMapBackedCache.  For that, see 
https://issues.apache.org/jira/browse/SOLR-2948 and https://issues.apache.org/jira/browse/SOLR-2613 .

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Gustav [mailto:xbihydhd@sharklasers.com] 
Sent: Thursday, April 25, 2013 9:17 AM
To: solr-user@lucene.apache.org
Subject: What is the difference between a Join Query and Embedded Entities in Solr DIH?

Hello guys, i saw this thread on stackoverflow, but still not satisfied with
the answers. 

I am trying to index data across multiple tables using Solr's Data Import
Handler. The official wiki on the DIH suggests using embedded entities to
link multiple tables like so:

<document>
    <entity name="item" pk="id" query="SELECT * FROM item">
        <entity name="member" pk="memberid" query="SELECT * FROM member
WHERE memberid='${item.memberid}'>
        </entity>
    </entity>
</document>

Another way that works is:

<document>
    <entity name="item" pk="id" query="SELECT * FROM item INNER JOIN member
ON item.memberid=member.memberid">
    </entity>
</document>

Are these two methods functionally different? Is there a performance
difference?

Another though would be that, if using join tables in MySQL, using the SQL
query method with multiple joins could cause multiple documents to be
indexed instead of one.




--
View this message in context: http://lucene.472066.n3.nabble.com/What-is-the-difference-between-a-Join-Query-and-Embedded-Entities-in-Solr-DIH-tp4058923.html
Sent from the Solr - User mailing list archive at Nabble.com.