You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Kanduru, Ajay (NIH/NLM/LHC) [C]" <ak...@mail.nih.gov> on 2011/07/18 18:48:01 UTC

Join performance?

I am trying to optimize performance of solr with our collection. The collection has 208M records with index size of about 80GB. The machine has 16GB and I am allocating about 14GB to solr.

I am using self join statement in filter query like this:
q=(general search term)
fq={!join from=join_field to=join_field}(field1:(field1 search term) AND field2:(field2 search term) AND field3:(field3 search term))
...

Field definitions:
join_field: string type (Has ~27K terms)
field1: text type
field2: double type
field3: string type

The response time of qf with join is about ten times compared to qf without join (~10 sec vs ~1 sec). Is this something on expected lines? In general what parameters, if any, can be tweaked? The intention is to use such multiple filter queries, hence the need for optimization. Sharding and more horse power are obvious solutions, but more interested in optimizing for a given host and a given data collection.

Appreciate any insight in this regard.

-Ajay

Re: Join performance?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Jul 18, 2011 at 12:48 PM, Kanduru, Ajay (NIH/NLM/LHC) [C]
<ak...@mail.nih.gov> wrote:
> I am trying to optimize performance of solr with our collection. The collection has 208M records with index size of about 80GB. The machine has 16GB and I am allocating about 14GB to solr.
>
> I am using self join statement in filter query like this:
> q=(general search term)
> fq={!join from=join_field to=join_field}(field1:(field1 search term) AND field2:(field2 search term) AND field3:(field3 search term))
> ...
>
> Field definitions:
> join_field: string type (Has ~27K terms)
> field1: text type
> field2: double type
> field3: string type
>
> The response time of qf with join is about ten times compared to qf without join (~10 sec vs ~1 sec). Is this something on expected lines?

Yep... the initial join implementation is O(nterms), so it's expected
to be slow when the number of unique terms is high.
Given your index size, it would have almost expected it to be slower!

As with faceting, I expect there to be other implementations in the
future, but nothing right now...

-Yonik
http://www.lucidimagination.com

> In general what parameters, if any, can be tweaked? The intention is to use such multiple filter queries, hence the need for optimization. Sharding and more horse power are obvious solutions, but more interested in optimizing for a given host and a given data collection.
>
> Appreciate any insight in this regard.
>
> -Ajay
>