You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Matt B <ma...@runbox.com> on 2015/03/02 21:04:42 UTC

Slow cross-core joins

I've recently inherited a Solr instance that is required to perform numerous joins between two cores, usually as filter queries, similar to the one below:

q=firstName=Matt&fq=-({!to=emailAddress toIndex=accounts type=join fromIndex=lists from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce OR {!to=emailDomain toIndex=accounts type=join fromIndex=lists from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce OR {!to=emailDomainReversed toIndex=accounts type=join fromIndex=lists from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce)

The accounts core is about 35GB with ~40,000,000 documents and the lists core is about 9 GB with 90,0000,000 documents.  There may be anywhere from one to one million documents in the lists core matching any particular list_id.  The idea is to filter a search query on the accounts core to include or exclude any documents with an email address, email domain, or reverse email domain that is found within the lists core for a particular list id.  The lists core is frequently updated on a daily basis with both additions and deletions.

Not surprisingly, such queries are very slow, usually taking minutes to return any results.

Are there any possible strategies to significantly increase the performance of such queries?  The JVM max heap size is set to 16 GB and the server has 64 GB RAM.

Re: Slow cross-core joins

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Excuse me for hijacking: I raised
https://issues.apache.org/jira/browse/LUCENE-6332. Please vote if you need.

On Mon, Mar 2, 2015 at 11:04 PM, Matt B <ma...@runbox.com> wrote:

> I've recently inherited a Solr instance that is required to perform
> numerous joins between two cores, usually as filter queries, similar to the
> one below:
>
> q=firstName=Matt&fq=-({!to=emailAddress toIndex=accounts type=join
> fromIndex=lists from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce
> OR {!to=emailDomain toIndex=accounts type=join fromIndex=lists
> from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce OR
> {!to=emailDomainReversed toIndex=accounts type=join fromIndex=lists
> from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce)
>
> The accounts core is about 35GB with ~40,000,000 documents and the lists
> core is about 9 GB with 90,0000,000 documents.  There may be anywhere from
> one to one million documents in the lists core matching any particular
> list_id.  The idea is to filter a search query on the accounts core to
> include or exclude any documents with an email address, email domain, or
> reverse email domain that is found within the lists core for a particular
> list id.  The lists core is frequently updated on a daily basis with both
> additions and deletions.
>
> Not surprisingly, such queries are very slow, usually taking minutes to
> return any results.
>
> Are there any possible strategies to significantly increase the
> performance of such queries?  The JVM max heap size is set to 16 GB and the
> server has 64 GB RAM.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: Slow cross-core joins

Posted by Matt B <ma...@runbox.com>.

Thanks all for the suggestions.  Regarding patch SOLR-4787, it seems like this will only work with long or int fields and not strings like email addresses.  But my coworker suggested the possibility of using a hash to generate long fields from the string fields, so I may try that out. 

-Matt


On Mon, 2 Mar 2015 23:16:33 -0700, William Bell <bi...@gmail.com> wrote:

> I agree that join is slow. Adding fq on LocalParams is good. Has this been
> added to {!lucene} and other calls like join ?
> 
> 
> 
> On Mon, Mar 2, 2015 at 2:00 PM, Gopal Patwa <go...@gmail.com> wrote:
> 
> > You could give a try for this join contrib patch
> >
> > https://issues.apache.org/jira/browse/SOLR-4787
> >
> >
> >
> > On Mon, Mar 2, 2015 at 12:04 PM, Matt B <ma...@runbox.com> wrote:
> >
> > > I've recently inherited a Solr instance that is required to perform
> > > numerous joins between two cores, usually as filter queries, similar to
> > the
> > > one below:
> > >
> > > q=firstName=Matt&fq=-({!to=emailAddress toIndex=accounts type=join
> > > fromIndex=lists
> > from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce
> > > OR {!to=emailDomain toIndex=accounts type=join fromIndex=lists
> > > from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce OR
> > > {!to=emailDomainReversed toIndex=accounts type=join fromIndex=lists
> > > from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce)
> > >
> > > The accounts core is about 35GB with ~40,000,000 documents and the lists
> > > core is about 9 GB with 90,0000,000 documents.  There may be anywhere
> > from
> > > one to one million documents in the lists core matching any particular
> > > list_id.  The idea is to filter a search query on the accounts core to
> > > include or exclude any documents with an email address, email domain, or
> > > reverse email domain that is found within the lists core for a particular
> > > list id.  The lists core is frequently updated on a daily basis with both
> > > additions and deletions.
> > >
> > > Not surprisingly, such queries are very slow, usually taking minutes to
> > > return any results.
> > >
> > > Are there any possible strategies to significantly increase the
> > > performance of such queries?  The JVM max heap size is set to 16 GB and
> > the
> > > server has 64 GB RAM.
> >
> 
> 
> 
> -- 
> Bill Bell
> billnbell@gmail.com
> cell 720-256-8076

Re: Slow cross-core joins

Posted by William Bell <bi...@gmail.com>.

I agree that join is slow. Adding fq on LocalParams is good. Has this been
added to {!lucene} and other calls like join ?



On Mon, Mar 2, 2015 at 2:00 PM, Gopal Patwa <go...@gmail.com> wrote:

> You could give a try for this join contrib patch
>
> https://issues.apache.org/jira/browse/SOLR-4787
>
>
>
> On Mon, Mar 2, 2015 at 12:04 PM, Matt B <ma...@runbox.com> wrote:
>
> > I've recently inherited a Solr instance that is required to perform
> > numerous joins between two cores, usually as filter queries, similar to
> the
> > one below:
> >
> > q=firstName=Matt&fq=-({!to=emailAddress toIndex=accounts type=join
> > fromIndex=lists
> from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce
> > OR {!to=emailDomain toIndex=accounts type=join fromIndex=lists
> > from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce OR
> > {!to=emailDomainReversed toIndex=accounts type=join fromIndex=lists
> > from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce)
> >
> > The accounts core is about 35GB with ~40,000,000 documents and the lists
> > core is about 9 GB with 90,0000,000 documents.  There may be anywhere
> from
> > one to one million documents in the lists core matching any particular
> > list_id.  The idea is to filter a search query on the accounts core to
> > include or exclude any documents with an email address, email domain, or
> > reverse email domain that is found within the lists core for a particular
> > list id.  The lists core is frequently updated on a daily basis with both
> > additions and deletions.
> >
> > Not surprisingly, such queries are very slow, usually taking minutes to
> > return any results.
> >
> > Are there any possible strategies to significantly increase the
> > performance of such queries?  The JVM max heap size is set to 16 GB and
> the
> > server has 64 GB RAM.
>



-- 
Bill Bell
billnbell@gmail.com
cell 720-256-8076

Re: Slow cross-core joins

Posted by Gopal Patwa <go...@gmail.com>.

You could give a try for this join contrib patch

https://issues.apache.org/jira/browse/SOLR-4787



On Mon, Mar 2, 2015 at 12:04 PM, Matt B <ma...@runbox.com> wrote:

> I've recently inherited a Solr instance that is required to perform
> numerous joins between two cores, usually as filter queries, similar to the
> one below:
>
> q=firstName=Matt&fq=-({!to=emailAddress toIndex=accounts type=join
> fromIndex=lists from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce
> OR {!to=emailDomain toIndex=accounts type=join fromIndex=lists
> from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce OR
> {!to=emailDomainReversed toIndex=accounts type=join fromIndex=lists
> from=listValue}list_id:000038f2-351b-11e4-9579-001e67654bce)
>
> The accounts core is about 35GB with ~40,000,000 documents and the lists
> core is about 9 GB with 90,0000,000 documents.  There may be anywhere from
> one to one million documents in the lists core matching any particular
> list_id.  The idea is to filter a search query on the accounts core to
> include or exclude any documents with an email address, email domain, or
> reverse email domain that is found within the lists core for a particular
> list id.  The lists core is frequently updated on a daily basis with both
> additions and deletions.
>
> Not surprisingly, such queries are very slow, usually taking minutes to
> return any results.
>
> Are there any possible strategies to significantly increase the
> performance of such queries?  The JVM max heap size is set to 16 GB and the
> server has 64 GB RAM.

Re: Slow cross-core joins

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Mon, Mar 2, 2015 at 11:04 PM, Matt B <ma...@runbox.com> wrote:

> There may be anywhere from one to one million documents in the lists core
> matching any particular list_id.

Matt,

What about reverse cardinality of this relation? ie for particular
listValue term, how many list_ids are associated? it's can be found in
facets on lists core
q=list_id:000038f2-351b-11e4-9579-001e67654bce&facet=true&facet.field=listValue

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>