You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by jo...@aol.com on 2014/04/30 22:29:01 UTC

Which Lucene search syntax is faster

Hi,


Given the following Lucene document that I’m adding to my index(and I expect to have over 10 million of them, each with various sizes from 1 Kbto 50 Kb:


<add>
  <doc>
    <fieldname="doc_type">PDF</field>
    <fieldname="title">Some name</field>
    <fieldname="summary">Some summary</field>
    <fieldname="owner">Who owns this</field>
    <fieldname="price">10</field>
    <fieldname="isbn">1234567890</field>
  </doc>
  <doc>
    <fieldname="doc_type">DOC</field>
    <fieldname="title">Some name</field>
    <fieldname="summary">Some summary</field>
    <fieldname="owner">Who owns this</field>
    <fieldname="price">10</field>
    <fieldname="isbn">0987654321</field>
  </doc>
  <!-- and more doc's -->
</add>



My question is this: what Lucene search syntax will give meback result the fastest?  If my user is interestedin finding data within “title” and “owner” fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:
 
1) skyfall ian fleming AND doc_type:DOC
2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian ORfleming) AND doc_type:DOC
3) Something else I don't know about.


Of the 10 million documents I will be indexing, 80% will be of "doc_type" PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that will mean anything in terms of which syntax I should use).


Thanks in advanced,
 
- MJ 

Re: Which Lucene search syntax is faster

Posted by Shawn Heisey <so...@elyograg.org>.
On 4/30/2014 3:47 PM, johnmunir@aol.com wrote:
> Thank you Shawn and Erick for the quick response.
>
>
> A follow up question.
>
>
> Basedon https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter,I see the "fl" (field list) parameter.  Does this mean I canbuild my Lucene search syntax as follows:

The fl parameter determines which stored fields show up in the results. 
By default, all fields that are stored will be returned.  If you want
relevancy scores, you'd include the pseudofield named "score" --
&fl=*,score is something we see a lot.  The fl parameter does not affect
the *search* at all.

>     q=skyfall OR ian ORfleming&fl=title&fl=owner&fq=doc_type:DOC
>
>
> And get the same result as (per Shawn's example changed it bit toadd OR):
>
>
>     q=title:(skyfall OR ian OR fleming)owner:(skyfall OR ian OR fleming)&fq=doc_type:DOC

Exactly right.

> Btw, my default search operator is set to AND.  My need is tofind whatever the user types in both of those two fields (or maybe some otherfields which is controlled by the UI).. For example, user types"skyfall ian fleming" and selected 3 fields, and want to narrowdown to doc_type DOC.

With the standard parser, you'd have to do the following.  Assume that
USERQUERY is a very basic query, perhaps a few terms, like your example
of "skyfall ian fleming".

q=field1:(USERQUERY) OR field2:(USERQUERY) OR
field3:(USERQUERY)&fq=doc_type:DOC

With edismax, you'd do:

q=USERQUERY&qf=field1 field2 field3&fq=doc_type:DOC

You might also add "&pf=field1 field2 field3" ... and there are a great
many other edismax/dismax query parameters too.  The edismax parser does
some truly amazing stuff.

Echoing what both Erick and I said ... worrying about the exact syntax
is premature optimization.  10 million docs is something that Solr can
handle easily, as long as there's enough RAM.

Thanks,
Shawn


Re: Which Lucene search syntax is faster

Posted by jo...@aol.com.
Thank you Shawn and Erick for the quick response.


A follow up question.


Basedon https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter,I see the "fl" (field list) parameter.  Does this mean I canbuild my Lucene search syntax as follows:


    q=skyfall OR ian ORfleming&fl=title&fl=owner&fq=doc_type:DOC


And get the same result as (per Shawn's example changed it bit toadd OR):


    q=title:(skyfall OR ian OR fleming)owner:(skyfall OR ian OR fleming)&fq=doc_type:DOC


Btw, my default search operator is set to AND.  My need is tofind whatever the user types in both of those two fields (or maybe some otherfields which is controlled by the UI).. For example, user types"skyfall ian fleming" and selected 3 fields, and want to narrowdown to doc_type DOC.


- MJ




-----Original Message-----
From: Erick Erickson <er...@gmail.com>
To: solr-user <so...@lucene.apache.org>
Sent: Wed, Apr 30, 2014 5:33 pm
Subject: Re: Which Lucene search syntax is faster


I'd add that I think you're worrying about the wrong thing. 10M
documents is not very many by modern Solr standards. I rather suspect
that you won't notice much difference in performance due to how you
construct the query.

Shawn's suggestion to use fq clauses is spot on, though. fq clauses
are re-used (see filterCache in solrconfig.xml). My rule of thumb is
to use fq clauses for most everything that does NOT contribute to
scoring...

Best,
Erick

On Wed, Apr 30, 2014 at 2:18 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 4/30/2014 2:29 PM, johnmunir@aol.com wrote:
>> My question is this: what Lucene search syntax will give meback result the 
fastest?  If my user is interestedin finding data within “title” and “owner” 
fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:
>>
>> 1) skyfall ian fleming AND doc_type:DOC
>
> If your default field is text, I'm fairly sure this will become
> equivalent to the following which is probably NOT what you want.
> Parentheses can be very important.
>
> text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC)
>
>> 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND 
doc_type:DOC
>
> This kind of query syntax is probably what you should shoot for.  Not
> from a performance perspective -- just from the perspective of making
> your queries completely correct.  Note that the +/- syntax combined with
> parentheses is far more precise than using AND/OR/NOT.
>
>> 3) Something else I don't know about.
>
> The edismax query parser is very powerful.  That might be something
> you're interested in.
>
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
>
>
>> Of the 10 million documents I will be indexing, 80% will be of "doc_type" 
PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that 
will mean anything in terms of which syntax I should use).
>
> For the most part, whatever general query format you choose to use will
> not matter very much.  There are exceptions, but mostly Solr (Lucene) is
> smart enough to convert your query to an efficient final parsed format.
> Turn on the debugQuery parameterto see what it does with each query.
>
> Regardless of whether you use the standard lucene query parser or
> edismax, incorporate filter queries into your query constructing logic.
> Your second example above would be better to express like this, with the
> default operator set to OR.  This uses both q and fq parameters:
>
> q=title:(skyfall ian fleming) owner:(skyfall ian fleming)&fq=doc_type:DOC
>
> https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter
>
> Thanks,
> Shawn
>

 

Re: Which Lucene search syntax is faster

Posted by Erick Erickson <er...@gmail.com>.
I'd add that I think you're worrying about the wrong thing. 10M
documents is not very many by modern Solr standards. I rather suspect
that you won't notice much difference in performance due to how you
construct the query.

Shawn's suggestion to use fq clauses is spot on, though. fq clauses
are re-used (see filterCache in solrconfig.xml). My rule of thumb is
to use fq clauses for most everything that does NOT contribute to
scoring...

Best,
Erick

On Wed, Apr 30, 2014 at 2:18 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 4/30/2014 2:29 PM, johnmunir@aol.com wrote:
>> My question is this: what Lucene search syntax will give meback result the fastest?  If my user is interestedin finding data within “title” and “owner” fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:
>>
>> 1) skyfall ian fleming AND doc_type:DOC
>
> If your default field is text, I'm fairly sure this will become
> equivalent to the following which is probably NOT what you want.
> Parentheses can be very important.
>
> text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC)
>
>> 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND doc_type:DOC
>
> This kind of query syntax is probably what you should shoot for.  Not
> from a performance perspective -- just from the perspective of making
> your queries completely correct.  Note that the +/- syntax combined with
> parentheses is far more precise than using AND/OR/NOT.
>
>> 3) Something else I don't know about.
>
> The edismax query parser is very powerful.  That might be something
> you're interested in.
>
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
>
>
>> Of the 10 million documents I will be indexing, 80% will be of "doc_type" PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that will mean anything in terms of which syntax I should use).
>
> For the most part, whatever general query format you choose to use will
> not matter very much.  There are exceptions, but mostly Solr (Lucene) is
> smart enough to convert your query to an efficient final parsed format.
> Turn on the debugQuery parameterto see what it does with each query.
>
> Regardless of whether you use the standard lucene query parser or
> edismax, incorporate filter queries into your query constructing logic.
> Your second example above would be better to express like this, with the
> default operator set to OR.  This uses both q and fq parameters:
>
> q=title:(skyfall ian fleming) owner:(skyfall ian fleming)&fq=doc_type:DOC
>
> https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter
>
> Thanks,
> Shawn
>

Re: Which Lucene search syntax is faster

Posted by Shawn Heisey <so...@elyograg.org>.
On 4/30/2014 2:29 PM, johnmunir@aol.com wrote:
> My question is this: what Lucene search syntax will give meback result the fastest?  If my user is interestedin finding data within “title” and “owner” fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:
>  
> 1) skyfall ian fleming AND doc_type:DOC

If your default field is text, I'm fairly sure this will become
equivalent to the following which is probably NOT what you want. 
Parentheses can be very important.

text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC)

> 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND doc_type:DOC

This kind of query syntax is probably what you should shoot for.  Not
from a performance perspective -- just from the perspective of making
your queries completely correct.  Note that the +/- syntax combined with
parentheses is far more precise than using AND/OR/NOT.

> 3) Something else I don't know about.

The edismax query parser is very powerful.  That might be something
you're interested in.

https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser


> Of the 10 million documents I will be indexing, 80% will be of "doc_type" PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that will mean anything in terms of which syntax I should use).

For the most part, whatever general query format you choose to use will
not matter very much.  There are exceptions, but mostly Solr (Lucene) is
smart enough to convert your query to an efficient final parsed format. 
Turn on the debugQuery parameterto see what it does with each query.

Regardless of whether you use the standard lucene query parser or
edismax, incorporate filter queries into your query constructing logic. 
Your second example above would be better to express like this, with the
default operator set to OR.  This uses both q and fq parameters:

q=title:(skyfall ian fleming) owner:(skyfall ian fleming)&fq=doc_type:DOC

https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter

Thanks,
Shawn